Yaamini’s Notebook: DMG Analysis Part 4

Characterizing GOslim terms

I was working through my discussion outline, but was unsure what to make of the lists of parent GOterms I matched to DMG and genes with DML. Steven suggested I look at GOslim terms instead.

I used this Jupyter notebook to get GOslim terms for transcripts. I created separate lists for biological processes and molecular functions. I returned to my DML gene enrichment and DMG gene enrichment to figure out which GOslim terms go with DML and DMG. To do that, I merged GOslim terms with GO-MWU output, then counted the frequency of each GOslim term present. Here are the resulting tables:

Biological processes:

Molecular function:

I’ll take the categories with the most transcripts and start there for a discussion.

Going forward

  1. Update methods and results
  2. Finalize figures
  3. Update paper repository
  4. Outline the discussion
  5. Write the discussion
  6. Write the introduction
  7. Revise my abstract
  8. Share the draft with collaborators and get feedback
  9. Post the paper on bioRXiv
  10. Prepare the manuscript for publication

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2ZjX88G

[code]cat 20190814_BmrkCalig.sh #!/bin/bash ## Job...

cat 20190814_BmrkCalig.sh
## Job Name
#SBATCH --job-name=BismarkAlign_Calig
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes 
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=14-23:30:00
## Memory per node
#SBATCH --mem=100G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=strigg@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/strigg/analyses/20190814_Calig
#SBATCH --constraint="skylake|broadwell"

#align with bismark


find /gscratch/scrubbed/strigg/TRIMG_adapt_5bp/TRIM_cat/Sealice*R1_001_val_1.fq.gz \
| xargs basename -s _R1_001_val_1.fq.gz| xargs -I{} /gscratch/srlab/programs/Bismark-0.19.0/bismark \
--path_to_bowtie /gscratch/srlab/programs/bowtie2- \
--samtools_path /gscratch/srlab/programs/samtools-1.9 \
--score_min L,0,-0.6 \
-p 4 \
--non_directional \
--dovetail \
--genome /gscratch/srlab/strigg/data/Caligus/GENOMES \
-1 /gscratch/scrubbed/strigg/TRIMG_adapt_5bp/TRIM_cat/{}_R1_001_val_1.fq.gz \
-2 /gscratch/scrubbed/strigg/TRIMG_adapt_5bp/TRIM_cat/{}_R2_001_val_2.fq.gz \
-o /gscratch/scrubbed/strigg/analyses/20190814_Calig

#run deduplicaiton
/gscratch/srlab/programs/Bismark-0.19.0/deduplicate_bismark \
--bam -p \
/gscratch/scrubbed/strigg/analyses/20190814_Calig/*.bam \
-o /gscratch/scrubbed/strigg/analyses/20190814_Calig/ \
2> /gscratch/scrubbed/strigg/analyses/20190814_Calig/dedup.err \
--samtools_path /gscratch/srlab/programs/samtools-1.9/

#create summary report
cat /gscratch/scrubbed/strigg/analyses/20190814_Calig/*PE_report.txt | \
grep -E 'Mapping\ efficiency\:|paired-end|Sequence|C methylated' \
cat - /gscratch/scrubbed/strigg/analyses/20190814_Calig/*.deduplication_report.txt | \
grep 'Mapping\ efficiency\:\|removed' \
> /gscratch/scrubbed/strigg/analyses/20190814_Calig/mapping_dedup_summary.txt

#run methylation extractor
/gscratch/srlab/programs/Bismark-0.19.0/bismark_methylation_extractor \
--paired-end --bedGraph --counts --scaffolds \
--multicore 28 \
/gscratch/scrubbed/strigg/analyses/20190814_Calig/*deduplicated.bam \
-o /gscratch/scrubbed/strigg/analyses/20190814_Calig/ \
--samtools /gscratch/srlab/programs/samtools-1.9/samtools \
2> /gscratch/scrubbed/strigg/analyses/20190814_Calig/bme.err

#create bismark reports for individual samlpes

#create bismark summary report for all samples

#Run coverage2cytosine command to generate cytosine coverage files
find /gscratch/scrubbed/strigg/analyses/20190814_Calig/*.cov.gz \
| xargs basename -s _R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz \
| xargs -I{} /gscratch/srlab/programs/Bismark-0.19.0/coverage2cytosine --gzip \
--genome_folder /gscratch/srlab/strigg/data/Caligus/GENOMES \
-o /gscratch/scrubbed/strigg/analyses/20190814_Calig/{}_cytosine_CpG_cov_report \

#compile and sort bams for methylkit
find /gscratch/scrubbed/strigg/analyses/20190814_Calig/*deduplicated.bam| \
xargs basename -s _R1_001_val_1_bismark_bt2_pe.deduplicated.bam | xargs -I{} /gscratch/srlab/programs/samtools-1.9/samtools \
sort /gscratch/scrubbed/strigg/analyses/20190814_Calig/{}_R1_001_val_1_bismark_bt2_pe.deduplicated.bam \
-o /gscratch/scrubbed/strigg/analyses/20190814_Calig/{}.dedup.sorted.bam

#align, #compile, #create, #run, #sbatch

Laura’s Notebook: Measuring QuantSeq larvae

The QuantSeq run I am prepping for will look at gene expression in O. lurida larvae, which were produced by adults that had previously been held in varying winter temperature and pCO2. All larvae were collected and frozen within a day or two of being released from the brood chamber, therefore they should all be at the same developmental stage. The size upon release, however, could be slightly different depending on the larval growth rate, and if larval release is triggered by something (e.g. food, tank cleaning). Since larval size could correspond with developmental stage which impacts gene expression profiles, I measured all the larvae that will also be sequenced.

While homogenizing frozen larvae with mortar + pestle, I preserved some of the larvae in ethanol, held in a fridge. To measure, I used the Nikon camera + microscope in Jackie’s lab, and the NIS-Element software’s automatic measurement capabilities. Steps:

  • Suspend larvae in ethanol with a pipette, transfer to a slide with cover slip
  • Capture image of larvae that are lying on their side, so the full width/length could be measured
  • Apply a binary layer to the image using the pre-determined threshold setting (based on color). The threshold also uses circularity and size to determine where the binary boundaries occur, and what to include.
  • Use the automatic measurement option to draw perimeters around all binary objects, then generate measurement statistics about each object. Here are details of the measurements I pulled for each object, from the NIS-Element user manual. Note: I believe the MaxFeret is the best representation of shell width, and MaxFeret90 is shell height.


When measuring objects, it’s important to do quality control – some larvae should not be included since they are at a weird angle, or tissue/junk is included in the auto-generated object, thus making the measurements erroneous. After doing this QC, I exported 2 images simultanously: 1) microscope image without any annotations, layers, etc.; 2) binary layer showing objects that were measured, object IDs, and scale bar. It took me a long time to figure out the software, which I don’t think is very intuitive. I made a screen recording of the process – check it out:

I measured at minimum 50 larvae from each of my 58 larval sample. Here are some quick and dirty plots of MaxFeret90 (aka shell height) and MaxFeret (aka shell width):




I need to measure the following larvae – missed these since I already had RNA:

Date Collected Spawning Group Population Treatment Larval sample # RNA Sample # Date homogenized
5/21/17 SN-10 Low B Oyster Bay C1 10 Low 06-A 41 na
5/23/17 SN-10 Low B Oyster Bay C1 10 Low 08-A 43 na
5/26/17 SN-10 Low B Oyster Bay C1 10 Low 24-A 46 na
5/27/17 SN-10 Low B Oyster Bay C1 10 Low 26-A 47 na
5/31/17 SN-10 Low A Oyster Bay C1 10 Low 32-A 44 na
6/14/17 SN-10 Low B Oyster Bay C1 10 Low 62-A 506 07/20/19
6/24/17 SN-10 Low A Oyster Bay C1 10 Low 79-A 45 na
5/23/17 SN-6 Ambient B Oyster Bay C1 6 Ambient 10-A 35 na
6/5/17 SN-6 Ambient B Oyster Bay C1 6 Ambient 45-A 513 07/23/19
6/6/17 SN-6 Ambient B Oyster Bay C1 6 Ambient 48-A 39 na
6/15/17 SN-6 Ambient B Oyster Bay C1 6 Ambient 69-A 37 na
6/19/17 SN-6 Ambient A Oyster Bay C1 6 Ambient 77-A 34 na

from The Shell Game https://ift.tt/2L8hesv

Sam’s Notebook: Data Wrangling – Panopea generosa Genome Feature Sequence Lengths

In preparation for a paper we’re writing, we needed some summary stats on the various genome assembly feature sequences. I determined the max/min, mean, and median sequence lengths for all the GFF feature files we currently have for Pgenerosa_v070, Pgenerosa_v070_top18_scaffolds, and Pgenerosa_v074. This info will be compiled in to a table for the manuscript. See our Genomic Resources wiki for more info on GFFs:

Calculations were performed using Python in a Jupyter Notebook.

Jupyter Notebook (GitHub):

Yaamini’s Notebook: DMG Analysis Part 3

Enrichment and description of differentially methylated genes

I talked to Shelly about my non-significant results. She said I could use uncorrected p-values in my results and argue that this is an exploratory study, just like she did with her crab metabolomics paper. I returned to my R Markdown script to do just that.

Methylation differences and annotation

I wanted to use GO-MWU again. I knew I needed methylation difference information between treatments to correct p-values if I wanted to use signed log p-values. To do this, I subsetted methylation information for ambient and treatment samples separately, then calculated median percent methylation by gene:

 ambientSamples <- subset(fullPercentMethExpanded, subset = fullPercentMethExpanded$treatment == "Ambient") #Subset ambient samples ambientSamplesPercentMeth <- aggregate(percentMeth ~ geneID, data = ambientSamples, FUN = median) #Calculate median methylation by gene for ambient samples  
 treatmentSamples <- subset(fullPercentMethExpanded, subset = fullPercentMethExpanded$treatment == "Treatment") #Subset treatment samples treatmentSamplesPercentMeth <- aggregate(percentMeth ~ geneID, data = treatmentSamples, FUN = median) #Calculate median methylation by gene for treatment samples  

Returning to the Kruskal-Wallis statistical output, I subsetted genes with uncorrected significant p-values — my DMG. I then added annotation information to the DMG, including Uniprot Accession codes, GO terms, and methylation differences. I repeated this process with DMG with DML. There are 223 DMG and 6 DMG with DML! I saved the annotated DMG here and annotated DMG with DML here.

Gene enrichment with GO-MWU

Gene enrichment time! I generated a GO annotation table and table of significance measures using signed log p-values from Kruskal-Wallis tests. All genes were used; not just those that are DMG. I noticed that half of the genes in the GO annotation table did not have associated GOterms. Unsurprisingly, there were no significantly enriched parent GOslim terms for biological processes, cellular components, and molecular function categories. I counted the frequency of parental GOslim terms for all genes tested and for DMG:

Biological processes:

Cellular components:

Molecular function:

Alright, now it’s time to write (and maybe make some new figures who knows).

Going forward

  1. Update methods and results
  2. Finalize figures
  3. Update paper repository
  4. Outline the discussion
  5. Write the discussion
  6. Write the introduction
  7. Revise my abstract
  8. Share the draft with collaborators and get feedback
  9. Post the paper on bioRXiv
  10. Prepare the manuscript for publication

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2PgXhVN

Grace’s Notebook: DIA Oysterseed Project – Finishing up Paper and cleaning up repo – in progress

I’ve been working on finally wrapping up the 2015 Oysterseed DIA project. I feel like I’ve tied off the paper fairly well this past week, and have been working on cleaning up and organizing the repository today. Details in post.


I’ve cleaned up some old comments, added some more details in the captions, and double checked all of the links and protocols.

Google doc: Oyster-Larval-Proteomics-2015


I made a repository for the paper during pubathon, and it has turned into a repository for the project as well. The original repository combined both this project and the one that Shelley and Kaitlyn worked on.

Repository: grace-ac/paper-pacific.oyster-larvae

I deleted some extra files that were never used for anything related to the final paper, and added information to all of the readme.md files. I still have some tasks to finish up, as noted by my notes (my “to-do” items are boxed in orange):


from Grace’s Lab Notebook https://ift.tt/2ZcMyAd

Yaamini’s Notebook: DMG Analysis Part 2

Identifying differentially methylated genes

After messing around with code to annotate coverage files with gene information, I finally have files I can work with to do some number crunching in this R Markdown script.

Formatting files

I wanted to use a shortcut to import multiple files at once, but I quickly found out that doing so makes it impossible to include additional arguments with read.delim or read.table. Additionally, the original coverage files were not tab-delimited, so I had tab and space delimiters in my files! I decided that the best thing to do would be to format, then import.

The first formatting issue I tackled was the fact that the original coverage files had space delimiters instead of tab-delimiters. Importing files with multiple delimiters into R, or trying to separate columns once imported, seemed hellish. I went to my code to create the join key. Instead of printing all columns in the file as-is after creating the new column, I specified tab delimiters between columns:

 #For each file that ends in bedgraph (the original coverage files) #Print the first three columns (chr, start, end) with dashes ("-") in between. Then, print the rest of the columns (chr, start, end, percentMeth) in the file with tabs ("\t") inbetween. #Save the file with the base name + .sorted for f in *bedgraph do awk '{print $1"-"$2"-"$3"\t"$1"\t"$2"\t"$3"\t"$4}' ${f} \ | sort -k1,1 \ > ${f}.sorted done  

I re-joined my files and was ready to tackle the second issue: column headers. When I tried importing all 10 files at once with list2env (more on that below), I couldn’t modify arguments of read.delim. Instead of messing with R code, I decided to add headers to each file. The minds at Stack Overflow suggested a few workarounds. I tried to keep column names consisent between previous versions of the files. Between each column name, I typed \t to include tab delimiters.

 #For each file that ends in percentMeth #List tab-delimited column headers #Add the file after the headers #Save output as a new file with the base name + ".header" for f in *percentMeth do echo -e "key\tchr\tstart\tend\tpercentMeth\tchr2\tstart2\tend2\tgene-chr\tgene-start\tgene-end\tmRNA-start\tmRNA-end\tgeneID\tGenbank\tmRNALength\tProduct\tUniprot\tGO" \ | cat - ${f} \ > ${f}.header done  

Finally, I was ready to import files into R! Instead of importing 10 files separately, I wanted to import them all together. Once again, Stack Overflow helped.

 #Create a file list for all 10 files to import #Import files to the .GlobalEnv with list2env. Use lapply to setNames of the files by taking all the common parts of their names out. Read files with read.delim (can only use defaults = header = TRUE, sep = "\t"). Files will be named zr2096_N filesToImport <- list.files(pattern = "*header") list2env(lapply(setNames(filesToImport, make.names(gsub("_s1_R1_val_1_bismark_bt2_pe.deduplicated.bismark.cov_5x.bedgraph.annotated-percentMeth.header", "", filesToImport))), read.delim), envir = .GlobalEnv)  

The code allowed me to import 10 files at once and condense the file names (let’s be real they were informative but super long and annoying). I removed extraneous columns for each sample.

Calculating median percent methylation

This step was pretty straightforward! I looked at Hollie’s code and found she used aggregate. I did too:

 #Calculate median percent methylation by gene for each sample zr2096_1_PercentMeth <- aggregate(percentMeth ~ geneID, data = zr2096_1, FUN = median) zr2096_2_PercentMeth <- aggregate(percentMeth ~ geneID, data = zr2096_2, FUN = median) zr2096_3_PercentMeth <- aggregate(percentMeth ~ geneID, data = zr2096_3, FUN = median) zr2096_4_PercentMeth <- aggregate(percentMeth ~ geneID, data = zr2096_4, FUN = median) zr2096_5_PercentMeth <- aggregate(percentMeth ~ geneID, data = zr2096_5, FUN = median) zr2096_6_PercentMeth <- aggregate(percentMeth ~ geneID, data = zr2096_6, FUN = median) zr2096_7_PercentMeth <- aggregate(percentMeth ~ geneID, data = zr2096_7, FUN = median) zr2096_8_PercentMeth <- aggregate(percentMeth ~ geneID, data = zr2096_8, FUN = median) zr2096_9_PercentMeth <- aggregate(percentMeth ~ geneID, data = zr2096_9, FUN = median) zr2096_10_PercentMeth <- aggregate(percentMeth ~ geneID, data = zr2096_10, FUN = median) head(zr2096_1_PercentMeth) #Confirm percent methylation calculation  

The resultant data frames had geneID and percentMeth. It was annoying that 1) I can’t figure out a way to loop through this command for all 10 samples and 2( all of my gene annotations were gone, but oh well. I guess I can add that in later.

The not-so-straightforward part: making sure every gene is listed in both dataframes. I realized that each dataframe had a different number of lines, or gene IDs. I used a series of merge commands to ensure that I had all gene IDs common between all 10 samples. Again, frustrating that I can’t do this in a loop:

 #Merge sample percent methylation information so that all geneIDs have data fullPercentMeth <- merge(x = zr2096_1_PercentMeth, y = zr2096_2_PercentMeth, by = "geneID") fullPercentMeth <- merge(x = fullPercentMeth, y = zr2096_3_PercentMeth, by = "geneID") fullPercentMeth <- merge(x = fullPercentMeth, y = zr2096_4_PercentMeth, by = "geneID") fullPercentMeth <- merge(x = fullPercentMeth, y = zr2096_5_PercentMeth, by = "geneID") fullPercentMeth <- merge(x = fullPercentMeth, y = zr2096_6_PercentMeth, by = "geneID") fullPercentMeth <- merge(x = fullPercentMeth, y = zr2096_7_PercentMeth, by = "geneID") fullPercentMeth <- merge(x = fullPercentMeth, y = zr2096_8_PercentMeth, by = "geneID") fullPercentMeth <- merge(x = fullPercentMeth, y = zr2096_9_PercentMeth, by = "geneID") fullPercentMeth <- merge(x = fullPercentMeth, y = zr2096_10_PercentMeth, by = "geneID")  

There are 11,622 genes with data for all 10 samples, meaning I shouldn’t have any missing data for my statistical tests. I subsetted the gene ID and each sample’s percent methylation columns to create individual dataframes for each sample. After changing column names, I added the sample number and treatment in different columns. Finally, I used rbind to create a dataframe with geneID, percentMeth, sampleNumber, and treatment information for all 10 samples. I saved the file here.

Statistical testing

Liew et al. checked normality and variance assumptions before performing a t-test. Since I did MBD-BSseq, which biases towards highly methylated regions, I’m pretty confident my data will not meet a normality assumption.

Choosing a method

My raw data is most definitely not normal if I looka t a histogram:

Screen Shot 2019-08-23 at 12 09 20 PM

To deal with the left-skewness, I applied separate log, square root, and cube root transformations to see if any of them look better:

Screen Shot 2019-08-23 at 12 10 14 PM

Screen Shot 2019-08-23 at 12 10 23 PM

Screen Shot 2019-08-23 at 12 11 27 PM

While the log transformation looks the best, it is still left-skewed. I decided to use the non-parametric Kruskall-Wallis test instead of a t-test. I chose Kruskall-Wallis over Mann-Whitney because I do not have paired data.

Conducting the statistical test

I decided to use kruskal.test in R to conduct the Kruskal-Wallis test. I took a quick peek at the example:

 ## Hollander & Wolfe (1973), 116. ## Mucociliary efficiency from the rate of removal of dust in normal ## subjects, subjects with obstructive airway disease, and subjects ## with asbestosis. x <- c(2.9, 3.0, 2.5, 2.6, 3.2) # normal subjects y <- c(3.8, 2.7, 4.0, 2.4) # with obstructive airway disease z <- c(2.8, 3.4, 3.7, 2.2, 2.0) # with asbestosis kruskal.test(list(x, y, z))  

I needed to isolate my ambient and treatment samples in different dataframes to follow this method. After sorting the dataframe by geneID then treatment, I used subset to create dataframes for each treatment.

For kruskal.test, I thought the best approach would be to loop through the dataset by geneID, extract the chi-squared statistic, df, and p-value from each test, and save the output in a new dataframe:

 #For each unique geneID #Conduct a KW test for a set of 5 rows from each dataframe (5 ambient and 5 treatment median % methylation values for the same gene) #Store the chi-squared statistic in the ith row #Store the df in the ith row #Store the p.value in the ith row for (i in 1:length(KWTestResults$geneID)) { temp <- kruskal.test(list(ambientPercentMethFull$percentMeth[((5*i)-4):(5*i)], treatmentPercentMethFull$percentMeth[((5*i)-4):(5*i)])) KWTestResults$chi.squared[i] <- temp$statistic KWTestResults$df[i] <- temp$parameter KWTestResults$p.value[i] <- temp$p.value }  

I then used adjust.p to apply a Benjamini-Hochberg correction to my p-values to control for a false discovery rate. When I looked at the range of my adjusted p-values, none of them were significant! I’m a little bummed, but it’s not surprising seeing how I did over 10,000 comparisons. I saved the output of my Kruskal-Wallis tests here.

mRNA positions with intersectBed

After I looked at my statistical test results, I had a mild panic that the mRNA start and end positions I used to annotate coverage files biased what loci were included. I thought there could be a chance intersectBed matched CpG loci with gene and mRNA positions. I removed the mRNA information from my annotations and reran intersectBed. I counted the lines of the resultant sorted and joined files. They did not differ from the counts I previously had, so I put that mild panic to rest and reverted to including mRNA positions in my files.

Filtering genes before statistical testing

Liew et al. filtered data before they conducted their statistical tests by ony retaining genes with at least five methylated positions. Even though keeping genes with no missing data pared down my list by ~10,000 genes, I figured it couldn’t hurt to replicate their efforts. I don’t really understand why they chose genes with at least five methylated positions, sice it seems like that would bias the analysis towards heavily methylated genes.

I thought an easier, and perhaps more meaningful way to pare down my list and see if it had an effect was to only look at genes with DML. I extracted unique geneID from my DML list and merged that with my kruskal.test results. I then applied a Benjamini-Hochberg correction to the p-values from the 481 genes. None of the adjusted p-values for genes with DML were significant (I saved a separate file here). My paper will focus solely on the DML.

Speaking of which…time to WRITE!

Going forward

  1. Update methods and results
  2. Update paper repository
  3. Outline the discussion
  4. Write the discussion
  5. Write the introduction
  6. Revise my abstract
  7. Share the draft with collaborators and get feedback
  8. Post the paper on bioRXiv
  9. Prepare the manuscript for publication

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/30vDeUD

Sam’s Notebook: Data Wrangling – Create a CpG GFF from Pgenerosa_v074 using EMBOSS fuzznuc on Swoose

Steven wanted me to create a CpG GFF for use in IGV visulations for continued Pgenerosa_v074 analysis. I did that by running the EMBOSS tool fuzznuc:

 ~/programs/EMBOSS-6.6.0/emboss/fuzznuc \ -sequence ~/data/P_generosa/genomes/Pgenerosa_v074.fa \ -pattern CG \ -outfile 20190821_fuzznuc_pgenv074.gff \ -rformat gff  

This was run on my computer, swoose.

Yaamini’s Notebook: DMG Analysis

Starting the differentially methylated gene analysis

After weeks of thinking about it, I’m finally looking at differentially methylated genes! I’m replicating the differentially methylated gene analysis from this paper. The authors used python scripts for the analysis, so I started by reviewing their steps:

  1. Annotate coverage files from bismark with gene and exon information
  2. Calculte median % methylation for each gene
  3. Check normality and variance assumptions
  4. Perform t-tests to see if median % methylation differs by treatment, and correct p-values for multiple comparisons

For my analysis, I created this repository folder and downloaded initial python scripts from this repository. I figured I could do the rest of the analysis in R, so I didn’t download any other python scripts.

Choosing a workflow

The first step is to annotate coverage files from bismark with gene information. I am using this python script, which I got from ). The majority of the script imports other functions and scripts.

I started running the script but encountered this error:

Screen Shot 2019-08-14 at 2 38 21 PM

I think the issue came from the .gff3 I downloaded from NCBI.

Screen Shot 2019-08-14 at 2 42 10 PM

I posted this issue, and Sam confirmed that the parse.gff3 script is the reason why I got an error. The script I downloaded isn’t a general GFF3 parser, so I would need to modify it for my own use. He asked me what I wanted to do, and I realized I don’t need to use the exact script. Talking with Shelly on Friday, I also found out that Hollie was working on something similar. In the issue’s comments, she shared this Markdown document and this R script with me. Since I’m more proficient in R, I decided to modify Hollie’s workflow for my use.

Annotating coverage files

Master gene annotation list

I created this R Markdown script for my analysis. The first thing I did was create a list of annotated genes. This was relatively straightforward for me because I’ve done it beforetwice!. I matched genes with mRNA, and ended up with a preliminary list of 60,201 entries — the exact number of mRNA. The file can be found here.

Then came the first tricky part: selecting the longest mRNA for each gene. When looking at differentially methylated genes, Liew et al. assigned a function to the gene by using information from the longest mRNA. I wanted to do the same thing. I initially used sort --unique to try and do this, but that didn’t work. I posted this issue, and Sam found a solution.

 #For the 6th field [$6] (gene ID), remove all duplicate entries in a csv file (-F","). These entries are saved and can be extracted as a different output if needed. #Save output to a new file awk -F"," '!x[$6]++' 2019-08-14-All-Annotated-Gene-and-mRNA.csv \ > 2019-08-14-All-Annotated-Gene-and-mRNA-Unique.csv  

I was left with ~34,000 lines, so under half of all mRNA were from duplicate genes. I appended Uniprot Accession codes and GOterms to each line and saved the file here.

Issues with full_join

To match loci in coverage files with the genes they’re in, Hollie used full_join from dplyr, then %>% filter to retain loci that were within the gene’s start and end positions. I tried doing the same thing and got an error:

 temp <- full_join(sample1, geneAnnotationsCondensedUniprotGOterms, by="gene.chr") %>% filter(start >= gene.start & start <= gene.end) Error in full_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches, environment()) : negative length vectors are not allowed  

Baffled, I went to Stack Overflow and found this could happen because the resultant table has too many lines. I posted this issue, in which Sam suggested I try the same code with subsets of my input files, and try the code on ostrich since it has more RAM. On genefish, I tried running the full_join using data from one chromosome in a sample, but it didn’t work. I tried running full samples on ostrich but that didn’t work either. Finally, I got code to work on ostrich using one chromosome from a sample and the annotated gene list with Uniprot Accession codes and GOterms. I had to use merge instead of full_join because dplyr is not compatible with that version of R, which also meant I couldn’t use %>% to pipe the table into filter and needed a separate filter step. Two hours after I started running the merge code, it still wasn’t done! Fed up, I was reminded of my best friend: bedtools.

Using intersectBed

Why did I spend time trying to figure out an R workaround when I could use intersectBed — a tool made for this exact purpose! I quickly found there was one caveat with intersectBed. I couldn’t use the coverage file as is, since it contained numerical values that were not positions on the chromosome. I needed to remove the percent methylation column from each file, run intersectBed, then add the percent methylation information back in.

I used the following code to isolate the first three columns (chromosome, start, end) from each coverage file. I included \t to ensure the file will be tab-delimited (requirement for bedtools):

“{bash} #For each file that ends in bedgraph #Print the first three columns, tab-delimited, and save the columns in a new file that has the same base name and ends in posOnly for f in *bedgraph do awk ‘{print $1”\t”$2”\t”$3}’ ${f} > ${f}.posOnly done

  With my newly created files, I was able to run `intersectBed`: ```{bash} #For each file that ends in posOnly #Use intersectBed to find where loci and genes intersect, allowing loci to be mapped to annotated genes #wb: Print all lines in the second file #a: file that ends in posOnly #b: annotated gene list #Save output in a new file that has the same base name and ends in -Annotated.txt for f in *posOnly do /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \ -wb \ -a ${f} \ -b 2019-08-14-Annotated-Gene-Longest-mRNA-Uniprot-GOterms.txt \ > ${f}-Annotated.txt done  

Adding back percent methylation information

After annotating coverage files with intersectBed, I need to add percent methylation information back to the annotated files to calculate median percent methylation by gene. I could import 20 files into R, or I could use a series of bash commands. There are two steps, since Linux join does not allow you to join by multiple fields. The first step involves creating a new column with the information from the multiple columns you want to match in both files. The second step is to join.

Both my original coverage files and annotated coverage files have the same first three columns: chromosome, start, and end for each CpG locus. For all 20 files, I created a new column with information formatted as chromosome-start-end:

Then, I used join specifiying the columns I just created:

 for f in *bedgraph do join -j 1 -t $'\t' \ ${f}.sorted \ ${f}.posOnly-Annotated.txt.sorted \ > ${f}.annotated-percentMeth done  

I used wc -l to check that the number of lines in the annotated files are the same as the number of lines in the files I just generated. There were no lines lost between the file pairs, so this part of my DMG analysis is done!

Going forward

  1. Calculate median percent methylation
  2. Conduct t-tests by gene to identify DMG
  3. Gene enrichment for DMG with GO-MWU
  4. Update methods and results
  5. Update paper repository
  6. Outline the discussion
  7. Write the discussion
  8. Write the introduction
  9. Revise my abstract
  10. Share the draft with collaborators and get feedback
  11. Post the paper on bioRXiv
  12. Prepare the manuscript for publication

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2TQkI77

Importing multiple files to R

Modified from this link.

coverageFiles <- list.files(pattern = "*bedgraph") #Create a file list for all 10 files to import
make.names(gsub("s1_R1_val_1_bismark_bt2_pe.deduplicated.bismark.cov_", "", coverageFiles))), read.table),
envir = .GlobalEnv) #Import files with list2env. Use lapply to setNames of the files by taking all the common parts of their names out. Files will be named zr2096_#_5x.bedgraph