https://github.com/eimd-2019/project-EWD-transcriptomics/blob/master/analyses/WGCNA/WGCNA.md
Monthly Archives: July 2020
Yaamini’s Notebook: MethCompare Part 24
Another (and hopefully final) iteration of union bedgraph analyses
Last night, Shelly pointed out that there were different versions of union begraphs in this issue. Steven originally created union bedgraphs, but in one of our meetings we realized that there were duplicate entries in each file. Shelly removed those duplicates and saved another version of files. I knew about these versions, but only used the older version with duplicates for my analysis! Even though my bedtools
code was written such that overlaps with genome features are only listed once, the presence of duplicates would affect the proportions of highly methylated, moderately methylated, and lowly methylated loci. I changed the file path I used to source the union bedfiles in this Jupyter notebook and reran the script. Then, I took that output and ran it through this R Markdown script to generate my figures and count information. As expected, no drastic changes to the methylation status stacked barplots and no change at all to the genomic location barplots. I updated the figures in the manuscript and started working on the results. Now that we have the correct input and all the code is (hopefully) refined, perhaps I won’t have to run this again in the future…
Going forward
- Look into program for mCpG identification
Please enable JavaScript to view the comments powered by Disqus.
from the responsible grad student https://ift.tt/2ZTcgay
via IFTTT
Yaamini’s Notebook: MethCompare Part 23
Calculating median percent methylation
We’re wrapping up our analyses. Hollie is looking at gene methylation by method and asked me to calculate median methylation by gene for each sample, since I did that before for C. virginica. I set out to do just that in this script. I started by testing code on M. capitata samples, then used the finalized code for P. acuta as well.
Looking at my files on gannet
, I realized I never transferred the intermediate genonme feature overlap files! I started rerunning this Jupyter notebook to get the gene-sample overlap files. Even after I generated those files, I was missing percent methylation information. I decided to include percent methylation information in the intermediate BEDfiles before using intersectBED
. After using intersectBed
, I had chromosome, start, stop, and percent methylation information for each sample.
I returned to my R Markdown script to import the overlap files. Again, I realized I was missing something! I needed to associate CpG loci with genes. I used intersectBed
to match CpGs with genes:
#For each sample-gene overlap file (ends in *bedgraph.bed-mcGenes) #Use intersectBed to find where loci and genes intersect, allowing loci to be mapped to genes #wb: Print all lines in the second file #a: sample-gene overlap file #b: gene track #Save output in a new file that has the same base name and ends in -geneID for f in ../analyses/Characterizing-CpG-Methylation-5x/Mcap/*bedgraph.bed-mcGenes do /usr/local/bin/intersectBed \ -wb \ -a ${f} \ -b ../genome-feature-files/Mcap.GFFannotation.gene.gff \ > ${f}-geneID done
I then took the files and moved them into a Median-Methylation-Calculations subdirectory. Next, I needed to import the nine files for each species into R. I used code I learned about when working with C. virginica to import all the files at once:
setwd("../analyses/Characterizing-CpG-Methylation-5x/Mcap/Median-Methylation-Calculations") #Set working directory within the notebook chunk for list.files to find the necessary files filesToImport <- list.files(pattern = "*geneID") #Create a file list for all 9 files to import. Only import overlaps for full samples (not divided by methylation status) list2env(lapply(setNames(filesToImport, make.names(gsub("_R1_001_val_1_bismark_bt2_pe._5x.bedgraph.bed-mcGenes-geneID", "", filesToImport))), read.delim, header = FALSE), envir = .GlobalEnv) #Import files to the .GlobalEnv with list2env. Use lapply to setNames of the files by taking all the common parts of their names out. Read files with read.delim and include header = FALSE. Files will be named Meth# head(Meth10) #Confirm import
My next steps were simple: remove redundant columns, add column names, and use aggregate
to calculate median percent methylation for each gene ID. That would amount to three lines of code for each sample, but I didn’t want to write an individual block of code for each sample! Based on this Stack Overflow post, I created a vector of dataframes, then pulled the dataframes out of the vector in a loop to format the files:
samplesMcap <- c("Meth10", "Meth11", "Meth12", "Meth13", "Meth14", "Meth15", "Meth16", "Meth17", "Meth18") #Create a vector of sample names
for(sample in samplesMcap) { #For each sample listed in samplesMcap sample.tmp <- get(sample) #Extract sample based on vector contents sample.tmp <- sample.tmp[,-c(5:12)] #Remove extraneous columns colnames(sample.tmp) <- c("chr", "start", "stop", "percentMeth", "geneID") #Rename columns assign(sample, sample.tmp) #Replace sample with edited sample.tmp contents } head(Meth17) #Confirm formatting changes
Finally, I used similar loops to calculate median percent methylation by gene ID and write out tab-delimited files for each sample:
for(sample in samplesMcap) { #For each sample listed in samplesMcap sample.tmp <- get(sample) #Extract sample based on vector contents sample.tmp <- aggregate(percentMeth ~ geneID, data = sample.tmp, FUN = median) #Use aggregate to group geneID and calculate median percent methylation assign(sample, sample.tmp) #Replace sample with edited sample.tmp contents } head(Meth17) #Confirm median methylation calculation
for (i in 1:length(samplesMcap)) { #For each sample listed in samplesMcap sample <- get(samplesMcap[i]) #Extract sample based on vector contents fileName <- paste("../analyses/Characterizing-CpG-Methylation-5x/Mcap/Median-Methylation-Calculations/", samplesMcap[i], "-Median-Methylation", ".txt", sep = "") #Assign filename for each sample write.table(sample, file = fileName, sep = "\t", row.names = FALSE, col.names = TRUE) #Write out files into the Median-Methylation-Calculations subdirectory }
The trick of putting dataframes in a vector to run the same commands on them is super useful. Definitely using that in the future. I posted the links to the M. capitata and P. acuta subdirectories in this issue for Hollie to use.
Aside from calculating methylation, I also modified figures based on suggestions from Hollie and updated the methods section. Not much left to work on (hopefully)!
Going forward
- Update results
- Look into program for mCpG identification
Please enable JavaScript to view the comments powered by Disqus.
from the responsible grad student https://ift.tt/2ZdwSLo
via IFTTT
WGCNA code
Yaamini’s Notebook: July 2020 Goals
Labs are opening back up so I’m wrapping up my bioinformatics time and transitioning into pipetting time.
JUne Goals Recap:
Coral Methylation Comparison:
- Finalized statistical analyses!
- Revised
bedtools
code and remade M. capitata genome tracks to include GeMoMa information - Started updating the scripts README and the paper
Other:
- Made progress on side projects (ocean acidification and reproduction review and eelgrass transcriptomics)
- Collaboratively wrote a proposal for a DEI-themed Bevan Series
- Put together a talk for the Friday Harbor Labs Summer Virtual Seminars!
July Goals
Coral Methylation Comparison:
- Update the methods and results in the paper
- Add to the discussion points
Gigas and Virginica Labwork:
- Review and update health and safety plan
- Pitch plan to Katie and Alan
- Create an action plan for samples
- Take inventory of materials and start ordering what’s needed
Other:
- Wrap up side projects (ocean acidification and reproduction review and eelgrass transcriptomics)
Please enable JavaScript to view the comments powered by Disqus.
from the responsible grad student https://ift.tt/2NRvc3W
via IFTTT