CG/CV transcriptomics

Helping Colleen with a comparative transcriptomics project for OSHV-infected and uninfected Pacific and eastern oysters to build skills and pipelines I’ll need for my RNASeq analysis later.

Revised WGCNA with GO-MWU analysis

Yaamini’s Notebook: MethCompare Part 24

Another (and hopefully final) iteration of union bedgraph analyses

Last night, Shelly pointed out that there were different versions of union begraphs in this issue. Steven originally created union bedgraphs, but in one of our meetings we realized that there were duplicate entries in each file. Shelly removed those duplicates and saved another version of files. I knew about these versions, but only used the older version with duplicates for my analysis! Even though my bedtools code was written such that overlaps with genome features are only listed once, the presence of duplicates would affect the proportions of highly methylated, moderately methylated, and lowly methylated loci. I changed the file path I used to source the union bedfiles in this Jupyter notebook and reran the script. Then, I took that output and ran it through this R Markdown script to generate my figures and count information. As expected, no drastic changes to the methylation status stacked barplots and no change at all to the genomic location barplots. I updated the figures in the manuscript and started working on the results. Now that we have the correct input and all the code is (hopefully) refined, perhaps I won’t have to run this again in the future…

Going forward

  1. Look into program for mCpG identification

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student

Yaamini’s Notebook: MethCompare Part 23

Calculating median percent methylation

We’re wrapping up our analyses. Hollie is looking at gene methylation by method and asked me to calculate median methylation by gene for each sample, since I did that before for C. virginica. I set out to do just that in this script. I started by testing code on M. capitata samples, then used the finalized code for P. acuta as well.

Looking at my files on gannet, I realized I never transferred the intermediate genonme feature overlap files! I started rerunning this Jupyter notebook to get the gene-sample overlap files. Even after I generated those files, I was missing percent methylation information. I decided to include percent methylation information in the intermediate BEDfiles before using intersectBED. After using intersectBed, I had chromosome, start, stop, and percent methylation information for each sample.

I returned to my R Markdown script to import the overlap files. Again, I realized I was missing something! I needed to associate CpG loci with genes. I used intersectBed to match CpGs with genes:

#For each sample-gene overlap file (ends in *bedgraph.bed-mcGenes) #Use intersectBed to find where loci and genes intersect, allowing loci to be mapped to genes #wb: Print all lines in the second file #a: sample-gene overlap file #b: gene track #Save output in a new file that has the same base name and ends in -geneID for f in ../analyses/Characterizing-CpG-Methylation-5x/Mcap/*bedgraph.bed-mcGenes do /usr/local/bin/intersectBed \ -wb \ -a ${f} \ -b ../genome-feature-files/Mcap.GFFannotation.gene.gff \ > ${f}-geneID done 

I then took the files and moved them into a Median-Methylation-Calculations subdirectory. Next, I needed to import the nine files for each species into R. I used code I learned about when working with C. virginica to import all the files at once:

setwd("../analyses/Characterizing-CpG-Methylation-5x/Mcap/Median-Methylation-Calculations") #Set working directory within the notebook chunk for list.files to find the necessary files filesToImport <- list.files(pattern = "*geneID") #Create a file list for all 9 files to import. Only import overlaps for full samples (not divided by methylation status) list2env(lapply(setNames(filesToImport, make.names(gsub("_R1_001_val_1_bismark_bt2_pe._5x.bedgraph.bed-mcGenes-geneID", "", filesToImport))), read.delim, header = FALSE), envir = .GlobalEnv) #Import files to the .GlobalEnv with list2env. Use lapply to setNames of the files by taking all the common parts of their names out. Read files with read.delim and include header = FALSE. Files will be named Meth# head(Meth10) #Confirm import 

My next steps were simple: remove redundant columns, add column names, and use aggregate to calculate median percent methylation for each gene ID. That would amount to three lines of code for each sample, but I didn’t want to write an individual block of code for each sample! Based on this Stack Overflow post, I created a vector of dataframes, then pulled the dataframes out of the vector in a loop to format the files:

samplesMcap <- c("Meth10", "Meth11", "Meth12", "Meth13", "Meth14", "Meth15", "Meth16", "Meth17", "Meth18") #Create a vector of sample names 
for(sample in samplesMcap) { #For each sample listed in samplesMcap sample.tmp <- get(sample) #Extract sample based on vector contents sample.tmp <- sample.tmp[,-c(5:12)] #Remove extraneous columns colnames(sample.tmp) <- c("chr", "start", "stop", "percentMeth", "geneID") #Rename columns assign(sample, sample.tmp) #Replace sample with edited sample.tmp contents } head(Meth17) #Confirm formatting changes 

Finally, I used similar loops to calculate median percent methylation by gene ID and write out tab-delimited files for each sample:

for(sample in samplesMcap) { #For each sample listed in samplesMcap sample.tmp <- get(sample) #Extract sample based on vector contents sample.tmp <- aggregate(percentMeth ~ geneID, data = sample.tmp, FUN = median) #Use aggregate to group geneID and calculate median percent methylation assign(sample, sample.tmp) #Replace sample with edited sample.tmp contents } head(Meth17) #Confirm median methylation calculation 
for (i in 1:length(samplesMcap)) { #For each sample listed in samplesMcap sample <- get(samplesMcap[i]) #Extract sample based on vector contents fileName <- paste("../analyses/Characterizing-CpG-Methylation-5x/Mcap/Median-Methylation-Calculations/", samplesMcap[i], "-Median-Methylation", ".txt", sep = "") #Assign filename for each sample write.table(sample, file = fileName, sep = "\t", row.names = FALSE, col.names = TRUE) #Write out files into the Median-Methylation-Calculations subdirectory } 

The trick of putting dataframes in a vector to run the same commands on them is super useful. Definitely using that in the future. I posted the links to the M. capitata and P. acuta subdirectories in this issue for Hollie to use.

Aside from calculating methylation, I also modified figures based on suggestions from Hollie and updated the methods section. Not much left to work on (hopefully)!

Going forward

  1. Update results
  2. Look into program for mCpG identification

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student

WGCNA code

Yaamini’s Notebook: July 2020 Goals


Labs are opening back up so I’m wrapping up my bioinformatics time and transitioning into pipetting time.

JUne Goals Recap:

Coral Methylation Comparison:


  • Made progress on side projects (ocean acidification and reproduction review and eelgrass transcriptomics)
  • Collaboratively wrote a proposal for a DEI-themed Bevan Series
  • Put together a talk for the Friday Harbor Labs Summer Virtual Seminars!

July Goals

Coral Methylation Comparison:

  • Update the methods and results in the paper
  • Add to the discussion points

Gigas and Virginica Labwork:

  • Review and update health and safety plan
  • Pitch plan to Katie and Alan
  • Create an action plan for samples
  • Take inventory of materials and start ordering what’s needed


  • Wrap up side projects (ocean acidification and reproduction review and eelgrass transcriptomics)

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student

Yaamini’s Notebook: MethCompare Part 22

Checking stacked barplots and post-hoc tests

Previously I remade M. capitata genome tracks and my genomic location stacked barplot. Something looked funky, but I tabled my investigation until later. Later is now, and my goal is to resolve that issue and finalize post-hoc tests to understand GLM output for methylation status and genomic location.

M. capitata genomic location

The first thing I did was check my bedtools code in this Jupyter notebook. That code looked good, so I figured there must have been an issue when I processed line count output in this R Markdown script. Before I began, I cleared by environment. I then reran the script to produce the overlap counts and overlap percents tables. Looking at the tables, WGBS detected 10% of CpGs that overlapped with CDS, while MBD-BS detected 14% of CpGs overlapping with CDS. This was not reflected in the stacked barplot I made! I ran the code fo my barplot and saw that it now reflected those percentages. My hunch is that clearing my environment did the trick. Since my union data and individual data scripts use the same dataframe names, I think the barplot I made last week called the correct dataframe name, but the content was for individual samples. I posted the revised barplot in this issue and moved on.

Screen Shot 2020-06-29 at 1 50 21 PM

Figure 1. M. capitata and P. acuta overlaps with various genome features (pdf here).

Post-hoc tests

I felt good about my PCoA, perMANOVAs, and models, but I still didn’t have all the information I needed. For each CpG methylation status or genomic feature overlap, the models test WGBS vs. RRBS or WGBS vs. MBD-BS, since WGBS is the base condition. However, the models don’t compare RRBS with MBD-BS. I wasn’t sure what post-hoc test to run, so I asked Evan. He suggested I use emmeans to run post-hoc tests; I did so in this script. The package runs estimated marginal means tests on each model and corrects p-values. I can then obtain log odds ratio test output and corrected p-values with a FDR. The command itself is very simple:

emmeans(McapCpGHighModel, pairwise ~ seqMethod, adjust = "FDR") 

The output provides confidence intervals for proportion overlap for each method on the logit scale, as well as log odds ratio results. I only wanted the test results since we’re not concerned with absolute proportions. To do this, I specified $contrasts. I also took the output with adjusted p-values and saved it as a dataframe:

McapCpGHighPostHoc <- data.frame(emmeans(McapCpGHighModel, pairwise ~ seqMethod, adjust = "FDR")$contrasts) #Run pairwise comparisons (estimated marginal means). Obtain log odd ratio results and not confidence intervals for individual methods in a dataframe format. Specify FDR instead of Tukey post-hoc test (default) head(McapCpGHighPostHoc) #Look at log odd ratio results 

I used rbind to combine statistical output for each model:

M. capitata:

P. acuta:

After I knit my script, I shared the output in this issue and on Slack. I still think Meth 8 may be an outlier, and I don’t know if the PCoAs should be included with the stacked barplot as a multipanel plot, or if they should just go in the supplement.

I think that’s it for my statistical analyses! Based on what we want for figures I can go back and make the PCoAs look pretty. Now I need to understand these results and update the paper.

Going forward

  1. Update methods
  2. Update results
  3. Update figures
  4. Look into program for mCpG identification

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student

Yaamini’s Notebook: MethCompare Part 21

Refining multivariate and modeling approaches

During our meeting Friday, I realized I interpreted the data analysis task incorrectly. Instead of combining CpG methylation status and genomic location information to use for multivariate and modeling approaches, I needed to consider the data separately!

Remaking M. capitata tracks

Before I could revise my analysis, Hollie pointed out that the M. capitata gene track did not have the correct amount of genes in this issue. In addition to using genes with the AUGUSTUS tag, I needed to pull genes that had the GeMoMa tag (from a different related species). In this Jupyter notebook, I matched multiple patterns with grep to revise the gene, CDS, and intron tracks.

#Match both AUGUSTUS and GeMoMa gene annotations !grep -e "AUGUSTUS gene" -e "GeMoMa gene" Mcap.GFFannotation.gff > Mcap.GFFannotation.gene.gff 

I used these revised tracks to create my three flank tracks and intergenic region track. Since the files have the same name, I didn’t need to update my IGV sessions. Instead, I uploaded the revised tracks to gannet so they would populate IGV.

Re-running scripts

Now that I had revised tracks, I needed to go back to my Jupyter notebook and characterize M. capitata feature overlaps! In this Jupyter notebook I characterize CG motif overlaps with the revised tracks. Then, I characterized overlaps between individual samples and feature tracks. I did the same thing in this Jupyter notebook for 5x union data. For the union data, I recreated CpG genomic location stacked barplots with this R Markdown script.

Screen Shot 2020-06-22 at 9 52 49 AM

Figure 1. *CpG genomic location.

Looking at my output, it no longer looks like MBD-BS is no longer the best for capturing gene regions. Since I ran this code quickly and late at night, I need to double check that this stacked barplot is trustworthy.

Revised statistical analyses

Now, the big stuff. I took the output for individual samples and recreated my summary tables in this R Markdown script. Based on feedback from Friday, I ran individual NMDS, ANOSIM, and model sets for the CpG methylation status and genomic location datasets. Similar to the methylation status models, I ran individual models for each genome feature since all the proportions add up to 1. I included replicate information as a random effect.

I ran my analyses and questions by Evan, and he suggested I switch over from NMDS and ANOSIM to PCoA and perMANOVA. Apparently, the creators of the vegan package recommend this. The perMANOVA has a more robust statistical output format as well. I kept my centered log-ratio transformation and euclidean distance matrix, but fed it into cmdscale to run a PCoA instead. I investigated the PCs, eigenvalues, and loadings. I then used adonis to run a global perMANOVA. To understand if a significant perMANOVA result was due to centroid or variance differences between groups, I ran a beta dispersion model. A non-significant result indicates that differences are due to centroids, not variance. After my global perMANOVA, I ran pairwise perMANOVA test to understand the significant global perMANOVA result. Similar to when I was running ANOSIM, I ended up with complete enumeration. However, it was easier to interpret the perMANOVA output and see how much variance the test explained. After running my regressions, I extracted p-values and corrected them using a false discovery rate.

Although most of my questions are addressed, I still have two more. When running the beta dispersion model for my P. acuta data, I found that group variances were significantly different from eachother! My guess is that sample 8 is an influential point or outlier that is contributing to variance differences between groups. I need to investigate this further. While I didn’t need to make any changes to the models (so nice!), the models are set up in a way that compares WGBS to RRBS and MBD-BS and not compare the two enrichment methods. I’m not sure which post-hoc tests I need to use to compare these two methods. I knit the output into this document, emailed Evan, and posted my follow-up questions here. The end is (hopefully) near!

Going forward

  1. Finalize analysis
  2. Check that union stacked barplots were generated correctly
  3. Update methods
  4. Update results
  5. Update figures
  6. Look into program for mCpG identification

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student

Yaamini’s Notebook: MethCompare Part 20

Preliminary follow-up GLMs

Previously, I tried using NMDS and ANOSIM to understand if sequencing method influenced differences in proportions of CpGs in various methylation statuses or genomic locations. I’m still waiting on a response to some questions I had, but in the meantime I’ll try using generalized linear models to get at some of these differences. Since we have small sample sizes, using a combination of approaches gives us more power. Specifically I will be testing if sequencing method or genomic location influences the proportion of CpGs in a particular methylation status.

In this script, I started creating my models. Evan suggested I use glmmTMB to run beta family models with a logit link. Since the proportions of highly, moderately, and lowly methylated CpGs add up to 1, I need to run separate models:

McapCpGHighModel <- glmmTMB(percentMeth ~ CDS + Introns + Upstream.Flanks + Downstream.Flanks + Intergenic + seqMethod, family = beta_family(link = "logit"), data = McapCpGMasterHigh) #Run the full model (all genomic locations) using a beta distribution and a logit link 

When I ran the full model, I got a convergence issue error. Digging into glmmTMB vignettes, it suggested that my model could be overparametrized. This made sense, since I have minimal data and five model parameters. I decided to run two models: one for gene features and another for non-gene features. First, I ran a null model (without sequencing method) and a full model (with sequencing method):

McapCpGHighModel.GeneNull <- glmmTMB(percentMeth ~ CDS + Introns, family = beta_family(link = "logit"), data = McapCpGMasterHigh) #Run null model (without sequencing method) using a beta distribution and a logit link McapCpGHighModel.Gene <- glmmTMB(percentMeth ~ CDS + Introns + seqMethod, family = beta_family(link = "logit"), data = McapCpGMasterHigh) #Run model using a beta distribution and a logit link 

Then, I compared these models with a likelihood ratio test:

McapCpGHighModel.GeneANOVA <- anova(McapCpGHighModel.Gene, McapCpGHighModel.GeneNull, test = "LRT") #Compare the null and treatment models with a likelihood ratio test. McapCpGHighModel.GeneANOVA #Look at ANOVA output 

Finally, I looked at the summary output for the best model using summary. Usualy, the best-fit model was one that included the sequencing method. I’m not sure how (or if) to compare the gene and non-gene models, but AIC might be useful in that respect.

Still have questions, but at least it’s something to discuss at our meeting!

Going forward

  1. Finalize analysis
  2. Locate TE tracks
  3. [Characterize intersections between data and TE, and create summary tables]
  4. Look into program for mCpG identification

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student

Sam’s Notebook: Metagenomics – Data Extractions Using MEGAN6

Decided to finally take the time to methodically extract data from our metagenomics project so that I have the tables handy when I need them and I can easily share them with other people. Previously, I hadn’t done this due to limitations on looking at the data remotely. I finally downloaded all of the RMA6 files from 20191014 after being fed up with the remote desktop connection and upgrading the size of my hard drive (5 of the six RMA6 files are >40GB in size).

Here’s an explanation of what was done and how the output files (see RESULTS section below) are named.

  • All RMA6 files were imported Files using absolute read counts.
  • For taxonomic extraction, data was extracted at the Class level.
  • All output files were generated using the “summarized” option when exporting, meaning that all the reads at and below the terminal taxonomic class were summed to generate the counts for a given Class.
  • All output files are tab-delimited.

Files are named in the following fashions:


  • abs-reads_abs-comparison_: Files were imported using absolute read counts and comparison was run using absolute read counts.
  • abs-reads-ind-samples_: Files were imported using absolute read counts and individual samples were analyzed.


  • day: Data is grouped by day.
  • pH: Data is grouped by pH.


  • interpro2go: Gene ontology assignments.
  • reads: Read counts.
  • PathToCount: Shows full “path” (either GO terms or taxonomy) leading to, and including, the terminal term and corresponding summarized counts (i.e. sum of all reads at terminal term and all below that term) of reads assigned to the terminal term.
  • PathToPercent: Shows full “path” (either GO terms or taxonomy) leading to, and including, the terminal term and corresponding percentage of summarized counts (i.e. sum of all reads at terminal term and all below that term) of reads assigned to the terminal term. This is percentage of all reads assigned within the designated group/sample. It is NOT a percentage of all reads in the entire experiment!!

Here’s how the sample names breakdown:

Sample Developmental Stage (days post-fertilization) pH Treatment
MG1 13 8.2
MG2 17 8.2
MG3 6 7.1
MG5 10 8.2
MG6 13 7.1
MG7 17 7.1

NOTE: Days used in analysis correspond to Emma’s day conversion:

Date Rhonda’s day Emma’s day
5/15 6 1
5/19 10 5
5/22 13 8
5/26 17 12