Yaamini’s Notebook: MethCompare Part 22

Checking stacked barplots and post-hoc tests

Previously I remade M. capitata genome tracks and my genomic location stacked barplot. Something looked funky, but I tabled my investigation until later. Later is now, and my goal is to resolve that issue and finalize post-hoc tests to understand GLM output for methylation status and genomic location.

M. capitata genomic location

The first thing I did was check my bedtools code in this Jupyter notebook. That code looked good, so I figured there must have been an issue when I processed line count output in this R Markdown script. Before I began, I cleared by environment. I then reran the script to produce the overlap counts and overlap percents tables. Looking at the tables, WGBS detected 10% of CpGs that overlapped with CDS, while MBD-BS detected 14% of CpGs overlapping with CDS. This was not reflected in the stacked barplot I made! I ran the code fo my barplot and saw that it now reflected those percentages. My hunch is that clearing my environment did the trick. Since my union data and individual data scripts use the same dataframe names, I think the barplot I made last week called the correct dataframe name, but the content was for individual samples. I posted the revised barplot in this issue and moved on.

Screen Shot 2020-06-29 at 1 50 21 PM

Figure 1. M. capitata and P. acuta overlaps with various genome features (pdf here).

Post-hoc tests

I felt good about my PCoA, perMANOVAs, and models, but I still didn’t have all the information I needed. For each CpG methylation status or genomic feature overlap, the models test WGBS vs. RRBS or WGBS vs. MBD-BS, since WGBS is the base condition. However, the models don’t compare RRBS with MBD-BS. I wasn’t sure what post-hoc test to run, so I asked Evan. He suggested I use emmeans to run post-hoc tests; I did so in this script. The package runs estimated marginal means tests on each model and corrects p-values. I can then obtain log odds ratio test output and corrected p-values with a FDR. The command itself is very simple:

emmeans(McapCpGHighModel, pairwise ~ seqMethod, adjust = "FDR") 

The output provides confidence intervals for proportion overlap for each method on the logit scale, as well as log odds ratio results. I only wanted the test results since we’re not concerned with absolute proportions. To do this, I specified $contrasts. I also took the output with adjusted p-values and saved it as a dataframe:

McapCpGHighPostHoc <- data.frame(emmeans(McapCpGHighModel, pairwise ~ seqMethod, adjust = "FDR")$contrasts) #Run pairwise comparisons (estimated marginal means). Obtain log odd ratio results and not confidence intervals for individual methods in a dataframe format. Specify FDR instead of Tukey post-hoc test (default) head(McapCpGHighPostHoc) #Look at log odd ratio results 

I used rbind to combine statistical output for each model:

M. capitata:

P. acuta:

After I knit my script, I shared the output in this issue and on Slack. I still think Meth 8 may be an outlier, and I don’t know if the PCoAs should be included with the stacked barplot as a multipanel plot, or if they should just go in the supplement.

I think that’s it for my statistical analyses! Based on what we want for figures I can go back and make the PCoAs look pretty. Now I need to understand these results and update the paper.

Going forward

  1. Update methods
  2. Update results
  3. Update figures
  4. Look into program for mCpG identification

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2VsCv68

Yaamini’s Notebook: MethCompare Part 21

Refining multivariate and modeling approaches

During our meeting Friday, I realized I interpreted the data analysis task incorrectly. Instead of combining CpG methylation status and genomic location information to use for multivariate and modeling approaches, I needed to consider the data separately!

Remaking M. capitata tracks

Before I could revise my analysis, Hollie pointed out that the M. capitata gene track did not have the correct amount of genes in this issue. In addition to using genes with the AUGUSTUS tag, I needed to pull genes that had the GeMoMa tag (from a different related species). In this Jupyter notebook, I matched multiple patterns with grep to revise the gene, CDS, and intron tracks.

#Match both AUGUSTUS and GeMoMa gene annotations !grep -e "AUGUSTUS gene" -e "GeMoMa gene" Mcap.GFFannotation.gff > Mcap.GFFannotation.gene.gff 

I used these revised tracks to create my three flank tracks and intergenic region track. Since the files have the same name, I didn’t need to update my IGV sessions. Instead, I uploaded the revised tracks to gannet so they would populate IGV.

Re-running scripts

Now that I had revised tracks, I needed to go back to my Jupyter notebook and characterize M. capitata feature overlaps! In this Jupyter notebook I characterize CG motif overlaps with the revised tracks. Then, I characterized overlaps between individual samples and feature tracks. I did the same thing in this Jupyter notebook for 5x union data. For the union data, I recreated CpG genomic location stacked barplots with this R Markdown script.

Screen Shot 2020-06-22 at 9 52 49 AM

Figure 1. *CpG genomic location.

Looking at my output, it no longer looks like MBD-BS is no longer the best for capturing gene regions. Since I ran this code quickly and late at night, I need to double check that this stacked barplot is trustworthy.

Revised statistical analyses

Now, the big stuff. I took the output for individual samples and recreated my summary tables in this R Markdown script. Based on feedback from Friday, I ran individual NMDS, ANOSIM, and model sets for the CpG methylation status and genomic location datasets. Similar to the methylation status models, I ran individual models for each genome feature since all the proportions add up to 1. I included replicate information as a random effect.

I ran my analyses and questions by Evan, and he suggested I switch over from NMDS and ANOSIM to PCoA and perMANOVA. Apparently, the creators of the vegan package recommend this. The perMANOVA has a more robust statistical output format as well. I kept my centered log-ratio transformation and euclidean distance matrix, but fed it into cmdscale to run a PCoA instead. I investigated the PCs, eigenvalues, and loadings. I then used adonis to run a global perMANOVA. To understand if a significant perMANOVA result was due to centroid or variance differences between groups, I ran a beta dispersion model. A non-significant result indicates that differences are due to centroids, not variance. After my global perMANOVA, I ran pairwise perMANOVA test to understand the significant global perMANOVA result. Similar to when I was running ANOSIM, I ended up with complete enumeration. However, it was easier to interpret the perMANOVA output and see how much variance the test explained. After running my regressions, I extracted p-values and corrected them using a false discovery rate.

Although most of my questions are addressed, I still have two more. When running the beta dispersion model for my P. acuta data, I found that group variances were significantly different from eachother! My guess is that sample 8 is an influential point or outlier that is contributing to variance differences between groups. I need to investigate this further. While I didn’t need to make any changes to the models (so nice!), the models are set up in a way that compares WGBS to RRBS and MBD-BS and not compare the two enrichment methods. I’m not sure which post-hoc tests I need to use to compare these two methods. I knit the output into this document, emailed Evan, and posted my follow-up questions here. The end is (hopefully) near!

Going forward

  1. Finalize analysis
  2. Check that union stacked barplots were generated correctly
  3. Update methods
  4. Update results
  5. Update figures
  6. Look into program for mCpG identification

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/31bzfQn

Yaamini’s Notebook: MethCompare Part 20

Preliminary follow-up GLMs

Previously, I tried using NMDS and ANOSIM to understand if sequencing method influenced differences in proportions of CpGs in various methylation statuses or genomic locations. I’m still waiting on a response to some questions I had, but in the meantime I’ll try using generalized linear models to get at some of these differences. Since we have small sample sizes, using a combination of approaches gives us more power. Specifically I will be testing if sequencing method or genomic location influences the proportion of CpGs in a particular methylation status.

In this script, I started creating my models. Evan suggested I use glmmTMB to run beta family models with a logit link. Since the proportions of highly, moderately, and lowly methylated CpGs add up to 1, I need to run separate models:

McapCpGHighModel <- glmmTMB(percentMeth ~ CDS + Introns + Upstream.Flanks + Downstream.Flanks + Intergenic + seqMethod, family = beta_family(link = "logit"), data = McapCpGMasterHigh) #Run the full model (all genomic locations) using a beta distribution and a logit link 

When I ran the full model, I got a convergence issue error. Digging into glmmTMB vignettes, it suggested that my model could be overparametrized. This made sense, since I have minimal data and five model parameters. I decided to run two models: one for gene features and another for non-gene features. First, I ran a null model (without sequencing method) and a full model (with sequencing method):

McapCpGHighModel.GeneNull <- glmmTMB(percentMeth ~ CDS + Introns, family = beta_family(link = "logit"), data = McapCpGMasterHigh) #Run null model (without sequencing method) using a beta distribution and a logit link McapCpGHighModel.Gene <- glmmTMB(percentMeth ~ CDS + Introns + seqMethod, family = beta_family(link = "logit"), data = McapCpGMasterHigh) #Run model using a beta distribution and a logit link 

Then, I compared these models with a likelihood ratio test:

McapCpGHighModel.GeneANOVA <- anova(McapCpGHighModel.Gene, McapCpGHighModel.GeneNull, test = "LRT") #Compare the null and treatment models with a likelihood ratio test. McapCpGHighModel.GeneANOVA #Look at ANOVA output 

Finally, I looked at the summary output for the best model using summary. Usualy, the best-fit model was one that included the sequencing method. I’m not sure how (or if) to compare the gene and non-gene models, but AIC might be useful in that respect.

Still have questions, but at least it’s something to discuss at our meeting!

Going forward

  1. Finalize analysis
  2. Locate TE tracks
  3. [Characterize intersections between data and TE, and create summary tables]
  4. Look into program for mCpG identification

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3hR8mqQ

Sam’s Notebook: Metagenomics – Data Extractions Using MEGAN6

Decided to finally take the time to methodically extract data from our metagenomics project so that I have the tables handy when I need them and I can easily share them with other people. Previously, I hadn’t done this due to limitations on looking at the data remotely. I finally downloaded all of the RMA6 files from 20191014 after being fed up with the remote desktop connection and upgrading the size of my hard drive (5 of the six RMA6 files are >40GB in size).

Here’s an explanation of what was done and how the output files (see RESULTS section below) are named.

  • All RMA6 files were imported Files using absolute read counts.
  • For taxonomic extraction, data was extracted at the Class level.
  • All output files were generated using the “summarized” option when exporting, meaning that all the reads at and below the terminal taxonomic class were summed to generate the counts for a given Class.
  • All output files are tab-delimited.

Files are named in the following fashions:


  • abs-reads_abs-comparison_: Files were imported using absolute read counts and comparison was run using absolute read counts.
  • abs-reads-ind-samples_: Files were imported using absolute read counts and individual samples were analyzed.


  • day: Data is grouped by day.
  • pH: Data is grouped by pH.


  • interpro2go: Gene ontology assignments.
  • reads: Read counts.
  • PathToCount: Shows full “path” (either GO terms or taxonomy) leading to, and including, the terminal term and corresponding summarized counts (i.e. sum of all reads at terminal term and all below that term) of reads assigned to the terminal term.
  • PathToPercent: Shows full “path” (either GO terms or taxonomy) leading to, and including, the terminal term and corresponding percentage of summarized counts (i.e. sum of all reads at terminal term and all below that term) of reads assigned to the terminal term. This is percentage of all reads assigned within the designated group/sample. It is NOT a percentage of all reads in the entire experiment!!

Here’s how the sample names breakdown:

Sample Developmental Stage (days post-fertilization) pH Treatment
MG1 13 8.2
MG2 17 8.2
MG3 6 7.1
MG5 10 8.2
MG6 13 7.1
MG7 17 7.1

NOTE: Days used in analysis correspond to Emma’s day conversion:

Date Rhonda’s day Emma’s day
5/15 6 1
5/19 10 5
5/22 13 8
5/26 17 12

Grace’s Notebook: Crab Analyses Progress

I made some progress in getting new results for the crab part of my thesis. I ran DESeq2 on libraries 8-11 based on infection, but taking temperature into account. I then took those results and contrasted the two temperature treatments to get an effect of temperature on infection gene expression. I also am trying to do time-series with the individual crab RNAseq… and that’s been a bit tricky because in some cases we don’t have any replicates, and in ones where we do, there are three time points and I’m struggling to figure out how to analyze them… links to Rmd and results files in post:

DESeq2 on libraries 8-11 based on infection taking temperature into account:

Rmd: scripts/DESeq-libraries8-11.Rmd
.HTML preview: /scripts/DESeq-libraries8-11.htm

Heatmaps are viewable in the .html linked above^

Infection comparison taking temperature into account
DEGlist (.tab)

DEGlist (.txt with Trinity_ID column):

With count data:

Annotated with transcriptome v. 1.5 blast output and uniprot-GO:

Contrast temperature treatments from the above results:

DEGlist (.tab)

DEGlist (.txt)

With count data:

Annotated with transcriptome v. 1.5 blast output and uniprot-GO:

Time series with individual crab RNAseq data attempts:

Rmd: scripts/DESeq-time-series.Rmd

.HTML: scripts/DESeq-time-series.html

To Do for time series:

  • try to figure out how to plot gene counts of crab ABC over time
  • Gene list for ABC vs GHI (I think I have this… but I think because it’s two temperatures at two time points, I have to do what I did above for libraries 8 -11 where I do the initital deseq, and then do a contrast to get the final gene list)

from Grace’s Lab Notebook https://ift.tt/3dcVsA1

Yaamini’s Notebook: MethCompare Part 19

Multivariate analysis for compositional data

Last week I reached out to Evan and Julian to see if they could help me decide on a statistical approach. I posted Evan’s suggestions in this issue. In short, he suggested pairing multivariate and generalized linear model approaches, giving me more power to discern if sequencing method influences CpG detection in genome features or methylation status.

Data transformation

I opened this R Markdown script to get started. To create my dataframe, I used cbind to combine the CpG type and location percentages for each sample. I used the genomic location data the full sample, and not the individual columns for a sample that looked at proportino of a specific methylation status in that genomic location. I figured the point of a multivariate analysis would be to determine if these variables are related in any sense, so I didn’t want to add any variables that would be redundant.

Since I was working with proportion data, Evan suggested I borrow methods from compositional data analysis techniques. To do this, I used a centered log-ratio transformation on my matrix:

McapCpGMultiTrans <- data.frame(clr(McapCpGMultiMaster)) #Use centered log-ratio transformation 

This is where I ran into my first point of confusion. Looking over Julian’s email, he suggested I use a Hellinger’s transformation on my data to give low weights to smaller proportions:

McapCpGMultiTrans <- data.trans(McapCpGMultiMaster, method = "hellingers", plot = FALSE) #Hellinger (asymmetric) transformation 

When I tried this code, I got a data matrix filled with N/As! I wasn’t sure why this was happening, so I decided to just continue with the CLR transformation and ask Evan which method would be more suitable later.


Although we already had stacked barplots for visualization, I wanted to go through the exercise of creating and NMDS to look at any influential loadings. I followed the same process I used for my DNR paper:

  1. Create a scree plot to determine the optimal number of axes
  2. Confirm that the number of axes returns a low stress value using a Montecarlo permutation
  3. Calculate loadings
  4. Plot an NMDS with loadings that have p-values < 0.05
  5. Conduct global and pairwise ANOSIM tests
  6. Conduct post-hoc tests for significant ANOSIM if necessary

At steps 1 and 2, I got errors that suggested I had insufficient data. Additionally, I was unable to create a scree plot, and my stress value was very close to zero. I recorded these errors but decided to proceed.

For M. capitata, all loadings were significant below 0.05, but for P. acuta, CDS was not an influential loading! Interesting. When I created my NMDS, all samples for the same method were plotted on top of eachother for M. capitata. For P. acuta, my WGBS and RRBS plotted on top of eachother. However, my MBD-BS samples were separated, and sample 8 looked like an outlier within that group.

Related to my lack of sufficient data, I was able to conduct a global ANOSIM for each species, but I got errors suggesting I had insufficient data when I investigated the significant global ANOSIM results with pairwise tests. During the pairwise tests, all possible permutations were conducted (“complete enumeration”). My pairwise tests were not significant, but since my global ANOSIM was, I have a feeling that I do not have enough statistical power to investigate that difference using this multivariate method.

I knit the script into this document and shared it with Evan to get his take on my insufficient data and transformation issues. Most importantly, I wonder if his original suggestion to pair multivariate and generalized linear model approaches is still the best option considering I am data limited.

Going forward

  1. Conduct glm analysis and revise multivariate analysis
  2. Locate TE tracks
  3. [Characterize intersections between data and TE, and create summary tables]
  4. Look into program for mCpG identification

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2zE4zvL

Shelly’s Notebook:

from Shelly A. Trigg, Ph.D. https://ift.tt/2BaWVte

Grace’s Notebook: June 2020 Goals

Writing this halfway through the month of June, but it’s ok.


Successfully defended on June 2nd!!

To dos:

  • write an intro for my thesis tying two projects together (why is temperature important; why are these two species important)
  • submit thesis to Graduate School by June 25th (last day is Friday, June 26th) (electronic thesis)
  • submit master’s thesis approval form (committee) (form) by June 26th (I don’t think this has been submitted yet…?)

Crab stuff

  • DESeq2 on libraries 8-11 –> compare based on infection status to get effect of temperature (in progress: here)
  • Time series of individual crab RNAseq data (in progress: here)

Resources that I’ve been using to figure out the above tasks:

Things that might come in handy as I continue to try to figure these tasks out:

I’ll create a more detailed post after I find out what works and why.

Oyster stuff

I think just go through the comments from Chelsea and Pam and clean it up.

Goal timeline:

By/on June 18th (this Thursday):

  • have new results for crab project
  • have meeting with Pam (and Steven and Sam) to go over crab project

By Jun 23rd (next Tuesday):

  • have thesis intro written
  • have oytser comments addressed
  • have progress made on new crab results discussion, etc

June 24th (next Wednesday):

  • make sure I know how I’ll submit thesis electronically (info linked in thesis section above)

Submit thesis to graduate school by June 25th!! (June 26th if needed)
Make sure committee signs and submits the form approving my submittal of an electronic thesis to the Graduate School (get signatures) by June 26th at the latest.

from Grace’s Lab Notebook https://ift.tt/2LWlJX1

Yaamini’s Notebook: MethCompare Part 18

Formatting individual sample information

In this issue we’ve discussed how to revise our CpG methylation status and genomic location statistical analyses. We know we want to compare proportoins while investigating if sequencing method affects the proportion of different methylation statuses or in genomic locations. I posted some suggestions, but in the meantime I thought I could obtain individual-level proportion data.

Thankfully for me, most of the pipeline was already set up! In this Jupyter notebook I counted CpGs for each methylation status and in various genomic features. The only things I needed to modify were ensuring I used bedtools -u, adding code for upstream and downstream flank overlaps, and adding the path to the explicit intragenic region tracks. I took the output files (line counts) and used them in this R Markdown script and used them to create summary tables:

M. capitata:

P. acuta:

Going forward

  1. Conduct statistical analysis
  2. Locate TE tracks
  3. [Characterize intersections between data and TE, and create summary tables]
  4. Look into program for mCpG identification

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2XU7I3V

Shelly’s Notebook:

from Shelly A. Trigg, Ph.D. https://ift.tt/2XHUM0V