Update to project-virginica-oa Repository Contents

Found out that there were several large files in the project-virginica-oa repository (ex. bam files from bismark runs) that maxed out hummingbird’s hard drive. The solution was to stop running bismark on hummingbird and use Mox instead. To continue to use hummingbird for any other analysis, I needed to move the large files from hummingbird to gannet. I created checksums for all of my large file folders, than moved them to this directory on gannet. Once on gannet, I deleted the files off hummingbird and synced my Github repository. Lab notebook links may be broken. In terms of workflows, any time I generate a large file in a repository that cannot be synced to Github, I should move it to gannet.

DML Analysis: Analysis Methods

Read The epigenetic landscape of transgenerational acclimation to ocean warming to get an idea for potential analysis methods. The authors used different functions in methylKit to obtain DMRs:

"Briefly, the ‘methRead’ function of methylKit reads the mapping results with 10 reads per cytosine as a minimum coverage threshold. High coverage bases (99.9%) were filtered to exclude potential PCR bias and then normalized using ‘filterByCoverage’ and ‘normalizeCoverage’ functions, respectively. Genomic regions were categorized as CpG island, CpG shore, promoter, 5′ untranslated region (UTR), exon, introns, 3′ UTR and repeats. Methylated or unmethylated cytosines in each genomic region were summed for each sample by the ‘regionCounts’ function of methylKit. The P values of methylation differences for each region between two samples were calculated using a chi-squared test in the ‘calculateDiffMeth’ function."

They also generated heatmaps with DMR data, which would be useful in my case as well.

In BEDtools, they used the closest function to pair DMRs and genes:

"The closest gene to a DMR on the same scaffold was identified using ‘closest’ from BEDTools v. 2.2339. This resulted in 1,563 genes from 2,078 CpG DMRs, while 115 DMRs were on scaffolds without annotated genes (Supplementary Table 4)."

Because they paired gene expression data with epigenetic data, they did not do any gene enrichment. I’ll need to refer to Emma’s geoduck paper for those methods.

DML Analysis: GOterm Update

Sent blastx results and C. virginica genome to Mike Riffle in Genome Sciences. He is going to build me a portal that will do a GOterm enrichment on any set of genes I provide. This is similar to the portal he built for Emma’s geoduck paper.

Linear Mixed Effect Models in R

Good tutorial from Bodo Winter at UC Merced can be found here. He encourages using Likelihood Ratio Tests to obtain p-values.

DML Analysis: How to get GOterms

Gene Set Enrichment Analysis Workflow:

  • Get Entrez Gene IDs
  • Match IDs with GOterms
  • Use both topGO and DAVID for enrichment


  • The gene IDs found in the C. virginica GFF files are not official, NCBI Entrez Gene IDs. Not sure what LOC{} is, but XM_{} are Genbank IDs. Genbank IDs from the GFF were not recognized by DAVID


  • blastx to get Uniprot accession codes and GOterms
  • Use Uniprot and GOterms in DAVID
  • Convert Uniprot accession codes to Entrez IDs
  • Use Entrez IDs and GOterms in DAVID

DML Analysis: Possible Gene Enrichment Methods

topGO and GOstats look promising. This blog has some code for how to use these packages, but it’s a little unclear. Overall consensus on this forum is that DAVID is heavily-used, but should not be used anymore since it hasn’t been updated in several years. Would also be nice to have something in a script instead of a website.

DML Analysis: Notes

The mRNA track includes exon and introns. The exon track just has exons. See the issue.

Next steps in analysis (from this issue):

  • Proportion CGs in exons, introns, and mRNA
  • Find DMLs and CG motifs within 1000 bp of mRNA, possibly using -iobuf with intersectBed
  • Enrichment analysis of mRNA genomic regions. Need to find specific package?