Grace’s Notebook: Update on Crab Pooling and Plan for next week-ish

Today I met with Steven to talk briefly about the crab sample pooling (need more info on how to proceed- will wait for Sam’s return in 1.5 weeks), and about things I can do in the interim while the samples are being sequenced: practicing Trinity with Geoduck transcriptome data; DecaPod; 2015 Oysterseed project.

Crab Pooling

So I misunderstood what was needed. The CORE facility needs at LEAST 20ng/ul of RNA in a 50ul sample. Meaning, the sample needs at 1000ng of RNA. However, when the Qubit results (in ng/ul) are multiplied by 50ul (the final volume of the isoated RNA samples), the vast majority of them are well below 250 ng of RNA. For a pool (4 crab samples per pool) to be successful, each sample needs to contribute at least 250ng of RNA to the pooled sample.

Here’s link to GitHub issue: here

My epiphany notes from today:

Practicing with Trinity

Steven mentioned that while the samples are being sequenced (it can take about 1.5 months) it would be cool if I practiced/ learn how to use Trinity on some Geoduck transcriptome data. I will be using Trinity to assemble the transcriptome from my samples later.

Here is a link to a wiki on Trinity that I have been reading about: here


When I was in Juneau last November with Pam, I met Genevieve Johnson who is a current intern at NOAA and she is working on Tanner crab hybrid population structure and genomics. I met up with her again this past weekend when I was in Juneau visiting my sister and we spoke more about our projects. She is in Seattle this week and if she has time, I will meet up with her and record a little about what she’s up to and her experience with Tanner crabs for a new episode of DecaPod.

If she doesn’t have time while she’s here, we can try and do it remotely. Should be interesting to see how the sound quality works with that.

Some other episode ideas are:

  • Crab pooling plan and sequencing info (wait for Sam to get back to get better idea of what is involved)
  • Some summaries of literature on Tanner crabs and Bitter crab disease and Hematodinium that I read

2015 Oysterseed project

Work on understanding what I’m doing and making sure I choosing the rights settings because my peaks are looking pretty bad. I had a really high error rate last week, so I need to go through the protocols and Yaamini and Laura’s notebooks to see what they did.

from Grace’s Lab Notebook

Yaamini’s Notebook: DML Analysis Part 4

flankBed 101

It’s not enough to understand where DMLs or CG motifs intersect with exons, introns and mRNAs! Regions up or downstream from mRNA regions may contain promoters or other transcription factors. These areas could also be regulated by methylation. To explore this, I identified flank in `bedtools in this issue as a method to add 1000 bp to the beginning and end of an mRNA region.

Before I could do this successfully, I had to create a “genome file”. For flankBed, the genome just needed to be a tab-delimited file with the chromosome name, start and stop positions. I used NCBI information to create the “genome file” in TextWrangler.

I used the following code in my Jupyter notebook to generate the flanks:

 ! /Users/Shared/bioinformatics/bedtools2/bin/flankBed -i C_virginica-3.0_Gnomon_mRNA.gff3 -g 2018-06-15-bedtools-Chromosome-Lengths.txt -b 1000 \ > 2018-06-15-mRNA-1000bp-Flanks.bed  

I had an issue running the code over the weekend, since I couldn’t get any output. Once I restarted Jupyter and reran the code, it worked!

I took my flankBed output and ran it through intersectBed to find overlaps between the output, DMLs, and CG motifs.

 ! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \ -wb \ -a 2018-06-15-mRNA-1000bp-Flanks.bed \ -b ../2018-05-29-MethylKit-Full-Samples/2018-05-30-DML-Locations.bed \ > 2018-06-19-mRNA-100bp-Flanks-DML.txt  
 ! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \ -wb \ -a 2018-06-15-mRNA-1000bp-Flanks.bed \ -b C_virginica-3.0_CG-motif.bed \ > 2018-06-19-mRNA-100bp-Flanks-CGmotif.txt  

You can find the flank-DML output on Github, but the CG motif one was too big. I’ll upload that to OWL shortly.

That’s a wrap on the gonad methylation stuff until I’m back in town!

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student

Yaamini’s Notebook: DML Analysis Part 3

Quantifying proportion of overlaps

Now that I know the DMLs and CGs intersect with various feature tracks, I need to quantify the overlap proportions. I answsered the following questions in this Jupyter notebook:

  1. What proportion of total overlaps does a certain feature track represent?
  2. Out the total number of CG motifs, how many overlaped with a feature track?
  3. Out the total number of DMLs, how many overlaped with a feature track?

It was simple enough to conceptualize and carry out! Just needed to count the number of DMLs or CG motifs, then count the number overlaps before summing and dividing in various ways. See my previous notebook post for code that counts the lines in BEDfiles and overlaps.

Here’s the answer to the first question:

  • Proportion exon overlap with total CG overlaps: 67.55% (0.6754709569888477)
  • Proportion intron overlap with total CG overlaps: 26.06% (0.2606253947864305)
  • Proportion mRNA overlap with total CG overlaps: 6.39% (0.06390364822472172)

Most of our CG overlaps come from exons, which code for proteins. Interesting!

Here are the answsers to the second and third questions:

Feature % CG Overlaps % DML Overlaps
Exon 4.40 66.05
Intron 1.70 27.82
mRNA 0.42 92.04

Those numbers are different, so there’s definitely some sort of enrichment going on! Maybe when I figure out that gene set enrichment everything will make sense.

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student

Yaamini’s Notebook: DML Analysis Part 2

Understanding gene enrichment

TL;DR: I got confused a lot but I think I’m on the right track, but this will take a while.

I switched over to an R Markdown file and tried working with topGO for gene enrichment. I wanted to use an R package over DAVID to improve the reproducibility of my work.

I set up some code and reformatted by DML-mRNA overlap file in the script. topGO requires GOterms as inputs as well, so I tried using to match gene IDs to GOterms. However, I ran into my main problem: I don’t have Entrez Gene IDs (official NCBI IDs) for my C. virginica genes. Without these, I cannot use any sort of gene enrichment R package, let anlone convert gene IDs to GOterms! Steven suggested I blastx the C. virginica genome against the UNIPROT database. This would give me UNIPROT accession codes and GOterms. I can find a way to convert UNIPROT accession codes to Entrez Gene IDs to use topGO. He also suggested I use DAVID for gene enrichment and compare the results.

I rearraged my R Markdown file and started blastx. We’ll see how this goes…

Want to know more about my thought process for this analysis? I started using WordPress to document intermediate thoughts! Click on the “Feed” link in the top right corner of this webpage, or navigate to these links:

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student

[code]#!/bin/bash ## Job Name #SBATCH...

## Job Name
#SBATCH --job-name=bismark
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-100:00:00
## Memory per node
#SBATCH --mem=100G
#SBATCH --mail-type=ALL
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/srlab/sr320/analyses/0618b

source /gscratch/srlab/programs/scripts/

/gscratch/srlab/programs/Bismark-0.19.0/bismark_genome_preparation \
--verbose \

find /gscratch/srlab/sr320/data/olurida-bs/[1-8]_*.gz \
| xargs basename -s _L001_R1_001.fastq.gz | xargs -I{} /gscratch/srlab/programs/Bismark-0.19.0/bismark \
--path_to_bowtie /gscratch/srlab/programs/bowtie2-2.1.0 \
--score_min L,0,-0.9 \
-genome /gscratch/srlab/sr320/data/olurida-genomes/v081 \
-p 28 \

/gscratch/srlab/programs/Bismark-0.19.0/deduplicate_bismark \
--bam -s \

/gscratch/srlab/programs/Bismark-0.19.0/bismark_methylation_extractor \
--bedGraph --counts --scaffolds \
--multicore 14 \


[code][sr320@mox2 0617]$ tail slurm-191208.out ->"AllSamples_Maf1_association2.arg"...

[sr320@mox2 0617]$ tail slurm-191208.out 
	-> Tue Jun 19 06:24:05 2018
	-> Arguments and parameters for all analysis are located in .arg file
	-> Total number of sites analyzed: 786914862
	-> Number of sites retained after filetering: 1 
	[ALL done] cpu-time used =  124766.39 sec
	[ALL done] walltime used =  153744.00 sec