Yaamini’s Notebook: DML Analysis Part 46

Epigenetics reading group feedback!

Although the whole gang wasn’t present Mac and Steven gave me some great feedback on the C. virginica edits I started on! Here are some of the highlights:

Minor improvements

  • Clearly defining what I consider a DML, since a locus doesn’t have to be one base pair
  • Rephrasing GO-MWU input creation methods
  • Not using signed p-values for GO-MWU. Mac made a good point that understanding how up- or downregulated genes influence enrichment patterns makes more sense for transcription than seeing how hyper- and hypomethylated DML influence enrichment. We don’t know what methylation is doing in a gene, and using signed p-values doesn’t shed any light on that function. Additionally, hypermethylation could lead to increased or decreased gene activity. If that gene is responsible for repressing another gene function, assigning a positive or negative value based solely on methylation status doesn’t take into account gene function.
  • Removing DMG analysis since the actual function of methylation in these genes is unknown (no paired transcription data) and it didn’t add anything to the paper.
  • I moved the percent methylation across genome feature figure to Figure 1 instead of keeping it with Figure 2.
  • Changed Figure 8 so it was colored by gene group, not biological process
  • Mac suggested I convert the multiple barplots in Figure 5 (panels c-f) to stacked barplots. I created a new versions of this plot after doing some standard dataframe manipulation.

Calculating “genome methylation”

In the paper, I state that 22% of the C. virginica genome was methylated. Mac and Steven were hesitant to make this claim since it wasn’t clear how I got this number, and I couldn’t find it in the methods anywhere. When I looked at the 3,181,904 methylated CpGs (> 50% methylated) versus the 14,458,703 total CpGs in the C. virginca gneome, I got 22%. This doesn’t saying that 22% of the genome is methylated. Instead, 22% of CpGs were found to be methylated. I clarified this statement in the abstract, results, and discussion.

Methylation islands

I removed the methylation island bar in Figure 2, since the methylated CpGs and methylation islands are correlated. Mac described methylation islands as a smoothing function, so it’s not really interesting to see where they are located with respect to the other genome features beyond just knowing what’s genic and intergenic. I also removed Figure 2b (individual feature locations in methylation islands). After calculating median and ranges for methylation island length and number of mCpGs, I tried creating a histogram of island lengths in this R Markdown file. When I created a preliminary plot, it was clear that I would need a gapped axis! I previously created a barplot with a gapped axis, so I thought I could recycle code. I kept getting the following error when I tried plotting my data:

Error in rect(xtics[littleones] - halfwidth, botgap, xtics[littleones] + : cannot mix zero-length and non-zero-length coordinates 

In the meantime, I looked at the histogram bins and how many islands were in each bin. I wrote up a quick description for the results section.

Things to improve in the discussion

  • Look at list of genes with DML and see if any of them have been reported in ocean acidification and gene expression studies
  • Add ocean acidification context to final two discussion paragraphs, along with caveat about experimental design confounding responses to ocean acidification with potential differences in gamete maturation

Going forward

  1. Address remaining comments about discussion text
  2. Update manuscript text
  3. Update response to reviewers
  4. Consolidate any co-author feedback
  5. Submit comment responses and reviesed manuscript
  6. Post revised paper on bioRXiv
  7. Update paper repository with new code and figures

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3ceuclt

Sam’s Notebook: qPCR – C.bairdi RNA Check for Residual gDNA

After deciding on a primer set to use for gDNA detection on 20200225, went ahead and ran a qPCR on most of the RNA samples described in Grace’s Google Sheet. Some samples were not run, as they had not yet been located at the time I began the qPCR.

Used 2ng of RNA (1uL) in each reaction. A 5uL dilution of each sample was made to a concentration of 2ng/uL. The decision for this quantity was based on the amount of RNA we might use in a reverse transcription reaction (50ng/25uL) and the volume of the resulting cDNA we’d run in a qPCR reaction (1uL). Calculations and the qPCR results (Cq values) are below (Google Sheet). It’s a modified version of Grace’s spreadsheet (linked above).

All reactions were run with 2x SsoFast EVA Green qPCR Master Mix (BioRad) on the Roberts Lab CFX Connect qPCR machine.

All samples were run in duplicate. See the qPCR Report (in Results section below) for plate layout, cycling params, etc.

Master mix calcs are here (Google Sheet):

NOTE: Sample 238 has almost nothing left; maybe 1uL?

Sam’s Notebook: qPCR – C.bairdi Primer Tests on gDNA

We received the primers I ordered on 20200220 and now need to test them to see if they detect gDNA. If yes, then they’re good candidates to assess the presence of residual gDNA in our RNA samples before we proceed with reverse transcription.

The three primer pairs to test are:

  • SRIDs 1774/1775 (40s rRNA S30)
  • SRIDS 1776/1777 (allantoicase)
  • SRIDs 1778/1779 (ubiquitin thioesterase)

Used 1uL of a 1:10 dilution of the 6129_403_26 gDNA from 20200210 as template (equates to ~4ng/uL).

All samples were run in triplicate.

qPCR Master Mix calcs (Google Sheet):

Sam’s Notebook: Primer Design – C.bairdi Primers for Checking RNA for Residual gDNA

Getting ready to run some qPCRs and first we need to confirm that our RNA is actually DNA-free. Before we can do that, we need some primers to use, so I decided to semi-arbitrarily select three different gene targets from our MEGAN6 taxonomic-specific Trinity assembly from 20200122.

I used our recent differential gene expression analysis to identify those genes which were highly differentially expressed in infected vs. uninfected samples.

Overall, the process went something like this:

  1. Sort upregulated genes in infected group by logFC (fold change) to find Trinity transcript IDs of highly expressed genes:
awk 'NR>1' salmon.gene.counts.matrix.infected_vs_uninfected.edgeR.DE_results.P0.05_C1.infected-UP.subset \ | sort -n -k4,4 
  1. Search for some of the highly expressed Trinity IDs in the Trinotate annotations to find SwissProt IDs:
grep "TRINITY_DN6549_c0_g1" \ 20200126.cbai.trinotate_annotation_report.txt 
  1. Copy SwissProt ID (if available) and see what it is on the UniProtKB website.
  2. If interesting (somewhat), search Trinity de novo assembly for transcript sequence.
  3. Use sequence to generate primers on the Primer3 website.

Yaamini’s Notebook: Gigas and Virginica Comparison Part 4

Reframing my questions and making a poster

I have (almost) all analyses completed, so it’s time to make the poster! I showed Steven the analyses I did thus far, which helped me frame the sub-questions they correspond to.

Where in the genome are MI?

I calculated how much individual features overlapped with MI, but I don’t have any visual representation of the genomic location of MI. Steven suggested I create a stacked barplot similar to those I’ve made in the past.

C. virginica

To start, I returned to my Jupyter notebook. Since intersectBed output can change depending on which file you list first, I characterized two different kinds of overlaps:

  • MI first: Where are MI in the genome?
  • Features first: How much of a single feature overlaps with MI (see below)?

All output is saved in this folder. The naming conventions match which file was argument -a and which was -b (ex. MI-Exon means that MI was listed as -a). For two different features, it didn’t make sense to look at both kinds of overlaps. Understanding which MI are in intergenic regions is important, but I don’t know what it would tell me if there is an intergenic region inside my MI. Also, bedtools wouldn’t let me switch the inputs to even answer that question. Knowing which DML are in MI is important, but since they are single loci doing the opposite command will lead to the same output.

Table 1. Characterizing two different kinds of overlaps between MI and various genome features in C. virginica

Feature MI Location in Feature Individual Feature Overlaps with MI
Exons 22705 240133
Intron 28730 92472
Genes 30773 15009
mRNA 29805 29483
Transposable Elements 25085 107926
Putative Promoters 4217 8846
Other 1154 N/A
DML N/A 537

C. gigas

I rinsed and repeated with C. gigas MI in this Jupyter notebook.

Table 2. Characterizing two different kinds of overlaps between MI and various genome features in C. gigas

Feature MI Location in Feature Individual Feature Overlaps with MI
Exons 16135 49607
Intron 18322 51980
Genes 19278 8761
Transposable Elements 2988 5868
Putative Promoters 2498 2746
Other 3084 N/A
DML N/A 384

Stacked barplot

I updated this file with overlap information to include MI locations. I imported the revised file into this R Markdown document to analyze the distributions. The MI locations in various genomic features are significantly different between the two species (χ2 = 21.026, df = 4, p-value = 0.0003129).

Screen Shot 2020-02-13 at 10 04 36 PM

Figure 1. Location of MI in C. virginica and C. gigas genomes.

Are genes in MI conserved?

Which genes are in MI for both species? I started working on that question in this Jupyter notebook. My plan:

  1. Annotate Gene-MI overlaps for each species with Uniprot Accession codes
  2. Match Uniprot Accesion codes!

Annotating the C. gigas overlaps with Uniprot Accession codes was easy, but C. virginica was a bit more complicated. I identified overlaps using a BEDfile that didn’t have gene information! Additionally, I have several files with C. virginica Uniprot Accession codes and matching Genbank IDs, but those are found in mRNA.

To fix the first issue, I went back to this Jupyter notebook and used the gff gene track to find the overlaps instead. Easy!

To fix the second issue, I matched gene IDs in the gene-MI overlap file with Genbank IDs in the mRNA track to then match to Uniprot Accesion codes (confusing I know). When it came down to merging Uniprot Accession codes, I found that there were no common genes in MI between the two species! Wild.

How much of a single feature overlaps with MI?

Fixing bedtools output

Earlier, I tried finding overlaps between C. gigas MI and genome feature tracks in this Jupyter notebook. My output wasn’t correctly tab-delimited, so I posted this issue. After trying various slight code modifications, Sam noticed I wasn’t using the most updated bedtools version. I followed the instructuions here to download version 2.29.1, then reran my code. Updating the version was enough to fix my output issue! Turns out the only thing it needed to fix was the actual output, so the overlap counts I have don’t need to be revised.

Revising the figure

I returned to this R Markdown file to see if the distributions were different and fix my MI overlap boxplots. I found that the distribution of individual genome features in MI is significantly different between species (χ2 = 23.599, df = 2, p-value = 7.507e-06). I modified the y-axis label in the figure since it wasn’t accurate and added the revised C. gigas intron data.

Screen Shot 2020-02-13 at 4 11 27 PM

Figure 2. Final boxplots showing genome feature overlaps with methylation islands.

What biological processes are represented in DML compared to the full genome?

C. virginica

Since I already made a back-to-back GOSlim plot for the paper submission, all I needed to do was recolor it to match the poster color scheme. I returned to this R Markdown file and did just that. Instead of grouping GOSlim terms by colors, I thought I’d just use different colors for the full genome and DML.

Screen Shot 2020-02-14 at 10 52 57 AM

Figure 3. GOSlim terms in C. virginica genome and DML

C. gigas

For WSN, I examined GOSlim terms for genes with DML…OR SO I THOUGHT. When I went back to this Jupyter notebook, I found that I annotated genes containing DML with Uniprot Accession codes, but I didn’t actually subset my GOSlim list so that it only included those genes when making my figure in this R Markdown file. Since my goal is to make a back-to-back GOSlim plot for genes and genes with DML, I did that subsetting.

Screen Shot 2020-02-14 at 10 46 41 AM

Figure 4. GOSlim terms in C. gigas genome and DML

Looking at both figures, there are two things that stand out:

  1. Both genomes have remarkably similar distributions of GOSlim terms
  2. DML processes are also similar, but there are potentially more developmental process GOSlim terms associated with C. gigas DML-gene overlaps.

While these figures are good, they take up too much space on the poster! Steven suggested I divde the percent of genes with DML with the percent of genes for each GOSlim term to get relative proportions. I could then plot the C. virginica and C. gigas data back-to-back.

I made the plot using the same code I used for the other plots, but for some reason it adds “0” to the x-axis label 3 times…unsure why that’s happening but I can fix it later.

Screen Shot 2020-02-15 at 10 45 28 AM

Figure 5. Relative proportion of GOSlim terms for genes with DML vs. all genes for C. virginica and C. gigas

Final poster

I tried my hand ata the #betterposter format! It was difficult to figure out what the main takeaway should be since there were a lot of interesting components, but we settled on some summary statistics from MI and DML as well as the GOSlim plot. Here’s hoping people will be interested at Ocean sciences next week!

Screen Shot 2020-02-15 at 1 37 30 PM

Going forward

  1. Send for printing!

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3bHTbgw

Yaamini’s Notebook: Gigas and Virginica Comparison Part 3

C. gigas vs. C. virginica: DML

Now the good stuff: looking at DML in response to ocean acidification in Crassostrea spp. gonad tissue!

DML quantity and locations

When I was first analyzing my C. gigas DML, I was pretty sure the numbers weren’t identical. I decided to look at the number of DML and DML-genome feature track overlaps.

Table 1. Comparison of DML number and overlaps with various genome feature files. C. gigas DML were created with 10x data and 2 pooled samples, and C. virginica DML were created with 5x data and 10 individual samples.

Feature ** C. virginica ** ** C. gigas **
Quantity 598 628
Overlaps with Exons 368 (61.5%) 157 (25.0%)
Overlaps with Introns 192 (32.1%) 285 (45.4%)
Overlaps with Genes 560 (93.6%) 442 (70.4%)
Overlaps with Transposable Elements 57 (9.5%) 8 (1.3%)
Overlaps with Putative Promoters 42 (7.0%) 24 (3.8%)
Overlaps with Other Regions 21 (3.5%) 165 (26.3%)

At first glance, there aren’t that many more C. gigas DML than there are C. virginica DML. However, they’re distributed differently! In C. gigas, only 70.4% of DML are in genes as opposed to 93.6% in C. virginica. Within genes, C. gigas DML are more heavily weighted towards introns, while C. virginica DML are primarily in exons. In this R Markdown script, I conducted a proportion test to see if the DML distribution bewteen species truly were different. Based on a chi-squared test, the distributions of DML across genomic features are significantly between speceis (χ2 = 38.516, df = 4, P-value = 8.767e-08).

Screen Shot 2020-02-12 at 11 24 06 PM

Figure 1. Distribution of C. gigas and C. virginica DML in various genome features.

Comparing GOSlim functions of common genes with DML

Even though there are differences in where DML are located in their respective gneomes, there may be some genes in both species that have DML in response to ocean acidification. The first thing I needed to do was identify if there were any overlaps. I decided to try matching the annotated gene products associated with each DML list in this Jupyter notebook. Late-night work delirium coupled with caffeine lead me to believe the best way to do this would be to look for partial matches in the gene product columns. I tried come awk code I found online, but I couldn’t get it to work consistently. Depending on which list I used as my base, I got two different numbers of overlapping gene products! Annoyed, I posted this issue. Sam rightly pointed out that the best way to go about this would be to look at Uniprot Accession instead since they have one-to-one matches with proteins. While I briefly toyed with the idea of trying to use diff, I went back to Old Faithful: join:

#Join the 1st column in the first file with the 1st column in the second file #The files are tab delimited, and the output should also be tab delimited (-t $'\t') #Check head of output #Count number of overlaps !join -1 1 -2 1 -t $'\t' \ 2019-06-20-Cv-DML-Gene-Annotation-sorted-reduced.tab \ DML-Gene-annot-reduced.tab \ > 2020-02-11-Common-DML-Cv-Cg.txt 

The output, found here, has Uniprot Accession, Genbank ID, and CGI ID. The inputs were reduced versions of annotated C virginica and annotated C. gigas I made with awk. I created an expanded version of this common DML here that has product annotations from both DML sets. Even though I figured the products would be identical since they are associated with Uniprot Accession, I still wanted one column from each dataset.

Once I had the list of overlapping genes with DML, I eliminated duplicate entries, going from 117 to 81 overlaps. I then used CGI ID to match these entries with GOSlim terms. I extracted unique biological process entries that did not include “other biological processes” distinctions and saved the output here. In total I had 53 unique GOSlim terms representing 81 common genes with DML between C. gigas and C. virginica! I don’t know what I was expecting, but that really doesn’t seem like much. Interesting.

In this R Markdown file, I made some figures to characterize the GOSlim terms in common genes and in the original datasets. I’m really glad I settled on this figure format during WSN becuase it has come in handy.

Screen Shot 2020-02-13 at 12 52 53 AM

Figure 2. GOSlim terms associated with common genes containing DML in C. gigas and C. virginica.

Screen Shot 2020-02-13 at 1 31 15 AM

Figure 3. GOSlim terms associated with C. virginica genes with DML

Screen Shot 2020-02-13 at 1 31 42 AM

Figure 4. GOSlim terms associated with C. gigas genes with DML

It’s interesting that signal transduction, death, cell adhesion, and cell-cell signaling — processes represented in the C. gigas and C. virginica DML datasets — are not present in GOterms of common genes. Maybe this is due to differences in the experimental set-up or of family and/or species-specific stress tolerances.

Going forward

  1. Draft poster for ASLO and get feedback
  2. Finalize ASLO poster and send for printing by Thursday to get the ASLO discount!

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2OSyYei

Yaamini’s Notebook: Gigas and Virginica Comparison Part 2

C. gigas vs. C. virginica: general methylation landscapes

Now that I’ve tested methylation island analysis, I want to finalize my C. virginica and C. gigas MI tracks and look at the location of MI in the genome and in relation to DML.

C. virginica MI

Finalizing track parameters

Based on this issue, I tested out 500 bp windows and a 0.02 mCpG fraction with either a 25 bp or 50 bp step size.

Table 1. Testing 500 bp window sizes and different step sizes

Initial Window Size (bp) mCpG Fraction **Step Size (bp) ** Number of MI Max mCpG in MI Min mCpG in MI
500 0.02 25 64795 24777 10
500 0.02 50 63483 24777 10

I then visualized the tracks in IGV! The two 500 bp tracks are pretty comparable, but there are definitely places without MI in the 500 bp tracks that were in the 200 and 300 bp tracks.

Screen Shot 2020-02-08 at 3 19 06 PM

Screen Shot 2020-02-08 at 3 19 27 PM

Screen Shot 2020-02-08 at 3 20 08 PM

Screen Shot 2020-02-08 at 3 20 54 PM

Screen Shot 2020-02-08 at 3 24 32 PM

Figures 1-5. MI tracks with 200, 300, and 500 bp windows and 0.02 mCpG fractions against all 5x CpGs, methylated CpGs, and mRNA.

When I asked for feedback on the tracks in this issue, Steven pointed out that some of the MI were in fact less than 500 bp.

Screen Shot 2020-02-08 at 4 16 08 PM

Figure 6. MI that is at least 500 bp.

Screen Shot 2020-02-08 at 4 19 57 PM

Figure 7. MI that is less than 500 bp.

I don’t undertand why the script still created MI that were less than 500 bp! I decided to manually filter by length using this code:

!awk '{if ($3-$2 >= 500) { print $1"\t"$2"\t"$3"\t"$4}}' 2020-02-06-Methylation-Islands-500_0.02_25.tab \ > 2020-02-06-Methylation-Islands-500_0.02_25-filtered.tab 

Almost 30,000 of the MI I identified did not meet this size threshold!

Tables 2. MI characteristics after filtering by size. No MI should be less than 500 bp.

Initial Window Size (bp) mCpG Fraction Step Size (bp) Number of MI Number of MI after Filtering Max mCpG in MI Min mCpG in MI
500 0.02 25 64795 36060 24777 11
500 0.02 50 63483 37063 24777 11

Based on the feedback, I decided to move forward with the filtered 500 bp windows, 0.02 mCpG fraction, and 50 bp window sizes.

Location in the genome

In this Jupyter notebook, I characterized the location of genome feature tracks and DML with respect to MI.

Table 3. MI overlaps with genome feature tracks and DML.

Feature Number of Overlaps with MI
Exons 240133
Introns 92472
Genes 15009
mRNA 29483
Transposable Elements 107926
Putative Promoters 8846
DML 537
Hypermethylated DML 283
Hypomethylated DML 254

Most of my DML (89.7%) were in my MI! Additionally, 32.8% of exons, 29.2% of introns, 49.0% of mRNA, and 38.6% of genes overlapped with MI. In contrast, only 15.6% of transposable elements and 14.7% of promoters overlapped with MI.

To visualize the MI in the C. virginica genome, I decided to create boxplots similar to Jeong et al.. I started by importing MI-feature track overlaps in this R Markdown script. For each file, I calculated the length of the MI, then divided the base piars that overlapped by the MI length. This gave me an overlap percentage that I used in boxplots! I created a base plot, but decided not to spend too much time on it now and return to it for my reviewer comments in a few days. The code I used for calculations and to make the plot was important than actually making the plot pretty and saving it.

C. gigas MI

Now that I generated C. virginic MI, it’s time to move onto C. gigas MI.

General gonad methylation landscape

To identify methylated CpGs in the C. gigas genome, I need to characterize the general methylation landscape. Although Claire and Mac did this previously with ctendia, I wanted to go through this process with my gonad samples in the event that somatic and gonad tissue had variations in methylation landscapes. Additionally, Claire and Mac identified mCpGs using 5x data, and all of my analyses were done with 10x data.

Rerunning bismark

When I went to access my coverage files on ganent, I found that I had overwritten these files! the folder with bismark output no longer existed…which meant I had to rerun bismark. :sob:

Based on previous lab notebook posts detailing my struggle running mox last time, I made a quick change to this script so my deduplication would run only on my sample files, and not proceed iteratively through a bunch of other files. I used the following code to transfer my files over to mox:

rsync --archive --progress --verbose yaamini@*fastq.gz . #Transfer sample files to scrubbed/yaamini/data/2019-09-03-Trimmed-Files rsync --archive --progress --verbose yaamini@ . #Transfer bisulfite converted genome to scrubbed/yaamini/data/2019-09-03-Bismark-Inputs rsync --archive --progress --verbose yaamini@ . #Transfer script to srlab/yaaminiv 

It took me a few failed runs to realize I transfered files to the incorrect directories or didn’t transfer the files at all. I definitely opted for a more complicated file structure in my script but if I would have perused the script more carefully beforehand it wouldn’t have been a problem. A good reminder for me to really read the script! Once every file was where I said it would be, I ran my script and hoped for the best. I used the following code to check up on progress:

cat slurm-1786754.out 

Even though I set up the script to run through all of bismark, I thankfully only need the coverage files to proceed with my analyses.

One day and six hours later, my bismark run completed! I used this code to transfer the files off mox:

rsync --archive --progress --verbose /gscratch/scrubbed/yaamini/analyses/Gigas-WGBS/2019-09-11-Gigas-Bismark/* yaamini@ #Transfer files to gannet 
Characterizing the methylation landscape

While my files were processing, I set up this Jupyter notebook to create a master file and characterize methylation. Using the head of two C. virginica coverage files, I used a series of commands to merge the information in both coverage files. I then used these commands on the coverage files I got from bismark. Yay efficiency!

Before I created a master coverage file, I quickly created 5x and 10x sample bedgraphs in this Jupyter notebook since the output wasn’t on gannet or in my repository…RIP.

Alright, BACK TO THE GOOD STUFF. First, I created a new column in each coverage file that combined chromosome and start bp information. I will need to use this information to sort files and join them.

for f in ../2019-09-13-IGV-Verification/*cov do awk '{print $1"-"$2"\t"$1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}' ${f} \ | sort -k1,1 \ > ${f}.sorted done 

Next, I joined the two files by the chr-start column. In order to simulate an outer join, I specified that unmatched lines in each file should be included in the output using the -a argument. The only time a locus would be in one file and not another is when there are no reads for that locus in one sample. To clean up the file, I replaced any empty fields (methylation percentage, count methylated, count unmethylated) with 0 using -e 0. I then specified which columns I wanted printed in the output using -o. Before saving the output, I used tr to translate the “-“ in the first column to a tab delimiter. This way, I did not have to deal with any redundancy in the locus start positions from each files or figure out how to merge the columns while dealing with zeroes.

#Join the first column in the first file with the first column in the second file #The files are tab delimited, and the output should also be tab delimited (-t $'\t') #Print unpairable lines in file 1 (-a1) and 2 (-a2) to simulate an outer join. Replace empty fields with 0 (-e) and only print the following fields (-o) #Convert - to \t to uncouple the chromosome and start position !join -1 1 -2 1 \ -t $'\t' \ -a1 -a2 -e 0 -o '0,1.5,1.6,1.7,2.5,2.6,2.7' \ FILE \ FILE \ | tr '-' "\t" \ > cgigas_gonad-10x_raw.cov 

Finally, I consolidated the methylated and unmethylated read counts from each file, summed the total number of reads at each locus, and calculated the revised methylation percentage.

#Calculate total count methylated and unmethylated for each locus #Sum number of reads at each locus #Calculating revised methylation percentage for each locus #Multiply percentages by 100 !awk '{ print $1"\t"$2"\t"$2"\t"$4+$7"\t"$5+$8}' cgigas_gonad-10x_raw.cov \ | awk '{ print $1"\t"$2"\t"$3"\t"$4+$5"\t"$4"\t"$5}' \ | awk '{ print $1"\t"$2"\t"$3"\t"$5/$4"\t"$5"\t"$6}' \ | awk '{ print $1"\t"$2"\t"$3"\t"$4*100"\t"$5"\t"$6}' \ > cgigas_gonad-10x_concat.cov 

I have data for 12,728,174 CpGs at 10x read depth. After reducing the dataset to loci with at least 10x coverage, I identified methylated (≥ 50%), sparsely methylated (≥ 10%, < 50%), and unmethylated (< 10%) CpG loci in this Jupyter notebook. I also characterized the location of these loci in relation to various genome features.

Table 1. General C. gigas gonad methylation landscape.

Feature All 10x Loci Methylated Loci Sparsely Methylated Loci Unmethylated Loci
Number of CpGs 12728174 1677041 2267700 8783433
Exons 1803226 559020 199748 1044458
Introns 3834047 706876 768464 2358707
Genes 5637273 1265896 968212 3403165
Transposable Elements 612466 41024 161419 410023
Putative Promoters 803342 59704 118610 625028
Other 5998040 354355 1093226 4550459

Creating the MI track

In this Jupyter notebook I generated the methylation island track using the same parameters I used for C. virginica.

Tables 2. MI characteristics after filtering by size. No MI should be less than 500 bp.

Initial Window Size (bp) mCpG Fraction *Step Size** Number of MI Number of MI after Filtering Max mCpG in MI Min mCpG in MI
500 0.02 50 38443 23173 3298 11

I then visualized the MI in this IGV session. The MI in C. gigas are much more sparse, which is different than C. virginica.

Screen Shot 2020-02-13 at 2 46 36 AM

Screen Shot 2020-02-13 at 2 46 59 AM

Screen Shot 2020-02-13 at 2 48 38 AM

Screen Shot 2020-02-13 at 2 49 12 AM

Figures 1-4. Methylation islands in the C. gigas gonad

Location in the genome

In this Jupyter notebook, I used bedtools to characterize the overlaps between C. gigas MI and genome feature tracks! I was able to run the code and get counts, but the output format for the Intron-MI and Promoter-MI files looked wonky:

Screen Shot 2020-02-13 at 9 20 36 AM

Figure 5. Weird file formatting

I posted this issue to figure out why my output looked this way, since I needed proper output to correctly count overlaps and create MI overlap figures.

Table 3. Overlaps between methylation islands and various genome features.

Feature Number of Overlaps
Exons 49607 (25.2%)
Introns 51980 (29.5%)
Genes 8761 (31.3%)
Transposable Elements 5868 (4.9%)
Putative Promoters 2746 (9.8%)
DML 384 (61.1%)

The sparse MI definitely translates to fewer overlaps between DML, but similar overlaps between MI, exons, introns, and genes in both species. Since I don’t trust the intron output, I’m unsure if the actual number here is correct, but in the offchance it is, I included it. Transposable elements and promoters in the C. virginica genome had more overlaps with MI than those same features in the C. gigas genome.

In this R Markdown file I created a standalone C. gigas MI overlap boxplot as well as a grouped barplot with C. virginica and C. gigas data. Again, I didn’t pay too much attention to the standalone plot since I figured I’d come back and use it again when I had more data. Even though my intron overlap data formatting issue prevented me from creating a boxplot with C. gigas intron information, I created a grouped barplot for C. virginica and C. gigas data. I learned you could use at to set where on the x-axis to place individual boxplots to artificially group them!

Screen Shot 2020-02-13 at 9 43 38 AM

Figure 6. Exon, intron, and transposable element overlaps. C. gigas intron-MI overlap data is missing due to output formatting issues.

I think some of the differences are a little clearer in this boxplot. Time to move on from this analysis!

Going forward

  1. Fix C. gigas intron-MI and promoter-MI formatting issues
  2. Update C. gigas MI overlap counts and grouped barplot
  3. Compare C. gigas and C. virginica DML and GOSlim terms
  4. Draft poster for ASLO and get feedback
  5. Finalize ASLO poster and send for printing by Thursday to get the ASLO discount!

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2wahkMs

Grace’s Notebook: Sent samples to Genewiz

Today I shipped samples to Genewiz for individual RNAseq. Details in post.

GitHub Issue #832

Picking samples

Sam, Steven, and I picked samples from this spreadsheet that had Hematodinium qPCR starting quantities that looked good, and we picked samples such that we followed individuals across the different sampling time points.

All samples we picked are in this spreadsheet, and then we further reduced it to 24 samples (samples with a “yes” in column H).

Prepping samples

Sam has been in contact with Genewiz and did all the paperwork. They only need at least 20ng total of RNA, with no minimum requirement for volume.

I aliquoted out different volumes from each of the 24 samples such that we had enough left for qPCR down the line, and such that there was at least 20ng of total RNA. Samples were aliquoted into 0.5ul tubes (requirement of Genewiz).

Samples that were sent: spreadsheet here

They were sent on overnight priority shipping on dry ice.

from Grace’s Lab Notebook https://ift.tt/2tU2skL

Kaitlyn’s notebook: testing new primers on geoduck hemolymph RNA

I made cDNA from previously isolated RNA. I included a pooled sample that I used to test the primers I recently designed via qPCR.

Reverse Transcriptase

RT was performed using 100ng of each sample with M-MLV Reverse Transcriptase from Promega according to the Roberts Lab SOP. All reagents were doubled to produce twice the amount of cDNA. The pooled sample was quadrupled. A 1:1000 dilution of primer was made to increase pipetting volume, and a 1:10 dilution was made for the pooled sample. Calculations can be found here.

  • Pooled sample contains 61, 55F, 43B, and 19

Samples are stored in the -20C fridge in 209 in 20190107 Kaitlyn’s Box.


Primers were tested based on the Roberts lab qPCR SOP. The primers can be found here or they can also be found in the PrimerDatabase.

There was more pooled sample loss than I expected from RT. I originally intended to use 4ul of pooled sample per primer. Instead I could only use 1ul per sample, and I spiked in 1ul of cDNA from pre-existing cDNA so that I had enough master mix avaliable for each sample.



Well Fluor Target Content Sample Cq
A01 SYBR TIF3s6B Unkn-01 Pooled 18.2612
A02 SYBR TIF3s6B Unkn-01 Pooled 18.17076
A03 SYBR NFIP Unkn-02 Pooled 27.42391
A04 SYBR NFIP Unkn-02 Pooled 26.68092
A05 SYBR APLP Unkn-03 Pooled 27.29234
A06 SYBR APLP Unkn-03 Pooled 27.19591
A07 SYBR TIF3s10 Unkn-04 Pooled 37.34317
A08 SYBR TIF3s10 Unkn-04 Pooled 33.8457
B01 SYBR ECHD3 Unkn-05 Pooled 32.67729
B02 SYBR ECHD3 Unkn-05 Pooled 33.01202
B03 SYBR TIF3sF Unkn-06 Pooled 21.97772
B04 SYBR TIF3sF Unkn-06 Pooled 22.08964
B05 SYBR TIF3s12 Unkn-07 Pooled 32.0917
B06 SYBR TIF3s12 Unkn-07 Pooled 32.74342
B07 SYBR FEN1 Unkn-08 Pooled 34.8731
B08 SYBR FEN1 Unkn-08 Pooled NaN
C01 SYBR SPTN1 Unkn-09 Pooled 27.68292
C02 SYBR SPTN1 Unkn-09 Pooled 27.16297
C03 SYBR NSF Unkn-10 Pooled 34.57384
C04 SYBR NSF Unkn-10 Pooled 37.76697
C05 SYBR TIF3s7 Unkn-11 Pooled 34.7186
C06 SYBR TIF3s7 Unkn-11 Pooled 35.08897
C07 SYBR TIF3s8-2 Unkn-12 Pooled 22.06597
C08 SYBR TIF3s8-2 Unkn-12 Pooled 21.92805
D01 SYBR TIF3s8-1 Unkn-13 Pooled 37.13184
D02 SYBR TIF3s8-1 Unkn-13 Pooled 35.01336
D03 SYBR TIF3s4a Unkn-14 Pooled NaN
D04 SYBR TIF3s4a Unkn-14 Pooled NaN
D05 SYBR GLYG Unkn-15 Pooled 37.49771
D06 SYBR GLYG Unkn-15 Pooled 35.38684
D07 SYBR RPL5 Unkn-16 Pooled 17.46911
D08 SYBR RPL5 Unkn-16 Pooled 19.57397
E01 SYBR GSK3B Unkn-17 Pooled 19.45339
E02 SYBR GSK3B Unkn-17 Pooled 18.07122

One target primer, FEN1, only worked on one sample which may have been a technical error so it may be worth re-running. TIF3s4a, one of 9 regulatory primers, did not work on either sample, but all the others worked great!


  • Choose 1-2 regulatory primers
  • Retest FEN1, target, primer
  • Assess known samples

Sam’s Notebook: DNA Isolation, Quantification, and Gel – C.bairdi gDNA Sample 6129_403_26

In order to do some genome sequencing on C.bairid and Hematodinium, we need hihg molecular weight gDNA. I attempted this twice before, using two different methods (Quick DNA/RNA Microprep Kit (ZymoResearch) on 20200122 and the E.Z.N.A Mollusc DNA Kit (Omega) on 20200108) using ~10yr old ethanol-preserved tissue provided by Pam Jensen. Both methods yielded highly degrade gDNA. So, I’m now attempting to get higher quality gDNA from the RNAlater-preserved hemolymph pellets from this experiment.

Used the Quick DNA/RNA Microprep Kit (ZymoResearch) to perform four parallel isolations using four separate aliquots of sample 6129_403_26 with the following changes/notes:

  • Used 75uL of each hemolyph pellet slurry from each tube
  • Eluted all columns with a single 30uL elution (i.e. used 30uL for first column, then used that same elution to elute the next column, etc.) to try to concentrate the sample.

DNA was quantified using the Roberts Lab Qubit 3.0, using 1uL of sample, with the ds DNA BR Assay (Invitrogen).

Used 4uL of sample (~200ng) to run on 1x low-TAE 0.8% agarose gel.