Sam’s Notebook: MBD Enrichment – DNA Quantification of C.virginica MBD Samples from 20190312

Finished MBD enrichment on 20190312 using the MethylMiner kit. Since we were out of Qubit assay tubes, I could not quantify these samples when I initially completed the ethanol precipitation. Tubes are in, so I went forward with quantification using the Roberts’ Lab Qubit 3.0 and the 1x High Sensitivity dsDNA assay.

Used 1uL of each sample.

Yaamini’s Notebook: DML Analysis Part 29

Describing general methylation trends

A.K.A. that time I used a lot of bash for loops and awk commands while understanding what I was doing.

Total DNA recovered

I noticed Mac’s 2013 paper included how much of the original input DNA was recovered from MBD procedures. I figured it was a good thing to include in my paper too! In this spreadsheet, I documented the µg of DNA input I used from each sample. In total, I used 37.28 µg. After MBD, the samples were resuspended in 25 µL of buffer. The total yield was 1.42515 µg. This meant that only 3.82% of DNA was recovered after MBD.

Counting reads

One important part of characterizing general methylation trends is to explain how many reads were used for analysis. In this Jupyter notebook, I downloaded FastQC and bismark reports. I originally thought I should download the full trimmed and untrimmed files, but in this issue Steven pointed out that all of that information would be stored in summary reports.

I used a similar series of commands for each report to obtain read counts:

  1. Determine the for loop selects the files I want
 %%bash for f in *zip do echo ${f} done  
  1. Unzip files if needed
 %%bash for f in *zip do unzip ${f} done  
  1. Isolate the number of reads from each report and concatenate in a new file
 %%bash for f in *fastqc do grep "Total Sequences *" ${f}/fastqc_data.txt \ >> 2019-03-17-Untrimmed-Read-Counts.txt done  

I counted untrimmed reads, trimmed reads, and reads that were not mapped to the genome. For the unmapped reads, I subtracted the number of unmapped reads from the number of trimmed reads to obtain the number of reads mapped to the genome. There are 279,681,264 untrimmed reads sequence reads. Of 275,914,272 trimmed paired-end reads, 190,675,298 reads were mapped.

Identify methylated CpG loci

I want to describe general methylation trends, irrespective of pCO2 treatment in my C. virginica gonad data. Claire and Mac both had sections in their papers where determined if a CpG locus was methylated or not. From Mac’s 2013 PeerJ paper:

A CpG locus was considered methylated if at least half of the reads remained unconverted after bisulfite treatment.

They characterized this using in BSMAP, but Steven pointed out I could use .cov files in this issue. In this Jupyter notebook, I first identified loci with 5x coverage for each sample by using this awk command in a loop:

 awk '{print $1, $2-1, $2, $4, $5+$6}' ${f} | awk '{if ($5 >= 5) { print $1, $2-1, $2, $4 }}'  

The coverage file has the following columns: chromosome, start, end, methylation percentage, count methylated, and count unmethylated. In the above command, I correct the start and stop positions ($2-1, $2), and add the count methylated and unmethylated ($5+$6) in each file ${f}. If the new fifth column, total count methylated and unmethylated was greater than 5 ($5 >= 5), it meant that I had 5x coverage for that locus. I could then redirect that information to a new file.

The next step was to concatenate all loci with 5x coverage across my five control samples. In this step, I’m essentially mimicing methylKit unite. I used a series of join commands for this task, then used awk to clean up the columns.

Screen Shot 2019-03-18 at 9 15 05 PM

Screen Shot 2019-03-18 at 9 15 20 PM

Screen Shot 2019-03-18 at 9 15 31 PM

The final file with 5x loci across all samples can be found here. By summing the percent methylation information from each sample for each locus, I could characterize its methylation status and separate loci and save these loci in a new file:

Using wc -l to count the number of loci in each file, I had 63,827 total loci across all samples with at least 5x coverage. This is very different from the 14,458,703 CpG motifs across the C. virginica genome. Of the loci with at least 5x coverage, 60,552 were methylated, 2,796 were sparsely methylated, and 479 were unmethylated.

Characterize location of methylated CpG

The last step to describe methylated CpG is to figure out where the 60,552 loci are in the genome! I set up intersectBed in my Jupyter notebook to find overlaps between my methylated loci and exons, introns, mRNA coding regions, transposable elements, and putative promoter regions. When creating my BEDfile, I needed to specify tab delimiters (\t) to get intersectBed to recognize the file:

 awk '{print $1"\t"$2"\t"$3}' 2019-03-18-Control-5x-CpG-Loci-Methylated.bedgraph \ > 2019-03-18-Control-5x-CpG-Loci-Methylated.bed  

Similar to the DML, methylated CpG were primarily located in genic regions (56,055 CpG; 92.6%), with 3,083 unique genes represented. More CpG (40,127; 66.27%) overlaped with exons than with introns (17,510; 28.92%). I found 4,687 CpG (7.74%) in transposable elements and 1,221 CpG (2.02%) in putative promoter regions 1 kb upstream of transcription start sites. There were 2278 methylated loci (3.76%) that did not overlap with exons, introns, transposable elements, or putative promoters. All lists of overlapping loci can be found in this folder.


I had lots of awk and bash coding issues today BUT I FIGURED MOST OF THEM OUT MYSELF! Just in time to go on vacation and lose all of my hard-earned coding prowess.

Going forward

  1. Conduct a proportion test for DML locations
  2. Update paper methods and results
  3. Work through gene-level analysis
  4. Figure out what’s going on with DMR
  5. Figure out what’s going on with the gene background

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student