Sam’s Notebook: Metagenomics – P.generosa Water Sample Assemlby Comparisons with Quast

Continuing work on the metagenomics project, Emma shared her “co-assembly”, so I figured it would be quick and easy to compare hers with mine and get a feel for how different/similar they might be. I did a similar comparison last week where I compared each of our individual water sample assemblies. Those results showed my assemblies generated:

  • significantly larger “largest contigs” (10 – 50x larger than Emma’s)
  • larger N50 values (~2x larger than Emma’s)
  • total length in bps (~1.5x more than Emma’s)

So, I ran Quast on my computer (swoose – Ubuntu 16.04LTS) with the following input FastAs:

  python \ /home/sam/programs/quast-5.0.2/ \ --threads=20 \ --min-contig=100 \ --labels=ets,sjw \ /home/sam/data/metagenomics/P_generosa/emma_assemblies/contigs.fa \ /home/sam/data/metagenomics/P_generosa/final.contigs.fa 

Yaamini’s Notebook: DML Analysis Part 29

Refining general methylation trends

Based on some comments on my draft paper, there are some things I need to modify. I’m still a little slow from vacation, so I’ve been working on this stuff for the past few days and still have a lot more to do, but figured I should write a quick update.

To deduplicate or not to deduplicate?

Sam read bismark documentation and found the following statement:

This step is recommended for whole-genome bisulfite samples, but should not be used for reduced representation libraries such as RRBS, amplicon or target enrichment libraries.

Deduplication removes duplicate alignments, and based on the manual it looked like I should have deduplicated my data. I posted this issue to clarify the problem and figure out if I should deduplicate. I was confused until Mac mentioned this:

It doesn’t have to to with bisulfite sequencing or not, it has to do with type of enrichment – specifically with a restriction enzyme that cuts the same spot. For RRBS, my sequences will always start with the restriction site, then when I sequence 100bp out from there I am definitely going to get duplicated sequences because they will all line up to those cut sites. I have the same issue when I do RAD-Seq and only sequence from 1 end. I can’t deduplicate because all my reads will line up at the same place. If you’re doing whole genome or an enrichment protocol where you’ve randomly sheared across the genome (or if you’ve done paired end sequencing for RAD), then you wouldn’t expect to get sequences that are EXACTLY the same – unless they are PCR duplicates (at which point if you include the duplicates you are biasing your methylation call to the same fragment that just happened to get amplified lots of times, which is bad).

Since my data is randomly fragmented, I’m sticking with deduplication! Glad I won’t have to redo that.

Identify methylated CpG loci

Speaking of redoing things…

When I previously identified methylated loci, I used CpGs common between all five control samples. Instead, I needed to just use all loci with 5x coverage, even if that locus was only present in one sample. I modified my code in this Jupyter notebook to grab unique loci with 5x coverage from control samples:

 !cat *5x.bedgraph | sort | uniq -u > 2019-03-18-Control-5x-CpG-Loci.bedgraph  

There are 5,194,571 CpG loci with 5x coverage in my control samples. The majority, 4,530,650 (87.2%) loci are methylated, with 470,711 (9.06%) sparsely methylated loci and 193,210 (3.72%) unmethylated loci. There are 2,255,472 (49.8%) methylated loci that overlap with exons, 1,646,352 (36.3%) with introns, 756,905 (16.7%) with transposable elements (all species), and 588,685 (13.0%) with transposable elements (C. gigas only). Putative promoter regions overlapped with 156,356 (13.5%) loci. I decided to include putative promoters in my “no overlaps” category. Additionally, Steven mentioned that the TE-all list was the canonical list for C. virgincia transposable elements, because species-specific elements will be pulled from that list. I found 345,205 (6.6%) methylated loci did not overlap with exons, introns, transposable elements (all species), or putative promoters. Of the 3,853,885 (85%) loci in mRNA coding regions, there are 41,921 unique genes represented. I remade my distribution figure to reflect these changes.

Screen Shot 2019-04-04 at 11 01 00 PM

Figure 1. Frequency distribution of methylation ratios for CpG loci in C. virginica gonad tissue. A total of 5,194,571 CpGs with at least 5x coverage common across all five control samples were characterized. Loci were considered methylated if they were at least 50% methylated, sparsely methylated loci were 10-50% methylated, and unmethylated loci were 0-10% methylated.

Revising DML code

Steven got 633 DML, while I initially got 630. I posted this issue and found that we were using different code to set coverage. I used mincov in processBismarkAln to get loci with 5x coverage right off the bat:

 processBismarkAln(location = analysisFiles, = sample.IDs, assembly = "v3", read.context = "CpG", mincov = 5, treatment = treatmentSpecification)  

Steven set mincov = 2 in processBismarkAln, then used filterByCoverage to set the final 5x coverage minimum:

 processBismarkAln(location = file.list_10, = list("1","2","3","4","5","6","7","8","9","10"), assembly = "v001", read.context="CpG", mincov=2, treatment = c(0,0,0,0,0,1,1,1,1,1)) filtered.myobj=filterByCoverage(myobj_10,lo.count=5,lo.perc=NULL, hi.count=100,hi.perc=NULL)  

He suggested I follow this method, since I could then set a high coverage threshold. I also noticed that we had different assembly arguments in processBismarkAln. Turns out that’s just a naming convention so it doesn’t affect DML.

I added code to my methylKit script that used processBismarkAln with mincov = 2, then used filterByCoverage:

 processedFiles <- processBismarkAln(location = analysisFiles, = sample.IDs, assembly = "v3", read.context = "CpG", mincov = 2, treatment = treatmentSpecification) #Process files for CpG methylation. Use 2x coverage for faster processing. Coverage will be adjusted later. First 5 files were from ambient conditions, and the second from high pCO2 conditions. processedFilteredFilesCov5 <- filterByCoverage(processedFiles, lo.count = 5, lo.perc = NULL, hi.count = 100, hi.perc = NULL) #filter processed files using lo.count and hi.count coverage thresholds. Coverage should be no less than 5 and should not exceed 100.  

I peeked at the %CpG methylation and CpG coverage plots to compare them to those from the 5x destranded samples I previously created. These plots have different values, so clearly my dataset has changed.



Figures 2-3. Sample 9 %CpG and CpG coverage plots.

I also evaluated clustering for the filtered 5x coverage data.





Figures 4-7. Correlation plot, full sample clustering, full sample PCA, and scree plot for filtered 5x coverage data.

Finally, I identified DML. I got 598 DML, which still doesn’t match Steve’s result. It’s also less DML than I had before. I exported the data to this file. I thought it may have been a versioning issue, so I updated R and methylKit. I still ended up with 598 DML, so there must be something else going on.

Recharacterizing DML locations

I characterized genomic locations for my new DML set. As expected, the majority of overlaps, 368 (61.5%), were with exons. There were also 191 DML (31.9%) that overlapped with introns. Transposable elements overlapped with 57 DML (9.5%) when all speices were used, and 39 DML (6.5%) when only C. gigas was used. Putative promoter regions overlapped with 67 DML (11.2%). There were also 20 DML (3.3%) with no overlaps with exons, introns, transposable elements (all), and putative promoter regions.

Checking read calculations

I initially listed 190 million reads mapping back to the C. virginica genome from my samples, but this was a much higher rate than any of the mapping efficiencies I had in a sample. In this Jupyter notebook, I recalculated the number of mapped reads. I pulled out mapping efficiency, then the number of sequence pairs analyzed, for each sample. I then used a series of bash commands to isolate mapping efficiency percentages and multiply them by pairs analyzed:

 #Isolate mapping efficiency percentages (remove % from the end) #Save new document !cut -f2 2019-03-17-Mapping-Efficiency.txt | cut -c1-4 \ | paste 2019-03-17-Mapping-Efficiency.txt -> 2019-03-17-Mapping-Efficiency-Percents-Included.txt #Divide percentages by 100 #Save new document !awk -v c=100 '{ print $3/c}' 2019-03-17-Mapping-Efficiency-Percents-Included.txt \ | paste 2019-03-17-Mapping-Efficiency-Percents-Included.txt -> 2019-03-17-Mapping-Efficiency-Divided-Percents.txt #Isolate pairs analyzed #Save in new document with divided percents !cut -f2 2019-03-17-Pairs-Analyzed.txt \ |paste 2019-03-17-Mapping-Efficiency-Divided-Percents.txt -> 2019-03-17-Pairs-Analyzed-and-Mapping-Efficiency.txt #Mulitply percentage mapped and pairs analyzed columns to obtain mapped reads !awk '{ print $3, $6, $5*$6}' 2019-03-17-Pairs-Analyzed-and-Mapping-Efficiency.txt \ > 2019-03-17-Mapped-Reads.txt #Sum the contents of the third column ($3) to obtain the total number of paired-end reads that mapped to the genome. !cat 2019-03-17-Mapped-Reads.txt | awk '{ sum+=$3} END {print sum}'  

I found 136,186,380 paired-end reads had mapped to the genome, or 49.4%. This makes much more sense with my mapping efficiencies.

Additionally, I revised my paper to include a statement about the number of loci with 5x coverage, instead of 1x coverage. In this Jupyter notebook, I set a 5x coverage threshold and counted CpGs. I have data for 911,159 CpGs, which is 62.4% of CpGs in the genome.

Updating gannet

Since I generated a lot analytical output, I decided to upload the current version of my repository to gannet. It can be found at this link. As I continue to finalize analyses, I’ll probably keep updating the same repository link so I don’t end up with 13 different repository versions over the course of a week.

New paper repository

Steven also created a new paper repository in the NSF E20 organization. I’ll update the repository with polished scripts to ensure Katie and Alan can reproduce my results. I cross-referenced my scripts, R Markdown file, and Jupyter notebooks with my current methods and the draft paper. I also organized the files in the sequence they should be used and added short descriptions in the README.

Going forward

  1. Check that I identified methylated CpGs correctly
  2. Chi-squared tests for DML distribution
  3. Describe (somehow) genes with DML in them
  4. Figure out what’s going on with the gene background
  5. Figure out what’s going on with DMR
  6. Work through gene-level analysis
  7. Update paper repository
  8. Start writing the discussion
  9. Draft timeline to finish the paper by the end of the month

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student

Shelly’s Notebook: Fri. April 5, 2019 Oyster Seed Proteomics

Re-do analysis with updated selected proteins from clustering without day 0.

Kaitlin’s previous clustering analysis included day 0. She made a euclidean distance matrix and cut the dendrogram at 250 to ID 33 proteins in 35 different clusters.

In the new analysis, she made a euclidean distance matrix and cut the dendrogram at 150 to ID 135 proteins in 61 different clusters.

There are now 135 protiens ID’d by clustering, 143 proteins ID’d by ASCA, and 163 proteins ID’d by proportions test to be significantly influenced by temperature.

I made a new proportional venn diagram in R to show proteins commonly and uniquely identified by different statistical methods. img

I also made a new heatmap of these proteins. img

These figures will be the new figure 3 in the manuscript.

I also plotted heatmaps and abundance line plots of proteins identified by each individual method as a way to verify the methods are selecting proteins that have abundances affected by temperature. To see these, CHECK OUT THIS MARKDOWN FILE FOR NEW FIGURES

Markdown file was generated by this R markdown file, so it out if you’re interested in the R code.

Still need to:

  1. Re-run the GO analysis/cytoscape with updated protein list generated today
  2. dig into the last figure in terms of what the biological significance actually is
  3. work on the discussion

from shellytrigg

Grace’s Notebook: Annotated Protein List from MS Stats with set thresholds… in progress

Yesterday and today I got some more 2015 DIA paper-related tasks done, including making a table that includes the proteins detected above a log2FC of 3.00 and below -3.00, and one that was annotated…

Tasks completed

  1. Uploaded the skyline daily .zip file to the Roberts Lab PanoramaWeb in a new folder entitled “2015-DIA-Cgigas-seed”.
  2. Updated links in the GitHub paper-pacific.oyster-larvae/protocols

Threshold protein lists

R script

Yesterday I made one without annotation. 20190403-2015Cgseed-protcomp.csv 26 proteins.

Today I made one with Steven’s annotated protein list. 20 proteins.

We cross-referenced the 20190403-2015Cgseed-protcomp.csv with Steven’s pivot table created from 0403-DIA-Cgseed.csv and determined that if the log2FC number is negative, it’s 23C, and if the log2FC is positive, it’s 29C.

Weekend and next week tasks

  1. Keep working on paper (intro, methods, discussion)
  2. Get results for the paper figured out- do I include the phenotype data even though the 29C data is pooled from 7 silos…?
  3. Get repository to the point where it only has the important information. I still think I have too much stuff in there. But once everything is finished, it’ll be easier to ID what needs to stay and what needs to go.

from Grace’s Lab Notebook

Updating R using updateR package

Shelly’s Notebook: Apr 3-4, 2019 Oyster Proteomics

General proteome characterization

For preliminary figures to generally describe the types of proteins detected in the proteomes from different temperatures and different time points; I wanted to do a functional classification of all proteins detected in different samples.

In brief:

  1. I merged the average NSAF data with the uniprot mapping data and used an e-value cut off of 10^-10. This kept about half of all the proteins detected (~3500).
  2. I extracted all their GO IDs that were annotated in the Uniprot database and reformated them so that for each protein is listed with only one GO ID, repeated proteins on a new line for each different GO ID it has.
  3. I mapped the GO IDs to the Roberts lab GO slim terms and plotted pie charts for:
    • each proteome from a different time point and temperature
      • there were no differences between the pie charts, meaning they mostly have the same GO slim terms. This makes sense because most proteins were overlapping between time points and temperatures. See markdown file linked below for these plots.
    • comprehensive proteomes for each temperature
      • again no difference in pie charts imgimg
    • the comprehensive proteome of all samples
      • this is what I’m thinking to report in the paper: img

For more details, refer to markdown file :

Redo the PCA of the technical replicates

I removed day 0 from clustering technical replicates and generated the plots below in this markdown file and corresponding Rmarkdown file. I think the PCA of log ADJ NSAF values show the tighest clustering between technical replicates, so I’ll replace the current PCA in Supplementary figure 1 with this one.






Still need to work on the following:

  1. Make heat map plot of hierarchical clustered proteins, now clustered without day 0.
  2. Re-run the GO analysis/cytoscape with the updated significantly changed protein list
  3. dig into the last figures
  4. discussion

from shellytrigg

Ignoring previously committed files

Sam’s Notebook: Metagenomics – Geoduck Water Sample Assembly Comparisons with MetaQuast

As part of addressing this GitHub Issue on assessing taxonomic diversity in our metagenomics samples, I decided to compare the individual sample assemblies I made on 20190327 using Megahit and the assemblies that Emma made. Emma utilized NGless on her cluster in Genome Sciences with the following scripts:

  #!/bin/bash #$ -cwd #$ -S /bin/bash #$ -o ./output #$ -e ./output #$ -l mfree=3G #$ -pe serial 16 #$ -R y # Send email when job starts, ends or runs into an error #$ -m beas #$ -M #cause a failing process to trigger a job failure to make errors easier to catch set -eo pipefail #Initialize and load modules . /etc/profile.d/ module purge module load modules{,-{init,gs/prod}} ngless/0.10.0 #Script ngless \ /net/nunn/vol1/emmats/sequencing/geo_metaG/ngl.singleLibs \ --trace \ --keep-temporary-files \ -j $NSLOTS \ -t /data/scratch/ssd 

Grace’s Notebook: Re-do MS Stats; DIA 2015 Cgseed Updates

Today I got some more input from Emma regarding the reports to export from Skyline to analyze. I exported a new MS Stats report (the built-in one from Skyline) and also made the bioreplicates in the file accurate (details in post). Additionally, I exported a new report that includes information that should help us get to the point of comparing 23C and 29C proteins, and identifying the proteins expressed in total.

New reports

New MS Stats report
New protein list

The new protein list integrates the transitions for the each peptide. However, we are still trying to figure out how to integrate the peptides for each protein (GitHub #666).

I also spoke with Yaamini, and she told me that we should only look at proteins that have 3 or more peptides (ones that have 2 or 1 should be discarded). The ones that have more than 3 peptides, the three most abundant peptides should be used.

Re-do MS Stats with new report

New MS Stats script

Products from new MS Stats:

Thoughts on new MS Stats

I haven’t delved too much into the plots, except for the Volcano Plot.

It is very different from the one I did with the earlier MS Stats report (the one I created by hand).

The one from before had the bioreplicates wrong. I had:
| Temp trtmnt | sample day | sample | bioreplicate | |————-|————|——–|————–| | 23C | 9/11/15 | 1 | 1 | | 29C | 9/11/15 | 2 | 2 | | 23C | 9/14/15 | 13 | 1 | | 29C | 9/14/15 | 14 | 2 |

And it showed that there was 1 significantly downregulated (old volcano plot)

Today, Steven said that bioreplicates should actually be within treatment. So I fixed the bioreplicates for today’s MS Stats report and MS Stats analysis in R. The bioreplicates are now:
| Temp trtmnt | sample day | sample | bioreplicate | |————-|————|——–|————–| | 23C | 9/11/15 | 1 | 1 | | 29C | 9/11/15 | 2 | 1 | | 23C | 9/14/15 | 13 | 2 | | 29C | 9/14/15 | 14 | 2 |

Maybe that’s why the new Volcano plot has no significant proteins and the last plot did.

Next steps:

I also spoke with Yaamini, and she told me that we should only look at proteins that have 3 or more peptides (ones that have 2 or 1 should be discarded). The ones that have more than 3 peptides, the three most abundant peptides should be used.

  • Get protein lists: total proteins, and (if temp trtmnts are different) proteins for each temp trtmnt
  • Clean up repo: readme files; only include important stuff
  • Finish paper by April 17th

from Grace’s Lab Notebook

Yaamini’s Notebook: April 2019 Goals


It’s cherry blossom season! Which also means I’m back from vacation and it’s time to set some new goals.

March Goals Recap:


Gigas Broodstock:

  • Putting this on hold until I can figure out how to deal with low-volume samples


April Goals


  • Revise methods and results for general methylation and DML descriptions
  • Modify paper repo so analysis is easily reproducible
  • Conduct gene-level analysis
  • Sort out issues with DMR and methylKit
  • Draft full paper

Virginica Sperm:

  • Identify samples for RNA-Seq
  • Extract DNA from mantle samples
  • Do MBD for sperm and mantle DNA


  • Find GSR
  • Schedule new committee meeting

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student