Kaitlyn’s notebook: Options for cluster analysis


Now that I have unique proteins, I need to map them back to biological processes. Gene enrichment examines what genes are over-represented relative to a background genome. A single biological process maps to multiple genes. A biological process (cellular component, or molecular function) is enriched based on the additive changes in genes that occurred based on a treatment or phenotype. Therefore, the background of list of genes you analyze when identifying enriched processes is very important.

My current list of unique proteins examines two of my three silos. I have a couple options with this data for a gene list:

  • split up silos and examine enrichment based on unique proteins per silo


  • look at enriched processes of all unique proteins.

For my background I can include:

  • all detected proteins in the experiment (all silos)
  • detected proteins in silo 3 and 9 only
  • all unique proteins in silo 3 and 9.

I’m not sure which background would be most useful for determining the differences between temperature.

I want to visualize the differences that are occurring between the silos in addition to identifying the changes between biological processes. Heatmaps are a great way to view that information. Considering I have 1532 unique proteins between the silos, a heatmap with them would be difficult to interpret. It would be nice if I could incorporate biological process into the heatmap, however I am not sure how to do that.

When Shelly and I discussed this, I thought that I could map the parent GO term back to the protein using the Uniprot accession code and then plot the abundances of the proteins underneath the parent term in a heat map. I don’t know if protein abundances can be representative of enriched processes though since the group of genes is determined independent of any quantitative value.

Terms to refer back to for DAVID…

  • RT: related term (genes that share similar sets of annotation terms are in the same biological mechanism) by Kappa value (closer to 1 = better)
  • COG ontology: NCBI’s COG (Clusters of Orthologous Groups of proteins) for genome scale analysis of proteins functions and evolution
  • SP_PIR_KEYWORDS: SwissProt/Uniprot and PIR keywords
  • UP_SEQ_FEATURE: annotation category, Uniprot Sequence Feature at the Uniprot site
  • SCOP_..: Protein structure
  • GO FAT: filters out broad GO terms based on a “measured specificity” of each term (not by level)
    • GO_ALL is all levels
    • GO_GP_1 is level 1
      • Information typically increases with levels as nodes annotate deeper, however information can decrease if a node does not annotate any deeper
    • GO_Direct: directly annotated by source database without any parent terms (but low specificity)
  • Functional Annotation Clustering report groups/displays similar annotations together which improves the ability to understand the biology easier compared to the traditional Functional annotation chart
    • Initial Group Members- min gene number in seeding group (lower = more genes/functional group/many small groups)
    • Final Group Members- min gene number after cleanup (number of genes cluster must have to be presented)
    • Multi-Linkage Threshold- how seeding groups are merged (higher % = sharper separation)
    • Clustering  typically contains more than chart b/c of significant neighbors

Sam’s Notebook: qPCRs – Ronit’s C.gigas ploidy/dessication/heat stress cDNA (1:5 dilution)


IMPORTANT: The cDNA used for the qPCRs described below was a 1:5 dilution of Ronit’s cDNA made 20181017 with the following primers! Diluted cDNA was stored in his -20oC box with his original cDNA.

The following primers were used:


  • Cg_18s_F (SR ID: 1408)
  • Cg_18s_R (SR ID: 1409)

EF1 (elongation factor 1)

  • EF1_qPCR_5′ (SR ID: 309)
  • EF1_qPCR_3′ (SR ID: 308)

HSC70 (heat shock cognate 70)

  • Cg_hsc70_F (SR ID: 1396)
  • Cg_hsc70_R2 (SR ID: 1416)

HSP90 (heat shock protein 90)

  • Cg_Hsp90_F (SR ID: 1532)
  • Cg_Hsp90_R (SR ID: 1533)

DNMT1 (DNA methyltransferase 1)

  • Cg_DNMT1_F (SR ID: 1511)
  • Cg_DNMT1_R (SR ID: 1510)

Prx6 (peroxiredoxin 6)

  • Cg_Prx6_F (SR ID: 1381)
  • Cg_Prx6_R (SR ID: 1382)

Samples were run on Roberts Lab CFX Connect (BioRad). All samples were run in duplicate. See qPCR Report (Results section) for plate layout, cycling params, etc.

qPCR master mix calcs (Google Sheet):

Yaamini’s Notebook: DML Analysis Part 13

IN PROGRESS: Different mincov values in methylKit

Using this R Markdown file, I tested the effect of different mincov values on sample clustering and DMLs produced. After dicsussing methods in this issue, I went through this process with both Steven’s samples and my own samples.

Steven’s samples

All of my output from this analysis can be found here. Below are some highlights:

Figures 1-3. Percent CpG coverage for all samples using a) mincov = 1 b) mincov = 3 or c) mincov = 5.

Figures 4-6. Percent CpG methylation for all samples using a) mincov = 1 b) mincov = 3 or c) mincov = 5.

Figures 7-9. Full sample CpG methylation clustering using a) mincov = 1 b) mincov = 3 or c) mincov = 5.

*Figures 10-12 PCA of full sample methylation using a) mincov = 1 b) mincov = 3 or c) mincov = 5.

I also wrote out differentially methylated loci that were at least 50% different between my treatment and control for mincov = 1, mincov = 3, and mincov = 5. I haven’t dug into what the exact differences are between these files, but there are at least differences in the number of DMLs produced.

Table 1. The mincov metric, total number of loci produced, and the number of DMLs that were at least 50% different between treatment andc control samples. More restrictive mincov metrics produced less significantly different DMLs.

mincov Total Loci Number of Significantly Different DMLs
1 1112085 4904
3 670301 1398
5 503780 816

One thing that was concerning about the pipeline is that I kept getting this error:

 glm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred  

My samples

I went through the bismark pipeline in my Jupyter notebook to get my deduplicated and sorted files. Initially I tried using bismark_methylation_extractor, but I was unable to extract methylation data for all files before genefish ran out of space (again…RIP). I moved all my large files to gannet and decided it probably wasn’t worth extracting the methylation data from genefish since I already have the pipeline running on Mox. If I have some downtime, I can always change the code so I’m running bismark_methylation_extractor from gannet.

All output from methylKit testing for my samples can be found here.

Figures 13-15. Percent CpG coverage for all samples using a) mincov = 1 b) mincov = 3 or c) mincov = 5.

Figures 16-18. Percent CpG methylation for all samples using a) mincov = 1 b) mincov = 3 or c) mincov = 5.

Figures 19-21. Full sample CpG methylation clustering using a) mincov = 1 b) mincov = 3 or c) mincov = 5.

*Figures 22-24 PCA of full sample methylation using a) mincov = 1 b) mincov = 3 or c) mincov = 5.


  • LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://ift.tt/1PPNtdx/ / var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page’s canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page’s unique identifier variable }; */ (function() { // DON’T EDIT BELOW THIS LINE var d = document, s = d.createElement(‘script’); s.src = ‘https://the-responsible-grad-student.disqus.com/embed.js’; s.setAttribute(‘data-timestamp’, +new Date()); (d.head || d.body).appendChild(s); })(); </script>

Please enable JavaScript to view the <a href=“https://disqus.com/?ref_noscript”>comments powered by Disqus.</a>


from the responsible grad student https://ift.tt/2PBSgCW

Ronit’s Notebook: Desiccation/Elevated Temperature Samples qPCR (HSP90, EF1)

Today, I ran a qPCR assay with the cDNA from the desiccation + elevated temperature samples. I examined heat shock protein (HSP90) and elongation factor (EF1, normalization gene) and ran 2 duplicates for each primer. I created a mastermix with 20 µL of forward primer, 20 µL of reverse primer, 400 µL of 2x qPCR master mix, and 320 µL of DEPC-treated water. 19 µL of the mastermix was put into each well and a subsequent 1 µL of cDNA was put in for each sample (water for the negative controls and gDNA for the positive controls).

Sam confirmed that the elongation factor primer works, so we should hopefully see amplification for the EF1 primer. I’ll be in next week to check up on the data/run some more RNA extractions.

Plate map
The wells are organized by rows and are in the order of: D01, D02, D09, D10, D11, D12, D19, D20, T01, T02, T09, T10, T11, T12, T19, T20 (samples cover 2 rows).

A1-A12 and B1-B4 are the EF1 replicate 1 samples.
C1-C12 and D1-D4 are the EF1 replicate 2 samples.
E1-E12 and F1-F4 are the HSP90 replicate 1 samples.
G1-G12 and H1-H4 are the HSP90 replicate 2 samples.
H5 is the no template control for EF1. H6 is the control with gDNA for EF1. and H7 is the replicate no template control for EF1. H8 is the replicate control with gDNA for EF1.
H9 is the no template control for HSP90. H10 is the control with gDNA for HSP90. H11 is the replicate no template control for HSP90. H12 is the replicate control with gDNA for HSP90.