Ronit’s Notebook: Generating Plots for qPCR Data

Today, I made a GitHub account and generated box plots for all my qPCR data using R. Attached below are pictures of the box plots:


Shelly’s Notebook: Tues. Jan 15, 2019, Oyster Seed Proteomics, filter inconsistently detected proteins and cluster replicates

This entry is refers to; R markdown file here.

I categorized proteins as being inconsistently detected between technical replicates if their adjusted normalized spectral abundance factor (ADJNSAF) values showed a standard deviation > 5 and a percent error > 20. This would remove a protein for instance if it was detected in one replicate at 100 and 0 in the other.

In total, this filter removed 3,396 proteins leaving 4,596 proteins to remain.

I then redid the PCA on these 4,596 proteins:

filtered proteins not filtered
unnamed-chunk-4-1.png unnamed-chunk-9-1.png
filtered proteins not filtered log vals
unnamed-chunk-4-1.png unnamed-chunk-10-1.png

And redid the NMDS analyses:

filtered proteins not filtered
unnamed-chunk-5-1.png unnamed-chunk-11-1.png
filtered proteins not filtered log vals
unnamed-chunk-5-1.png unnamed-chunk-12-1.png

I think the NMDS on the filtered proteins shows the most agreement between technical replicates, but I’m interested to see what others think.

For now I went ahead with this filtered data set, averaged the NSAF values for technical replicates, and will proceed using that in downstream analyses. The filtered dataset with averaged technical replicate NSAF values called ‘silo3and9_nozerovals_noincnstprot.csv’ [can be found here] (

from shellytrigg

Yaamini’s Notebook: DML Analysis Part 21

Examining sample clustering

Something Shelly brought up at the end of last quarter is how odd my sample clustering is. Previously, I compared dendograms and PCA plots for my samples using different mincov settings for methylKit. Of the settings I used, mincov = 3 produced the best clustering and PCA output:



Figures 1-2. Dendogram and PCA plots for C. virginica gonad sequence data using mincov = 3.

She suggested I revist these plots to see if I could improve clustering by changing my alignment stringency in bismark. HJ mentioned looking at SNP data may also help explain my poor clustering. Looking at these plots again, I see that O1 is farther from the other treatment samples in the PCA, and very separated in the dendogram. This sample also had the lowest mapping efficiency. I decided to see what happened to clustering if I removed that sample before looking into different alignments or SNPs.



Figures 3-4. Dendogram and PCA plots for sequence data, omitting sample 1.

Without sample 1, the clustering in the PCA looked a bit better. The red samples are from the control treatment, while the blue samples are the high pCO2 treatment. It could be that there’s no coordinated methylation response to ocean acidification, or that alignment stringency or SNPs are affecting clustering. I have to do some more digging.

Going forward

  1. See how alignment stringency or SNPs affect clustering
  2. Determine if a formal gene enrichment is necessary
  3. If necessary, select the most appropriate gene enrichment method
  4. Describe functions of most interesting genes with DML and DMR

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student

Yaamini’s Notebook: DML Analysis Part 20

Proportion test results

So far, I’ve used bedtools to find overlaps bewteen DML, DMR, the gene background, and various genome features (exons, introns, mRNA coding regions, and transposable elements). I calculated proportions between DML, DMR, and genome features in this Jupyter notebook, and overlap proportions between the gene background and genome features in this Jupyter notebook. The gene background refers to the output from unite in methylKit (see methylKit script for more information).

My next step was to see if these proportions were significantly different from eachother using prop.test in this R Markdown file. I pulled the number of overlaps from my Jupyter notebooks and used that as the number of successes. The line counts for each genome feature file were used as totals. I compared all three proportions, but also did pairwise comparisons between the gene background and either DML or DMR. My prop.test output can be found in this file and in Table 1 below.

Table 1. Results from prop.test in R. Test results are organized first by the genomic feature overlaps being tested (ex. exons), then by the comparisons included. “All” refers to DML, DMR, and gene background proportions, “DML-GB” for only DML and gene background proportions, and “DMR-GB” for only DMR and gene background proportions. Significant p-values at the 0.05 level are bolded.

Feature Test Chi-Squared Statistic df P-Value
Exon All 63.44 2 1.67e-14
Exon DML-GB 25.58 1 4.24e-07
Exon DMR-GB 36.64 1 1.42e-09
Intron All 136.63 2 2.15e-30
Intron DML-GB 19.59 1 9.62e-06
Intron DMR-GB 115.04 1 7.73e-27
mRNA All 10.18 2 0.006
mRNA DML-GB 2.85 1 0.09
mRNA DMR-GB 6.43 1 0.01
Transposable Elements (All) All 26.13 2 2.12e-06
Transposable Elements (All) DML-GB 0.67 1 0.41
Transposable Elements (All) DMR-GB 24.15 1 8.90e-07
Transposable Elements (Cg) All 14.62 2 0.0007
Transposable Elements (Cg) DML-GB 8.18 1 0.004
Transposable Elements (Cg) DMR-GB 5.48 1 0.02

When comparing all three proportions, all proportions were significantly different from eachother. For the DML-GB tests, all comparisons were significant except for mRNA and transposable elements (all) overlaps. It was interesting that the overlap proportions were significantly different for exons and introns, but not mRNA. All DMR-GB comparisons were significant as well. The differences in significance between DML-GB and DMR-GB could be attributed to the way I calculated overlaps. Each overlapping region is listed as one line entry by bedtools. DMR overlapping regions can be multiple base pairs long because each DMR is 100 bp. However, DML and gene background overlapping regions can only be one base pair because DML and the gene background are each listed locus by locus. It will be interesting to calculate the actual length of each DMR overlap, then use that in a proportion test.

For now, I can conclude that DML and DMR locations are different from the gene background’s location. That will be interesting to interpret in my paper!

Going forward

  1. See how min_cov, alignment stringency, or SNPs affect clustering
  2. Determine if a formal gene enrichment is necessary
  3. If necessary, select the most appropriate gene enrichment method
  4. Describe functions of most interesting genes with DML and DMR

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student

Grace’s Notebook: First Attempts- Trying new BLAST +2.81 with new taxid options

Today I’m trying to get started on the new BLAST +2.81 that has new databases and improved performance. This is pretty exciting because once I figure out how this works, I’ll be able to easily get taxonomy information like Order, Class, etc. I’m attempting what I believe is the first step in this process: trying to get the taxid for “Decapoda”. Details below of resources used and what I did.

New BLAST nt taxonomy: Step 1- trying to get Decapoda taxid

BLAST +2.81 information

New BLAST taxonomy options

Sam installed the new version on Mox:

 And the databases: ```/gscratch/srlab/blastdbs/ncbi-nr-nt-v5``` Sam reccommends that each user installs the eDirect utilities in order to use the full taxid functionality. Instructions on installing eDirect: [here]( My eDirect directory (after installation) lives in my home directory on Mox: ![img](../notebook-images/edirect-blast-directory.png) Contents of eDirect: ![img](../notebook-images/edirect-contents.png) The [new BLAST taxonomy]( method starts with (this example was taken from the pdf): -n Enterobacterales

Taxid: 91347 rank: order division: enterobacteria scientific name: Enterobacterales common name:

1 matches found -t 91347 > 91347.txids

blastn –db nt –query QUERY –taxidlist 91347.txids –outfmt 7 –out

 #### So, if I'm understanding this correctly, I am going to: 1. run the first command: ``` -n Decapoda``` That will give me a taxid number, which I will: 2. then use to create a file with the extension ```.txids``` that will contain the parts of the overall nucleotide taxonomy database that are included in that taxid number: ``` -t ##### > #####.txids``` 3. Then, I'll perform a ```blastn``` with my query.fa (the assembled _C. bairdi_ transcriptome) against the taxid list (file with extension ```.txids```) to find all of the proteins that are associated with Decapoda, and put it into an output file (extension ```.tab```). #### Here's what I have currently in queue on Mox:  


Job Name

#SBATCH –job-name=get_species_taxid

Allocation Definition

#SBATCH –account=coenv #SBATCH –partition=coenv


Nodes (We only get 1, so this is fixed)

#SBATCH –nodes=1

Walltime (days-hours:minutes:seconds format)

#SBATCH –time=4-00:00:00

Memory per node

#SBATCH –mem=100 ##turn on e-mail notification #SBATCH –mail-type=ALL #SBATCH –

Specify the working directory for this job

#SBATCH –workdir=/gscratch/srlab/graceac9/analyses/0115-get-species-taxids

Load Python Mox module for Python module availability

module load intel-python3_2017

/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin \ -n Decapoda -out 0115-get_species_taxid_decapod.txt “`

Not sure if this will actually do anything… but we’ll see once Sam’s job finishes.

In the meantime…


  • Extract RNA using Trizol LS Reagent tomorrow, run on Qubit, Bioanalyze with Qiagen kit-extracted sampels

2015 Oysterseed:

  • Work on figures and paper (Emma is coming to lab meeting Thursday – I will only be able to be there 9:45-10:15)

from Grace’s Lab Notebook