Sam’s Notebook: qPCR – C.gigas primer and gDNA tests with 18s and EF1 primers

0000-0002-2747-368X

The [qPCR I ran earlier today to check for residual gDNA in Ronit’s DNased RNA] turned out terribly, due to a combination of bad primers and, possibly, bad gDNA.

I tracked down some different primers for testing:

  • Cg_18s_1644_F (SRID 1168)
  • Cg_18s_1750_R (SRID 1169)
  • EF1_qPCR_5′ (SRID 309)
  • EF1_qPCR_3′ (SRID 310)

In addition to BB15 from 20090519, I decided to test out BB16 from 20090519 as a positive control.

Samples were run on Roberts Lab CFX Connect (BioRad). All samples were run in duplicate. See qPCR Report (Results section) for plate layout, cycling params, etc.

qPCR master mix calcs (Google Sheet):

Sam’s Notebook: qPCR – Ronit’s DNAsed C.gigas Ploidy/Dessication RNA with 18s primers

0000-0002-2747-368X

After DNasing Ronit’s RNA earlier today, I needed to check for any residual gDNA.

Identified some old, old C.gigas 18s primers that should amplify gDNA:

  • gigas18s_fw (SRID 157)
  • gigas18s_rv (SRID 156)

Used some old C.gigas gDNA (BB15 from 20090519) as a positive control.

Samples were run on Roberts Lab CFX Connect (BioRad). All samples were run in duplicate. See qPCR Report (Results section) for plate layout, cycling params, etc.

qPCR master mix calcs (Google Sheet):

Sam’s Notebook: VCF Splitting with bcftools

0000-0002-2747-368X

Steven asked for some help trying to split a VCF in to individual VCF files.

VCF file (15GB): SNP.TRSdp5g95FnDNAmaf05.vcf.gz

Skip to the Results section if you don’t want to read through the tials and tribulations of getting this to work.

Here’s an overview of how I managed to get this to work and what didn’t work.

Figured out the VCF file needed to be sorted, bgzipped (part of htslib), and indexed with tabix, due to the following error when initially trying to process with VCF file using bcftools:

[W::vcf_parse] contig '' is not defined in the header. (Quick workaround: index the file with tabix.)
Undefined tags in the header, cannot proceed in the sample subset mode.

So, I did that:

  • Sort and bgzip:
 cat SNP.TRSdp5g95FnDNAmaf05.vcf | \ awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' | \ bgzip --threads 20 > SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz 
  • Index with tabix:
 tabix --preset vcf SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz 

This produced a separate file:

  • SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz.tbi.

It seems as though this file must exist in the same directory as the source VCF for it to be utilized, although no commands work directly with this index file.

Then, tried biostars solution, but produces an error

  #!/bin/bash for file in *.vcf.gz; do for sample in `bcftools query -l $file`; do bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file done done  

Resulting error:

 [E::bcf_calc_ac] Incorrect AN/AC counts at NC_035780.1:26174 

And empty split VCF files…

Tried tabix on unsorted bgzipped file yields this error:

 [E::hts_idx_push] chromosome blocks not continuous 

Tried modified sort:

  cat SNP.TRSdp5g95FnDNAmaf05.vcf | \ awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1V -k2,2n"}' | \ bgzip --threads 20 > SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz  

Produces this error:

 [E::bcf_calc_ac] Incorrect AN/AC counts at NC_035780.1:26174 

And empty split VCF files…

Changed to new version of “view” – trying “call” instead (it seems that bcftools view is deprecated?):

  #!/bin/bash for file in *.vcf.gz; do for sample in `bcftools query -l $file`; do bcftools call \ --consensus-caller \ --output-type z \ --threads 18 \ --samples $sample --output-file ${file/.vcf.gz/.$sample.vcf.gz} \ $file done done  

Still results in empty output files.

Based off of the repeated error about AN/AC counts, tried to fill AN/AC values…

  bcftools plugin fill-AN-AC SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz \ --output-type z \ --threads 18 \ --output SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz  

And, ran this code:

  #!/bin/bash for file in SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz; do for sample in `bcftools query -l $file`; do bcftools call \ --consensus-caller \ --output-type z \ --threads 18 \ --samples $sample --output-file ${file/.vcf.gz/.$sample.vcf.gz} \ $file done done  

Still results in empty files…

Try original code again (expanded shortened arguments to improve readability):

  #!/bin/bash for file in SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz; do for sample in `bcftools query -l $file`; do bcftools view \ --min-ac 1 \ --output-type z \ --samples $sample \ --output-file ${file/.vcf*/.$sample.vcf.gz} \ --threads 18 \ $file done done  

P.S. I realize the outermost for loop is not necessary, but it was faster/easier to just quickly modify the code from that Biostars solution.

Sam’s Notebook: DNase Treatment – Ronit’s C.gigas Ploiyd/Dessication Ctenidia RNA

0000-0002-2747-368X

After quantifying Ronit’s RNA earlier today, I DNased them using the Turbo DNA-free Kit (Ambion), according to the manufacturer’s standard protocol.

Used 1000ng of RNA in a 50uL reaction in a 0.5mL thin-walled snap cap tube. Samples were mixed by finger flicking and then incubated 30mins @ 37oC in a PTC-200 thermal cylcer (MJ Research), without a heated lid.

DNase inactivation was performed (0.1 volumes of inactivation reagent; 5uL), pelleted, and supe transferred to new 1.7mL snap cap tube.

Samples were stored on ice in preparation for qPCR to test for residual gDNA.

DNase calculations are here:

Samples will be permanently stored here (Google Sheet):

from Sam’s Notebook https://ift.tt/2J3c0Nl
via IFTTT

Kaitlyn’s notebook: Unique proteins

 

Parsing out unique proteins from hierarchical clusters:

I updated the method to average rather than ward.D2 (Ward 1963) because it changed the cophenetic correlation from 0.6299225 to 0.9433488. The typical accepted value is at least 0.75. There is now a total of 41 clusters (note the agglomerative coefficient = 0.9959262).

Most proteins are in cluster 1 (14381/14510 = 98.68%). However, because the cluster hierarchy represents the original object-by-object dissimilarity matrix so well, I am not so worried about this.

freq-table-3_9

Now I want to remove proteins that are in the same cluster. I’m guessing most of cluster 1 will be removed, but this will help remove many proteins (and thus noise) in my data and highlight the proteins that cluster differently based on abundance. My attempts to solve this problem are in this issue.

This spreadsheet contains a list of proteins that were uniquely abundant between silos.

Now I want to move on to gene enrichment of these proteins, but I need to determine the most appropriate background for DAVID or what the background is for CompGO.

The importance of a genetic background. I’m going to use all of the proteins detected in silo 3 and 9 which were all used for the cluster analysis. I chose not to use the proteins that may be unique in silo 2 since it was not included in this analysis, however it will be worthwhile in the future to do the same methods done here between silos 2 and 3 since there were mortality differences despite the silos being the same temperature. I will only use detected proteins in silos 2 and 3 at that point, and depending on the results, I could redo the same method with all silos and use silos 2 and 3 as a replicate. The proteins would be distinguished by temperature only at this point rather than by group.


Quick note on small things I learned so I won’t forget down the road…

  • I was having some problems with git finding removed files and Github wanting to track them. This code fixes that in a user friendly way:
git clean -i -fd
    • -i for interactive
      -f for file
      -d for directory
      • Note: Add -n or –dry-run to just check what it will do.
  • R is a 1 based index.

I also installed Shutter on Roadrunner using the Ubuntu Software (App store) for image editing to post my issue.

Sam’s Notebook: RNA Quantification – Ronit’s C.gigas Ploidy/Dessication RNA

0000-0002-2747-368X

Last Friday, Ronit quantified 1:10 dilutions of the RNA I isolated on 20181003 and the RNA he finished isolating on 20181011, but two of the samples (D11-C, T10-C) were still too concentrated.

I made 1:20 dilutions (1uL RNA in 19uL 0.1% DEPC-treated H2O) and quantified them using the Roberts Lab Qubit 3.0, with the RNA HS assay. Used 1uL of the diluted RNA.

Kaitlyn’s notebook: Bioanalyzer and Clustering

Bioanalyzer Results

I ran the broodstock DNA extractions on the Bioanalzyer today (10/10). Here are the results. For the first chip, the samples are:

1 – 10-3
2- 3-T1
3- UK-05
4- 12-T6
5- UK-06
6- 8-T2
7- 11-T4
8- UK-02
9- 7-T2
10- UK-08
11- 5-T3

The second chip is already labelled, and I reran UK-08  on it as well. Unfortunately, during preparation of the chip with the gel dye matrix, the plunger clip released to a higher position. I think this disrupted the gel matrix in the chip and caused some errors on the gel matrix although the electrophregram.

Clustering

Previously, I was attempting to use a Bray-Curtis distance matrix which was used on my  cluster analysis for each individual silo. However, bray-curtis is an asymmetrical analysis such that any double zero values will be removed. My data contains several double zero values (where abundances weren’t detected for multiple days or in multiple silos for a day), but that is relevant information when examining the pattern of abundances. I choose to use a euclidean distance matrix instead because it is the most commonly used symmetrical distance matrix. I also changed the method I was using. Average is still a good choice but I think Ward’s method (1963) will be better since it is less sensitive to outliers. It minimizes the error of sum of squares by minimizing the increase in the sum of squares distances at each step. Average gave the best cophentic correlation (0.9433488). This should be done on my previous clusters for each individual silo if we choose to utilize those plots. This is not kmeans clustering- it is hierarchical clustering.

The table can be found here (along with a frequency table, scree plot, dendrogram and line plots which are pasted below for convenience.)

In the table, there is a protein for Silo 3 and a protein for Silo 9 so there are 2 of each proteins. Clusters are represented in a separate column. I want to determine what proteins were sorted into the same cluster for each silo. Obviously, I could do this manually but that would take a while. Next, I need a code that removes a protein if the cluster is the same for the duplicated protein.