Kaitlyn’s notebook: NSAF vs NUMSPEC and working draft methods

Shelly and I are working on putting together a paper focusing on proteomics for the 2016 oyster seed data. Here is the working draft!

We also found out that the data file I have been working with is not NSAF (normalized spectral abundance factor) values. Instead they are spectral peak values (NUMSPEC) which do not really correlate to protein abundance values.. All downstream analyses need to be done on NSAF values, but we are now sure that these are the correct values and that a redone file will export in the same format.

I went through Emma’s notebook today and found some information on technical replicates, MS preparation and experimental justification to name a few things. Important/relevant posts are linked in the draft that can be discussed and used to help discern the parts of the method we are unsure about.

We are focusing on the methods section. I added in information about hierarchical clustering and a fold change analysis. We are unsure if fold change analysis will stay in the paper. Previously we had down this with the NUMSPEC data, so in order to know if it will stay in the methods section, I need to redo it with the new NSAF data. Here is what I’ve done so far:

First I needed to organize the .tsv file. It contains several other NSAF values and spectral values. I know from the above issue that adjusted NSAF values should be used so I extracted all columns containing ‘ADJNSAF’. Next, I had to remove everything but the sample name which was tricky since the sample names are not the same length. Another tough aspect of rearranging this data sheet is that the sample names do not intuitively correspond to the silo, temperature or day as seen here, but another benefit to taking the time to reformat the data, is that it can be put through the other scripts much easier now to generate a new clustering and ASCA heatmap.

All silo 3 and 9 samples from day 0 (competent larvae) to day 13:

  • 1- s0d0
  • 4- s3d3
  • 8- s9d3
  • 12- s3d5
  • 16- s9d5
  • 20- s3d7
  • 24- s9d7
  • 28- s3d9
  • 32- s9d9
  • 36- s3d11
  • 40- s9d11
  • 44- s3d13
  • 48- s9d13

Row means were taken followed by a foldchange analysis for each day. Originally we removed fold changes less than 2 (as written in the methods), but that leaves many NA values for proteins that can’t be visualized. I need to find a way to remove NAs and infinity values without setting the cutoff value for this reason.

Sam’s Notebook: Annotation – Geoduck Genome with MAKER Submitted to Mox

Well, here we go! Initiating full-blown annotation of the Pgenerosa_v070 (FastA; 2.1GB), using MAKER (v.2.31.10) on Mox. This will perform the following:

  • one round of MAKER gene model predictions
  • two rounds of SNAP gene model training/predictions
  • renaming of gene models to NCBI-standardized convention
  • functional characterization of protein models (via BLASTp)
  • functional characterization of protein domains (via InterProScan5)

I’ve submitted the job to Mox and now we wait. When I last ran this process on the Olympia oyster genome, the job took about two weeks to complete. I’ve made some changes that will allow it to run a bit faster (I think/hope) – primarily by using gene model GFFs for the SNAP gene model trainings, instead of relying on the FastA file file each time, since the FastA file needs to get BLASTed on each round and BLASTing is a time consuming process.

To finish off this post, I’ll provide the SBATCH script used to submit this job to Mox. Be prepared, it’s a doozy and it took me many hours to fully put this together (the script has >350 lines in it!). When the job summary is typed out like above, it seems so easy! Yeesh!

  #!/bin/bash ## Job Name #SBATCH --job-name=maker ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=2 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190115_geoduck_maker_genome_annotation # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # Add BLAST to system PATH export PATH=$PATH:/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin export BLASTDB=/gscratch/srlab/blastdbs/UniProtKB_20181008/ ## Establish variables for more readable code wd=$(pwd) maker_dir=/gscratch/srlab/programs/maker-2.31.10/bin snap_dir=/gscratch/srlab/programs/maker-2.31.10/exe/snap ### Paths to Maker binaries maker=${maker_dir}/maker gff3_merge=${maker_dir}/gff3_merge maker2zff=${maker_dir}/maker2zff fathom=${snap_dir}/fathom forge=${snap_dir}/forge hmmassembler=${snap_dir}/hmm-assembler.pl fasta_merge=${maker_dir}/fasta_merge map_ids=${maker_dir}/maker_map_ids map_gff_ids=${maker_dir}/map_gff_ids map_fasta_ids=${maker_dir}/map_fasta_ids functional_fasta=${maker_dir}/maker_functional_fasta functional_gff=${maker_dir}/maker_functional_gff ipr_update_gff=${maker_dir}/ipr_update_gff iprscan2gff3=${maker_dir}/iprscan2gff3 blastp_dir=${wd}/blastp_annotation maker_blastp=${wd}/blastp_annotation/20190108_blastp.outfmt6 maker_prot_fasta=${wd}/snap02/Pgenerosa_v70_snap02.all.maker.proteins.fasta maker_prot_fasta_renamed=${wd}/snap02/Pgenerosa_v70_snap02.all.maker.proteins.renamed.fasta maker_transcripts_fasta=${wd}/snap02/Pgenerosa_v70_snap02.all.maker.transcripts.fasta maker_transcripts_fasta_renamed=${wd}/snap02/Pgenerosa_v70_snap02.all.maker.transcripts.renamed.fasta snap02_est_gff=${wd}/snap02/Pgenerosa_v70_snap01.maker.all.noseqs.est2genome.gff snap02_gff=${wd}/snap02/Pgenerosa_v70_snap02.all.gff snap02_gff_renamed=${wd}/snap02/Pgenerosa_v70_snap02.all.renamed.gff snap02_protein_gff=${wd}/snap02/Pgenerosa_v70_snap01.maker.all.noseqs.protein2genome.gff snap02_rm_gff=${wd}/snap02/Pgenerosa_v70_snap01.maker.all.noseqs.repeats.gff put_func_gff=Pgenerosa_v70_genome_snap02.all.renamed.putative_function.gff put_func_prot=Pgenerosa_v70_genome_snap02.all.maker.proteins.renamed.putative_function.fasta put_func_trans=Pgenerosa_v70_genome_snap02.all.maker.transcripts.renamed.putative_function.fasta put_domain_gff=Pgenerosa_v70_genome_snap02.all.renamed.putative_function.domain_added.gff ips_dir=${wd}/interproscan_annotation ips_base=Pgenerosa_v70_maker_proteins_ips ips_name=Pgenerosa_v70_maker_proteins_ips.tsv id_map=${wd}/snap02/Pgenerosa_v70_genome.map ips_domains=Pgenerosa_v70_genome_snap02.all.renamed.visible_ips_domains.gff ## Path to blastp blastp=/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin/blastp ## Path to InterProScan5 interproscan=/gscratch/srlab/programs/interproscan-5.31-70.0/interproscan.sh ## Store path to options control file maker_opts_file=./maker_opts.ctl ### Path to genome FastA file genome=/gscratch/srlab/sam/data/P_generosa/generosa_genomes/Pgenerosa_v070.fa ### Path to transcriptome FastA file transcriptome=/gscratch/srlab/sam/data/P_generosa/generosa_transcriptomes/20180827_trinity_geoduck.fasta ### Path to Crassotrea gigas NCBI protein FastA gigas_proteome=/gscratch/srlab/sam/data/C_gigas/gigas_ncbi_protein/GCA_000297895.1_oyster_v9_protein.faa ### Path to Crassostrea virginica NCBI protein FastA virginica_proteome=/gscratch/srlab/sam/data/C_virginica/virginica_ncbi_protein/GCF_002022765.2_C_virginica-3.0_protein.faa ### Path to Panopea generosa TransDecoder protein FastA panopea_td_proteome=/gscratch/srlab/sam/data/P_generosa/generosa_proteomes/20180827_trinity_geoduck.fasta.transdecoder.pep ### Path to concatenated NCBI prteins FastA combined_proteomes=${wd}/combined_proteomes.fasta ### Path to P.generosa-specific repeat library repeat_library=/gscratch/srlab/sam/data/P_generosa/generosa_repeats/Pgenerosa_v070-families.fa ### Path to SwissProt database for BLASTp sp_db_blastp=/gscratch/srlab/blastdbs/UniProtKB_20190109/uniprot_sprot.fasta ## Make directories mkdir blastp_annotation mkdir interproscan_annotation ## Create Maker control files needed for running Maker, only if it doesn't already exist and then edit it. ### Edit options file ### Set paths to P.generosa genome and transcriptome. ### Set path to combined C. gigas, C.virginica, P.generosa proteomes. ### The use of the % symbol sets the delimiter sed uses for arguments. ### Normally, the delimiter that most examples use is a slash "/". ### But, we need to expand the variables into a full path with slashes, which screws up sed. ### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%"). if [ ! -e maker_opts.ctl ]; then $maker -CTL sed -i "/^genome=/ s% %$genome %" "$maker_opts_file" sed -i "/^est=/ s% %$transcriptome %" "$maker_opts_file" sed -i "/^protein=/ s% %$combined_proteomes %" "$maker_opts_file" sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file" sed -i "/^est2genome=0/ s/est2genome=0/est2genome=1/" "$maker_opts_file" sed -i "/^protein2genome=0/ s/protein2genome=0/protein2genome=1/" "$maker_opts_file" fi ## Create combined proteome FastA file, only if it doesn't already exist. if [ ! -e combined_proteomes.fasta ]; then touch combined_proteomes.fasta cat "$gigas_proteome" >> combined_proteomes.fasta cat "$virginica_proteome" >> combined_proteomes.fasta cat "$panopea_td_proteome" >> combined_proteomes.fasta fi ## Run Maker ### Specify number of nodes to use. mpiexec -n 56 $maker ## Merge gffs ${gff3_merge} -d Pgenerosa_v70.maker.output/Pgenerosa_v70_master_datastore_index.log ## GFF with no FastA in footer ${gff3_merge} -n -s -d Pgenerosa_v70.maker.output/Pgenerosa_v70_master_datastore_index.log > Pgenerosa_v70.maker.all.noseqs.gff ## Merge all FastAs ${fasta_merge} -d Pgenerosa_v70.maker.output/Pgenerosa_v70_master_datastore_index.log ## Extract GFF alignments for use in subsequent MAKER rounds ### Transcript alignments awk '{ if ($2 == "est2genome") print $0 }' Pgenerosa_v70.maker.all.noseqs.gff > Pgenerosa_v70.maker.all.noseqs.est2genome.gff ### Protein alignments awk '{ if ($2 == "protein2genome") print $0 }' Pgenerosa_v70.maker.all.noseqs.gff > Pgenerosa_v70.maker.all.noseqs.protein2genome.gff ### Repeat alignments awk '{ if ($2 ~ "repeat") print $0 }' Pgenerosa_v70.maker.all.noseqs.gff > Pgenerosa_v70.maker.all.noseqs.repeats.gff ## Run SNAP training, round 1 mkdir snap01 && cd snap01 ${maker2zff} ../Pgenerosa_v70.all.gff ${fathom} -categorize 1000 genome.ann genome.dna ${fathom} -export 1000 -plus uni.ann uni.dna ${forge} export.ann export.dna ${hmmassembler} genome . > Pgenerosa_v70_snap01.hmm ## Initiate second Maker run. ### Copy initial maker control files and ### - change gene prediction settings to 0 (i.e. don't generate Maker gene predictions) ### - use GFF subsets generated in first round of MAKER ### - set location of snaphmm file to use for gene prediction ### Percent symbols used below are the sed delimiters, instead of the default "/", ### due to the need to use file paths. if [ ! -e maker_opts.ctl ]; then $maker -CTL sed -i "/^genome=/ s% %$genome %" maker_opts.ctl sed -i "/^est2genome=1/ s/est2genome=1/est2genome=0/" maker_opts.ctl sed -i "/^protein2genome=1/ s/protein2genome=1/protein2genome=0/" maker_opts.ctl sed -i "/^est_gff=/ s% %../Pgenerosa_v70.maker.all.noseqs.est2genome.gff %" maker_opts.ctl sed -i "/^protein_gff=/ s% %../Pgenerosa_v70.maker.all.noseqs.protein2genome.gff %" maker_opts.ctl sed -i "/^rm_gff=/ s% %../Pgenerosa_v70.maker.all.noseqs.repeats.gff %" maker_opts.ctl sed -i "/^snaphmm=/ s% %Pgenerosa_v70_snap01.hmm %" maker_opts.ctl fi ## Run Maker ### Set basename of files and specify number of CPUs to use mpiexec -n 56 $maker \ -base Pgenerosa_v70_snap01 ## Merge gffs ${gff3_merge} -d Pgenerosa_v70_snap01.maker.output/Pgenerosa_v70_snap01_master_datastore_index.log ## GFF with no FastA in footer ${gff3_merge} -n -s -d Pgenerosa_v70_snap01.maker.output/Pgenerosa_v70_snap01_master_datastore_index.log > Pgenerosa_v70_snap01.maker.all.noseqs.gff ## Extract GFF alignments for use in subsequent MAKER rounds ### Transcript alignments awk '{ if ($2 == "est2genome") print $0 }' Pgenerosa_v70_snap01.maker.all.noseqs.gff > Pgenerosa_v70_snap01.maker.all.noseqs.est2genome.gff ### Protein alignments awk '{ if ($2 == "protein2genome") print $0 }' Pgenerosa_v70_snap01.maker.all.noseqs.gff > Pgenerosa_v70_snap01.maker.all.noseqs.protein2genome.gff ### Repeat alignments awk '{ if ($2 ~ "repeat") print $0 }' Pgenerosa_v70_snap01.maker.all.noseqs.gff > Pgenerosa_v70_snap01.maker.all.noseqs.repeats.gff ## Run SNAP training, round 2 cd .. mkdir snap02 && cd snap02 ${maker2zff} ../snap01/Pgenerosa_v70_snap01.all.gff ${fathom} -categorize 1000 genome.ann genome.dna ${fathom} -export 1000 -plus uni.ann uni.dna ${forge} export.ann export.dna ${hmmassembler} genome . > Pgenerosa_v70_snap02.hmm ## Initiate third and final Maker run. ### Copy initial maker control files and: ### - change gene prediction settings to 0 (i.e. don't generate Maker gene predictions) ### - use GFF subsets generated in first round of SNAP ### - set location of snaphmm file to use for gene prediction. ### Percent symbols used below are the sed delimiters, instead of the default "/", ### due to the need to use file paths. if [ ! -e maker_opts.ctl ]; then $maker -CTL sed -i "/^genome=/ s% %$genome %" maker_opts.ctl sed -i "/^est2genome=1/ s/est2genome=1/est2genome=0/" maker_opts.ctl sed -i "/^protein2genome=1/ s/protein2genome=1/protein2genome=0/" maker_opts.ctl sed -i "/^est_gff=/ s% %${snap02_est_gff} %" maker_opts.ctl sed -i "/^protein_gff=/ s% %${snap02_protein_gff} %" maker_opts.ctl sed -i "/^rm_gff=/ s% %${snap02_rm_gff} %" maker_opts.ctl sed -i "/^snaphmm=/ s% %Pgenerosa_v70_snap02.hmm %" maker_opts.ctl fi ## Run Maker ### Set basename of files and specify number of CPUs to use mpiexec -n 56 $maker \ -base Pgenerosa_v70_snap02 ## Merge gffs ${gff3_merge} \ -d Pgenerosa_v70_snap02.maker.output/Pgenerosa_v70_snap02_master_datastore_index.log ## GFF with no FastA in footer ${gff3_merge} -n -s -d Pgenerosa_v70_snap02.maker.output/Pgenerosa_v70_snap02_master_datastore_index.log > Pgenerosa_v70_snap02.maker.all.noseqs.gff ## Merge FastAs ${fasta_merge} \ -d Pgenerosa_v70_snap02.maker.output/Pgenerosa_v70_snap02_master_datastore_index.log # Create copies of files for mapping cp ${maker_prot_fasta} ${maker_prot_fasta_renamed} cp ${maker_transcripts_fasta} ${maker_transcripts_fasta_renamed} cp ${snap02_gff} ${snap02_gff_renamed} # Map IDs ## Change gene names ${maker_map_ids} \ --prefix PGEN_ \ --justify 8 \ ${snap02_gff} \ > ${id_map} ## Map GFF IDs ${map_gff_ids} \ ${id_map} \ ${snap02_gff_renamed} ## Map FastAs ### Proteins ${map_fasta_ids} \ ${id_map} \ ${maker_prot_fasta_renamed} ### Transcripts ${map_fasta_ids} \ ${id_map} \ ${maker_transcripts_fasta_renamed} # Run InterProScan 5 ## disable-precalc since this requires external database access (which Mox does not allow) cd ${ips_dir} ${interproscan} \ --input ${maker_prot_fasta_renamed} \ --goterms \ --output-file-base ${ips_base} \ --disable-precalc # Run BLASTp cd ${blastp_annotation} ${blastp} \ -query ${maker_prot_fasta_renamed} \ -db ${sp_db_blastp} \ -out ${maker_blastp} \ -max_target_seqs 1 \ -evalue 1e-6 \ -outfmt 6 \ -num_threads 28 # Functional annotations cd ${wd} ## Add putative gene functions ### GFF ${functional_gff} \ ${sp_db_annotations} \ ${maker_blastp} \ ${snap02_gff_renamed} \ > ${put_func_gff} ### Proteins ${maker_functional_fasta} \ ${sp_db_annotations} \ ${maker_blastp} \ ${maker_prot_fasta_renamed} \ > ${put_func_prot} ### Transcripts ${maker_functional_fasta} \ ${sp_db_annotations} \ ${maker_blastp} \ ${maker_transcripts_fasta_renamed} \ > ${put_func_trans} ## Add InterProScan domain info ### Add searchable tags ${ipr_update_gff} \ ${put_func_gff} \ ${ips_dir}/${ips_name} \ > ${put_domain_gff} ### Add viewable features for genome browsers (JBrowse, Gbrowse, Web Apollo) ${iprscan2gff3} \ ${ips_dir}/${ips_name} \ ${snap02_gff_renamed} \ > ${ips_domains}  

from Sam’s Notebook http://bit.ly/2FtTVJh

Kaitlyn’s notebook: new height for dendrogram

Goals for today:

New Heatmap

Quick refresher: Hierarchical clustering compares the pattern of abundance between each protein. It does not factor in time as a dependent value. Instead it considers each time point as an independent variable.

I wanted to better evaluate the cutoff value for my hierarchical clustering dendrogram. The dendrogram is inserted below (ignore the current red cutoff line).

[A dendrogram was created because the scree plot was not detailed enough to find an appropriate cutoff value (at the elbow).]



Clusters directly correlate to the nodes therefore the higher the cutoff value, or height, the fewer clusters are created (because less nodes are included).

I choose 3 cutoff values, at 90, 110 and 180 because they were either highly inclusive of branches with fewer nodes or highly exclusive of branches with many nodes.

  • 180 includes all small branches.
  • 80 includes only the dense nodes and intricate branching at the bottom.
  • 110 sits between high exclusive and highly inclusive.

Next, I compared the line plots at each cutoff value to determine which level of inclusivity best examined the relationship between proteins. I was looking for something that grouped proteins together well. This value couldn’t be so inclusive of branching that all proteins grouped together into so few clusters that individuality was lost, or so exclusive of branching that proteins were so often clustered separately that patterns were lost.

Here are line plots of the abundance values of each cluster:


Cutoff height at 90 producing 43 clusters. Highly exclusive of branching which means there are several nodes and thus clusters. The similarity between proteins is most strict at this cutoff value. We can see this most in clusters 1 and 2 which visually look pretty similar but that this cutoff value determines as distinct. There are two dense clusters: Cluster 15 has proteins that stayed relatively constant between about 25 and 100.  Cluster 23 has proteins whose value stayed consistently below 50 abundance. 28 clusters contain only 1 protein.


Cutoff height at 110 producing 31 clusters. This value is moderately exclusive/inclusive. You can see a major change between cutoff value 90 and 110 in cluster 1. Previously, at 90, those proteins were distinct, but here, at 110, they are clustered together. You can also see that there are 3 dense clusters here: Cluster 8 has proteins whose abundance stayed between 80 and 160. Cluster 11 has proteins that stayed between 25 and 100, and cluster 16 has proteins who generally stayed below 50 abundance. At 90, there were only two dense clusters. Here, 20 clusters contain 1 protein as opposed to 28 at the 90 cutoff value.


Cutoff height at 180 with 11 clusters. It looks like all the dense clusters at previous cutoff values were grouped into cluster 6 here. This results in the lowest number of clusters with only 1 protein (2 total) but it looks like this might be a bit too inclusive to parse out abundance patterns based on clusters 10, 7 and 1 which look like the similarity between proteins is not as good as the previous cutoff value.

I choose cutoff 110 because of the balance between grouping similar proteins and not being so exclusive that mostly single protein clusters are produced.


Grace’s Notebook: Error 3 on Centrifuge- Couldn’t finish Extraction with TriZol LS Reagent

Today I tried out the Trizol LS Reagent extraction, but the centrifuge in the 4˚C in FTR 213 had an “Err 3” message, which happens when there’s an error in the rotational speed measurement system (GitHub Issue #542). I’ll put a different centrifuge in the 4C and re-try the protocol on Wednesday. This post includes my protocol for the Trizol LS Reagent extraction.

Trizol LS Reagent extraction

As stated in the blurb above, the centrifuge in the 4C fridge in FTR 213 has an error that won’t allow it to run. I’ll go through the protocol again on Wednesday (I have no classes on Wednesdays), run the samples on the Qubit, and run on the Bioanalyzer with the 8 Qiagen RNeasy Kit-extracted samples (Post from Nov 21, 2018).

The samples I started working with today (and which have been dumped due to the centrifuge error) are from Day 26. They are each the second out of the three samples that were taken for each crab. In other words, there is only one sample tube remaining for each of these three crabs. The tubes used today were: 427, 430, 433.

Trizol LS Reagent Extraction Protocol (I’ll be doing this on Wednesday):

Lyse samples and separate phases:

  1. Add 750uL Trizol LS Reagent to the samples (pelleted C. bairdi hemolymph)
    (Need to maintain a 3:1 ratio of Trizol to sample. I estimated that the samples are about 250ul)
  2. Homogenize sample by pipetting (5x per sample)
  3. Centrifuge 5 mins at 12000g at 4C
  4. Transfer clear supernatant to a new, labeled tube
  5. Incubate 5 mins to permit complete dissociation of the nucleoproteins complex.
  6. Add 200ul of chloroform and close lid
  7. Incubate at room T 2-3 mins
  8. Centrifuge 15mins at 12000g at 4C
    (The mixture separates into a lower phenol-chloroform, and interphase, and a colorless upper aqueous phase)
  9. Transfer aqueous phase containing RNA to a new tube by angling at 45˚ and pipetting out.
    (DO NOT transfer any of inerphase or organic layer. ONLY the aqueous phase)

Isolate RNA

A. Precipitate RNA

  1. Add 500ul of ispropanol to the aqueous phase
  2. Incubate at room T 10 mins
  3. Centrifuge 10 mins at 12000g at 4C
    (Total RNA becomes a white gel-like pellet at the bottom of the tube)
  4. Discard supernatant

B. Wash the RNA

  1. Resuspend pellet in 1mL of 75% ethanol
  2. Vortex briefly. Centrifuge for 5 mins at 7500g at 4C
  3. Discard supernatant
  4. Air dry pellet 5-10 mins

C. Solubilize the RNA

  1. Resuspend pellet in 20ul of 0.1% DEPC-treated H20
  2. Incubate in a heat block set to 55˚C for 15mins

Run 1ul of each sample on Qubit

Run 1ul of each sample on Bioanalyzer

from Grace’s Lab Notebook http://bit.ly/2CliQuE