Transcriptome Annotation – Trinotate Hematodinium v1.6 on Mox

To continue annotation of our Hematodinium v1.6 transcriptome assembly, I wanted to run Trinotate.

Info for each transcriptome version (library composition, assembly dates, BUSCO, etc) can be found in this table:

This was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210309_hemat_trinotate_transcriptome-v1.6 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=7-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210309_hemat_trinotate_transcriptome-v1.6 # Script to run Trinotate on Hematodinium transcriptome: # v1.6 ################################################################################### # These variables need to be set by user # Input files ## BLASTx blastx_out="/gscratch/scrubbed/samwhite/outputs/20210308_hemat_diamond_blastx_v1.6_v1.7/hemat_transcriptome_v1.6.fasta.blastx.outfmt6" ## TransDecoder transdecoder_dir="/gscratch/scrubbed/samwhite/outputs/20210309_hemat_transdecoder_transcriptomes_v1.6_v1.7/20210309_hemat_transcriptome_v1.6.fasta.transdecoder" blastp_out="${transdecoder_dir}/20210309_hemat_transcriptome_v1.6.fasta.blastp_out/20210309_hemat_transcriptome_v1.6.fasta.blastp.outfmt6" pfam_out="${transdecoder_dir}/20210309_hemat_transcriptome_v1.6.fasta.pfam_out/20210309_hemat_transcriptome_v1.6.fasta.pfam.domtblout" lORFs_pep="${transdecoder_dir}/hemat_transcriptome_v1.6.fasta.transdecoder_dir/longest_orfs.pep" ## Transcriptomics transcriptomes_dir="/gscratch/srlab/sam/data/Hematodinium/transcriptomes" trinity_fasta="${transcriptomes_dir}/hemat_transcriptome_v1.6.fasta" trinity_gene_map="${transcriptomes_dir}/hemat_transcriptome_v1.6.fasta.gene_trans_map" ################################################################################### wd="$(pwd)" timestamp=$(date +%Y%m%d) ## Paths to input/output files ## New folders for working directory rnammer_out_dir="${wd}/RNAmmer_out" signalp_out_dir="${wd}/signalp_out" tmhmm_out_dir="${wd}/tmhmm_out" rnammer_prefix=${trinity_fasta##*/} prefix="${timestamp}.${rnammer_prefix}.trinotate" # Output files rnammer_out="${rnammer_out_dir}/${rnammer_prefix}.rnammer.gff" signalp_out="${signalp_out_dir}/${prefix}.signalp.out" tmhmm_out="${tmhmm_out_dir}/${prefix}.tmhmm.out" trinotate_report="${wd}/${prefix}_annotation_report.txt" # Paths to programs rnammer_dir="/gscratch/srlab/programs/RNAMMER-1.2" rnammer="${rnammer_dir}/rnammer" signalp_dir="/gscratch/srlab/programs/signalp-4.1" signalp="${signalp_dir}/signalp" tmhmm_dir="/gscratch/srlab/programs/tmhmm-2.0c/bin" tmhmm="${tmhmm_dir}/tmhmm" trinotate_dir="/gscratch/srlab/programs/Trinotate-v3.1.1" trinotate="${trinotate_dir}/Trinotate" trinotate_rnammer="${trinotate_dir}/util/rnammer_support/RnammerTranscriptome.pl" trinotate_GO="${trinotate_dir}/util/extract_GO_assignments_from_Trinotate_xls.pl" trinotate_features="${trinotate_dir}/util/Trinotate_get_feature_name_encoding_attributes.pl" trinotate_sqlite_db="Trinotate.sqlite" # Generate FastA checksum, for reference if needed. md5sum ${trinity_fasta} > fasta.checksum.md5 # Make output directories mkdir "${rnammer_out_dir}" "${signalp_out_dir}" "${tmhmm_out_dir}" # Copy sqlite database template cp ${trinotate_dir}/admin/Trinotate.sqlite . # Run signalp ${signalp} \ -f short \ -n "${signalp_out}" \ ${lORFs_pep} # Run tmHMM ${tmhmm} \ --short \ < ${lORFs_pep} \ > "${tmhmm_out}" # Run RNAmmer cd "${rnammer_out_dir}" || exit ${trinotate_rnammer} \ --transcriptome ${trinity_fasta} \ --path_to_rnammer ${rnammer} cd "${wd}" || exit # Run Trinotate ## Load transcripts and coding regions into database ${trinotate} \ ${trinotate_sqlite_db} \ init \ --gene_trans_map "${trinity_gene_map}" \ --transcript_fasta "${trinity_fasta}" \ --transdecoder_pep "${lORFs_pep}" ## Load BLAST homologies "${trinotate}" \ "${trinotate_sqlite_db}" \ LOAD_swissprot_blastp \ "${blastp_out}" "${trinotate}" \ "${trinotate_sqlite_db}" \ LOAD_swissprot_blastx \ "${blastx_out}" ## Load Pfam "${trinotate}" \ "${trinotate_sqlite_db}" \ LOAD_pfam \ "${pfam_out}" ## Load transmembrane domains "${trinotate}" \ "${trinotate_sqlite_db}" \ LOAD_tmhmm \ "${tmhmm_out}" ## Load signal peptides "${trinotate}" \ "${trinotate_sqlite_db}" \ LOAD_signalp \ "${signalp_out}" ## Load RNAmmer "${trinotate}" \ "${trinotate_sqlite_db}" \ LOAD_rnammer \ "${rnammer_out}" ## Creat annotation report "${trinotate}" \ "${trinotate_sqlite_db}" \ report \ > "${trinotate_report}" # Extract GO terms from annotation report "${trinotate_GO}" \ --Trinotate_xls "${trinotate_report}" \ -G \ --include_ancestral_terms \ > "${prefix}".go_annotations.txt # Make transcript features annotation map "${trinotate_features}" \ "${trinotate_report}" \ > "${prefix}".annotation_feature_map.txt # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log 

Transcriptome Annotation – Hematodinium Transcriptomes v1.6 v1.7 v2.1 v3.1 with DIAMOND BLASTx on Mox

Needed to annotate the Hematodinium sp. transcriptomes that I’ve assembled using DIAMOND BLASTx. This will also be used for additional downstream annotation (TransDecoder, Trinotate):

All of the above transcriptomes were assembled with different combinations of the crab RNAseq data we generated. Here’s a link to an overview of the various assemblies:

DIAMOND BLASTx was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=hemat_diamond_blastx_v1.6_v1.7_v2.1_v3.1 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=0-08:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200814_hemat_diamond_blastx_v1.6_v1.7_v2.1_v3.1 ## Script for running BLASTx (using DIAMOND) to annotate ## Hematodinium transcriptomes v1.6, v1.7, v2.1 and v3.1 against SwissProt database. ## Output will be in standard BLAST output format 6. ################################################################################### # These variables need to be set by user # Programs array declare -A programs_array programs_array=( [diamond]="/gscratch/srlab/programs/diamond-0.9.29/diamond" ) # Establish variables for more readable code transcriptomes_dir=/gscratch/srlab/sam/data/Hematodinium/transcriptomes # Array of the various comparisons to evaluate # Each condition in each comparison should be separated by a "-" transcriptomes_array=( "${transcriptomes_dir}"/hemat_transcriptome_v1.6.fasta \ "${transcriptomes_dir}"/hemat_transcriptome_v1.7.fasta \ "${transcriptomes_dir}"/hemat_transcriptome_v2.1.fasta \ "${transcriptomes_dir}"/hemat_transcriptome_v3.1.fasta ) # DIAMOND UniProt database dmnd=/gscratch/srlab/blastdbs/uniprot_sprot_20200123/uniprot_sprot.dmnd ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 for fasta in "${!transcriptomes_array[@]}" do # Remove path from transcriptome using parameter substitution transcriptome_name="${transcriptomes_array[$fasta]##*/}" # Generate checksums for reference md5sum "${transcriptomes_array[$fasta]}">> fasta.checksums.md5 # Run DIAMOND with blastx # Output format 6 produces a standard BLAST tab-delimited file ${programs_array[diamond]} blastx \ --db ${dmnd} \ --query "${transcriptomes_array[$fasta]}" \ --out "${transcriptome_name}".blastx.outfmt6 \ --outfmt 6 \ --evalue 1e-4 \ --max-target-seqs 1 \ --block-size 15.0 \ --index-chunks 4 done ################################################################################### # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : n } >> system_path.log # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" ${programs_array[$program]} --help echo "" echo "" echo "

Transcriptome Assembly – Hematodinium Transcriptomes v1.6 and v1.7 with Trinity on Mox

I’d previously assembled hemat_transcriptome_v1.0.fasta on 20200122, hemat_transcriptome_v1.5.fasta on 20200408, extracted hemat_transcriptome_v2.1.fasta from an existing FastA on 20200605, as well as extracted hemat_transcriptome_v3.1.fasta on 20200605.

All of the above transcriptomes were assembled with different combinations of the crab RNAseq data we generated. Here’s a link to an overview of the various assemblies (including the two generated today):

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=hemat_trinity_v1.6_v1.7 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=15-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210308_hemat_trinity_v1.6_v1.7 # Script to generate Hematodinium Trinity transcriptome assemblies: # v1.6 libaries: 2018 2019 2020-GW 2020-UW # v1.7 libraries: 2018 2019 2020-UW # See corresponding FastQ list for each assembly to see FastQ used in each assembly. ################################################################################### # These variables need to be set by user # Assign Variables script_path=/gscratch/scrubbed/samwhite/outputs/20210308_hemat_trinity_v1.6_v1.7/20210308_hemat_trinity_v1.6_v1.7.sh reads_dir=/gscratch/srlab/sam/data/Hematodinium/RNAseq transcriptomes_dir=/gscratch/srlab/sam/data/Hematodinium/transcriptomes threads=28 # Carrot needed to limit grep to line starting with #SBATCH # Avoids grep-ing the command below. max_mem=$(grep "^#SBATCH --mem=" ${script_path} | awk -F [=] '{print $2}') # Paths to programs trinity_dir="/gscratch/srlab/programs/trinityrnaseq-v2.9.0" samtools="/gscratch/srlab/programs/samtools-1.10/samtools" # Array of the various comparisons to evaluate # Each condition in each comparison should be separated by a "-" transcriptomes_array=( "${transcriptomes_dir}"/hemat_transcriptome_v1.6.fasta \ "${transcriptomes_dir}"/hemat_transcriptome_v1.7.fasta ) # Programs array declare -A programs_array programs_array=( [samtools_faidx]="${samtools} faidx" \ [trinity]="${trinity_dir}/Trinity" \ [trinity_stats]="${trinity_dir}/util/TrinityStats.pl" \ [trinity_gene_trans_map]="${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl" \ [trinity_fasta_seq_length]="${trinity_dir}/util/misc/fasta_seq_length.pl" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Set working directory wd=$(pwd) # Loop through each transcriptome for transcriptome in "${!transcriptomes_array[@]}" do ## Inititalize arrays R1_array=() R2_array=() reads_array=() # Variables R1_list="" R2_list="" trinity_out_dir="" transcriptome_name="${transcriptomes_array[$transcriptome]##*/}" assembly_stats="${transcriptome_name}_assembly_stats.txt" trinity_out_dir="${transcriptome_name}_trinity_out_dir" # v1.6 libraries: 2018 2019 2020-GW 2020-UW if [[ "${transcriptome_name}" == "hemat_transcriptome_v1.6.fasta" ]]; then reads_array=("${reads_dir}"/*megan*.fq) # Create array of fastq R1 files R1_array=("${reads_dir}"/*megan*R1.fq) # Create array of fastq R2 files R2_array=("${reads_dir}"/*megan*R2.fq) # v.17 libraries: 2018 2019 2020-UW elif [[ "${transcriptome_name}" == "hemat_transcriptome_v1.7.fasta" ]]; then reads_array=("${reads_dir}"/20200[145][13][189]*megan*.fq) # Create array of fastq R1 files R1_array=("${reads_dir}"/20200[145][13][189]*megan*R1.fq) # Create array of fastq R2 files R2_array=("${reads_dir}"/20200[145][13][189]*megan*R2.fq) fi # Create checksum list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename md5sum "${reads_array[@]}" >> "${transcriptome_name}".fastq-checksums.md5 # Create comma-separated lists of FastQ reads R1_list=$(echo "${R1_array[@]}" | tr " " ",") R2_list=$(echo "${R2_array[@]}" | tr " " ",") if [[ "${transcriptome_name}" == "hemat_transcriptome_v1.6.fasta" ]]; then # Run Trinity without stranded RNAseq option ${programs_array[trinity]} \ --seqType fq \ --max_memory ${max_mem} \ --CPU ${threads} \ --output ${trinity_out_dir} \ --left "${R1_list}" \ --right "${R2_list}" else # Run Trinity with stranded RNAseq option ${programs_array[trinity]} \ --seqType fq \ --max_memory ${max_mem} \ --CPU ${threads} \ --output ${trinity_out_dir} \ --SS_lib_type RF \ --left "${R1_list}" \ --right "${R2_list}" fi # Rename generic assembly FastA mv "${trinity_out_dir}"/Trinity.fasta "${trinity_out_dir}"/"${transcriptome_name}" # Assembly stats ${programs_array[trinity_stats]} "${trinity_out_dir}"/"${transcriptome_name}" \ > "${assembly_stats}" # Create gene map files ${programs_array[trinity_gene_trans_map]} \ "${trinity_out_dir}"/"${transcriptome_name}" \ > "${trinity_out_dir}"/"${transcriptome_name}".gene_trans_map # Create sequence lengths file (used for differential gene expression) ${programs_array[trinity_fasta_seq_length]} \ "${trinity_out_dir}"/"${transcriptome_name}" \ > "${trinity_out_dir}"/"${transcriptome_name}".seq_lens # Create FastA index ${programs_array[samtools_faidx]} \ "${trinity_out_dir}"/"${transcriptome_name}" # Copy files to transcriptomes directory rsync -av \ "${trinity_out_dir}"/"${transcriptome_name}"* \ ${transcriptomes_dir} # Capture FastA checksums for verification cd "${trinity_out_dir}"/ echo "Generating checksum for ${transcriptome_name}" md5sum "${transcriptome_name}" > "${transcriptome_name}".checksum.md5 echo "Finished generating checksum for ${transcriptome_name}" echo "" cd ${wd} done # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "