TWIP – Episode 8 – A whole new world

This week we go deep on several topics including how to best consider Oly genetics, how to analyze gene expression in a “cluster” format, and Laura gives us some initial insights into her new gene expression data.

TWIP – Episode 7: Like Hotel California

This week we waste no time diving into goals and the week ahead, along with a bit of well-being preamble.

Transcriptome Annotation – Trinotate on C.bairdi Transcriptome v4.0 on Mox

Continued annotation of cbai_transcriptome_v4.0.fasta [Trinity de novo assembly from 20210317(https://ift.tt/2NziJW6] using Trinotate on Mox. This will provide a thorough annotation, including genoe ontology (GO) term assignments to each contig.

One thing to note is that upon initial run, RNAmmer caused the script to exit with an error due to not having produced any results. The developer responded to the GitHub issue I posted and indicated the lack of results was a bit unexpected, but suggested I add the “or” bash notation (||) to the end of the RNammer command to allow the Trinotate pipeline to proceed without any RNAmmer info. It’s still surprising that there weren’t any matches…

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210318_cbai_trinotate_transcriptome-v4.0 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=7-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210318_cbai_trinotate_transcriptome-v4.0 # Script to run Trinotate on C.bairdi transcriptome v4.0 # NOTE: RNAMMER appears to not find any matches, so have added "||" at end of RNAMMER # command to allow annotation to proceed. ################################################################################### # These variables need to be set by user # Input files ## BLASTx blastx_out="/gscratch/scrubbed/samwhite/outputs/20210318_cbai_diamond_blastx_transcriptome-v4.0/cbai_transcriptome_v4.0.blastx.outfmt6" ## TransDecoder transdecoder_dir="/gscratch/scrubbed/samwhite/outputs/20210317_cbai_transdecoder_transcriptome_v4.0" blastp_out="${transdecoder_dir}/blastp_out/cbai_transcriptome_v4.0.fasta.blastp.outfmt6" pfam_out="${transdecoder_dir}/pfam_out/cbai_transcriptome_v4.0.fasta.pfam.domtblout" lORFs_pep="${transdecoder_dir}/cbai_transcriptome_v4.0.fasta.transdecoder_dir/longest_orfs.pep" ## Transcriptomics transcriptomes_dir="/gscratch/srlab/sam/data/C_bairdi/transcriptomes" trinity_fasta="${transcriptomes_dir}/cbai_transcriptome_v4.0.fasta" trinity_gene_map="${transcriptomes_dir}/cbai_transcriptome_v4.0.fasta.gene_trans_map" ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # SegFault fix? export THREADS_DAEMON_MODEL=1 wd="$(pwd)" timestamp=$(date +%Y%m%d) ## Paths to input/output files ## New folders for working directory rnammer_out_dir="${wd}/RNAmmer_out" signalp_out_dir="${wd}/signalp_out" tmhmm_out_dir="${wd}/tmhmm_out" rnammer_prefix=${trinity_fasta##*/} prefix="${timestamp}.${rnammer_prefix}.trinotate" # Output files rnammer_out="${rnammer_out_dir}/${rnammer_prefix}.rnammer.gff" signalp_out="${signalp_out_dir}/${prefix}.signalp.out" tmhmm_out="${tmhmm_out_dir}/${prefix}.tmhmm.out" trinotate_report="${wd}/${prefix}_annotation_report.txt" # Paths to programs rnammer_dir="/gscratch/srlab/programs/RNAMMER-1.2" rnammer="${rnammer_dir}/rnammer" signalp_dir="/gscratch/srlab/programs/signalp-4.1" signalp="${signalp_dir}/signalp" tmhmm_dir="/gscratch/srlab/programs/tmhmm-2.0c/bin" tmhmm="${tmhmm_dir}/tmhmm" trinotate_dir="/gscratch/srlab/programs/Trinotate-v3.1.1" trinotate="${trinotate_dir}/Trinotate" trinotate_rnammer="${trinotate_dir}/util/rnammer_support/RnammerTranscriptome.pl" trinotate_GO="${trinotate_dir}/util/extract_GO_assignments_from_Trinotate_xls.pl" trinotate_features="${trinotate_dir}/util/Trinotate_get_feature_name_encoding_attributes.pl" trinotate_sqlite_db="Trinotate.sqlite" # Generate FastA checksum, for reference if needed. md5sum ${trinity_fasta} > fasta.checksum.md5 # Make output directories mkdir "${rnammer_out_dir}" "${signalp_out_dir}" "${tmhmm_out_dir}" # Copy sqlite database template cp ${trinotate_dir}/admin/Trinotate.sqlite . # Run signalp ${signalp} \ -f short \ -n "${signalp_out}" \ ${lORFs_pep} # Run tmHMM ${tmhmm} \ --short \ < ${lORFs_pep} \ > "${tmhmm_out}" # Run RNAmmer # Has "||" operator due to previous lack of matches # Need "||" to continue with annotation. cd "${rnammer_out_dir}" || exit ${trinotate_rnammer} \ --transcriptome ${trinity_fasta} \ --path_to_rnammer ${rnammer} \ || cd "${wd}" || exit # Run Trinotate ## Load transcripts and coding regions into database ${trinotate} \ ${trinotate_sqlite_db} \ init \ --gene_trans_map "${trinity_gene_map}" \ --transcript_fasta "${trinity_fasta}" \ --transdecoder_pep "${lORFs_pep}" ## Load BLAST homologies "${trinotate}" \ "${trinotate_sqlite_db}" \ LOAD_swissprot_blastp \ "${blastp_out}" "${trinotate}" \ "${trinotate_sqlite_db}" \ LOAD_swissprot_blastx \ "${blastx_out}" ## Load Pfam "${trinotate}" \ "${trinotate_sqlite_db}" \ LOAD_pfam \ "${pfam_out}" ## Load transmembrane domains "${trinotate}" \ "${trinotate_sqlite_db}" \ LOAD_tmhmm \ "${tmhmm_out}" ## Load signal peptides "${trinotate}" \ "${trinotate_sqlite_db}" \ LOAD_signalp \ "${signalp_out}" ## Load RNAmmer "${trinotate}" \ "${trinotate_sqlite_db}" \ LOAD_rnammer \ "${rnammer_out}" ## Creat annotation report "${trinotate}" \ "${trinotate_sqlite_db}" \ report \ > "${trinotate_report}" # Extract GO terms from annotation report "${trinotate_GO}" \ --Trinotate_xls "${trinotate_report}" \ -G \ --include_ancestral_terms \ > "${prefix}".go_annotations.txt # Make transcript features annotation map "${trinotate_features}" \ "${trinotate_report}" \ > "${prefix}".annotation_feature_map.txt # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log 

Transcriptome Annotation – DIAMOND BLASTx on C.bairdi Transcriptome v4.0 on Mox

Continued annotation of cbai_transcriptome_v4.0.fasta [Trinity de novo assembly from 20210317(https://ift.tt/2NziJW6] using DIAMOND BLASTx on Mox. This will be used as a component of Trinotate annotation downstream.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210318_cbai_diamond_blastx_transcriptome-v4.0 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210318_cbai_diamond_blastx_transcriptome-v4.0 ### BLASTx of Trinity de novo assembly of all C.bairdi RNAseq with BLASTx matches to C.opilio genome. ### cbai_transcriptome_v4.0.fasta ### Includes RNAseq short-hand of: 2020-GW, 2020-UW, 2019, 2018. ################################################################################### # These variables need to be set by user # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Programs array declare -A programs_array programs_array=( [diamond]="/gscratch/srlab/programs/diamond-0.9.29/diamond" ) # DIAMOND UniProt database dmnd=/gscratch/srlab/blastdbs/uniprot_sprot_20200123/uniprot_sprot.dmnd # Trinity assembly (FastA) fasta=/gscratch/srlab/sam/data/C_bairdi/transcriptomes/cbai_transcriptome_v4.0.fasta ################################################################################### # Strip leading path and extensions no_path=$(echo "${fasta##*/}") no_ext=$(echo "${no_path%.*}") # Run DIAMOND with blastx # Output format 6 produces a standard BLAST tab-delimited file ${programs_array[diamond]} blastx \ --db ${dmnd} \ --query "${fasta}" \ --out "${no_ext}".blastx.outfmt6 \ --outfmt 6 \ --evalue 1e-4 \ --max-target-seqs 1 \ --block-size 15.0 \ --index-chunks 4 # Generate checksums for future reference echo "" echo "Generating checksum for ${fasta}." md5sum "${fasta}">> fastq.checksums.md5 echo "Completed checksum for ${fasta}." echo "" ################################################################################### # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "

TransDecoder – C.bairdi Transcriptome v4.0 on Mox

Began annotation of cbai_transcriptome_v4.0.fasta [Trinity de novo assembly from 20210317(https://ift.tt/2NziJW6] using TransDecoder on Mox. This will be used as a component of Trinotate annotation downstream.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210317_cbai_transdecoder_transcriptome_v4.0 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=8-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210317_cbai_transdecoder_transcriptome_v4.0 # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Set workind directory as current directory wd="$(pwd)" # Set input file locations trinity_fasta="/gscratch/srlab/sam/data/C_bairdi/transcriptomes/cbai_transcriptome_v4.0.fasta" trinity_gene_map="/gscratch/srlab/sam/data/C_bairdi/transcriptomes/cbai_transcriptome_v4.0.fasta.gene_trans_map" # Capture trinity file name trinity_fasta_name=${trinity_fasta##*/} # Paths to input/output files blastp_out_dir="${wd}/blastp_out" transdecoder_out_dir="${wd}/${trinity_fasta_name}.transdecoder_dir" pfam_out_dir="${wd}/pfam_out" blastp_out="${blastp_out_dir}/${trinity_fasta_name}.blastp.outfmt6" pfam_out="${pfam_out_dir}/${trinity_fasta_name}.pfam.domtblout" lORFs_pep="${transdecoder_out_dir}/longest_orfs.pep" pfam_db="/gscratch/srlab/programs/Trinotate-v3.1.1/admin/Pfam-A.hmm" sp_db="/gscratch/srlab/programs/Trinotate-v3.1.1/admin/uniprot_sprot.pep" # Paths to programs blast_dir="/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin" blastp="${blast_dir}/blastp" hmmer_dir="/gscratch/srlab/programs/hmmer-3.2.1/src" hmmscan="${hmmer_dir}/hmmscan" transdecoder_dir="/gscratch/srlab/programs/TransDecoder-v5.5.0" transdecoder_lORFs="${transdecoder_dir}/TransDecoder.LongOrfs" transdecoder_predict="${transdecoder_dir}/TransDecoder.Predict" # Capture FastA MD5 checksum for future reference md5sum "${trinity_fasta}" >> "${trinity_fasta_name}".checksum.md5 # Make output directories mkdir "${blastp_out_dir}" mkdir "${pfam_out_dir}" # Extract long open reading frames "${transdecoder_lORFs}" \ --gene_trans_map "${trinity_gene_map}" \ -t "${trinity_fasta}" # Run blastp on long ORFs "${blastp}" \ -query "${lORFs_pep}" \ -db "${sp_db}" \ -max_target_seqs 1 \ -outfmt 6 \ -evalue 1e-5 \ -num_threads 28 \ > "${blastp_out}" # Run pfam search "${hmmscan}" \ --cpu 28 \ --domtblout "${pfam_out}" \ "${pfam_db}" \ "${lORFs_pep}" # Run Transdecoder with blastp and Pfam results "${transdecoder_predict}" \ -t "${trinity_fasta}" \ --retain_pfam_hits "${pfam_out}" \ --retain_blastp_hits "${blastp_out}" # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log 

Transcriptome Assessment – BUSCO Metazoa on C.bairdi Transcriptome v4.0 on Mox

I previously created a C.bairdi de novo transcriptome assembly v4.0 with Trinity from all our C.bairdi RNAseq reads which had BLASTx matches to the C.opilio genome and decided to assess its “completeness” using BUSCO and the metazoa_odb9 database.

BUSCO was run with the --mode transcriptome option on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210317_cbai_busco_transcriptome_v4.0 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=1-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210317_cbai_busco_transcriptome_v4.0 ### C.bairdi transcriptome assembly completeness assessment using BUSCO. ### This is checking cbai_transcriptome_v4.0 fasta # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 ## Input files and settings busco_db=/gscratch/srlab/sam/data/databases/BUSCO/metazoa_odb9 transcriptome_fasta=/gscratch/srlab/sam/data/C_bairdi/transcriptomes/cbai_transcriptome_v4.0.fasta augustus_species=fly threads=40 ## Save working directory wd=$(pwd) # Extract FastA filename fasta_name=${transcriptome_fasta##*/} ## Set program paths augustus_bin=/gscratch/srlab/programs/Augustus-3.3.2/bin augustus_scripts=/gscratch/srlab/programs/Augustus-3.3.2/scripts blast_dir=/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/ busco=/gscratch/srlab/programs/busco-v3/scripts/run_BUSCO.py hmm_dir=/gscratch/srlab/programs/hmmer-3.2.1/src/ ## Augustus configs augustus_dir=${wd}/augustus augustus_config_dir=${augustus_dir}/config augustus_orig_config_dir=/gscratch/srlab/programs/Augustus-3.3.2/config ## BUSCO configs busco_config_default=/gscratch/srlab/programs/busco-v3/config/config.ini.default busco_config_ini=${wd}/config.ini # Export BUSCO config file location export BUSCO_CONFIG_FILE="${busco_config_ini}" # Export Augustus variable export PATH="${augustus_bin}:$PATH" export PATH="${augustus_scripts}:$PATH" export AUGUSTUS_CONFIG_PATH="${augustus_config_dir}" # Copy BUSCO config file cp ${busco_config_default} "${busco_config_ini}" # Make Augustus directory if it doesn't exist if [ ! -d "${augustus_dir}" ]; then mkdir --parents "${augustus_dir}" fi # Copy Augustus config directory cp --preserve -r ${augustus_orig_config_dir} "${augustus_dir}" # Edit BUSCO config file ## Set paths to various programs ### The use of the % symbol sets the delimiter sed uses for arguments. ### Normally, the delimiter that most examples use is a slash "/". ### But, we need to expand the variables into a full path with slashes, which screws up sed. ### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%"). sed -i "/^;cpu/ s/1/${threads}/" "${busco_config_ini}" sed -i "/^tblastn_path/ s%tblastn_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^makeblastdb_path/ s%makeblastdb_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^augustus_path/ s%augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^etraining_path/ s%etraining_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^gff2gbSmallDNA_path/ s%gff2gbSmallDNA_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^new_species_path/ s%new_species_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^optimize_augustus_path/ s%optimize_augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^hmmsearch_path/ s%hmmsearch_path = /home/osboxes/BUSCOVM/hmmer/hmmer-3.1b2-linux-intel-ia32/binaries/%path = ${hmm_dir}%" "${busco_config_ini}" # Run BUSCO/Augustus training ${busco} \ --in ${transcriptome_fasta} \ --out ${fasta_name} \ --lineage_path ${busco_db} \ --mode transcriptome \ --cpu ${threads} \ --long \ --species ${augustus_species} \ --tarzip \ --augustus_parameters='--progress=true' # Create checksum for potential verification md5sum "${transcriptome_fasta}" >> "${fasta_name}".checksum.md5 

Transcriptome Assembly – C.bairdi Transcriptome v4.0 Using Trinity on Mox

Continuing to addressing this GitHub issue, to generate an additional C.bairdi transcriptome, I finally got to the point of actually running the assembly using Trinity using the extracted reads from 20210316. Those reads were identified via BLASTx agianst the C.opilio genome proteins on 20210312. Trinty was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210317_cbai_trinity_RNAseq_transcriptome-v4.0 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=20-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210317_cbai_trinity_RNAseq_transcriptome-v4.0 ## Trinity assembly of C.bairdi RNAseq reads ## with BLASTx matches to NCBI C.opilio genome proteins (GCA_016584305.1) ## Assembly will be referred to as cbai_transcriptome_v4.0 ################################################################################### # These variables need to be set by user # Path to this script script_path=/gscratch/scrubbed/samwhite/outputs/20210317_cbai_trinity_RNAseq_transcriptome-v4.0/20210317_cbai_trinity_RNAseq_transcriptome-v4.0.sh # RNAseq FastQs directory reads_dir=/gscratch/scrubbed/samwhite/outputs/20210316_cbai-vs-copi_reads_extractions # Transcriptomes directory transcriptomes_dir=/gscratch/srlab/sam/data/C_bairdi/transcriptomes # CPU threads threads=40 # Capture specified RAM from this script # Carrot needed to limit grep to line starting with #SBATCH # Avoids grep-ing the command below. max_mem=$(grep "^#SBATCH --mem=" ${script_path} | awk -F [=] '{print $2}') # Paths to programs trinity_dir="/gscratch/srlab/programs/trinityrnaseq-v2.12.0" samtools="/gscratch/srlab/programs/samtools-1.10/samtools" # Set transcriptome name transcriptome_name="cbai_transcriptome_v4.0.fasta" # Programs array declare -A programs_array programs_array=( [samtools_faidx]="${samtools} faidx" \ [trinity]="${trinity_dir}/Trinity" \ [trinity_stats]="${trinity_dir}/util/TrinityStats.pl" \ [trinity_gene_trans_map]="${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl" \ [trinity_fasta_seq_length]="${trinity_dir}/util/misc/fasta_seq_length.pl" ) # FastQ array fastq_array=(${reads_dir}/*copi-BLASTx-match.fq.gz) # Variables R1_list="" R2_list="" ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Set variables trinity_out_dir="" assembly_stats="${transcriptome_name}_assembly_stats.txt" trinity_out_dir="${transcriptome_name}_trinity_out_dir" # Adds empty line between checksums and next info logged to SLURM output. echo "" # Create comma-separated lists of FastQ reads # Loop through read pairs # Increment by 2 to process next pair of FastQ files for (( i=0; i<${#fastq_array[@]} ; i+=2 )) do # Handle "fence post" problem # associated with comma placement if [[ ${i} -eq 0 ]]; then R1_list="${fastq_array[${i}]}," R2_list="${fastq_array[${i}+1]}," elif [[ ${i} -eq $(( ${#fastq_array[@]} - 1 )) ]]; then R1_list="${R1_list}${fastq_array[${i}]}" R2_list="${R2_list}${fastq_array[${i}+1]}" else R1_list="${R1_list}${fastq_array[${i}]}," R2_list="${R2_list}${fastq_array[${i}+1]}," fi done # Run Trinity without stranded RNAseq option ${programs_array[trinity]} \ --seqType fq \ --max_memory ${max_mem} \ --CPU ${threads} \ --output ${trinity_out_dir} \ --left "${R1_list}" \ --right "${R2_list}" # Rename generic assembly FastA mv "${trinity_out_dir}"/Trinity.fasta "${trinity_out_dir}"/"${transcriptome_name}" # Assembly stats ${programs_array[trinity_stats]} "${trinity_out_dir}"/"${transcriptome_name}" \ > "${assembly_stats}" # Create gene map files ${programs_array[trinity_gene_trans_map]} \ "${trinity_out_dir}"/"${transcriptome_name}" \ > "${trinity_out_dir}"/"${transcriptome_name}".gene_trans_map # Create sequence lengths file (used for differential gene expression) ${programs_array[trinity_fasta_seq_length]} \ "${trinity_out_dir}"/"${transcriptome_name}" \ > "${trinity_out_dir}"/"${transcriptome_name}".seq_lens # Create FastA index ${programs_array[samtools_faidx]} \ "${trinity_out_dir}"/"${transcriptome_name}" # Copy files to transcriptomes directory rsync -av \ "${trinity_out_dir}"/"${transcriptome_name}"* \ ${transcriptomes_dir} # Capture FastA checksums for verification cd "${trinity_out_dir}"/ echo "" echo "Generating checksum for ${transcriptome_name}" md5sum "${transcriptome_name}" > "${transcriptome_name}".checksum.md5 echo "Finished generating checksum for ${transcriptome_name}" echo "" # Generate input FastQ checksums for fastq in "${!fastq_array[@]}" do echo "" echo "Generating checksum for ${fastq_array[$fastq]}" md5sum "${fastq_array[$fastq]}" >> fastq_checksums.md5 echo "Checksum for ${fastq_array[$fastq]} complete." done ################################################################################### # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "

Ditching full genomes, using hematodinium transcriptomev1.6

Using NCBI Genomes Here were my planned steps last time: BLASTx of transcriptome 2.0 against the protein sequences from the C. opilio genome downloaded from NCBI BLASTn of transcriptome 2.0 against the Amoebophrya sp. genome downloaded from NCBI Use those two BLAST results to determine which genes are Hematodinium in origin and which are C. bairdi for all of Transcriptome 2.0 I completed all steps (plus a few tBLASTx of transcriptome 2.0 against Amoebophrya genome), but ran into a major problem. Specifically, not enough of our sequences are finding matches – after combining all matches for the C. opilio and Amoebophrya sp. genomes, only around half were matched, even after dropping the e-value bar all the way down to 10^-2 (and that’s with double-counting all sequences that matched to both genomes!). Link to all scripts and the specific numbers of matches available here So what now? Well, we’re still trying…

from Aidan F. Coyle https://ift.tt/3lmdWDG
via IFTTT

Read Extractions – C.bairdi RNAseq Reads from C.opilio BLASTx Matches with seqkit on Mox

As part of addressing this GitHub issue, to generate an additional C.bairdi transcriptome, I needed to extract the reads ID’ed via BLASTX against the C.opilio genome on 20210312. Read extractions were performed using SeqKit on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210316_cbai-vs-copi_reads_extractions ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210316_cbai-vs-copi_reads_extractions ## Script for extracting C.bairdi RNAseq reads that matched to C.opilio ## genome via BLASTx on 20210312. ## Inputs are lists of RNAseq IDs ## Outputs are gzipped FastQ files of extracted reads ################################################################################### # These variables need to be set by user threads=40 # FastQ directory reads_dir=/gscratch/srlab/sam/data/C_bairdi/RNAseq # BLASTx results directory blastx_dir=/gscratch/scrubbed/samwhite/outputs/20210312_cbai-vs-copi_diamond_blastx # Programs array declare -A programs_array programs_array=( [seqkit]="/gscratch/srlab/programs/seqkit-0.15.0" \ [seqkit_grep]="/gscratch/srlab/programs/seqkit-0.15.0 grep" ) # FastQ array fastq_array=(${reads_dir}/*fastp-trim*.fq.gz) # BLASTx results files with query (FastQ read ID) only blastx_array=(${blastx_dir}/*.blastx.outfmt6-query) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # BLASTx FastQ files for file in "${!fastq_array[@]}" do # Remove path from transcriptome using parameter substitution no_path="${fastq_array[$file]##*/}" # Remove extensions from filename no_ext=${no_path%.fq.gz} # Set output filename out_file="${no_ext}_copi-BLASTx-match.fq.gz" # Generate checksums for reference echo "" echo "Generating checksum for ${fastq_array[$file]}." md5sum "${fastq_array[$file]}" >> fastq.checksums.md5 echo "Completed checksum for ${fastq_array[$file]}." echo "" # Extract reads using results from BLASTx files ${programs_array[seqkit_grep]} \ --pattern-file ${blastx_array[$file]} \ ${fastq_array[$file]} \ --out-file ${out_file} \ --threads ${threads} # Generate checksums of extracted reads reference echo "" echo "Generating checksum for ${out_file}." md5sum "${out_file}" >> fastq.checksums.md5 echo "Completed checksum for ${out_file}." echo "" done ################################################################################### # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle blank argument for help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] \ || [[ "${program}" == "seqkit" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "

DIAMOND BLASTx – C.bairdi RNAseq vs C.opilio Genome Proteins on Mox

We want to generate an additional Tanner crab (Chionoecetes bairdi) transcriptome, per this GitHub issue, to generate an additional C.bairdi transcriptome. This has come about due to the release of the genome of a very closely related crab species, Chionoecetes opilio (Snow crab).

I used DIAMOND BLASTx along with the Snow crab genome protein FastA (8.7MB) from NCBI (Acc: GCA_016584305.1).

NOTE: Since this is geared toward just identifying matching reads, the BLASTx output format will only contain the query ID. There will be one BLASTx output file for each corresponding input FastQ file.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210312_cbai-vs-copi_diamond_blastx ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=20-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210312_cbai-vs-copi_diamond_blastx ## Script for running BLASTx (using DIAMOND) with all of our C.bairdi RNAseq data to-date. ## BLASTx against C.opilio _(snow crab) NCBI protein FastA ## Output will be in standard BLAST output format 6, but only query ID. ## Output will be used to extract just reads with matches to to C.opilio genome, ## for downstream transcriptome assembly ################################################################################### # These variables need to be set by user # FastQ directory reads_dir=/gscratch/srlab/sam/data/C_bairdi/RNAseq # DIAMOND database dmnd=/gscratch/srlab/sam/data/C_opilio/blastdbs/GCA_016584305.1_ASM1658430v1_protein.dmnd # Programs array declare -A programs_array programs_array=( [diamond]="/gscratch/srlab/programs/diamond-0.9.29/diamond" ) # FastQ array fastq_array=(${reads_dir}/*fastp-trim*.fq.gz) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # BLASTx FastQ files for fastq in "${!fastq_array[@]}" do # Remove path from transcriptome using parameter substitution fastq_name="${fastq_array[$fastq]##*/}" # Generate checksums for reference echo "" echo "Generating checksum for ${fastq_array[$fastq]}." md5sum "${fastq_array[$fastq]}">> fastq.checksums.md5 echo "Completed checksum for ${fastq_array[$fastq]}." echo "" # Run DIAMOND with blastx # Output format 6 query only returns a single query ID per match # block-size and index-chunks are computing resource optimatization paraeters ${programs_array[diamond]} blastx \ --db ${dmnd} \ --query "${fastq_array[$fastq]}" \ --out "${fastq_name}".blastx.outfmt6-query \ --outfmt 6 qseqid \ --evalue 1e-4 \ --max-target-seqs 1 \ --max-hsps 1 \ --block-size 15.0 \ --index-chunks 4 done ################################################################################### # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "