Genome Annotation – P.generosa v1.0 Assembly Using DIAMOND BLASTx for BlobToolKit on Mox

To continue towards getting our Panopea generosa (Pacific geoduck) genome assembly (v1.0) analyzed with BlobToolKit, per this GitHub Issue, I’ve decided to run each aspect of the pipeline manually, as I continue to have issues utilizing the automatic pipeline. As such, I’ve run DIAMOND BLASTx according to the BlobToolKit “Getting Started” guide on Mox.

IMPORTANT: This is BLAST’ed against a customized UniProt database, per the BlobToolKit instructions here.. For posterity, here’re the instuctions provided on the website:

mkdir -p uniprot wget -q -O uniprot/reference_proteomes.tar.gz \ ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/$(curl \ -vs ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/ 2>&1 | \ awk '/tar.gz/ {print $9}') cd uniprot tar xf reference_proteomes.tar.gz touch reference_proteomes.fasta.gz find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | xargs cat >> reference_proteomes.fasta.gz echo "accession\taccession.version\ttaxid\tgi" > reference_proteomes.taxid_map zcat */*/*.idmapping.gz | grep "NCBI_TaxID" | awk '{print $1 "\t" $1 "\t" $3 "\t" 0}' >> reference_proteomes.taxid_map diamond makedb -p 16 --in reference_proteomes.fasta.gz --taxonmap reference_proteomes.taxid_map --taxonnodes ../taxdump/nodes.dmp -d reference_proteomes.dmnd cd - 

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210415_pgen_diamond_blastx_Panopea-generosa-v1.0 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=500G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210415_pgen_diamond_blastx_Panopea-generosa-v1.0 ### DIAMOND BLASTx of Panopea-generosa-v1.0 against customized UniProt database ### for import into BlobToolKit. ### Output is customized for input into BlobToolKit ################################################################################### # These variables need to be set by user # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Programs array declare -A programs_array programs_array=( [diamond]="/gscratch/srlab/programs/diamond-0.9.29/diamond" ) # DIAMOND UniProt database dmnd=/gscratch/srlab/blastdbs/20210401_uniprot_btk/reference_proteomes.dmnd # Genome (FastA) fasta=/gscratch/srlab/sam/data/P_generosa/genomes/Panopea-generosa-v1.0.fa ################################################################################### # Strip leading path and extensions no_path=$(echo "${fasta##*/}") no_ext=$(echo "${no_path%.*}") # Run DIAMOND with blastx # Customized output format for import into BlobToolKit ${programs_array[diamond]} blastx \ --db ${dmnd} \ --query "${fasta}" \ --out "${no_ext}".blastx.btk.outfmt6 \ --outfmt 6 qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore \ --sensitive \ --evalue 1e-25 \ --max-target-seqs 1 \ --block-size 15.0 \ --index-chunks 4 # Generate checksums for future reference echo "" echo "Generating checksum for ${fasta}." md5sum "${fasta}">> fastq.checksums.md5 echo "Completed checksum for ${fasta}." echo "" ################################################################################### # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "

Genome Annotation – P.generosa v1.0 Assembly Using BLASTn for BlobToolKit on Mox

To continue towards getting our Panopea generosa (Pacific geoduck) genome assembly (v1.0) analyzed with BlobToolKit, per this GitHub Issue, I’ve decided to run each aspect of the pipeline manually, as I continue to have issues utilizing the automatic pipeline. As such, I’ve run BLASTn according to the BlobToolKit “Getting Started” guide on Mox.

SBATCH script (GitHub):


#!/bin/bash ## Job Name #SBATCH --job-name=20210415_pgen_blastn-nt_Panopea-generosa-v1.0 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210415_pgen_blastn-nt_Panopea-generosa-v1.0 ### BLASTn of P.generosa genome assembly Panopea-generosa-v1.0.fa ### against NCBI nt database. ### In preparation for use in BlobTools2 ################################################################################### # These variables need to be set by user # Set number of CPUs to use threads=40 # Input/output files fasta="/gscratch/srlab/sam/data/P_generosa/genomes/Panopea-generosa-v1.0.fa" blast_db="/gscratch/srlab/blastdbs/20210401_ncbi_nt/nt" # Programs blastn="/gscratch/srlab/programs/ncbi-blast-2.10.1+/bin/blastn" # Programs associative array declare -A programs_array programs_array=( [blastn]="${blastn}" ) ################################################################################### # Exit script if any command fails set -e # Run BLASTn with custom format/settings for use in blobtools2 ${programs_array[blastn]} \ -db ${blast_db} \ -query ${fasta} \ -outfmt "6 qseqid staxids bitscore std" \ -max_target_seqs 10 \ -max_hsps 1 \ -evalue 1e-25 \ -num_threads ${threads} \ -out Panopea-generosa-v1.0_blobtools2_blast.out ################################################################################### # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]] \ || [[ "${program}" == "blastn" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "

TWIP 11 – Easter Pistachios

This week we welcome Ariana to the mix and get Olivia booted up.