Ditching full genomes, using hematodinium transcriptomev1.6

Using NCBI Genomes Here were my planned steps last time: BLASTx of transcriptome 2.0 against the protein sequences from the C. opilio genome downloaded from NCBI BLASTn of transcriptome 2.0 against the Amoebophrya sp. genome downloaded from NCBI Use those two BLAST results to determine which genes are Hematodinium in origin and which are C. bairdi for all of Transcriptome 2.0 I completed all steps (plus a few tBLASTx of transcriptome 2.0 against Amoebophrya genome), but ran into a major problem. Specifically, not enough of our sequences are finding matches – after combining all matches for the C. opilio and Amoebophrya sp. genomes, only around half were matched, even after dropping the e-value bar all the way down to 10^-2 (and that’s with double-counting all sequences that matched to both genomes!). Link to all scripts and the specific numbers of matches available here So what now? Well, we’re still trying…

from Aidan F. Coyle https://ift.tt/3lmdWDG
via IFTTT

Read Extractions – C.bairdi RNAseq Reads from C.opilio BLASTx Matches with seqkit on Mox

As part of addressing this GitHub issue, to generate an additional C.bairdi transcriptome, I needed to extract the reads ID’ed via BLASTX against the C.opilio genome on 20210312. Read extractions were performed using SeqKit on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210316_cbai-vs-copi_reads_extractions ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210316_cbai-vs-copi_reads_extractions ## Script for extracting C.bairdi RNAseq reads that matched to C.opilio ## genome via BLASTx on 20210312. ## Inputs are lists of RNAseq IDs ## Outputs are gzipped FastQ files of extracted reads ################################################################################### # These variables need to be set by user threads=40 # FastQ directory reads_dir=/gscratch/srlab/sam/data/C_bairdi/RNAseq # BLASTx results directory blastx_dir=/gscratch/scrubbed/samwhite/outputs/20210312_cbai-vs-copi_diamond_blastx # Programs array declare -A programs_array programs_array=( [seqkit]="/gscratch/srlab/programs/seqkit-0.15.0" \ [seqkit_grep]="/gscratch/srlab/programs/seqkit-0.15.0 grep" ) # FastQ array fastq_array=(${reads_dir}/*fastp-trim*.fq.gz) # BLASTx results files with query (FastQ read ID) only blastx_array=(${blastx_dir}/*.blastx.outfmt6-query) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # BLASTx FastQ files for file in "${!fastq_array[@]}" do # Remove path from transcriptome using parameter substitution no_path="${fastq_array[$file]##*/}" # Remove extensions from filename no_ext=${no_path%.fq.gz} # Set output filename out_file="${no_ext}_copi-BLASTx-match.fq.gz" # Generate checksums for reference echo "" echo "Generating checksum for ${fastq_array[$file]}." md5sum "${fastq_array[$file]}" >> fastq.checksums.md5 echo "Completed checksum for ${fastq_array[$file]}." echo "" # Extract reads using results from BLASTx files ${programs_array[seqkit_grep]} \ --pattern-file ${blastx_array[$file]} \ ${fastq_array[$file]} \ --out-file ${out_file} \ --threads ${threads} # Generate checksums of extracted reads reference echo "" echo "Generating checksum for ${out_file}." md5sum "${out_file}" >> fastq.checksums.md5 echo "Completed checksum for ${out_file}." echo "" done ################################################################################### # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle blank argument for help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] \ || [[ "${program}" == "seqkit" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "

DIAMOND BLASTx – C.bairdi RNAseq vs C.opilio Genome Proteins on Mox

We want to generate an additional Tanner crab (Chionoecetes bairdi) transcriptome, per this GitHub issue, to generate an additional C.bairdi transcriptome. This has come about due to the release of the genome of a very closely related crab species, Chionoecetes opilio (Snow crab).

I used DIAMOND BLASTx along with the Snow crab genome protein FastA (8.7MB) from NCBI (Acc: GCA_016584305.1).

NOTE: Since this is geared toward just identifying matching reads, the BLASTx output format will only contain the query ID. There will be one BLASTx output file for each corresponding input FastQ file.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210312_cbai-vs-copi_diamond_blastx ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=20-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210312_cbai-vs-copi_diamond_blastx ## Script for running BLASTx (using DIAMOND) with all of our C.bairdi RNAseq data to-date. ## BLASTx against C.opilio _(snow crab) NCBI protein FastA ## Output will be in standard BLAST output format 6, but only query ID. ## Output will be used to extract just reads with matches to to C.opilio genome, ## for downstream transcriptome assembly ################################################################################### # These variables need to be set by user # FastQ directory reads_dir=/gscratch/srlab/sam/data/C_bairdi/RNAseq # DIAMOND database dmnd=/gscratch/srlab/sam/data/C_opilio/blastdbs/GCA_016584305.1_ASM1658430v1_protein.dmnd # Programs array declare -A programs_array programs_array=( [diamond]="/gscratch/srlab/programs/diamond-0.9.29/diamond" ) # FastQ array fastq_array=(${reads_dir}/*fastp-trim*.fq.gz) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # BLASTx FastQ files for fastq in "${!fastq_array[@]}" do # Remove path from transcriptome using parameter substitution fastq_name="${fastq_array[$fastq]##*/}" # Generate checksums for reference echo "" echo "Generating checksum for ${fastq_array[$fastq]}." md5sum "${fastq_array[$fastq]}">> fastq.checksums.md5 echo "Completed checksum for ${fastq_array[$fastq]}." echo "" # Run DIAMOND with blastx # Output format 6 query only returns a single query ID per match # block-size and index-chunks are computing resource optimatization paraeters ${programs_array[diamond]} blastx \ --db ${dmnd} \ --query "${fastq_array[$fastq]}" \ --out "${fastq_name}".blastx.outfmt6-query \ --outfmt 6 qseqid \ --evalue 1e-4 \ --max-target-seqs 1 \ --max-hsps 1 \ --block-size 15.0 \ --index-chunks 4 done ################################################################################### # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "