Sam’s Notebook: Transcriptome Assembly – Hematodinium with MEGAN6 Taxonomy-specific Reads with Trinity on Mox

Ran a de novo assembly using the extracted reads classified under Alveolata from:

The assembly was performed with Trinity on Mox. It’s important to note that this assembly was not performed using the “stranded” option in Trinity. The previous Trinity assembly from 20200122 was performed using the “stranded” setting. The reason for this difference is that the most recent RNAseq libraries from 20200318 were not stranded libraries. As such, I think it might be best to use the “lowest common denominator” approach.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=trinity_hemat ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200330_hemat_trinity_megan_RNAseq # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log # User-defined variables reads_dir=/gscratch/srlab/sam/data/Hematodinium/RNAseq threads=27 assembly_stats=assembly_stats.txt timestamp=$(date +%Y%m%d) fasta_name="${timestamp}.hemat.megan.Trinity.fasta" # Paths to programs trinity_dir="/gscratch/srlab/programs/trinityrnaseq-v2.9.0" samtools="/gscratch/srlab/programs/samtools-1.10/samtools" ## Inititalize arrays R1_array=() R2_array=() # Variables for R1/R2 lists R1_list="" R2_list="" # Create array of fastq R1 files R1_array=(${reads_dir}/*_R1.fq) # Create array of fastq R2 files R2_array=(${reads_dir}/*_R2.fq) # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${reads_dir}/*.fq do echo "${fastq##*/}" >> fastq.list.txt done # Create comma-separated lists of FastQ reads R1_list=$(echo "${R1_array[@]}" | tr " " ",") R2_list=$(echo "${R2_array[@]}" | tr " " ",") # Run Trinity using "stranded" setting (--SS_lib_type) ${trinity_dir}/Trinity \ --seqType fq \ --max_memory 120G \ --CPU ${threads} \ --left "${R1_list}" \ --right "${R2_list}" # Rename generic assembly FastA mv trinity_out_dir/Trinity.fasta trinity_out_dir/${fasta_name} # Assembly stats ${trinity_dir}/util/TrinityStats.pl trinity_out_dir/${fasta_name} \ > ${assembly_stats} # Create gene map files ${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \ trinity_out_dir/${fasta_name} \ > trinity_out_dir/${fasta_name}.gene_trans_map # Create FastA index ${samtools} faidx \ trinity_out_dir/${fasta_name}