RNAseq Alignments – S.salar HISAT2 BAMs to GCF_000233375.1_ICSASG_v2_genomic.gtf Transcriptome Using StringTie on Mox

This is a continuation of addressing Shelly Trigg’s (regarding some Salmo salar RNAseq data) request (GitHub Issue) to trim (completed 20201029), perform genome alignment (completed on 20201103), and transcriptome alignment.

To finish this up, I used StringTie to align to a subset of the GCF_000233375.1_ICSASG_v2_genomic.gtf provided by Shelly. I created a subset (see SBATCH script below for deets) that only included the same sequence IDs in Shelly’s subsetted genome Fasta (see genome alignment completed on 20201103 for more info on that.).

Shelly indicated she really just needed the FPKM and/or TPM values, so I ran StringTie with the -A option which spits out a tab-delimited table with both of those values in columns 8 and 9.

This was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201104_ssal_RNAseq_stringtie_alignment ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201104_ssal_RNAseq_stringtie_alignment ### S.salar RNAseq StringTie alignment. ### Uses BAM alignment files from 20201103 and transcriptome ### GTF provided by Shelly (presumably GTF is from NCBI). ################################################################################### # These variables need to be set by user ## Assign Variables wd=$(pwd) # Set number of CPUs to use threads=27 # Input/output files bam_dir="/gscratch/scrubbed/samwhite/outputs/20201103_ssal_RNAseq_hisat2_alignment/" bam_md5s=bam_checksums.md5 genome="/gscratch/srlab/sam/data/S_salar/genomes/GCF_000233375.1_ICSASG_v2_genomic.fa" gtf_md5=gtf_checksums.md5 ncbi_transcriptome="/gscratch/srlab/sam/data/S_salar/transcriptomes/GCF_000233375.1_ICSASG_v2_genomic.gtf" chr_only_transriptome="GCF_000233375.1_ICSASG_v2_genomic_NC-chr-only.gtf" # Paths to programs stringtie="/gscratch/srlab/programs/stringtie-2.1.4.Linux_x86_64/stringtie" # Programs associative array declare -A programs_array programs_array=( [stringtie]="${stringtie}" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Subset transcriptome with NC_ only entries to match Shelly's genome # Only lines beginning with "NC_" grep "^NC_" "${ncbi_transcriptome}" >> "${chr_only_transriptome}" # Run StringTie for bam in "${bam_dir}"*.bam do # Parse out sample name by removing all text up to and including the last period. sample_name_no_path=${bam##*/} sample_name=${sample_name_no_path%%.*} # Exectute StringTie # Use list of of chromosome IDs (ref_list) # Output an abundance file with TPM and FPKM data in dedicated columns ${programs_array[stringtie]} \ ${bam} \ -G ${chr_only_transriptome} \ -A ${sample_name}_gene-abund.tab \ -p ${threads} # Generate BAM MD5 checksums md5sum "${bam}" >> "${bam_md5s}" done # Generate GTF MD5 checksum md5sum "${ncbi_transcriptome}" >> "${gtf_md5}" md5sum "${chr_only_transriptome}" >> "${gtf_md5}" # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} fi ${programs_array[$program]} -h echo "" echo "" echo "