Data Wrangling – MultiQC on S.salar RNAseq from fastp and HISAT2 on Mox

In Shelly’s GitHub Issue for this S.salar project, she also requested a MultiQC report for the trimming (completed on 20201029) and the genome alignments (completed on 20201103).

I ran MultiQC on Mox using a build node and no script, since the command is so simple (e.g. multiqc .) and so quick.

RNAseq Alignments – S.salar HISAT2 BAMs to GCF_000233375.1_ICSASG_v2_genomic.gtf Transcriptome Using StringTie on Mox

This is a continuation of addressing Shelly Trigg’s (regarding some Salmo salar RNAseq data) request (GitHub Issue) to trim (completed 20201029), perform genome alignment (completed on 20201103), and transcriptome alignment.

To finish this up, I used StringTie to align to a subset of the GCF_000233375.1_ICSASG_v2_genomic.gtf provided by Shelly. I created a subset (see SBATCH script below for deets) that only included the same sequence IDs in Shelly’s subsetted genome Fasta (see genome alignment completed on 20201103 for more info on that.).

Shelly indicated she really just needed the FPKM and/or TPM values, so I ran StringTie with the -A option which spits out a tab-delimited table with both of those values in columns 8 and 9.

This was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201104_ssal_RNAseq_stringtie_alignment ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201104_ssal_RNAseq_stringtie_alignment ### S.salar RNAseq StringTie alignment. ### Uses BAM alignment files from 20201103 and transcriptome ### GTF provided by Shelly (presumably GTF is from NCBI). ################################################################################### # These variables need to be set by user ## Assign Variables wd=$(pwd) # Set number of CPUs to use threads=27 # Input/output files bam_dir="/gscratch/scrubbed/samwhite/outputs/20201103_ssal_RNAseq_hisat2_alignment/" bam_md5s=bam_checksums.md5 genome="/gscratch/srlab/sam/data/S_salar/genomes/GCF_000233375.1_ICSASG_v2_genomic.fa" gtf_md5=gtf_checksums.md5 ncbi_transcriptome="/gscratch/srlab/sam/data/S_salar/transcriptomes/GCF_000233375.1_ICSASG_v2_genomic.gtf" chr_only_transriptome="GCF_000233375.1_ICSASG_v2_genomic_NC-chr-only.gtf" # Paths to programs stringtie="/gscratch/srlab/programs/stringtie-2.1.4.Linux_x86_64/stringtie" # Programs associative array declare -A programs_array programs_array=( [stringtie]="${stringtie}" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Subset transcriptome with NC_ only entries to match Shelly's genome # Only lines beginning with "NC_" grep "^NC_" "${ncbi_transcriptome}" >> "${chr_only_transriptome}" # Run StringTie for bam in "${bam_dir}"*.bam do # Parse out sample name by removing all text up to and including the last period. sample_name_no_path=${bam##*/} sample_name=${sample_name_no_path%%.*} # Exectute StringTie # Use list of of chromosome IDs (ref_list) # Output an abundance file with TPM and FPKM data in dedicated columns ${programs_array[stringtie]} \ ${bam} \ -G ${chr_only_transriptome} \ -A ${sample_name}_gene-abund.tab \ -p ${threads} # Generate BAM MD5 checksums md5sum "${bam}" >> "${bam_md5s}" done # Generate GTF MD5 checksum md5sum "${ncbi_transcriptome}" >> "${gtf_md5}" md5sum "${chr_only_transriptome}" >> "${gtf_md5}" # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} fi ${programs_array[$program]} -h echo "" echo "" echo "

RNAseq Alignments – Trimmed S.salar RNAseq to GCF_000233375.1_ICSASG_v2_genomic.fa Using Hisat2 on Mox

This is a continuation of addressing Shelly Trigg’s (regarding some Salmo salar RNAseq data) request (GitHub Issue) to trim (completed 20201029), perform genome alignment, and transcriptome alignment.

Ran HISAT2 with the trimmed FastQ files from 20201029 with the following options:

  • --dta: This stands for “downstream transcriptome alignment”. Since we’ll be performing a subsequent alignment using the transcriptome using StringTie, I decided to add this option.
  • --new-summary: This creates a summary file that can be easily read by MultiQC.

This was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201103_ssal_RNAseq_hisat2_alignment ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201103_ssal_RNAseq_hisat2_alignment ### S.salar RNAseq Hisat2 alignment. ### Uses fastp-trimmed FastQ files from 20201029. ### Uses GCF_000233375.1_ICSASG_v2_genomic.fa as reference, ### created by Shelly Trigg. ### This is a subset of the NCBI RefSeq GCF_000233375.1_ICSASG_v2_genomic.fna. ### Includes only "chromosome" sequence entries. ################################################################################### # These variables need to be set by user ## Assign Variables # Set number of CPUs to use threads=27 # Input/output files fastq_checksums=fastq_checksums.md5 fastq_dir="/gscratch/srlab/sam/data/S_salar/RNAseq/" genome_fasta="/gscratch/srlab/sam/data/S_salar/genomes/GCF_000233375.1_ICSASG_v2_genomic.fa" genome_index_name="GCF_000233375.1_ICSASG_v2" # Paths to programs hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2="${hisat2_dir}/hisat2" hisat2_build="${hisat2_dir}/hisat2-build" samtools="/gscratch/srlab/programs/samtools-1.10/samtools" ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() names_array=() # Programs associative array declare -A programs_array programs_array=( [hisat2]="${hisat2}" \ [hisat2-build]="${hisat2_build}" [samtools_index]="${samtools} index" \ [samtools_sort]="${samtools} sort" \ [samtools_view]="${samtools} view" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Capture date timestamp=$(date +%Y%m%d) # Create array of fastq R1 files for fastq in "${fastq_dir}"*_1.fastp-trim.20201029.fq.gz do fastq_array_R1+=("${fastq}") # Create array of sample names ## Uses parameter substitution to strip leading path from filename ## Uses awk to parse out sample name from filename names_array+=($(echo "${fastq#${fastq_dir}}" | awk -F"[_]" '{print $1 "_" $2}')) done # Create array of fastq R2 files for fastq in "${fastq_dir}"*_2.fastp-trim.20201029.fq.gz do fastq_array_R2+=("${fastq}") done # Build Hisat2 reference index "${programs_array[hisat2-build]}" \ "${genome_fasta}" \ "${genome_index_name}" \ -p "${threads}" \ 2> hisat2_build.err # Hisat2 alignments for index in "${!fastq_array_R1[@]}" do # Get current sample name sample_name=$(echo "${names_array[index]}") # Run Hisat2 # Sets --dta which tailors output for downstream transcriptome assemblers (e.g. Stringtie) # Sets --new-summary option for use with MultiQC "${programs_array[hisat2]}" \ -x "${genome_index_name}" \ --dta \ --new-summary \ -1 "${fastq_array_R1[index]}" \ -2 "${fastq_array_R2[index]}" \ -S "${sample_name}".sam \ 2> "${sample_name}"_hisat2.err # Sort SAM files, convert to BAM ${programs_array[samtools_view]} \ -@ "${threads}" \ -Su "${sample_name}".sam \ | ${programs_array[samtools_sort]} - \ -@ "${threads}" \ -o "${sample_name}".sorted.bam # Index sorted BAM file ${programs_array[samtools_index]} "${sample_name}".sorted.bam done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in "${fastq_dir}"*fastp-trim.20201029.fq.gz do echo "${fastq#${fastq_dir}}" >> fastq.list.txt md5sum "${fastq}" >> ${fastq_checksums} done # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} fi ${programs_array[$program]} -h echo "" echo "" echo "

MBD Selection – M.magister Sheared Gill gDNA 16 of 24 Samples Set 3 of 3

Click here for notebook on the first eight samples processed. Click here for the second set of eight samples processed. M.magister (Dungeness crab) gill gDNA provided by Mackenzie Gavery was previously sheared on 20201026 and three samples were subjected to additional rounds of shearing on 20201027, in preparation for methyl bidning domain (MBD) selection using the MethylMiner Kit (Invitrogen).

Followed the manufacturer’s protocol for using <= 1ug of DNA (I’m using 1ug) with the following notes/changes:

  • Prepared beads for all 8 samples in single prep. Combined the amount of beads/protein needed for 8, 1ug reactions. Protein calculations and wash volumes were based off of 8ug of input DNA. Prepared beads were resuspended in 80uL (instead of 100uL) and 10uL were distributed to each of 8 tubes. Volume was then brought up to 100uL with 1x binding/wash buffer.
  • DNA capture incubation was performed overnight (~20hrs).
  • Non-captured DNA and wash volumes were combined in a single tube for each sample. These were stored at 4oC, but were not precipitated.
  • Ethanol precipitations were incubated at -80oC overnight (~20hrs).
  • Precipitated DNA was resuspended in 21uL of H2O (this allows the usage of 1uL for Qubit and leave 20uL as the maximum input volume for the subsequent PicoMethylSeq Kit (ZymoResearch)).

Samples were quantified using the Roberts Lab Qubit 3.0 with the Qubit 1x dsDNA HS Assay (Invitrogen), using 1uL of sample.

All samples were stored temporarily at 4oC.

For reference, all sample info for this project is here (Google Sheet):

MBD Selection – M.magister Sheared Gill gDNA 8 of 24 Samples Set 2 of 3

Click here for notebook on the first eight samples processed. M.magister (Dungeness crab) gill gDNA provided by Mackenzie Gavery was previously sheared on 20201026 and three samples were subjected to additional rounds of shearing on 20201027, in preparation for methyl bidning domain (MBD) selection using the MethylMiner Kit (Invitrogen).

Followed the manufacturer’s protocol for using <= 1ug of DNA (I’m using 1ug) with the following notes/changes:

  • Prepared beads for all 8 samples in single prep. Combined the amount of beads/protein needed for 8, 1ug reactions. Protein calculations and wash volumes were based off of 8ug of input DNA. Prepared beads were resuspended in 80uL (instead of 100uL) and 10uL were distributed to each of 8 tubes. Volume was then brought up to 100uL with 1x binding/wash buffer.
  • DNA capture incubation was performed overnight (~20hrs).
  • Non-captured DNA and wash volumes were combined in a single tube for each sample. These were stored at 4oC, but were not precipitated.
  • Ethanol precipitations were incubated at -80oC overnight (~20hrs).
  • Precipitated DNA was resuspended in 21uL of H2O (this allows the usage of 1uL for Qubit and leave 20uL as the maximum input volume for the subsequent PicoMethylSeq Kit (ZymoResearch)).

Samples were quantified using the Roberts Lab Qubit 3.0 with the Qubit 1x dsDNA HS Assay (Invitrogen), using 1uL of sample.

All samples were stored temporarily at 4oC.

For reference, all sample info for this project is here (Google Sheet):