MBD BSseq Library Prep – M.magister MBD-selected DNA Using Pico Methyl-Seq Kit

After finishing the final set of eight MBD selections on 20201103, I’m finally ready to make the BSseq libraries using the Pico Methyl-Seq Library Prep Kit (ZymoResearch) (PDF). I followed the manufacturer’s protocols with the following notes/changes (organized by each section in the protocol):

GENERAL
  • Protocol was followed for using input DNA range 1ng – 50ng.
  • All thermalcycling was performed on the Roberts Lab PTC-200 (MJ Research).
  • All thermalcycling used a heated lid temp of 104oC, unless a different temp was specified in the protocol.
  • All elution steps were performed with heated elution buffer (55oC).
  • All index primers not included with the kit were a mix of the Illumina TruSeq P5 primer (SRID: 1733) and an Illumina TruSeq P7 index primer (see table at bottom of page). The mix consisted of 10uM each of P5 and P7 primers. See the Roberts Lab Primer Database (Google Sheet) for info on the primers.
SECTION 2
  • Used 0.5mL PCR tubes, since 0.2uL tubes were not specified and the 0.5mL tubes are easier to handle/work with.
  • PrepAmp Mix was prepared as a master mix and then distributed to samples as required
PrepAmp_component single_rxn_vol(uL) num_rxns total_vol(uL)
PrepAmp Buffer (5x) 1 26 26
PrepAmp Pre-mix 3.75 26 97.5
PrepAmp Polymerase 0.3 26 7.8
SECTION 3
SECTION 4
  • Recovery from SECTION 3 elution was only 10.5uL (expected 11.5uL based on protocol), so added 1.5uL H2O to each sample.
  • Based on input DNA range (1ng – 50ng), number of cycles was set to 8.
SECTION 5
  • Anticipating the loss in elution volume, samples were eluted with 13.5uL in the preceding cleanup step and yielded 12uL (the target input volume for this section).

Next step, run the samples on the Bioanalyzer for QC to see how they look.

Sample – Sequencing Primer Index Table
Sample Illumina_TruSeq_index_num Illumina_TruSeq_Index_seq SRID/ZymoID
CH01-06 1 CGATGT 1732
CH01-14 2 TGACCA A
CH01-22 3 ACAGTG 1731
CH01-38 4 GCCAAT B
CH03-04 5 CAGATC C
CH03-15 6 CTTGTA D
CH03-33 7 CGTGAT E
CH05-01 8 GCCTAA 1730
CH05-06 9 TCAAGT 1729
CH05-21 10 CTGATC 1728
CH05-24 11 AAGCTA 1727
CH05-26 12 GTAGCC F
CH07-06 13 TTGACT 1726
CH07-11 14 GGAACT 1725
CH07-24 15 TGACAT 1724
CH09-02 16 GGACGG 1723
CH09-11 17 CTCTAC 1722
CH09-13 18 GCGGAC 1721
CH09-28 19 TTTCAC 1720
CH09-29 20 GGCCAC 1719
CH10-01 21 CGAAAC 1718
CH10-08 22 CGTACG 1717
CH10-11 23 CCACTC 1805
CH10-19 25 ATCAGT 1804

All sample processing info/history can currently be found here (Google Sheet):

Any additional project info will end up in this GitHub repo:

from Sam’s Notebook https://ift.tt/37bTxdS
via IFTTT

RNA Isolation and Quantification – P.generosa Hemocytes from Shelly

Shelly asked me to isolate RNA from some P.generosa hemocytes (GitHub Issue) that she had.

RNA was isolated using the Quick DNA-RNA Microprep Plus Kit (ZymoResearch), according to the “Cells” protocol with the following notes/changes:

  • All samples were lysed with 400uL of DNA/RNA Lysis buffer
  • All samples were DNased using DNase aliquots from 20200127 (prepped by Kaitlyn Mitchell).

Samples were quantified using 1uL of sample with the Qubit RNA HS Assay (Invitrogen) on the Roberts Lab Qubit 3.0.

Two samples (022H and 044H) were too concentrated, so a 1:10 dilution was prepared and the two samples were re-quantified.

See the RESULTS section below for Qubit data and summary table of sample concentrations/yields.

SRA Submission – Ronits C.gigas Ploidy WGBS

A few days ago, we received the whole genome bisulfite sequencing (WBGS) data (20201110) from Ronit’s C.gigas diploid/triploid dessication/heat stress experiment (GitHub repo).

Now, I need to get these submitted to the NCBI short read archive (SRA).

For all publications, we should reference the SRA BioProject:

A summary table is provided below which includes individual SRA accessions for each sample:

Library_Name SeqID Tissue Ploidy Dessication Heat_Stress Library_kit SRA_BioProject SRA_accession
D11-C zr3534_1 ctenidia diploid yes no Zymo-Seq WGBS Library Kit (Cat#: D5465) PRJNA678408 SRX9508698
D12-C zr3534_2 ctenidia diploid yes no Zymo-Seq WGBS Library Kit (Cat#: D5465) PRJNA678408 SRX9508699
D13-C zr3534_3 ctenidia diploid yes no Zymo-Seq WGBS Library Kit (Cat#: D5465) PRJNA678408 SRX9508700
D19-C zr3534_4 ctenidia diploid yes yes Zymo-Seq WGBS Library Kit (Cat#: D5465) PRJNA678408 SRX9508701
D20-C zr3534_5 ctenidia diploid yes yes Zymo-Seq WGBS Library Kit (Cat#: D5465) PRJNA678408 SRX9508702
T11-C zr3534_6 ctenidia triploid yes no Zymo-Seq WGBS Library Kit (Cat#: D5465) PRJNA678408 SRX9508703
T12-C zr3534_7 ctenidia triploid yes no Zymo-Seq WGBS Library Kit (Cat#: D5465) PRJNA678408 SRX9508704
T13-C zr3534_8 ctenidia triploid yes no Zymo-Seq WGBS Library Kit (Cat#: D5465) PRJNA678408 SRX9508705
T19-C zr3534_9 ctenidia triploid yes yes Zymo-Seq WGBS Library Kit (Cat#: D5465) PRJNA678408 SRX9508706
T20-C zr3534_10 ctenidia triploid yes yes Zymo-Seq WGBS Library Kit (Cat#: D5465) PRJNA678408 SRX9508707

from Sam’s Notebook https://ift.tt/35wMOvA
via IFTTT

FastQC-MultiQc – C.gigas Ploidy WGBS Raw Sequence Data from Ronits Project on Mox

Earlier today, we received the C.gigas ploidy WGBS data that we submitted to ZymoResearch on 20200820.

As part of our usual work flow, I needed to run FastQC.

Ran FastQC on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201110_cgig_fastqc_ronit-ploidy-wgbs ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201110_cgig_fastqc_ronit-ploidy-wgbs ### FastQC assessment of raw sequencing from Ronit's ploidy WGBS. ################################################################################### # These variables need to be set by user # FastQC output directory output_dir=$(pwd) # Set number of CPUs to use threads=28 # Input/output files checksums=fastq_checksums.md5 fastq_list=fastq_list.txt raw_reads_dir=/gscratch/srlab/sam/data/C_gigas/wgbs/ # Paths to programs fastqc=/gscratch/srlab/programs/fastqc_v0.11.9/fastqc multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc # Programs associative array declare -A programs_array programs_array=( [fastqc]="${fastqc}" \ [multiqc]="${multiqc}" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Sync raw FastQ files to working directory rsync --archive --verbose \ "${raw_reads_dir}"zr3534*.fq.gz . # Populate array with FastQ files fastq_array=(*.fq.gz) # Pass array contents to new variable fastqc_list=$(echo "${fastq_array[*]}") # Run FastQC # NOTE: Do NOT quote ${fastqc_list} ${programs_array[fastqc]} \ --threads ${threads} \ --outdir ${output_dir} \ ${fastqc_list} # Create list of fastq files used in analysis echo "${fastqc_list}" | tr " " "\n" >> ${fastq_list} # Generate checksums for reference while read -r line do # Generate MD5 checksums for each input FastQ file echo "Generating MD5 checksum for ${line}." md5sum "${line}" >> "${checksums}" echo "Completed: MD5 checksum for ${line}." echo "" # Remove fastq files from working directory echo "Removing ${line} from directory" rm "${line}" echo "Removed ${line} from directory" echo "" done < ${fastq_list} # Run MultiQC ${programs_array[multiqc]} . # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} fi ${programs_array[$program]} -h echo "" echo "" echo "

Data Received – C.gigas Ploidy WGBS from Ronits Project via ZymoResearch

We received the data from our whole genome bisulfite sequencing (WGBS) submission to ZymoResearch on 2020820 for Ronit’s C.gigas diploid/triploid dessication/heat stress project.

Samples were sequenced using 150bp paired-end, on the Illumina NovaSeq.

Files have been added to the C.gigas folder in nightingales on Owl (Synology server).

I’ve updated the nightingales Google Sheet database as well.

Next up:

  • Run FatQC
  • Submit to NCBI sequence read archive (SRA).
SeqID Library_Name Tissue Ploidy Dessication Heat_Stress
zr3534_1 D11-C ctenidia diploid yes no
zr3534_2 D12-C ctenidia diploid yes no
zr3534_3 D13-C ctenidia diploid yes no
zr3534_4 D19-C ctenidia diploid yes yes
zr3534_5 D20-C ctenidia diploid yes yes
zr3534_6 T11-C ctenidia triploid yes no
zr3534_7 T12-C ctenidia triploid yes no
zr3534_8 T13-C ctenidia triploid yes no
zr3534_9 T19-C ctenidia triploid yes yes
zr3534_10 T20-C ctenidia triploid yes yes

from Sam’s Notebook https://ift.tt/3f4nT5T
via IFTTT

Hard Drive Upgrade – Gannet Synology Server

Completed upgrading the 12 x 8TB HDDs in our server, Gannet (Synology RS3618XS), to 12 x 16TB HDDs. The process was simple, but the repair process took ~20hrs for each new drive. So, the entire process required 12 separate days of pulling out one old HDD, replacing with a new HDD, and initiating the repair process in the Synology web interface.

But, it’s done and we now have a total of 145 TB (!!!), with ~72TB remaining to use up!

Screencap of Gannet storage space after upgrading all 12 HDDs.

from Sam’s Notebook https://ift.tt/32zM0nU
via IFTTT

Transcriptome Assessment – Crustacean Transcripome Completeness Evaluation Using BUSCO on Mox

Grace was recently working on writing up a manuscript which did a basic comparison of our C.bairdi transcriptome (cbai_transcriptome_v3.1) (see the Genomic Resources wiki for more deets) to two other species’ transcriptome assemblies. We wanted BUSCO evaluations as part of this comparison, but the two other species did not have BUSCO scores in their respective publications. As such, I decided to generate them myself, as BUSCO runs very quickly. The job was run on Mox.

Info on the other two species’ transcriptomes:

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201110_crustacean-transcriptomes_busco ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=3-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201110_crustacean-transcriptomes_busco ################################################################################### # These variables need to be set by user ## Save working directory wd=$(pwd) # Transcriptomes array transcriptomes_array=( /gscratch/srlab/sam/data/C_maenas/transcriptomes/GBXE01.1.fsa_nt \ /gscratch/srlab/sam/data/L_vannamei/transcriptomes/Trinity_Trimmed_Whiteleg_Shrimp_Transcrimptome_Assmbled_Supplemental_Data_1.fasta ) ## Input files and settings busco_db=/gscratch/srlab/sam/data/databases/BUSCO/metazoa_odb9 augustus_species=fly threads=28 # Programs associative array declare -A programs_array programs_array=( [busco]="/gscratch/srlab/programs/busco-v3/scripts/run_BUSCO.py" ) ## Set program paths augustus_bin=/gscratch/srlab/programs/Augustus-3.3.2/bin augustus_orig_config_dir=/gscratch/srlab/programs/Augustus-3.3.2/config augustus_scripts=/gscratch/srlab/programs/Augustus-3.3.2/scripts blast_dir=/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/ hmm_dir=/gscratch/srlab/programs/hmmer-3.2.1/src/ # Export Augustus variable export PATH="${augustus_bin}:$PATH" export PATH="${augustus_scripts}:$PATH" ## BUSCO configs busco_config_default=/gscratch/srlab/programs/busco-v3/config/config.ini.default busco_config_ini=${wd}/config.ini # Export BUSCO config file location export BUSCO_CONFIG_FILE="${busco_config_ini}" # Copy BUSCO config file cp ${busco_config_default} "${busco_config_ini}" # Edit BUSCO config file ## Set paths to various programs ### The use of the % symbol sets the delimiter sed uses for arguments. ### Normally, the delimiter that most examples use is a slash "/". ### But, we need to expand the variables into a full path with slashes, which screws up sed. ### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%"). sed -i "/^;cpu/ s/1/${threads}/" "${busco_config_ini}" sed -i "/^tblastn_path/ s%tblastn_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^makeblastdb_path/ s%makeblastdb_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^augustus_path/ s%augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^etraining_path/ s%etraining_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^gff2gbSmallDNA_path/ s%gff2gbSmallDNA_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^new_species_path/ s%new_species_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^optimize_augustus_path/ s%optimize_augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^hmmsearch_path/ s%hmmsearch_path = /home/osboxes/BUSCOVM/hmmer/hmmer-3.1b2-linux-intel-ia32/binaries/%path = ${hmm_dir}%" "${busco_config_ini}" ################################################################################### # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 for transcriptome in "${!transcriptomes_array[@]}" do # Remove path from transcriptome using parameter substitution transcriptome_name="${transcriptomes_array[$transcriptome]##*/}" ## Augustus config directories augustus_dir=${wd}/${transcriptome_name}_augustus augustus_config_dir=${augustus_dir}/config export AUGUSTUS_CONFIG_PATH="${augustus_config_dir}" # Make Augustus directory if it doesn't exist if [ ! -d "${augustus_dir}" ]; then mkdir --parents "${augustus_dir}" fi # Copy Augustus config directory cp --preserve -r ${augustus_orig_config_dir} "${augustus_dir}" # Run BUSCO/Augustus training ${programs_array[busco]} \ --in ${transcriptomes_array[$transcriptome]} \ --out ${transcriptome_name} \ --lineage_path ${busco_db} \ --mode transcriptome \ --cpu ${threads} \ --long \ --species ${augustus_species} \ --tarzip \ --augustus_parameters='--progress=true' # Capture FastA checksums for verification echo "" echo "Generating checksum for ${transcriptome_name}" md5sum "${transcriptomes_array[$transcriptome]}" > "${transcriptome_name}".checksum.md5 echo "Finished generating checksum for ${transcriptome_name}" echo "" done # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" ${programs_array[$program]} --help echo "" echo "" echo "

Data Wrangling – MultiQC on S.salar RNAseq from fastp and HISAT2 on Mox

In Shelly’s GitHub Issue for this S.salar project, she also requested a MultiQC report for the trimming (completed on 20201029) and the genome alignments (completed on 20201103).

I ran MultiQC on Mox using a build node and no script, since the command is so simple (e.g. multiqc .) and so quick.

RNAseq Alignments – S.salar HISAT2 BAMs to GCF_000233375.1_ICSASG_v2_genomic.gtf Transcriptome Using StringTie on Mox

This is a continuation of addressing Shelly Trigg’s (regarding some Salmo salar RNAseq data) request (GitHub Issue) to trim (completed 20201029), perform genome alignment (completed on 20201103), and transcriptome alignment.

To finish this up, I used StringTie to align to a subset of the GCF_000233375.1_ICSASG_v2_genomic.gtf provided by Shelly. I created a subset (see SBATCH script below for deets) that only included the same sequence IDs in Shelly’s subsetted genome Fasta (see genome alignment completed on 20201103 for more info on that.).

Shelly indicated she really just needed the FPKM and/or TPM values, so I ran StringTie with the -A option which spits out a tab-delimited table with both of those values in columns 8 and 9.

This was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201104_ssal_RNAseq_stringtie_alignment ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201104_ssal_RNAseq_stringtie_alignment ### S.salar RNAseq StringTie alignment. ### Uses BAM alignment files from 20201103 and transcriptome ### GTF provided by Shelly (presumably GTF is from NCBI). ################################################################################### # These variables need to be set by user ## Assign Variables wd=$(pwd) # Set number of CPUs to use threads=27 # Input/output files bam_dir="/gscratch/scrubbed/samwhite/outputs/20201103_ssal_RNAseq_hisat2_alignment/" bam_md5s=bam_checksums.md5 genome="/gscratch/srlab/sam/data/S_salar/genomes/GCF_000233375.1_ICSASG_v2_genomic.fa" gtf_md5=gtf_checksums.md5 ncbi_transcriptome="/gscratch/srlab/sam/data/S_salar/transcriptomes/GCF_000233375.1_ICSASG_v2_genomic.gtf" chr_only_transriptome="GCF_000233375.1_ICSASG_v2_genomic_NC-chr-only.gtf" # Paths to programs stringtie="/gscratch/srlab/programs/stringtie-2.1.4.Linux_x86_64/stringtie" # Programs associative array declare -A programs_array programs_array=( [stringtie]="${stringtie}" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Subset transcriptome with NC_ only entries to match Shelly's genome # Only lines beginning with "NC_" grep "^NC_" "${ncbi_transcriptome}" >> "${chr_only_transriptome}" # Run StringTie for bam in "${bam_dir}"*.bam do # Parse out sample name by removing all text up to and including the last period. sample_name_no_path=${bam##*/} sample_name=${sample_name_no_path%%.*} # Exectute StringTie # Use list of of chromosome IDs (ref_list) # Output an abundance file with TPM and FPKM data in dedicated columns ${programs_array[stringtie]} \ ${bam} \ -G ${chr_only_transriptome} \ -A ${sample_name}_gene-abund.tab \ -p ${threads} # Generate BAM MD5 checksums md5sum "${bam}" >> "${bam_md5s}" done # Generate GTF MD5 checksum md5sum "${ncbi_transcriptome}" >> "${gtf_md5}" md5sum "${chr_only_transriptome}" >> "${gtf_md5}" # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} fi ${programs_array[$program]} -h echo "" echo "" echo "

RNAseq Alignments – Trimmed S.salar RNAseq to GCF_000233375.1_ICSASG_v2_genomic.fa Using Hisat2 on Mox

This is a continuation of addressing Shelly Trigg’s (regarding some Salmo salar RNAseq data) request (GitHub Issue) to trim (completed 20201029), perform genome alignment, and transcriptome alignment.

Ran HISAT2 with the trimmed FastQ files from 20201029 with the following options:

  • --dta: This stands for “downstream transcriptome alignment”. Since we’ll be performing a subsequent alignment using the transcriptome using StringTie, I decided to add this option.
  • --new-summary: This creates a summary file that can be easily read by MultiQC.

This was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201103_ssal_RNAseq_hisat2_alignment ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201103_ssal_RNAseq_hisat2_alignment ### S.salar RNAseq Hisat2 alignment. ### Uses fastp-trimmed FastQ files from 20201029. ### Uses GCF_000233375.1_ICSASG_v2_genomic.fa as reference, ### created by Shelly Trigg. ### This is a subset of the NCBI RefSeq GCF_000233375.1_ICSASG_v2_genomic.fna. ### Includes only "chromosome" sequence entries. ################################################################################### # These variables need to be set by user ## Assign Variables # Set number of CPUs to use threads=27 # Input/output files fastq_checksums=fastq_checksums.md5 fastq_dir="/gscratch/srlab/sam/data/S_salar/RNAseq/" genome_fasta="/gscratch/srlab/sam/data/S_salar/genomes/GCF_000233375.1_ICSASG_v2_genomic.fa" genome_index_name="GCF_000233375.1_ICSASG_v2" # Paths to programs hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2="${hisat2_dir}/hisat2" hisat2_build="${hisat2_dir}/hisat2-build" samtools="/gscratch/srlab/programs/samtools-1.10/samtools" ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() names_array=() # Programs associative array declare -A programs_array programs_array=( [hisat2]="${hisat2}" \ [hisat2-build]="${hisat2_build}" [samtools_index]="${samtools} index" \ [samtools_sort]="${samtools} sort" \ [samtools_view]="${samtools} view" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Capture date timestamp=$(date +%Y%m%d) # Create array of fastq R1 files for fastq in "${fastq_dir}"*_1.fastp-trim.20201029.fq.gz do fastq_array_R1+=("${fastq}") # Create array of sample names ## Uses parameter substitution to strip leading path from filename ## Uses awk to parse out sample name from filename names_array+=($(echo "${fastq#${fastq_dir}}" | awk -F"[_]" '{print $1 "_" $2}')) done # Create array of fastq R2 files for fastq in "${fastq_dir}"*_2.fastp-trim.20201029.fq.gz do fastq_array_R2+=("${fastq}") done # Build Hisat2 reference index "${programs_array[hisat2-build]}" \ "${genome_fasta}" \ "${genome_index_name}" \ -p "${threads}" \ 2> hisat2_build.err # Hisat2 alignments for index in "${!fastq_array_R1[@]}" do # Get current sample name sample_name=$(echo "${names_array[index]}") # Run Hisat2 # Sets --dta which tailors output for downstream transcriptome assemblers (e.g. Stringtie) # Sets --new-summary option for use with MultiQC "${programs_array[hisat2]}" \ -x "${genome_index_name}" \ --dta \ --new-summary \ -1 "${fastq_array_R1[index]}" \ -2 "${fastq_array_R2[index]}" \ -S "${sample_name}".sam \ 2> "${sample_name}"_hisat2.err # Sort SAM files, convert to BAM ${programs_array[samtools_view]} \ -@ "${threads}" \ -Su "${sample_name}".sam \ | ${programs_array[samtools_sort]} - \ -@ "${threads}" \ -o "${sample_name}".sorted.bam # Index sorted BAM file ${programs_array[samtools_index]} "${sample_name}".sorted.bam done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in "${fastq_dir}"*fastp-trim.20201029.fq.gz do echo "${fastq#${fastq_dir}}" >> fastq.list.txt md5sum "${fastq}" >> ${fastq_checksums} done # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} fi ${programs_array[$program]} -h echo "" echo "" echo "