Read Mapping – C.bairdi 201002558-2729-Q7 and 6129-403-26-Q7 Taxa-Specific NanoPore Reads to cbai_genome_v1.01.fasta Using Minimap2 on Mox

After extracting FastQ reads using seqtk on 20201013 from the various taxa I had been interested in, the next thing needed doing was mapping reads to the cbai_genome_v1.01 “genome” assembly from 20200917. I found that Minimap2 will map long reads (e.g. NanoPore), in addition to short reads, so I decided to give that a rip.

Minimap2 was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201014__cbai_minimap_nanopore-megan6-taxa-reads ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=15-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201014_cbai_minimap_nanopore-megan6-taxa-reads ################################################################################### # These variables need to be set by user ## Assign Variables # CPU threads to use threads=27 # Genome FastA path genome_fasta=/gscratch/srlab/sam/data/C_bairdi/genomes/cbai_genome_v1.01.fasta # Paths to programs minimap2="/gscratch/srlab/programs/minimap2-2.17_x64-linux/minimap2" samtools="/gscratch/srlab/programs/samtools-1.10/samtools" # Programs array declare -A programs_array programs_array=( [minimap2]="${minimap2}" \ [samtools_sort]="${samtools} sort" \ [samtools_view]="${samtools} view" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Capture date timestamp=$(date +%Y%m%d) # Loop through each FastQ for fastq in *.fq do # Parse out sample name sample=$(echo "${fastq}" | awk -F"_" '{print $2}') # Caputure taxa taxa=$(echo "${fastq}" | awk -F"_" '{print $3}') # Capture filename prefix prefix="${timestamp}_${sample}_${taxa}" # Run Minimap2 with Oxford NanoPore Technologies (ONT) option # Using SAM output format (-a option) ${programs_array[minimap2]} \ -ax map-ont \ ${genome_fasta} \ ${fastq} \ | ${programs_array[samtools_sort]} --threads ${threads} \ -O sam \ > "${prefix}".sorted.sam # Capture FastA checksums for verification () echo "Generating checksum for ${fastq}" md5sum "${fastq}" > fastq_checksums.md5 echo "Finished generating checksum for ${fastq}" echo "" done # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log # Capture program options ## Note: Trinity util/support scripts don't have options/help menus for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" ${programs_array[$program]} --help echo "" echo "" echo "

Data Wrangling – C.bairdi NanoPore Reads Extractions With Seqtk on Mephisto

In my pursuit to identify which contigs/scaffolds of our C.bairdi” genome assembly from 20200917 correspond to interesting taxa, based on taxonomic assignments produced by MEGAN6 on 20200928, I used MEGAN6 to extract taxa-specific reads from cbai_genome_v1.01 on 20201007 – the output is only available in FastA format. Since I want the original reads in FastQ format, I will use the FastA sequence IDs (from the FastA index file) and provide that to seqtk to extract the FastQ reads for each sample and corresponding taxa.

This was run on my personal computer (mephisto) and documented in a Jupyter Notebook:

Jupyter Notebook (GitHub):