Data Wrangling – C.bairdi NanoPore 6129-403-26 Quality Filtering Using NanoFilt on Mox

Last week, I ran all of our Q7-filtered C.baird NanoPore reads through MEGAN6 to evaluate the taxonomic breakdown (on 20200917) and noticed that there were a large quantity of bases assigned to E.canceri (a known microsporidian agent of infection in crabs) and Aquifex sp. (a genus of thermophylic bacteria), in addition to the expected Arthropoda assignments. Notably, Alveolata assignments were remarkably low.

Since our NanoPore data was a combination of an uninfected sample and a Hematodinium-infected sample, I decided to find out which of the two samples was contributing to the E.canceri and Aquifex sp. assignments. To do so, first I need to generate a singular Q7-filtered FastQ using NanoFilt.

This set was the Hematodinium-infected hemolymph sample:

Job was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=cbai_nanofilt_Q7_6129_403_26_nanopore-data ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200928_cbai_nanofilt_Q7_6129_403_26_nanopore-data ################################################################################### # These variables need to be set by user # Load Anaconda # Uknown why this is needed, but Anaconda will not run if this line is not included. . "/gscratch/srlab/programs/anaconda3/etc/profile.d/conda.sh" # Activate the NanoPlot Anaconda environment conda activate nanofilt_2.6.0_env # Declare array raw_reads_dir_array=() # Paths to reads raw_reads_dir_array=( "/gscratch/srlab/sam/data/C_bairdi/DNAseq/ont_FAL86873_d8db260e_cbai_6129_403_26" ) # FastQ concatenation filename fastq_cat=20200928_cbai_nanopore_6129_403_26.fastq fastq_filtered=20200928_cbai_nanopore_6129_403_26_quality-7.fastq # Paths to programs nanofilt=NanoFilt # Set mean quality filter (integer) quality=7 ################################################################################### # Exit script if any command fails set -e # Inititalize array programs_array=() # Programs array programs_array=("${nanofilt}") # Loop through NanoPore data directories # to run NanoPlot, FastQC, and MultiQC for directory in "${raw_reads_dir_array[@]}" do # Find all FastQ files and concatenate into singel file while IFS= read -r -d '' filename do # Concatenate all FastQ files into single file # for NanoFilt and generate MD5 checksums echo "Now concatenating ${filename} to ${fastq_cat}..." cat "${filename}" >> ${fastq_cat} echo "Concatenation of ${filename} to ${fastq_cat} complete." # Create checksums file echo "Now generating checksum for ${filename}..." echo "" md5sum "${filename}" >> fastq_checksums.md5 echo "Checksum for ${filename} complete." echo "" done < <(find "${directory}" -name "*.fastq" -type f -print0) done # Generate MD5 checksum for concatenated FastQ file echo "Now generating checksum for ${fastq_cat}..." echo "" md5sum "${fastq_cat}" >> fastq_checksums.md5 echo "checksum for ${fastq_cat} complete." echo "" # Run NanoFilt ## Sets readtype to 1D (default) ## Filters on mean quality >= 7 (ONT "standard") ## FYI: seems to require piping stdin (i.e. cat fastq |)to NanoFilt... echo "Running ${programs_array[nanofilt]}" echo "" cat ${fastq_cat} \ | ${programs_array[nanofilt]} \ --readtype 1D \ --quality ${quality} \ > ${fastq_filtered} echo "${programs_array[nanofilt]} complete." echo "" # Generate MD5 checksum for concatenated FastQ file echo "Now generating checksum for ${fastq_filtered}..." echo "" md5sum "${fastq_filtered}" >> fastq_checksums.md5 echo "checksum for ${fastq_filtered} complete." echo "" # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${programs_array[program]}: " echo "" ${programs_array[program]} -h echo "" echo "" echo "

Data Wrangling – C.bairdi NanoPore 20102558-2729 Quality Filtering Using NanoFilt on Mox

Last week, I ran all of our Q7-filtered C.baird NanoPore reads through MEGAN6 to evaluate the taxonomic breakdown (on 20200917) and noticed that there were a large quantity of bases assigned to E.canceri (a known microsporidian agent of infection in crabs) and Aquifex sp. (a genus of thermophylic bacteria), in addition to the expected Arthropoda assignments. Notably, Alveolata assignments were remarkably low.

Since our NanoPore data was a combination of an uninfected sample and a Hematodinium-infected sample, I decided to find out which of the two samples was contributing to the E.canceri and Aquifex sp. assignments. To do so, first I need to generate a singular Q7-filtered FastQ using NanoFilt.

This set was the uninfected muscle sample:

Job was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=cbai_nanofilt_Q7_20102558-2729_nanopore-data ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200928_cbai_nanofilt_Q7_20102558-2729_nanopore-data ################################################################################### # These variables need to be set by user # Load Anaconda # Uknown why this is needed, but Anaconda will not run if this line is not included. . "/gscratch/srlab/programs/anaconda3/etc/profile.d/conda.sh" # Activate the NanoPlot Anaconda environment conda activate nanofilt_2.6.0_env # Declare array raw_reads_dir_array=() # Paths to reads raw_reads_dir_array=( "/gscratch/srlab/sam/data/C_bairdi/DNAseq/ont_FAL58500_04bb4d86_20102558-2729" \ "/gscratch/srlab/sam/data/C_bairdi/DNAseq/ont_FAL58500_94244ffd_20102558-2729" ) # FastQ concatenation filename fastq_cat=20200928_cbai_nanopore_20102558-2729.fastq fastq_filtered=20200928_cbai_nanopore_20102558-2729_quality-7.fastq # Paths to programs nanofilt=NanoFilt # Set mean quality filter (integer) quality=7 ################################################################################### # Exit script if any command fails set -e # Inititalize array programs_array=() # Programs array programs_array=("${nanofilt}") # Loop through NanoPore data directories # to run NanoPlot, FastQC, and MultiQC for directory in "${raw_reads_dir_array[@]}" do # Find all FastQ files and concatenate into singel file while IFS= read -r -d '' filename do # Concatenate all FastQ files into single file # for NanoFilt and generate MD5 checksums echo "Now concatenating ${filename} to ${fastq_cat}..." cat "${filename}" >> ${fastq_cat} echo "Concatenation of ${filename} to ${fastq_cat} complete." # Create checksums file echo "Now generating checksum for ${filename}..." echo "" md5sum "${filename}" >> fastq_checksums.md5 echo "Checksum for ${filename} complete." echo "" done < <(find "${directory}" -name "*.fastq" -type f -print0) done # Generate MD5 checksum for concatenated FastQ file echo "Now generating checksum for ${fastq_cat}..." echo "" md5sum "${fastq_cat}" >> fastq_checksums.md5 echo "checksum for ${fastq_cat} complete." echo "" # Run NanoFilt ## Sets readtype to 1D (default) ## Filters on mean quality >= 7 (ONT "standard") ## FYI: seems to require piping stdin (i.e. cat fastq |)to NanoFilt... echo "Running ${programs_array[nanofilt]}" echo "" cat ${fastq_cat} \ | ${programs_array[nanofilt]} \ --readtype 1D \ --quality ${quality} \ > ${fastq_filtered} echo "${programs_array[nanofilt]} complete." echo "" # Generate MD5 checksum for concatenated FastQ file echo "Now generating checksum for ${fastq_filtered}..." echo "" md5sum "${fastq_filtered}" >> fastq_checksums.md5 echo "checksum for ${fastq_filtered} complete." echo "" # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${programs_array[program]}: " echo "" ${programs_array[program]} -h echo "" echo "" echo "

Orientation 2, This Time It’s CoE-ier

Continuing to Analyze Hematodinium BLAST results

Again, this all refers to my project to locally blast hematodinium transcriptome data against the Swiss-prot database

Made major progress! Ran the hemat transcriptome through DIAMOND BLASTx and analyzed my results! Everything went quite smoothly, which was great – didn’t have too much time, as I’ve got the College of Environment orientation for most of this afternoon. No major issues to report- all went well

(Unmatched queries for both DIAMOND and NCBI BLASTx and matches that were only made by DIAMOND and NCBI BLASTx)[https://ift.tt/2RZMded]

(Code for the analyses)[https://ift.tt/3cuF6UH]

Overall, DIAMOND BLASTx had fewer matched queries than NCBI BLASTx (2501 vs 2862). Generally, the queries were fairly similar – DIAMOND matched 26 queries that NCBI didn’t.

Speed-wise, DIAMOND was way, way faster. NCBI took several hours to run, whereas DIAMOND ran in a minute or two. It was astoundingly quick!

Anyway, that’s about all I’ve got for this day – excited to start classes next week!

from Oceanic Observations https://ift.tt/2ECiXas
via IFTTT

Safs Orientation

Continuing to Analyze Hematodinium BLAST results

Again, this all refers to my project to locally blast hematodinium transcriptome data against the Swiss-prot database

Spent most of the morning at SAFS orientation! Chatted with the new grad students, learned some info about the program and schedule – typical orientation stuff

Given the limited time, made a surprising amount of progress today! Hit a substantial roadblock late last night – couldn’t figure out a way to scrub the two files (queries and matches) of whitespace and properly run a comparison function. Realized that it’d be optimal for both the query file and match file to be the same filetype, so made new .txt files. That completely fixed the problems and let me use diff to compare the two!

Diff eliminated all lines that were identical in both files, but still left our data a bit messy. Cleaned up the existing data, and got our results – 3486 of our 6348 queries had no match!

Next task: rerun the transcriptome through DIAMOND BLASTx! Got started on it, but having some problems getting it to run. Ah well, good project for tomorrow

from Oceanic Observations https://ift.tt/32YyHha
via IFTTT

Assembly Assessment – BUSCO C.bairdi Genome v1.01 on Mox

After creating a subset of the cbai_genome_v1.0 of contigs >100bp yesterday (subset named cbai_genome_v1.01), I wanted to generate BUSCO scores for cbai_genome_v1.01. This is primarily just to keep info consistent on our Genomic Resources wiki, as I don’t expect these scores to differ at all from the cbai_genome_v1.0 BUSCO scores.

Analysis was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=cbai_genome_v1.01_busco ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=3-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200924_cbai_genome_v1.01_busco ################################################################################### # Script to generate BUSCO score(s) for genome(s). # These variables need to be set by user ## Save working directory wd=$(pwd) # Genomes directory genomes_dir=/gscratch/srlab/sam/data/C_bairdi/genomes # Genomes array genomes_array=( "${genomes_dir}"/cbai_genome_v1.01.fasta \ ) ## Input files and settings busco_db=/gscratch/srlab/sam/data/databases/BUSCO/metazoa_odb9 augustus_species=fly threads=28 # Programs associative array declare -A programs_array programs_array=( [busco]="/gscratch/srlab/programs/busco-v3/scripts/run_BUSCO.py" ) ## Set program paths augustus_bin=/gscratch/srlab/programs/Augustus-3.3.2/bin augustus_orig_config_dir=/gscratch/srlab/programs/Augustus-3.3.2/config augustus_scripts=/gscratch/srlab/programs/Augustus-3.3.2/scripts blast_dir=/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/ hmm_dir=/gscratch/srlab/programs/hmmer-3.2.1/src/ # Export Augustus variable export PATH="${augustus_bin}:$PATH" export PATH="${augustus_scripts}:$PATH" ## BUSCO configs busco_config_default=/gscratch/srlab/programs/busco-v3/config/config.ini.default busco_config_ini=${wd}/config.ini # Export BUSCO config file location export BUSCO_CONFIG_FILE="${busco_config_ini}" # Copy BUSCO config file cp ${busco_config_default} "${busco_config_ini}" # Edit BUSCO config file ## Set paths to various programs ### The use of the % symbol sets the delimiter sed uses for arguments. ### Normally, the delimiter that most examples use is a slash "/". ### But, we need to expand the variables into a full path with slashes, which screws up sed. ### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%"). sed -i "/^;cpu/ s/1/${threads}/" "${busco_config_ini}" sed -i "/^tblastn_path/ s%tblastn_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^makeblastdb_path/ s%makeblastdb_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^augustus_path/ s%augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^etraining_path/ s%etraining_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^gff2gbSmallDNA_path/ s%gff2gbSmallDNA_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^new_species_path/ s%new_species_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^optimize_augustus_path/ s%optimize_augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^hmmsearch_path/ s%hmmsearch_path = /home/osboxes/BUSCOVM/hmmer/hmmer-3.1b2-linux-intel-ia32/binaries/%path = ${hmm_dir}%" "${busco_config_ini}" ################################################################################### # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 for genome in "${!genomes_array[@]}" do # Remove path from genome using parameter substitution genome_name="${genomes_array[$genome]##*/}" ## Augustus config directories augustus_dir=${wd}/${genome_name}_augustus augustus_config_dir=${augustus_dir}/config export AUGUSTUS_CONFIG_PATH="${augustus_config_dir}" # Make Augustus directory if it doesn't exist if [ ! -d "${augustus_dir}" ]; then mkdir --parents "${augustus_dir}" fi # Copy Augustus config directory cp --preserve -r ${augustus_orig_config_dir} "${augustus_dir}" # Run BUSCO/Augustus training ${programs_array[busco]} \ --in ${genomes_array[$genome]} \ --out ${genome_name} \ --lineage_path ${busco_db} \ --mode genome \ --cpu ${threads} \ --long \ --species ${augustus_species} \ --tarzip \ --augustus_parameters='--progress=true' # Capture FastA checksums for verification echo "" echo "Generating checksum for ${genome_name}" md5sum "${genomes_array[$genome]}" > "${genome_name}".checksum.md5 echo "Finished generating checksum for ${genome_name}" echo "" done # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" ${programs_array[$program]} --help echo "" echo "" echo "

Getting A Grep On Things

Analyzing Hematodinium BLAST results

Again, this all refers to my project to locally blast hematodinium transcriptome data against the Swiss-prot database

Spent the morning at the SAFS Coffee Hour! Met most of the new grad students, learned about all the best bagel places, and got some info on how the semester would go. Overall, a great morning.

After that, I started work on analyzing my blast results. Here are the goals: – Compare no_max_hsps to max_hsps blasts – Count lines of output files – Determine which queries didn’t produce alignments – Re-run using DIAMOND BLASTx – Compare results of DIAMOND BLASTx and NCBI BLASTx – Speed – Number of matches – SPID matches – Anything else It was fairly straightforward to count the output file lines, as I could just use wc -l. However, determining which queries didn’t produce alignments has proved more difficult. After quite a bit of trial and error and more error, I figured out how to use grep to select just the query name. All queries begin with “TRINITY_”, so I used the following command: grep -o ‘TRINITY[^ ]*’ filename > query_list

If I understand it correctly, the [^ ] indicates that it searched through the line until locating the space following TRINITY, and the * meant that all characters were accepted until a space was found.

So at this point, I have two files – a list of queries and a list of subjects. Now I just need to scrub them both of whitespace (apparently – the results are different when I run grep with -w) and cross-reference the two to isolate differences! Much easier said than done, but feeling fairly good. However, unlikely to make too much progress tomorrow, as I have orientation for much of the day

from Oceanic Observations https://ift.tt/3co24Na
via IFTTT

Data Wrangling – Subsetting cbai_genome_v1.0 Assembly with faidx

Previously assembled cbai_genome_v1.0.fasta with our NanoPore Q7 reads on 20200917 and noticed that there were numerous sequences that were well shorter than the expected 500bp threshold that the assembler (Flye) was supposed to spit out. I created an Issue on the Flye GitHub page to find out why. The developer responded and determined it was an issue with the assembly polisher and that sequences <500bp could be safely ignored.

So, I’ve decided to subset the cbai_genome_v1.0.fasta to exclude all sequences <1000bp, as that seems like a more reasonable minimum length for potential genes. I did not run this in a Jupyter Notebook, due to the brevity of the commands. Here are the commands, using faidx:

>1kbp subsetting

faidx --size-range 1000,1000000000 cbai_genome_v1.0.fasta > cbai_genome_v1.01.fasta 

Index new FastA

faidx Pgenerosa_v071.fasta 
samb@mephisto:~/data/C_bairdi/genomes$ sort -nk2,2 cbai_genome_v1.01.fasta.fai | head contig_4272 1000 15642836 60 61 contig_4503 1000 16422183 60 61 contig_4429 1001 16145927 60 61 contig_1038 1002 230201 60 61 contig_1691 1005 1716551 60 61 contig_2992 1005 7322005 60 61 contig_3284 1006 9674445 60 61 contig_1810 1008 2050977 60 61 contig_408 1008 15069716 60 61 contig_1616 1009 1549839 60 61 

Subsetting looks like it worked.

Looking at sequence counts in FastAs:

samb@mephisto:~/data/C_bairdi/genomes$ for file in *.fasta; do grep --with-filename -c ">" $file; done cbai_genome_v1.01.fasta:2431 cbai_genome_v1.0.fasta:3294 

Probing Jupyter

Final Results (I Think) For Hematodinium BLAST

Again, this all refers to my project to locally blast hematodinium transcriptome data against the Swiss-prot database

I got some help in the morning lab meeting, and learned how to get Jupyter Notebooks to accept Unix commands! Just gotta start each cell with ! or %%bash. Once I learned that, things got a lot simpler. I spent a while screwing things up in Jupyter (kept accidentally running cells by pressing shift+enter), but eventually got all the code for the hematodinium blast written out. Then I ran the blast, linked my new github repo, and boom – final results produced!

As before, I ran two separate blasts – one with a value of 1 for max_hsps, the other with no specified max_hsps value.

Now that I know how to use it, wow, Jupyter is pretty fantastic. Love how easy it is to go back and edit the code! Very impressed, excited to learn more.

from Oceanic Observations https://ift.tt/3mMckni
via IFTTT

He-Mat and the Masters of the Database

Success!

For some context, this post is a continuation of my entries from the past few days – I’ve been working to take hematodinium transcriptome data and locally blast against the Swiss-prot protein database

Well, I figured out a way to bring my files from WSL to Windows and link em in my notebook! My initial results can be seen here

Important caveat: The results above were produced following the same outline as the tutorial. Matches were made using max_target_seqs (set to 1). However, while investigating how to use blastx, I found info indicating that max_target_seqs just grabs the first sequence that meets the evalue standards (set to 1E-20 here). To me, that seemed sub-optimal, and so I reran blastx with a line added setting max_hsps to 1. Those results are linked here

I’m posting this mid-day just to get the data in. I’ll be spending the rest of the afternoon writing a biography for the Roberts Lab website!

from Oceanic Observations https://ift.tt/35X3X26
via IFTTT