Sam’s Notebook: Data Analysis – C.virginica MBD Sequencing Coverage

A while ago, Steven tasked me with assessing some questions related to the sequencing coverage we get doing MBD-BSseq in this GitHub issue. At the heart of it all was really to try to get an idea of how much usable data we actually get when we do an MBD-BSseq project. Yaamini happened to have done an MBD-BSseq project relatively recently, and it’s one she’s actively working on analyzing, so we went with that data set.

Data was initially trimmed:

Subsequently, the data was concatenated, subset, and aligned using Bismark:

Today, I finally found the time to dedicate to evaluating alignment coverage of each of the Bismark sequence subsets. It was done in a Jupyter Notebook and solely with Bash/Python! I used this as project as an excuse to dive into using Python a bit more, instead of using R. For what I needed to accomplish, I just felt like this approach was simpler (instead of creating an R project and all of that).

Sam’s Notebook: Metagenomics Annotation – P.generosa Water Samples Using BLASTn on Mox and KronaTools Visualization to Compare pH Treatments

Nearing the end of this quick metagenomics comparison of taxonomic differences between the two pH treatments (pH=7.1 and pH=8.2). Previously ran:

After this completes, I’ll run KronaTools to get a rundown on taxonomic makeup of these two different pH treatments. I don’t expect BLASTn to take terribly long (based on previous metagenomics runs wit this data set); I’d guess around 6hrs.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=blastn_metagenomics ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190416_metagenomics_pgen_blastn # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log wd="$(pwd)" threads=28 # Paths to programs blast_dir="/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin" blastn="${blast_dir}/blastn" # Paths to blastdbs blastdb_dir="/gscratch/srlab/blastdbs/ncbi-nr-nt-v5" blast_db="${blastdb_dir}/nt" # Directory with metagenemark FastAs fasta_dir="/gscratch/scrubbed/samwhite/outputs/20190416_metagenomics_pgen_metagenemark" # Export BLAST database directory export BLASTDB=${blastdb_dir} # Loop through metagenemark nucleotide FastAs # Create list of those FastAs for reference # Parse out sample names # Run BLASTn on each FastA for fasta in ${fasta_dir}/*nucleotides.fasta do echo ${fasta} >> input.fasta.list.txt no_ext=${fasta%%.*} sample_name=$(echo ${no_ext##*/}) # Run blastx on Trinity fasta ${blastn} \ -query ${fasta} \ -db ${blast_db} \ -max_target_seqs 1 \ -outfmt "6 std staxids" \ -evalue 1e-10 \ -num_threads ${threads} \ > ${wd}/${sample_name}.blastn.outfmt6 done 

Sam’s Notebook: Metagenomics Gene Prediction – P.generosa Water Samples Using MetaGeneMark on Mox to Compare pH Treatments

Continuing with a relatively quick comparison of pH treatments (pH=7.1 vs. pH=8.2), I wanted to run gene prediction on the MEGAHIT assemblies I made yesterday. I ran MetaGeneMark on the two pH-specific assemblies on Mox. This should be a very fast process (I’m talking, like a couple of minutes fast), so it enhances the annotation with very little effort and time.

This will output a nucleotides FastA, proteins FastA, and a GFF for each of the two assemblies (i.e. pH treatments)

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=mgm ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=20-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190416_metagenomics_pgen_metagenemark # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # Programs gmhmmp="/gscratch/srlab/programs/MetaGeneMark_linux_64_3.38/mgm/gmhmmp" mgm_mod="/gscratch/srlab/programs/MetaGeneMark_linux_64_3.38/mgm/MetaGeneMark_v1.mod" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" # Variables assemblies_dir=/gscratch/scrubbed/samwhite/outputs/20190415_metagenomics_pgen_megahit ## Initialize array assemblies_array=() # Populate array with MegaHit FastAs assemblies_array=$(find ${assemblies_dir} -maxdepth 3 -name "*.contigs.fa") # List of files in array printf "%s\n" "${assemblies_array[@]}" >> fastas.list.txt # Loop through array and run MetaGeneMark # Parse out sample name by removing .contigs.fa from filename # and remove path for sample in ${assemblies_array[@]} do no_ext=${sample%%.*} sample_name=$(echo ${no_ext##*/}) # Run MetaGeneMark ## Specifying the following: ### -a : output predicted proteins ### -A : write predicted proteins to designated file ### -d : output predicted nucleotides ### -D : write predicted nucleotides to designated file ### -f 3 : Output format in GFF3 ### -m : Model file (supplied with software) ### -o : write GFF3 to designated file ${gmhmmp} \ -a \ -A ${sample_name}.mgm-proteins.fasta \ -d \ -D ${sample_name}.mgm-nucleotides.fasta \ -f 3 \ -m ${mgm_mod} \ ${sample} \ -o ${sample_name}.mgm.gff done # Index FastAs for fasta in *.fasta do ${samtools} faidx ${fasta} done 

Sam’s Notebook: Metagenome Assemblies – P.generosa Water Samples Trimmed HiSeqX Data Using Megahit on Mox to Compare pH Treatments

A report involving our work on the geoduck water metagenomics is due later this week and our in-depth analysis for this project using Anvi’o is likely to require at least another week to complete. Even though we have a broad overview of the metagenomic taxa present in these water samples, we don’t have data in a format for comparing across samples/treatments. So, I initiated our simplified pipeline (MEGAHIT > MetaGeneMark > BLASTn > KronaTools) for examining our metagenomic data of the two treatments:

  • pH=7.1
  • pH=8.2

I ran MEGAHIT on the trimmed HiSeqX data, but concatenated the corresponding pH treatment FastQ files to create a single assembly for each pH treatment.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=megahit ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190415_metagenomics_pgen_megahit # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # variables wd=$(pwd) fastq_dir=/gscratch/srlab/sam/data/metagenomics/P_generosa megahit=/gscratch/srlab/programs/megahit_v1.1.4_LINUX_CPUONLY_x86_64-bin/megahit bbmap_dir=/gscratch/srlab/programs/bbmap_38.34 samtools=/gscratch/srlab/programs/samtools-1.9/samtools cpus=28 ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() names_array=(pH71 pH82) # Create list of input FastQs used for concatenation # Concatenate all pH7.1 R1 FastQs # Uses parameter substitution to strip path for fastq in ${fastq_dir}/Library_Geoduck_MG_[367]*_R1_*.gz do echo ${fastq#${fastq_dir}} >> fastq.list.pH71.txt cat ${fastq} >> pH71.all.R1.fq.gz done # Create list of input FastQs used for concatenation # Concatenate all pH7.1 R2 FastQs # Uses parameter substitution to strip path for fastq in ${fastq_dir}/Library_Geoduck_MG_[367]*_R2_*.gz do echo ${fastq#${fastq_dir}} >> fastq.list.pH71.txt cat ${fastq} >> pH71.all.R2.fq.gz done # Create list of input FastQs used for concatenation # Concatenate all pH8.2 R1 FastQs # Uses parameter substitution to strip path for fastq in ${fastq_dir}/Library_Geoduck_MG_[125]*_R1_*.gz do echo ${fastq#${fastq_dir}} >> fastq.list.pH82.txt cat ${fastq} >> pH82.all.R1.fq.gz done # Create list of input FastQs used for concatenation # Concatenate all pH8.2 R2 FastQs # Uses parameter substitution to strip path for fastq in ${fastq_dir}/Library_Geoduck_MG_[125]*_R2_*.gz do echo ${fastq#${fastq_dir}} >> fastq.list.pH82.txt cat ${fastq} >> pH82.all.R2.fq.gz done # Populate R1 array with concatenated R1 FastQs for fastq in *R1*.fq.gz do fastq_array_R1+=(${fastq}) done # Populate R2 array with concatenated R2 FastQs for fastq in *R2*.gz do fastq_array_R2+=(${fastq}) done # Loop through samples for sample in ${!names_array[@]} do sample_name=$(echo ${names_array[sample]}) mkdir ${sample_name} && cd ${sample_name} # Run Megahit using paired-end reads ${megahit} \ -1 ${wd}/${fastq_array_R1[sample]} \ -2 ${wd}/${fastq_array_R2[sample]} \ --num-cpu-threads ${cpus} \ --out-prefix ${sample_name} # Create FastA index file ${samtools} faidx megahit_out/${sample_name}.contigs.fa # Determine coverage ## Align reads with BBmap BBwrap ${bbmap_dir}/bbwrap.sh \ ref=megahit_out/${sample_name}.contigs.fa \ in1=${fastq_array_R1[sample]} \ in2=${fastq_array_R2[sample]} \ out=${sample_name}.aln.sam.gz ## Output contig coverage ${bbmap_dir}/pileup.sh \ in=${sample_name}.aln.sam.gz \ out=${sample_name}.coverage.txt # Return to working directory cd ${wd} done 

Sam’s Notebook: FastQC – WGBS Sequencing Data from Genewiz Received 20190408

We received whole genome bisulfite sequencing (WGBS) data from Genewiz last week on 20190408, so ran FastQC on the files on my computer (swoose). FastQC results will be added to Nightingales Google Sheet.

Each set of FastQs were processed with a bash script. This file (ends with .sh) can be found in each corresponding output folder (see below).

Sam’s Notebook: Transcriptome Assembly – Geoduck Tissue-specific Assembly Larvae Day5 EPI99 with HiSeq and NovaSeq Data on Mox

I previously assembled and annotated P.generosa larval Day 5 transcriptome (20190318 – mislabeled as Juvenile Day 5 in my previous notebook entries) using just our HiSeq data from our Illumina collaboration. This was a an oversight, as I didn’t realize that we also had NovaSeq RNAseq data. So, I’ve initiated another de novo assembly using Trinity incorporating both sets of data.

Ran a de novo assembly on our HiSeq and NovaSeq data from Hollie’s larval Day 5 EPI 99 sample. This was done for Christian to use in some long, non-coding RNA (lncRNA) analysis.

NovaSeq data had been previously trimmed.

Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic option.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=trin_epi99 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=30-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_EPI99_RNAseq # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # User-defined variables reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/epi_115 threads=28 assembly_stats=assembly_stats.txt # Paths to programs trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" ## Inititalize arrays R1_array=() R2_array=() # Variables for R1/R2 lists R1_list="" R2_list="" # Create array of fastq R1 files R1_array=(${reads_dir}/*_R1_*.gz) # Create array of fastq R2 files R2_array=(${reads_dir}/*_R2_*.gz) # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${reads_dir}/*.gz do echo ${fastq##*/} >> fastq.list.txt done # Create comma-separated lists of FastQ reads R1_list=$(echo ${R1_array[@]} | tr " " ",") R2_list=$(echo ${R2_array[@]} | tr " " ",") # Run Trinity ${trinity_dir}/Trinity \ --trimmomatic \ --seqType fq \ --max_memory 120G \ --CPU ${threads} \ --left \ ${R1_list} \ --right \ ${R2_list} # Assembly stats ${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \ > ${assembly_stats} # Create gene map files ${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \ trinity_out_dir/Trinity.fasta \ > trinity_out_dir/Trinity.fasta.gene_trans_map # Create FastA index ${samtools} faidx \ trinity_out_dir/Trinity.fasta 

Sam’s Notebook: Transcriptome Assembly – Geoduck Tissue-specific Assembly Gonad HiSeq and NovaSeq Data on Mox

I previously assembled and annotated P.generosa gonad transcriptome (20190318) using just our HiSeq data from our Illumina collaboration. This was a an oversight, as I didn’t realize that we also had NovaSeq RNAseq data. So, I’ve initiated another de novo assembly using Trinity incorporating both sets of data.

NovaSeq data had been previously trimmed.

Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic option.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=trin_gonad ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=30-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_gonad_RNAseq # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # User-defined variables reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/gonad threads=28 assembly_stats=assembly_stats.txt # Paths to programs trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" ## Inititalize arrays R1_array=() R2_array=() # Variables for R1/R2 lists R1_list="" R2_list="" # Create array of fastq R1 files R1_array=(${reads_dir}/*_R1_*.gz) # Create array of fastq R2 files R2_array=(${reads_dir}/*_R2_*.gz) # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${reads_dir}/*.gz do echo ${fastq##*/} >> fastq.list.txt done # Create comma-separated lists of FastQ reads R1_list=$(echo ${R1_array[@]} | tr " " ",") R2_list=$(echo ${R2_array[@]} | tr " " ",") # Run Trinity ${trinity_dir}/Trinity \ --trimmomatic \ --seqType fq \ --max_memory 120G \ --CPU ${threads} \ --left \ ${R1_list} \ --right \ ${R2_list} # Assembly stats ${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \ > ${assembly_stats} # Create gene map files ${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \ trinity_out_dir/Trinity.fasta \ > trinity_out_dir/Trinity.fasta.gene_trans_map # Create FastA index ${samtools} faidx \ trinity_out_dir/Trinity.fasta