MBD Selection – M.magister Sheared Gill gDNA 8 of 24 Samples Set 1 of 3

M.magister (Dungeness crab) gill gDNA provided by Mackenzie Gavery was previously sheared on 20201026 and three samples were subjected to additional rounds of shearing on 20201027, in preparation for methyl bidning domain (MBD) selection using the MethylMiner Kit (Invitrogen).

Followed the manufacturer’s protocol for using <= 1ug of DNA (I’m using 1ug) with the following notes/changes:

  • Prepared beads for all 8 samples in single prep. Combined the amount of beads/protein needed for 8, 1ug reactions. Protein calculations and wash volumes were based off of 8ug of input DNA. Prepared beads were resuspended in 80uL (instead of 100uL) and 10uL were distributed to each of 8 tubes. Volume was then brought up to 100uL with 1x binding/wash buffer.
  • DNA capture incubation was performed overnight (~20hrs).
  • Non-captured DNA and wash volumes were combined in a single tube for each sample. These were stored at 4oC, but were not precipitated.
  • Ethanol precipitations were incubated at -80oC overnight (~20hrs).
  • Precipitated DNA was resuspended in 21uL of H2O (this allows the usage of 1uL for Qubit and leave 20uL as the maximum input volume for the subsequent PicoMethylSeq Kit (ZymoResearch)).

Samples were quantified using the Roberts Lab Qubit 3.0 with the Qubit 1x dsDNA HS Assay (Invitrogen), using 1uL of sample.

All samples were stored temporarily at 4oC.

For reference, all sample info for this project is here (Google Sheet):

DNA Shearing – M.magister gDNA Additional Shearing CH05-01_21 CH07-11 and Bioanalyzer

After shearing all of the M.magister gill gDNA on 20201026, there were still three samples that still had average fragment lengths that were a bit longer than desired (~750bp, but want ~250 – 550bp):

  • CH05-01
  • CH05-21
  • CH07-11

I initially chose to run them on the sonicator (Bioruptor 300; Diagenode) for two successive rounds of 5 cycles (30s ON, 30s OFF; low intensity). This proved to be insufficient (see RESULTS section below), so I ran three additional successive rounds of 5 cycles (30s ON, 30s OFF; low intensity).

The original samples used 1ug of DNA in a volume of 51uL (this volume was selected to simplify downstream calculations, after using 1uL for the Bioanalyzer after shearing), using 0.65mL prelubricated snap cap tubes (Costar; Cat# 3206). Have ended up using a total of 3uL (includes two rounds of Bioanalyzer from today’s assays), leaving these three samples with a volume of 48uL.

Post-sonication/shearing, samples were run on High Sensitivity DNA Assay chips in the Bioanalyzer 2100 (Agilent).

All samples and volumes used are listed in the following Google Sheet (originally provided by Mackenzie Gavery):

DNA Shearing – M.magister gDNA Shearing All Samples and Bioanalyzer

I previously ran some shearing tests on 20201022 to determine how many cycles to run on the sonicator (Bioruptor 300; Diagenode) to achieve an average fragment length of ~350 – 500bp in preparation for MBD-BSseq. The determination was 70 cycles (30s ON, 30s OFF; low intensity), sonicating for 35 cycles, followed by successive rounds of 5 cycles each.

I used 1ug of DNA in a volume of 51uL (this volume was selected to simplify downstream calculations, after using 1uL for the Bioanalyzer after shearing), using 0.65mL prelubricated snap cap tubes (Costar; Cat# 3206).

Post-sonication/shearing, samples were run on High Sensitivity DNA Assay chips in the Bioanalyzer 2100 (Agilent).

All samples and volumes used are listed in the following Google Sheet (originally provided by Mackenzie Gavery):

DNA Shearing – M.magister CH05-21 gDNA Full Shearing Test and Bioanalyzer

Yesterday, I did some shearing of Metacarcinus magister gill gDNA on a test sample (CH05-21) to determine how many cycles to run on the sonicator (Bioruptor 300; Diagenode) to achieve an average fragment length of ~350 – 500bp in preparation for MBD-BSseq. The determination from yesterday was 70 cycles (30s ON, 30s OFF; low intensity). That determination was made by first sonicating for 35 cycles, followed by successive rounds of 5 cycles each. I decided to repeat this, except by doing it in a single round of sonication.

I used 1ug of DNA in a volume of 50uL, using 0.65mL prelubricated snap cap tubes (Costar; Cat# 3206).

It turns out the Bioruptor has a maximum cycle setting of 60 cycles, so I decided to do 35 cycles, immediately followed by another 35 cycles.

Post-sonication/shearing, samples were run on High Sensitivity DNA Assay chips in the Bioanalyzer 2100 (Agilent).

DNA Shearing – M.magister gDNA Shear Testing and Bioanalyzer

Steven assigned me to do some MBD-BSseq library prep (GitHub Issue) for some Dungeness crab (Metacarcinus magister) DNA samples provided by Mackenzie Gavery. The DNA was isolated from juvenile (J6/J7 developmental stages) gill tissue. One of the first steps in MBD-BSseq is to fragment DNA to a desired size (~350 – 500bp in our case). However, we haven’t worked with Metacarcinus magister DNA previously, so I need to empirically determine sonicator (Bioruptor 300; Diagenode) settings for these samples.

I used 1ug of DNA in a volume of 50uL, using 0.65mL prelubricated snap cap tubes (Costar; Cat# 3206).

Initially, I did a 35 cycle (30s ON, 30s OFF; low intensity setting) run and determined it was insufficient, so ran increments of 5 cycles and pulled out 1.5uL after each one to run on the Bioanalyzer.

Post-sonication/shearing, samples were run on High Sensitivity DNA Assay chips in the Bioanalyzer 2100 (Agilent).

Read Mapping – C.bairdi 201002558-2729-Q7 and 6129-403-26-Q7 Taxa-Specific NanoPore Reads to cbai_genome_v1.01.fasta Using Minimap2 on Mox

After extracting FastQ reads using seqtk on 20201013 from the various taxa I had been interested in, the next thing needed doing was mapping reads to the cbai_genome_v1.01 “genome” assembly from 20200917. I found that Minimap2 will map long reads (e.g. NanoPore), in addition to short reads, so I decided to give that a rip.

Minimap2 was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201014__cbai_minimap_nanopore-megan6-taxa-reads ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=15-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201014_cbai_minimap_nanopore-megan6-taxa-reads ################################################################################### # These variables need to be set by user ## Assign Variables # CPU threads to use threads=27 # Genome FastA path genome_fasta=/gscratch/srlab/sam/data/C_bairdi/genomes/cbai_genome_v1.01.fasta # Paths to programs minimap2="/gscratch/srlab/programs/minimap2-2.17_x64-linux/minimap2" samtools="/gscratch/srlab/programs/samtools-1.10/samtools" # Programs array declare -A programs_array programs_array=( [minimap2]="${minimap2}" \ [samtools_sort]="${samtools} sort" \ [samtools_view]="${samtools} view" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Capture date timestamp=$(date +%Y%m%d) # Loop through each FastQ for fastq in *.fq do # Parse out sample name sample=$(echo "${fastq}" | awk -F"_" '{print $2}') # Caputure taxa taxa=$(echo "${fastq}" | awk -F"_" '{print $3}') # Capture filename prefix prefix="${timestamp}_${sample}_${taxa}" # Run Minimap2 with Oxford NanoPore Technologies (ONT) option # Using SAM output format (-a option) ${programs_array[minimap2]} \ -ax map-ont \ ${genome_fasta} \ ${fastq} \ | ${programs_array[samtools_sort]} --threads ${threads} \ -O sam \ > "${prefix}".sorted.sam # Capture FastA checksums for verification () echo "Generating checksum for ${fastq}" md5sum "${fastq}" > fastq_checksums.md5 echo "Finished generating checksum for ${fastq}" echo "" done # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log # Capture program options ## Note: Trinity util/support scripts don't have options/help menus for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" ${programs_array[$program]} --help echo "" echo "" echo "

Data Wrangling – C.bairdi NanoPore Reads Extractions With Seqtk on Mephisto

In my pursuit to identify which contigs/scaffolds of our C.bairdi” genome assembly from 20200917 correspond to interesting taxa, based on taxonomic assignments produced by MEGAN6 on 20200928, I used MEGAN6 to extract taxa-specific reads from cbai_genome_v1.01 on 20201007 – the output is only available in FastA format. Since I want the original reads in FastQ format, I will use the FastA sequence IDs (from the FastA index file) and provide that to seqtk to extract the FastQ reads for each sample and corresponding taxa.

This was run on my personal computer (mephisto) and documented in a Jupyter Notebook:

Jupyter Notebook (GitHub):

NanoPore Reads Extractions – C.bairdi Taxonomic Reads Extractions with MEGAN6 on 201002558-2729-Q7 and 6129-403-26-Q7

After completing the taxonomic comparisons of 201002558-2729-Q7 and 6129-403-26-Q7 on 20201002, I decided to extract reads assigned to the following taxa for further exploration (primarily to identify contigs/scaffolds in our cbai_genome_v1.0.fasta (19MB).

Used MEGAN6 to extract reads from the MEGAN6 RMA6 files from 201002558-2729-Q7 taxonomic assignments on 20200928 and from 6129-403-26-Q7 on 20200928.

Comparison – C.bairdi 20102558-2729 vs. 6129-403-26 NanoPore Taxonomic Assignments Using MEGAN6

After noticing that the initial MEGAN6 taxonomic assignments for our combined C.bairdi NanoPore data from 20200917 revealed a high number of bases assigned to E.canceri and Aquifex sp., I decided to explore the taxonomic breakdown of just the individual samples to see which of the samples was contributing to these taxonomic assignments most.

After completing the individual taxonomic assignments, I compared the two sets of assignments using MEGAN6 and generated this bar plot showing percentage of normalized base counts assigned to the following groups within each sample:

  • Aquifex sp.
  • Arthropoda
  • E.canceri
  • SAR (Supergroup within which Alveolata/Hematodinium sp. falls)
IMPORTANT!!!
  • The taxonomic makeup shown in these comparisons is only a comparison of bases assigned amongst the four taxa selected above. It is not a comparison of the full taxonomic makeup of the two samples. I will discuss the data shown here in that context.

20201002_cbai_nanopore_20102558-2729-Q7-vs-6129-403-26-Q7_megan-taxonomic-comparison-bar-plot

Comparison table:

Taxa 20102558-2729-Q7_base-counts 20102558-2729-Q7_base-counts(%) 6129-403-26-Q7_base-counts 6129-403-26-Q7_base-counts(%)
Aquifex sp. 221,823.00 10.25 199,287.06 10.43
Arthropoda 1,046,619.00 48.38 1,134,731.00 59.40
Enterospora canceri 889,082.00 41.10 561,754.19 29.41
Sar 5,855.00 0.27 14,582.56 0.76
TOTAL 2,163,379.00 1,910,354.81

Some observations:

  • Aquifex sp. account for nearly the same percentage of assignments in both samples.
  • Arthropoda makes up ~50% of assigned bases in the uninfected muscle sample (20102558-2729), but ~60% in the Hematodinium-infected hemolymph sample. (6129-403-26).
  • E.canceri makes up ~41% of assigned bases in the uninfected muscle sample (20102558-2729), but only ~30% in the Hematodinium-infected hemolymph sample.
  • SAR contributes a very small percentage to each of the two samples, but has ~2.8x the number of assigned bases. Additionally, as noted in the taxonomic assignment analysis of 20102558-2729-Q7 on 20200928, no bases are assigned to descendants of this Supergroup, whereas in the taxonomic analysis of 6129-403-26-Q7 on 20200928, there are bases assigned within the descendants of this Supergroup, down the level of Hematodinium sp. Genus.

Pretty interesting stuff!

I also briefly looked at the taxonomic assignments from all of our hemolymph RNAseq samples to see if if Aquifex sp. and/or E.canceri appear:

Interestingly, a high number of reads are assigned to E.canceri in all samples, but no reads are assigned to Aquifex sp.. Another observation is that a fair number of reads get assigned to Vibrio parahemolyticus, but very few number of NanoPore DNA bases get assigned to V.parahemolyticus.

Next up I think I might try to identify which contigs/scaffolds from the cbai_genome_v1.0 Flye assembly correspond to these taxa. The approach would be to create a BLAST database (DB) from the cbai_genome_v1.0.fasta (19MB). Then extract the NanoPore reads assigned to each of the taxa above, then BLAST them against the cbai_genome_v1.0 BLAST DB.

from Sam’s Notebook https://ift.tt/2GDrpa2
via IFTTT

Taxonomic Assignments – C.bairdi 20102558-2729-Q7 NanoPore Reads Using DIAMOND BLASTx on Mox and MEGAN6 daa2rma on emu

After noticing that the initial MEGAN6 taxonomic assignments for our combined C.bairdi NanoPore data from 20200917 revealed a high number of bases assigned to E.canceri and Aquifex sp., I decided to explore the taxonomic breakdown of just the individual samples to see which of the samples was contributing to these taxonomic assignments most.

Ran the muscle (no Hematodinium infection; (20102558-2729-Q7) NanoPore sequencing data through DIAMOND BLASTx (on Mox) and subsequent output conversion to the MEGAN6 RMA6 format (on swoose due to program Java X11 requirement which is not functional on Mox) to obtain taxonomic assignments.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=cbai_blastx_DIAMOND_nanopore_20102558-2729_Q7 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200928_cbai_diamond_blastx_nanopore_20102558-2729_Q7 # Script to run DIAMOND BLASTx on 20102558-2729 quality filtered (Q7) C.bairdi NanoPore reads # from 20200928 using the --long-reads option # for subsequent import into MEGAN6 to evaluate reads taxonomically. ################################################################################### # These variables need to be set by user # Input FastQ file fastq=/gscratch/srlab/sam/data/C_bairdi/DNAseq/20200928_cbai_nanopore_20102558-2729_quality-7.fastq # DIAMOND Output filename prefix prefix=20200928_cbai_nanopore_20102558-2729_Q7 # Set number of CPUs to use threads=28 # Program paths diamond=/gscratch/srlab/programs/diamond-2.0.4/diamond # DIAMOND NCBI nr database with taxonomy dumps dmnd_db=/gscratch/srlab/blastdbs/ncbi-nr-20190925/nr.dmnd ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Inititalize arrays programs_array=() # Programs array programs_array=("${diamond}") md5sum "${fastq}" > fastq_checksums.md5 # Run DIAMOND with blastx # Output format 6 produces a standard BLAST tab-delimited file # Run DIAMOND with blastx # Output format 100 produces a DAA binary file for use with MEGAN ${diamond} blastx \ --long-reads \ --db ${dmnd_db} \ --query "${fastq}" \ --out "${prefix}".blastx.daa \ --outfmt 100 \ --top 5 \ --block-size 8.0 \ --index-chunks 1 \ --threads ${threads} # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${programs_array[program]}: " echo "" ${programs_array[program]} help echo "" echo "" echo "