Yaamini’s Notebook: DML Analysis Part 32

Revised finalized general methylation trends

Based on feedback from Katie, Steven, and Alan, I went back to my code and revised C. virginica gonad methylation trends.

Revised timeline

But first…let’s revist my grand procalamtion: “I’m going to finish this paper by the end of the month.” (lol) I think I can finish my analyses by the end of this week and have some sort of draft discussion. Before the next E2O meeting, I’ll definitely have a draft paper ready.

Characterizing loci locations

In this Jupyter notebook, I characterized the location of all 5x CpGs I had data for, sparsely methylated loci, and unmethyalted loci. Katie suggested I make a table with this information for statistical analyses, so I did.

Table 1. Locations of 5x CpGs enriched by MBD treatment, methylated loci, sparsely methylated loci, and unmethylated loci. “Other” refers to any loci that did not overlap with exons, introns, transposable elements (all), and putative promoters.

Category All 5x CpGs Methylated Sparsely methylated Unmethylated
Total 4,304,257 3,181,904 481,788 640,565
Unique genes 54,619 44,505 47,243 47,584
mRNA coding regions 3,140,744 2,437,901 303,890 398,953
Exons 1,366,779 1,013,691 105,871 247,217
Introns 1,811,271 1,448,786 201,553 160,932
Transposable elements (all) 1,011,883 755,222 155,293 101,368
Transposable elements (Cg) 767,604 610,208 108,858 48,538
Putative promoters 203,376 134,534 27,443 41,399
Other 627,257 386,003 86,923 154,331

Chi-squared tests

For this round of chi-squared tests, I updated my overlap proportions file. I used this code to conduct chi-squared tests for various groupings. Every single time, I found that my distributions were significantly different. However, I’m now doubting how effective chi-squared tests are as a statistical approach. Looking at the observed, expected, and residual values from chisq.test from my test comparing all CG motifs with those enriched by MBD, I notice that my expected values are not what I want. In comparing these distributions, I want the background proportions (in this case, all CG motifs), to serve as the expected values, but that’s not the case:

Screen Shot 2019-04-29 at 7 14 20 PM

I psoted this issue to get some clarification on my methods.

Revised figures

In the meantime, I revised my figures!

1-totalenriched

Figure 1. Distribution of CpGs enriched by MBD versus all CpGs in the C. virginica genome.

2-enriched

Figure 2. Distribution of methylated CpGs versus those enriched by MBD.

3-allcat

Figure 3. Distribution of methylated, sparsely methylated, and unmethylated CpGs versus CpGs enriched by MBD.

4-dml

Figure 4. Distribution of DML versus methylated CpGs.

Once I sort out my statstical tests I can actually add significances to my figures.

Going forward

  1. Sort out statistical tests
  2. Describe (somehow) genes with DML in them
  3. Figure out what’s going on with the gene background
  4. Figure out what’s going on with DMR
  5. Work through gene-level analysis
  6. Update paper repository
  7. Start writing the discussion

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student http://bit.ly/2UK6Wlj
via IFTTT

Kaitlyn’s notebook: 20190419 Pt. Whitney geoducks

Broodstock

Tank 6:

Tank 6

Tank 5:

20190419_090053.jpg

Tank 4:

20190419_090105-e1556035508146.jpg

Juveniles overwintered in heath stacks:

20190419_093604

Some signs of buildup/infestation:

  • The first geoduck is alive.
  • The second image contains two shells and a dead geoduck.

20190419_09375020190419_093952.jpg

KCL Priming experiment:

Additionally, Sam worked on an experiment to determine the optimum time for KCL egg priming. These are some important points about the experiment (more info is on Slack):

  • 6 x 10 experiment
  • Screened fertilized eggs (~1500) and took sample 1 hour post-fertilization from screen (concentrated sample).
  • Added eggs into glass bottle (0.3 um filtered seawater)
  • Bottles stored in wine cooler

Sam screening bottles:

20190419_110109.jpg

Bottles were discarded because we did not see many D-hinge larvae:

  • Sam looked and believed he saw some evidence of fertilization (formation of polar bodies) in samples taken 1 hour post-fertilization and will looks through those samples instead.

 

Kaitlyn’s notebook: Geoduck heath tray setup

Elevated header tank:IMG_20190419_085330

Dripper and outflow pipe to heath trays:

IMG_20190419_090149.jpg

Heath stack 1:

  • Red are elevated CO2 lines, blue is ambient.
  • Nozzles on the line control flow into the tray.
  • Boards are covering the top trays.20190419_090944

Elbow pipes on heath trays:

IMG_20190419_091714

Accessing the water lines:

  • It is easy to pull on the water line system when pulling out individual trays. There is access to the back of the trays so that the water line can be removed so the tray can be pulled out.
  • However, this does make the trays stick out into the walkway. We zip-tied poles onto the edge of the stack in front of the elbow pipes to prevent people from directly hitting anything.
  • Also, the elevated water treatment has some bubbles that are generated in the line.

Line from algal header tank:

  • Sam added a thinner line (connected below the green zip tie) to improve flow.
  • The line runs to an algal header tank in the breezeway that is filled in the mornings and feeds everything in this room. The tank runs out in the evening.

20190419_091615

Juvenile geoduck in H1T1:

20190419_141734.jpg

For comparison, here are 5 week old industry geoduck:

20190419_100128.jpg

Interestingly, juveniles were crawling up the side in one industry tank:

20190419_100231.jpg

 

 

Sam’s Notebook: Data Analysis – C.virginica MBD Sequencing Coverage

A while ago, Steven tasked me with assessing some questions related to the sequencing coverage we get doing MBD-BSseq in this GitHub issue. At the heart of it all was really to try to get an idea of how much usable data we actually get when we do an MBD-BSseq project. Yaamini happened to have done an MBD-BSseq project relatively recently, and it’s one she’s actively working on analyzing, so we went with that data set.

Data was initially trimmed:

Subsequently, the data was concatenated, subset, and aligned using Bismark:

Today, I finally found the time to dedicate to evaluating alignment coverage of each of the Bismark sequence subsets. It was done in a Jupyter Notebook and solely with Bash/Python! I used this as project as an excuse to dive into using Python a bit more, instead of using R. For what I needed to accomplish, I just felt like this approach was simpler (instead of creating an R project and all of that).

Sam’s Notebook: Metagenomics Annotation – P.generosa Water Samples Using BLASTn on Mox and KronaTools Visualization to Compare pH Treatments

Nearing the end of this quick metagenomics comparison of taxonomic differences between the two pH treatments (pH=7.1 and pH=8.2). Previously ran:

After this completes, I’ll run KronaTools to get a rundown on taxonomic makeup of these two different pH treatments. I don’t expect BLASTn to take terribly long (based on previous metagenomics runs wit this data set); I’d guess around 6hrs.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=blastn_metagenomics ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190416_metagenomics_pgen_blastn # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log wd="$(pwd)" threads=28 # Paths to programs blast_dir="/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin" blastn="${blast_dir}/blastn" # Paths to blastdbs blastdb_dir="/gscratch/srlab/blastdbs/ncbi-nr-nt-v5" blast_db="${blastdb_dir}/nt" # Directory with metagenemark FastAs fasta_dir="/gscratch/scrubbed/samwhite/outputs/20190416_metagenomics_pgen_metagenemark" # Export BLAST database directory export BLASTDB=${blastdb_dir} # Loop through metagenemark nucleotide FastAs # Create list of those FastAs for reference # Parse out sample names # Run BLASTn on each FastA for fasta in ${fasta_dir}/*nucleotides.fasta do echo ${fasta} >> input.fasta.list.txt no_ext=${fasta%%.*} sample_name=$(echo ${no_ext##*/}) # Run blastx on Trinity fasta ${blastn} \ -query ${fasta} \ -db ${blast_db} \ -max_target_seqs 1 \ -outfmt "6 std staxids" \ -evalue 1e-10 \ -num_threads ${threads} \ > ${wd}/${sample_name}.blastn.outfmt6 done 

Sam’s Notebook: Metagenomics Gene Prediction – P.generosa Water Samples Using MetaGeneMark on Mox to Compare pH Treatments

Continuing with a relatively quick comparison of pH treatments (pH=7.1 vs. pH=8.2), I wanted to run gene prediction on the MEGAHIT assemblies I made yesterday. I ran MetaGeneMark on the two pH-specific assemblies on Mox. This should be a very fast process (I’m talking, like a couple of minutes fast), so it enhances the annotation with very little effort and time.

This will output a nucleotides FastA, proteins FastA, and a GFF for each of the two assemblies (i.e. pH treatments)

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=mgm ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=20-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190416_metagenomics_pgen_metagenemark # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # Programs gmhmmp="/gscratch/srlab/programs/MetaGeneMark_linux_64_3.38/mgm/gmhmmp" mgm_mod="/gscratch/srlab/programs/MetaGeneMark_linux_64_3.38/mgm/MetaGeneMark_v1.mod" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" # Variables assemblies_dir=/gscratch/scrubbed/samwhite/outputs/20190415_metagenomics_pgen_megahit ## Initialize array assemblies_array=() # Populate array with MegaHit FastAs assemblies_array=$(find ${assemblies_dir} -maxdepth 3 -name "*.contigs.fa") # List of files in array printf "%s\n" "${assemblies_array[@]}" >> fastas.list.txt # Loop through array and run MetaGeneMark # Parse out sample name by removing .contigs.fa from filename # and remove path for sample in ${assemblies_array[@]} do no_ext=${sample%%.*} sample_name=$(echo ${no_ext##*/}) # Run MetaGeneMark ## Specifying the following: ### -a : output predicted proteins ### -A : write predicted proteins to designated file ### -d : output predicted nucleotides ### -D : write predicted nucleotides to designated file ### -f 3 : Output format in GFF3 ### -m : Model file (supplied with software) ### -o : write GFF3 to designated file ${gmhmmp} \ -a \ -A ${sample_name}.mgm-proteins.fasta \ -d \ -D ${sample_name}.mgm-nucleotides.fasta \ -f 3 \ -m ${mgm_mod} \ ${sample} \ -o ${sample_name}.mgm.gff done # Index FastAs for fasta in *.fasta do ${samtools} faidx ${fasta} done 

Sam’s Notebook: Metagenome Assemblies – P.generosa Water Samples Trimmed HiSeqX Data Using Megahit on Mox to Compare pH Treatments

A report involving our work on the geoduck water metagenomics is due later this week and our in-depth analysis for this project using Anvi’o is likely to require at least another week to complete. Even though we have a broad overview of the metagenomic taxa present in these water samples, we don’t have data in a format for comparing across samples/treatments. So, I initiated our simplified pipeline (MEGAHIT > MetaGeneMark > BLASTn > KronaTools) for examining our metagenomic data of the two treatments:

  • pH=7.1
  • pH=8.2

I ran MEGAHIT on the trimmed HiSeqX data, but concatenated the corresponding pH treatment FastQ files to create a single assembly for each pH treatment.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=megahit ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190415_metagenomics_pgen_megahit # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # variables wd=$(pwd) fastq_dir=/gscratch/srlab/sam/data/metagenomics/P_generosa megahit=/gscratch/srlab/programs/megahit_v1.1.4_LINUX_CPUONLY_x86_64-bin/megahit bbmap_dir=/gscratch/srlab/programs/bbmap_38.34 samtools=/gscratch/srlab/programs/samtools-1.9/samtools cpus=28 ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() names_array=(pH71 pH82) # Create list of input FastQs used for concatenation # Concatenate all pH7.1 R1 FastQs # Uses parameter substitution to strip path for fastq in ${fastq_dir}/Library_Geoduck_MG_[367]*_R1_*.gz do echo ${fastq#${fastq_dir}} >> fastq.list.pH71.txt cat ${fastq} >> pH71.all.R1.fq.gz done # Create list of input FastQs used for concatenation # Concatenate all pH7.1 R2 FastQs # Uses parameter substitution to strip path for fastq in ${fastq_dir}/Library_Geoduck_MG_[367]*_R2_*.gz do echo ${fastq#${fastq_dir}} >> fastq.list.pH71.txt cat ${fastq} >> pH71.all.R2.fq.gz done # Create list of input FastQs used for concatenation # Concatenate all pH8.2 R1 FastQs # Uses parameter substitution to strip path for fastq in ${fastq_dir}/Library_Geoduck_MG_[125]*_R1_*.gz do echo ${fastq#${fastq_dir}} >> fastq.list.pH82.txt cat ${fastq} >> pH82.all.R1.fq.gz done # Create list of input FastQs used for concatenation # Concatenate all pH8.2 R2 FastQs # Uses parameter substitution to strip path for fastq in ${fastq_dir}/Library_Geoduck_MG_[125]*_R2_*.gz do echo ${fastq#${fastq_dir}} >> fastq.list.pH82.txt cat ${fastq} >> pH82.all.R2.fq.gz done # Populate R1 array with concatenated R1 FastQs for fastq in *R1*.fq.gz do fastq_array_R1+=(${fastq}) done # Populate R2 array with concatenated R2 FastQs for fastq in *R2*.gz do fastq_array_R2+=(${fastq}) done # Loop through samples for sample in ${!names_array[@]} do sample_name=$(echo ${names_array[sample]}) mkdir ${sample_name} && cd ${sample_name} # Run Megahit using paired-end reads ${megahit} \ -1 ${wd}/${fastq_array_R1[sample]} \ -2 ${wd}/${fastq_array_R2[sample]} \ --num-cpu-threads ${cpus} \ --out-prefix ${sample_name} # Create FastA index file ${samtools} faidx megahit_out/${sample_name}.contigs.fa # Determine coverage ## Align reads with BBmap BBwrap ${bbmap_dir}/bbwrap.sh \ ref=megahit_out/${sample_name}.contigs.fa \ in1=${fastq_array_R1[sample]} \ in2=${fastq_array_R2[sample]} \ out=${sample_name}.aln.sam.gz ## Output contig coverage ${bbmap_dir}/pileup.sh \ in=${sample_name}.aln.sam.gz \ out=${sample_name}.coverage.txt # Return to working directory cd ${wd} done 

Sam’s Notebook: FastQC – WGBS Sequencing Data from Genewiz Received 20190408

We received whole genome bisulfite sequencing (WGBS) data from Genewiz last week on 20190408, so ran FastQC on the files on my computer (swoose). FastQC results will be added to Nightingales Google Sheet.

Each set of FastQs were processed with a bash script. This file (ends with .sh) can be found in each corresponding output folder (see below).

Sam’s Notebook: Transcriptome Assembly – Geoduck Tissue-specific Assembly Larvae Day5 EPI99 with HiSeq and NovaSeq Data on Mox

I previously assembled and annotated P.generosa larval Day 5 transcriptome (20190318 – mislabeled as Juvenile Day 5 in my previous notebook entries) using just our HiSeq data from our Illumina collaboration. This was a an oversight, as I didn’t realize that we also had NovaSeq RNAseq data. So, I’ve initiated another de novo assembly using Trinity incorporating both sets of data.

Ran a de novo assembly on our HiSeq and NovaSeq data from Hollie’s larval Day 5 EPI 99 sample. This was done for Christian to use in some long, non-coding RNA (lncRNA) analysis.

NovaSeq data had been previously trimmed.

Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic option.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=trin_epi99 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=30-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_EPI99_RNAseq # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # User-defined variables reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/epi_115 threads=28 assembly_stats=assembly_stats.txt # Paths to programs trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" ## Inititalize arrays R1_array=() R2_array=() # Variables for R1/R2 lists R1_list="" R2_list="" # Create array of fastq R1 files R1_array=(${reads_dir}/*_R1_*.gz) # Create array of fastq R2 files R2_array=(${reads_dir}/*_R2_*.gz) # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${reads_dir}/*.gz do echo ${fastq##*/} >> fastq.list.txt done # Create comma-separated lists of FastQ reads R1_list=$(echo ${R1_array[@]} | tr " " ",") R2_list=$(echo ${R2_array[@]} | tr " " ",") # Run Trinity ${trinity_dir}/Trinity \ --trimmomatic \ --seqType fq \ --max_memory 120G \ --CPU ${threads} \ --left \ ${R1_list} \ --right \ ${R2_list} # Assembly stats ${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \ > ${assembly_stats} # Create gene map files ${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \ trinity_out_dir/Trinity.fasta \ > trinity_out_dir/Trinity.fasta.gene_trans_map # Create FastA index ${samtools} faidx \ trinity_out_dir/Trinity.fasta 

Sam’s Notebook: Transcriptome Assembly – Geoduck Tissue-specific Assembly Gonad HiSeq and NovaSeq Data on Mox

I previously assembled and annotated P.generosa gonad transcriptome (20190318) using just our HiSeq data from our Illumina collaboration. This was a an oversight, as I didn’t realize that we also had NovaSeq RNAseq data. So, I’ve initiated another de novo assembly using Trinity incorporating both sets of data.

NovaSeq data had been previously trimmed.

Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic option.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=trin_gonad ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=30-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_gonad_RNAseq # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # User-defined variables reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/gonad threads=28 assembly_stats=assembly_stats.txt # Paths to programs trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" ## Inititalize arrays R1_array=() R2_array=() # Variables for R1/R2 lists R1_list="" R2_list="" # Create array of fastq R1 files R1_array=(${reads_dir}/*_R1_*.gz) # Create array of fastq R2 files R2_array=(${reads_dir}/*_R2_*.gz) # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${reads_dir}/*.gz do echo ${fastq##*/} >> fastq.list.txt done # Create comma-separated lists of FastQ reads R1_list=$(echo ${R1_array[@]} | tr " " ",") R2_list=$(echo ${R2_array[@]} | tr " " ",") # Run Trinity ${trinity_dir}/Trinity \ --trimmomatic \ --seqType fq \ --max_memory 120G \ --CPU ${threads} \ --left \ ${R1_list} \ --right \ ${R2_list} # Assembly stats ${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \ > ${assembly_stats} # Create gene map files ${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \ trinity_out_dir/Trinity.fasta \ > trinity_out_dir/Trinity.fasta.gene_trans_map # Create FastA index ${samtools} faidx \ trinity_out_dir/Trinity.fasta