Sam’s Notebook: Transcriptome Assembly – Geoduck Tissue-specific Assembly Juvenile Ambient OA EPI124 with HiSeq and NovaSeq Data on Mox

Ran a de novo assembly on our HiSeq and NovaSeq data from Hollie’s juvenile EPI 124 sample “ambient OA”. This was done for Christian to use in some long, non-coding RNA (lncRNA) analysis.

NovaSeq data had been previously trimmed.

Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic option.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=trin_epi124 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=30-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_EPI124_RNAseq # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # User-defined variables reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/epi_124 threads=28 assembly_stats=assembly_stats.txt # Paths to programs trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" ## Inititalize arrays R1_array=() R2_array=() # Variables for R1/R2 lists R1_list="" R2_list="" # Create array of fastq R1 files R1_array=(${reads_dir}/*_R1_*.gz) # Create array of fastq R2 files R2_array=(${reads_dir}/*_R2_*.gz) # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${reads_dir}/*.gz do echo ${fastq##*/} >> fastq.list.txt done # Create comma-separated lists of FastQ reads R1_list=$(echo ${R1_array[@]} | tr " " ",") R2_list=$(echo ${R2_array[@]} | tr " " ",") # Run Trinity ${trinity_dir}/Trinity \ --trimmomatic \ --seqType fq \ --max_memory 120G \ --CPU ${threads} \ --left \ ${R1_list} \ --right \ ${R2_list} # Assembly stats ${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \ > ${assembly_stats} # Create gene map files ${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \ trinity_out_dir/Trinity.fasta \ > trinity_out_dir/Trinity.fasta.gene_trans_map # Create FastA index ${samtools} faidx \ trinity_out_dir/Trinity.fasta 

Sam’s Notebook: Transcriptome Assembly – Geoduck Tissue-specific Assembly Juvenile Ambient OA EPI123 with HiSeq Data on Mox

Ran a de novo assembly on our HiSeq data from Hollie’s juvenile EPI 123 sample “ambient OA”. This was done for Christian to use in some long, non-coding RNA (lncRNA) analysis.

Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic option.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=trin_epi123 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=30-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_EPI123_RNAseq # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # User-defined variables reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/epi_123 threads=28 assembly_stats=assembly_stats.txt # Paths to programs trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" ## Inititalize arrays R1_array=() R2_array=() # Variables for R1/R2 lists R1_list="" R2_list="" # Create array of fastq R1 files R1_array=(${reads_dir}/*_R1_*.gz) # Create array of fastq R2 files R2_array=(${reads_dir}/*_R2_*.gz) # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${reads_dir}/*.gz do echo ${fastq##*/} >> fastq.list.txt done # Create comma-separated lists of FastQ reads R1_list=$(echo ${R1_array[@]} | tr " " ",") R2_list=$(echo ${R2_array[@]} | tr " " ",") # Run Trinity ${trinity_dir}/Trinity \ --trimmomatic \ --seqType fq \ --max_memory 120G \ --CPU ${threads} \ --left \ ${R1_list} \ --right \ ${R2_list} # Assembly stats ${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \ > ${assembly_stats} # Create gene map files ${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \ trinity_out_dir/Trinity.fasta \ > trinity_out_dir/Trinity.fasta.gene_trans_map # Create FastA index ${samtools} faidx \ trinity_out_dir/Trinity.fasta 

Sam’s Notebook: Transcriptome Assembly – Geoduck Tissue-specific Assembly Juvenile Super Low OA EPI116 with HiSeq and NovaSeq Data on Mox

Ran a de novo assembly on our HiSeq and NovaSeq data from Hollie’s juvenile EPI 116 sample “super low OA”. This was done for Christian to use in some long, non-coding RNA (lncRNA) analysis.

NovaSeq data had been previously trimmed.

Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic option.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=trin_epi116 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=30-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_EPI116_RNAseq # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # User-defined variables reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/epi_116 threads=28 assembly_stats=assembly_stats.txt # Paths to programs trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" ## Inititalize arrays R1_array=() R2_array=() # Variables for R1/R2 lists R1_list="" R2_list="" # Create array of fastq R1 files R1_array=(${reads_dir}/*_R1_*.gz) # Create array of fastq R2 files R2_array=(${reads_dir}/*_R2_*.gz) # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${reads_dir}/*.gz do echo ${fastq##*/} >> fastq.list.txt done # Create comma-separated lists of FastQ reads R1_list=$(echo ${R1_array[@]} | tr " " ",") R2_list=$(echo ${R2_array[@]} | tr " " ",") # Run Trinity ${trinity_dir}/Trinity \ --trimmomatic \ --seqType fq \ --max_memory 120G \ --CPU ${threads} \ --left \ ${R1_list} \ --right \ ${R2_list} # Assembly stats ${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \ > ${assembly_stats} # Create gene map files ${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \ trinity_out_dir/Trinity.fasta \ > trinity_out_dir/Trinity.fasta.gene_trans_map # Create FastA index ${samtools} faidx \ trinity_out_dir/Trinity.fasta 

Sam’s Notebook: Transcriptome Assembly – Geoduck Tissue-specific Assembly Juvenile Super Low OA EPI115 with HiSeq Data on Mox

Ran a de novo assembly on our HiSeq data from Hollie’s juvenile EPI 115 sample “super low OA”. This was done for Christian to use in some long, non-coding RNA (lncRNA) analysis.

Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic option.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=trin_epi115 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=30-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_EPI115_RNAseq # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # User-defined variables reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/epi_115 threads=28 assembly_stats=assembly_stats.txt # Paths to programs trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" ## Inititalize arrays R1_array=() R2_array=() # Variables for R1/R2 lists R1_list="" R2_list="" # Create array of fastq R1 files R1_array=(${reads_dir}/*_R1_*.gz) # Create array of fastq R2 files R2_array=(${reads_dir}/*_R2_*.gz) # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${reads_dir}/*.gz do echo ${fastq##*/} >> fastq.list.txt done # Create comma-separated lists of FastQ reads R1_list=$(echo ${R1_array[@]} | tr " " ",") R2_list=$(echo ${R2_array[@]} | tr " " ",") # Run Trinity ${trinity_dir}/Trinity \ --trimmomatic \ --seqType fq \ --max_memory 120G \ --CPU ${threads} \ --left \ ${R1_list} \ --right \ ${R2_list} # Assembly stats ${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \ > ${assembly_stats} # Create gene map files ${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \ trinity_out_dir/Trinity.fasta \ > trinity_out_dir/Trinity.fasta.gene_trans_map # Create FastA index ${samtools} faidx \ trinity_out_dir/Trinity.fasta 

Sam’s Notebook: Transcriptome Assembly – Geoduck Tissue-specific Assembly Ctenidia with HiSeq and NovaSeq Data on Mox

I previously assembled and annotated P.generosa ctenidia transcriptome (20190318) using just our HiSeq data from our Illumina collaboration. This was a an oversight, as I didn’t realize that we also had NovaSeq RNAseq data. So, I’ve initiated another de novo assembly using Trinity incorporating both sets of data.

NovaSeq data had been previously trimmed.

Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic option.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=trin_ctenidia ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=30-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_ctenidia_RNAseq # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # User-defined variables reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/ctenidia threads=28 assembly_stats=assembly_stats.txt # Paths to programs trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" ## Inititalize arrays R1_array=() R2_array=() # Variables for R1/R2 lists R1_list="" R2_list="" # Create array of fastq R1 files R1_array=(${reads_dir}/*_R1_*.gz) # Create array of fastq R2 files R2_array=(${reads_dir}/*_R2_*.gz) # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${reads_dir}/*.gz do echo ${fastq##*/} >> fastq.list.txt done # Create comma-separated lists of FastQ reads R1_list=$(echo ${R1_array[@]} | tr " " ",") R2_list=$(echo ${R2_array[@]} | tr " " ",") # Run Trinity ${trinity_dir}/Trinity \ --trimmomatic \ --seqType fq \ --max_memory 120G \ --CPU ${threads} \ --left \ ${R1_list} \ --right \ ${R2_list} # Assembly stats ${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \ > ${assembly_stats} # Create gene map files ${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \ trinity_out_dir/Trinity.fasta \ > trinity_out_dir/Trinity.fasta.gene_trans_map # Create FastA index ${samtools} faidx \ trinity_out_dir/Trinity.fasta 

Sam’s Notebook: RNA Isolation and Quantification – Crab Hemolypmh Using Quick-DNA-RNA Microprep Plus Kit

In the continuing struggle to isolate RNA from the Chionoecetes bairdi hemolymph preserved in RNAlater, Pam managed to find the Quick-DNA-RNA Microprep Plus Kit (ZymoResearch) as a potential option. We received a free sample of the kit and so I gave it a shot. Grace pulled 10 samples she’d previously used to isolate RNA (and was unable to get anything out of virtually all of them using the RNeasy Plus Micro Kit (Qiagen)) for me to try out this new kit:

  • 31
  • 142
  • 285
  • 291
  • 317
  • 326
  • 419
  • 455
  • 503
  • 506

I’ve included an image of the sample tubes, due to the fact that they are colored, and I can’t remember the significance (insignificance?) of this. Want to make sure that is documented in case it’s important.

10 samples used for RNA isolation in ice bucket

Yaamini’s Notebook: DML Analysis Part 31

Finalized general methylation trends

After taking many steps backwards and only some steps forward in this issue, Sam, Steven, and I chatted about the best way to characterize methylation trends in C. virginica. We decided it would be best to use this file Sam made, which is a concatenation of all CpGs in my data, each with one coverage and percent methylation metric.

New method

In this Jupyter notebook, I downloaded Sam’s file, counted lines, and filtered out all loci with a minimum of 5x coverage.

 !awk '{if ($5+$6 >= 5) { print $1, $2-1, $3, $4, $5+$6}}' cvir_bsseq_all_pe_R1_bismark_bt2_pe.bismark.cov \ > 2019-04-09-All-5x-CpGs.bedgraph  

I then identified methylated, partially methylated, and unmethylated loci.

 awk '{if ($4 >= 50) { print $1, $2, $3, $4 }}' 2019-04-09-All-5x-CpGs.bedgraph \ > 2019-04-09-All-5x-CpG-Loci-Methylated.bedgraph awk '{if ($4 < 50) { print $1, $2, $3, $4}}' 2019-04-09-All-5x-CpGs.bedgraph \ | awk '{if ($4 > 10) { print $1, $2, $3, $4 }}' \ > 2019-04-09-All-5x-CpG-Loci-Sparsely-Methylated.bedgraph awk '{if ($4 <= 10) { print $1, $2, $3, $4 }}' 2019-04-09-All-5x-CpGs.bedgraph \ > 2019-04-09-All-5x-CpG-Loci-Unmethylated.bedgraph  

Here’s the breakdown:

  • 14,458,703 CG motifs in the C. virginica genome
  • 14,026,131 CpGs with data in the concatenation file (97.0%)
  • 4,304,257 loci with 5x coverage (30.7%)
  • 3,181,904 methylated loci (73.9%)
  • 481,788 sparsely methylated loci (11.2%)
  • 640,565 unmethylated loci (14.9%)

Seeing a roughly 75% methylation rate seemed weird, given how methylation rates in C. gigas were low. Sam suggested I look at my data in IGV to confirm “it’s legit.”

Screen Shot 2019-04-10 at 10 37 14 AM

Screen Shot 2019-04-10 at 10 33 47 AM

Screen Shot 2019-04-10 at 10 34 09 AM

Figures 1-3. All CpGs, methylated, sparsely methylated, and unmethylated loci in all C. virginica gonad samples.

Looking at the screenshots, it does seem like most of my data is methylated. It’s possible that gonad tissue has a higher methylation rate than ctenidia (what Mac used to describe C. gigas methylation trends), or that there really is a species difference. It may also be beneficial to set a 75% cutoff to define a locus as methylated. These are things I need to dig into in my discussion. In the meantime, remade my frequency distribution with this code.

Screen Shot 2019-04-10 at 11 52 07 AM

Figure 4. Frequency distribution of methylated CpGs in C. virginica gonad data.

Using the new list of methylated loci, I characterized the location of the loci in the C. virginica genome. Methylated CpGs were found primarily in genic regions, with 2,437,901 (76.6%) loci in 44,505 unique genes. Methylated loci were also concentrated in exons, with 1,013,691 CpGs (31.9%) in exons versus 1,448,786 (45.5%) loci in introns. Transposable elements contained (TE-all) 755,222 methylated CpGs (23.7%), with 610,208 loci (19.2%) overlapping with TE-Cg. Putative promoter regions 1 kb upstream of transcription start sites overlapped with 134,534 loci (4.2%). There were 386,003 methylated loci (12.1%) that did not overlap with either exons, introns, transposable elements, or promoter regions.

Chi-squared tests and figures

I intially thought I would make one figure comparing total CpGs, methylated CpGs, and DML, but Steven said I should only compare a list of interest to its background. This means I need to make two separate figures: one to compare total and methylated CpGs, and one to compare methylated CpGs and DML.

I quickly went into this Jupyter notebook to count how many CG motifs did not overlap with either exons, introns, transposable elements (all), or putative promoters. I found 4760,788 CpGs did not overlap with any characterized region. Then, I created this file in Excel (there’s probably a way to automate this but I can’t think of it right now) with genomic locations for all CpGs, methylated CpGs, and DML. From my chi-squared test of homogeneity in this R Markdown file, I found that the distribution of methylated CpGs in the C. virginica genome was different than the distribution of all CpGs (chi-squared = 5,028,800, df = 4, p-value < 2.2e-16). Similarly, the distribution of DML was different than the distribution of methylated CpGs (chi-squared = 347.01, df = 4, p-value < 2.2e-16). I couldn’t figure out the best way to conduct a post-hoc test, so instead I made figures.

Screen Shot 2019-04-10 at 11 41 02 PM

Figure 5. Distribution of total CpG and methylated CpG in the C. virginica genome

Screen Shot 2019-04-10 at 11 40 52 PM

Figure 6. Distribution of methylated CpG and DML

Going forward

  1. Figure out post-hoc test for chi-squared test
  2. Describe (somehow) genes with DML in them
  3. Figure out what’s going on with the gene background
  4. Figure out what’s going on with DMR
  5. Work through gene-level analysis
  6. Update paper repository
  7. Start writing the discussion
  8. Draft timeline to finish the paper by the end of the month

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student http://bit.ly/2KsFwAA
via IFTTT

Shelly’s Notebook: Mon. April 8, 2019 Salmon Sea lice Project

Salmon sea lice project

  • met with Steven and Christian today about directions for epigenetic exploration of salmon infected with sea lice
  • need to determine best method for epigenetic analysis for these samples (e.g. ATAC-seq, amplicon BS-seq, ChIP-seq, etc.)
    • it’s hard to say how well the chromatin structure is maintained in ethanol fixed samples (see ATAC-seq references on Epigenome methods slide. This could contribute to variation between individuals.
    • From the references I found so far (see Epigenome methods slide), it seems like RRBS or amplicon BS sequencing would be the better way to go.

from shellytrigg http://bit.ly/2uX3bhX
via IFTTT

Grace’s Notebook: Bairdi RNA extractions- other methods to try; Plan for Week

Between trying to add some edits and clean things up for the 2015 Oysterseed project paper, I’ve been reading protocols for other RNA extraction kits. Zymo Quick-DNA/RNA Mircoprep Plus Kit, Zymo Directzol RNA MiniPrep… Details in post

Zymo Research Quick-DNA/RNA Microprep Plus Kit

Cat. D7005
Manual pdf

This was recommended to Sam and I from Pam. Sam requested a free sample, which should be arriving some time this week.

Zymo Direct-zol RNA MiniPrep

Cat. R2050
Manual pdf

We already have this from when Hollie was here in 2016. There’s one box that’s opened, and one unopened 50rxns kit box.

This week

Crab

I’ll be able to try out both kits (Microprep should arrive this week; we already have Direct-zol)

Oysterseed

paper
Intro: update it and tailor it more to the project
Methods: get them in a solid state – address all comments and get details right
Results: identify all results and include in paper; are we using phenotypic data even though I can’t get it for just the control silos?
Discussion: write it up after results are finalized

Skyline document: finish publishing it to Panorama (need project title, abstract in order to finalize publication)
Have this all done by next Wednesday
I think once the results are finalized, I’ll feel able to finish up Intro and Discussion

from Grace’s Lab Notebook http://bit.ly/2I5mmPb
via IFTTT

Sam’s Notebook: Data Management – Whole Genome Bisulfite Sequencing Data from Genewiz Received

Received the WGBS data from Genewiz that were sent to Genewiz for whole genome bisulfite sequencing on 20190207. These were from:

Transferred data to nightingales/C_gigas or nightingales/P_generosa and verified MD5 checksums. Have added list of files to nightingales spreadsheet(Google Sheet).

from Sam’s Notebook http://bit.ly/2I81EOq
via IFTTT