Sam’s Notebook: Transcriptome Assembly – Geoduck Tissue-specific Assembly Juvenile Super Low OA EPI115 with HiSeq Data on Mox

Ran a de novo assembly on our HiSeq data from Hollie’s juvenile EPI 115 sample “super low OA”. This was done for Christian to use in some long, non-coding RNA (lncRNA) analysis.

Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic option.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=trin_epi115 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=30-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_EPI115_RNAseq # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # User-defined variables reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/epi_115 threads=28 assembly_stats=assembly_stats.txt # Paths to programs trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" ## Inititalize arrays R1_array=() R2_array=() # Variables for R1/R2 lists R1_list="" R2_list="" # Create array of fastq R1 files R1_array=(${reads_dir}/*_R1_*.gz) # Create array of fastq R2 files R2_array=(${reads_dir}/*_R2_*.gz) # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${reads_dir}/*.gz do echo ${fastq##*/} >> fastq.list.txt done # Create comma-separated lists of FastQ reads R1_list=$(echo ${R1_array[@]} | tr " " ",") R2_list=$(echo ${R2_array[@]} | tr " " ",") # Run Trinity ${trinity_dir}/Trinity \ --trimmomatic \ --seqType fq \ --max_memory 120G \ --CPU ${threads} \ --left \ ${R1_list} \ --right \ ${R2_list} # Assembly stats ${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \ > ${assembly_stats} # Create gene map files ${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \ trinity_out_dir/Trinity.fasta \ > trinity_out_dir/Trinity.fasta.gene_trans_map # Create FastA index ${samtools} faidx \ trinity_out_dir/Trinity.fasta 

Sam’s Notebook: Transcriptome Assembly – Geoduck Tissue-specific Assembly Ctenidia with HiSeq and NovaSeq Data on Mox

I previously assembled and annotated P.generosa ctenidia transcriptome (20190318) using just our HiSeq data from our Illumina collaboration. This was a an oversight, as I didn’t realize that we also had NovaSeq RNAseq data. So, I’ve initiated another de novo assembly using Trinity incorporating both sets of data.

NovaSeq data had been previously trimmed.

Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic option.

SBATCH script (GitHub):

  #!/bin/bash ## Job Name #SBATCH --job-name=trin_ctenidia ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=30-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_ctenidia_RNAseq # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # User-defined variables reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/ctenidia threads=28 assembly_stats=assembly_stats.txt # Paths to programs trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" ## Inititalize arrays R1_array=() R2_array=() # Variables for R1/R2 lists R1_list="" R2_list="" # Create array of fastq R1 files R1_array=(${reads_dir}/*_R1_*.gz) # Create array of fastq R2 files R2_array=(${reads_dir}/*_R2_*.gz) # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${reads_dir}/*.gz do echo ${fastq##*/} >> fastq.list.txt done # Create comma-separated lists of FastQ reads R1_list=$(echo ${R1_array[@]} | tr " " ",") R2_list=$(echo ${R2_array[@]} | tr " " ",") # Run Trinity ${trinity_dir}/Trinity \ --trimmomatic \ --seqType fq \ --max_memory 120G \ --CPU ${threads} \ --left \ ${R1_list} \ --right \ ${R2_list} # Assembly stats ${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \ > ${assembly_stats} # Create gene map files ${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \ trinity_out_dir/Trinity.fasta \ > trinity_out_dir/Trinity.fasta.gene_trans_map # Create FastA index ${samtools} faidx \ trinity_out_dir/Trinity.fasta 

Sam’s Notebook: RNA Isolation and Quantification – Crab Hemolypmh Using Quick-DNA-RNA Microprep Plus Kit

In the continuing struggle to isolate RNA from the Chionoecetes bairdi hemolymph preserved in RNAlater, Pam managed to find the Quick-DNA-RNA Microprep Plus Kit (ZymoResearch) as a potential option. We received a free sample of the kit and so I gave it a shot. Grace pulled 10 samples she’d previously used to isolate RNA (and was unable to get anything out of virtually all of them using the RNeasy Plus Micro Kit (Qiagen)) for me to try out this new kit:

  • 31
  • 142
  • 285
  • 291
  • 317
  • 326
  • 419
  • 455
  • 503
  • 506

I’ve included an image of the sample tubes, due to the fact that they are colored, and I can’t remember the significance (insignificance?) of this. Want to make sure that is documented in case it’s important.

10 samples used for RNA isolation in ice bucket

Yaamini’s Notebook: DML Analysis Part 31

Finalized general methylation trends

After taking many steps backwards and only some steps forward in this issue, Sam, Steven, and I chatted about the best way to characterize methylation trends in C. virginica. We decided it would be best to use this file Sam made, which is a concatenation of all CpGs in my data, each with one coverage and percent methylation metric.

New method

In this Jupyter notebook, I downloaded Sam’s file, counted lines, and filtered out all loci with a minimum of 5x coverage.

 !awk '{if ($5+$6 >= 5) { print $1, $2-1, $3, $4, $5+$6}}' cvir_bsseq_all_pe_R1_bismark_bt2_pe.bismark.cov \ > 2019-04-09-All-5x-CpGs.bedgraph  

I then identified methylated, partially methylated, and unmethylated loci.

 awk '{if ($4 >= 50) { print $1, $2, $3, $4 }}' 2019-04-09-All-5x-CpGs.bedgraph \ > 2019-04-09-All-5x-CpG-Loci-Methylated.bedgraph awk '{if ($4 < 50) { print $1, $2, $3, $4}}' 2019-04-09-All-5x-CpGs.bedgraph \ | awk '{if ($4 > 10) { print $1, $2, $3, $4 }}' \ > 2019-04-09-All-5x-CpG-Loci-Sparsely-Methylated.bedgraph awk '{if ($4 <= 10) { print $1, $2, $3, $4 }}' 2019-04-09-All-5x-CpGs.bedgraph \ > 2019-04-09-All-5x-CpG-Loci-Unmethylated.bedgraph  

Here’s the breakdown:

  • 14,458,703 CG motifs in the C. virginica genome
  • 14,026,131 CpGs with data in the concatenation file (97.0%)
  • 4,304,257 loci with 5x coverage (30.7%)
  • 3,181,904 methylated loci (73.9%)
  • 481,788 sparsely methylated loci (11.2%)
  • 640,565 unmethylated loci (14.9%)

Seeing a roughly 75% methylation rate seemed weird, given how methylation rates in C. gigas were low. Sam suggested I look at my data in IGV to confirm “it’s legit.”

Screen Shot 2019-04-10 at 10 37 14 AM

Screen Shot 2019-04-10 at 10 33 47 AM

Screen Shot 2019-04-10 at 10 34 09 AM

Figures 1-3. All CpGs, methylated, sparsely methylated, and unmethylated loci in all C. virginica gonad samples.

Looking at the screenshots, it does seem like most of my data is methylated. It’s possible that gonad tissue has a higher methylation rate than ctenidia (what Mac used to describe C. gigas methylation trends), or that there really is a species difference. It may also be beneficial to set a 75% cutoff to define a locus as methylated. These are things I need to dig into in my discussion. In the meantime, remade my frequency distribution with this code.

Screen Shot 2019-04-10 at 11 52 07 AM

Figure 4. Frequency distribution of methylated CpGs in C. virginica gonad data.

Using the new list of methylated loci, I characterized the location of the loci in the C. virginica genome. Methylated CpGs were found primarily in genic regions, with 2,437,901 (76.6%) loci in 44,505 unique genes. Methylated loci were also concentrated in exons, with 1,013,691 CpGs (31.9%) in exons versus 1,448,786 (45.5%) loci in introns. Transposable elements contained (TE-all) 755,222 methylated CpGs (23.7%), with 610,208 loci (19.2%) overlapping with TE-Cg. Putative promoter regions 1 kb upstream of transcription start sites overlapped with 134,534 loci (4.2%). There were 386,003 methylated loci (12.1%) that did not overlap with either exons, introns, transposable elements, or promoter regions.

Chi-squared tests and figures

I intially thought I would make one figure comparing total CpGs, methylated CpGs, and DML, but Steven said I should only compare a list of interest to its background. This means I need to make two separate figures: one to compare total and methylated CpGs, and one to compare methylated CpGs and DML.

I quickly went into this Jupyter notebook to count how many CG motifs did not overlap with either exons, introns, transposable elements (all), or putative promoters. I found 4760,788 CpGs did not overlap with any characterized region. Then, I created this file in Excel (there’s probably a way to automate this but I can’t think of it right now) with genomic locations for all CpGs, methylated CpGs, and DML. From my chi-squared test of homogeneity in this R Markdown file, I found that the distribution of methylated CpGs in the C. virginica genome was different than the distribution of all CpGs (chi-squared = 5,028,800, df = 4, p-value < 2.2e-16). Similarly, the distribution of DML was different than the distribution of methylated CpGs (chi-squared = 347.01, df = 4, p-value < 2.2e-16). I couldn’t figure out the best way to conduct a post-hoc test, so instead I made figures.

Screen Shot 2019-04-10 at 11 41 02 PM

Figure 5. Distribution of total CpG and methylated CpG in the C. virginica genome

Screen Shot 2019-04-10 at 11 40 52 PM

Figure 6. Distribution of methylated CpG and DML

Going forward

  1. Figure out post-hoc test for chi-squared test
  2. Describe (somehow) genes with DML in them
  3. Figure out what’s going on with the gene background
  4. Figure out what’s going on with DMR
  5. Work through gene-level analysis
  6. Update paper repository
  7. Start writing the discussion
  8. Draft timeline to finish the paper by the end of the month

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student http://bit.ly/2KsFwAA
via IFTTT