Sam’s Notebook: Trimming/FastQC/MultiQC – C.bairdi RNAseq FastQ with fastp on Mox

After receiving our RNAseq data from Genewiz earlier today, needed to run FastQC, trim, check trimmed reads with FastQC.

FastQC on raw reads was run locally and files were kept on owl/nightingales/C_bairdi.

fastp trimming was run on Mox, followed by MultiQC.

FastQC on trimmed reads were run locally, followed by MultiQC.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=cbai_fastp_trimming_RNAseq ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200318_cbai_RNAseq_fastp_trimming ### C.bairdi RNAseq trimming using fastp. # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log # Set number of CPUs to use threads=27 # Input/output files trimmed_checksums=trimmed_fastq_checksums.md5 raw_reads_dir=/gscratch/scrubbed/samwhite/data/C_bairdi/RNAseq/ # Paths to programs fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() programs_array=() R1_names_array=() R2_names_array=() # Programs array programs_array=("${fastp}" "${multiqc}") # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${programs_array[program]}: " echo "" ${programs_array[program]} -h echo "" echo "" echo "

Sam’s Notebook: Data Received – C.bairdi RNAseq Data from Genewiz

We received the RNAseq data from the RNA that was sent out by Grace on 20200212.

Sequencing is 150bp PE.

Grace has a Google Sheet that describes what the samples constitute (e.g. ambient/cold/warm, infected/uninfect, day, etc.)

Genewiz report:

Project Sample ID Barcode Sequence # Reads Yield (Mbases) Mean Quality Score % Bases >= 30
30-343338329 72 ACTCGCTA+TCGACTAG 27,249,335 8,175 34.16 85.82
30-343338329 73 ACTCGCTA+TTCTAGCT 25,856,008 7,757 33.87 84.36
30-343338329 113 ACTCGCTA+CCTAGAGT 31,638,462 9,492 32.38 77.77
30-343338329 118 ACTCGCTA+GCGTAAGA 29,253,455 8,776 33.50 82.62
30-343338329 127 ACTCGCTA+CTATTAAG 27,552,329 8,266 33.14 81.13
30-343338329 132 ACTCGCTA+AAGGCTAT 27,518,702 8,256 34.86 88.87
30-343338329 151 ACTCGCTA+GAGCCTTA 33,430,314 10,029 35.01 89.35
30-343338329 173 ACTCGCTA+TTATGCGA 33,262,459 9,979 34.45 87.06
30-343338329 178 GGAGCTAC+TCGACTAG 29,495,389 8,849 35.01 89.62
30-343338329 221 GGAGCTAC+TTCTAGCT 25,902,415 7,771 34.76 88.40
30-343338329 222 GGAGCTAC+CCTAGAGT 53,808,137 16,142 30.90 71.11
30-343338329 254 GGAGCTAC+GCGTAAGA 16,771,613 5,031 35.14 90.03
30-343338329 272 GGAGCTAC+CTATTAAG 27,818,893 8,346 33.30 81.70
30-343338329 280 GGAGCTAC+AAGGCTAT 61,008,799 18,303 30.85 70.86
30-343338329 294 GGAGCTAC+GAGCCTTA 28,539,233 8,562 35.12 90.04
30-343338329 334 GGAGCTAC+TTATGCGA 25,916,895 7,775 34.98 89.39
30-343338329 349 GCGTAGTA+TCGACTAG 32,868,756 9,861 33.53 82.69
30-343338329 359 GCGTAGTA+TTCTAGCT 27,274,149 8,182 34.96 89.20
30-343338329 425 GCGTAGTA+CCTAGAGT 66,224,932 19,867 29.54 65.13
30-343338329 427 GCGTAGTA+GCGTAAGA 18,918,640 5,676 33.31 80.87
30-343338329 445 GCGTAGTA+CTATTAAG 30,745,388 9,224 33.07 80.83
30-343338329 463 GCGTAGTA+AAGGCTAT 19,531,145 5,859 34.27 86.08
30-343338329 481 GCGTAGTA+GAGCCTTA 50,592,084 15,178 31.92 75.59
30-343338329 485 GCGTAGTA+TTATGCGA 26,010,208 7,803 34.63 87.48

Confirmed that SFTP transfer from Genewiz to owl/nightingales/C_bairdi/ was successful:

screencap of md5sum output

Shelly’s Notebook: Mon. Mar. 16, 2020 Trimming Geoduck RRBS data

This entry is about trimming for the 2016 juvenile geoduck RRBS data Hollie generated.

Multi-core TrimGalore!

TrimGalore! can be run with multi-core settings if you use version 0.6.1 or newer. Reference to the multi-core TrimGalore! update is here: Using 8 cores reduced the run time for 100M reads from ~2hr10min to ~30min.

Trimming history of RRBS data

Illumina Recommended Trimming:

  • I spoke with Dina from Illumina tech support today and she found trimming recommendations for the Illumina Truseq Methylation Kit here on page 45: Illumina adapter sequences reference
      • The first 13 bases (bolded above) correspond to the universal illumina adapter sequence you can specify in TrimGalore (AGATCGGAAGAGC)
      • There is an additional 12bp added on by the Illumina Truseq Methylation Kit that are recommeneded to be trimmed off
  • Bismark User Guide sections TruSeq DNA-Methylation Kit (formerly EpiGnome) and Random priming and 3’ Trimming in general
  • Illumina Truseq Methylation Kit workflow
  • Adaptor-tagged%20TruSeq%20DNA%20Methylation%20LIbrary%20Kit%20Workflow.png

Testing Recommended Trimming Parameters

TrimGalore! with new parameters

I performed a test on just one sample: EPI-167

(base) [strigg@mox2 raw]$ wget --no-check-certificate --2020-03-16 20:29:13-- Resolving ( Connecting to (||:443... connected. WARNING: cannot verify's certificate, issued by ‘/C=US/ST=MI/L=Ann Arbor/O=Internet2/OU=InCommon/CN=InCommon RSA Server CA’: Unable to locally verify the issuer's authority. HTTP request sent, awaiting response... 200 OK Length: 1451174652 (1.4G) [application/x-gzip] Saving to: ‘EPI-167_S10_L002_R1_001.fastq.gz’ 100%[=============================================================================================>] 1,451,174,652 27.6MB/s in 51s 2020-03-16 20:30:04 (26.9 MB/s) - ‘EPI-167_S10_L002_R1_001.fastq.gz’ saved [1451174652/1451174652] (base) [strigg@mox2 raw]$ wget --no-check-certificate --2020-03-16 20:30:08-- Resolving ( Connecting to (||:443... connected. WARNING: cannot verify's certificate, issued by ‘/C=US/ST=MI/L=Ann Arbor/O=Internet2/OU=InCommon/CN=InCommon RSA Server CA’: Unable to locally verify the issuer's authority. HTTP request sent, awaiting response... 200 OK Length: 1496018906 (1.4G) [application/x-gzip] Saving to: ‘EPI-167_S10_L002_R2_001.fastq.gz’ 100%[=============================================================================================>] 1,496,018,906 27.2MB/s in 53s 2020-03-16 20:31:01 (27.1 MB/s) - ‘EPI-167_S10_L002_R2_001.fastq.gz’ saved [1496018906/1496018906] 

Alignments with new trimming

  • ran this script
    • check Mbias plots in report
    • check percent methylation
      • previous
      • new trimming
        Trim date 03/16/2020 05/16/2018
        Read pairs analyzed 23436512 24481250
        mapping efficiency (%) 40.9 42.6
        ambiguously mapped read pairs (%) 11.8 8.2
        unaligned read pairs 47.3 49.2
        mC in CpG (%) 25.3 27.9
        mC in CHG (%) 1.7 2.9
        mC in CHH (%) 2.7 3
        mC in CN or CHN (%) 4.9 8.5
    • determine if deduplicating should be done
      • previous report showed 26.85% duplicate alignments were removed – NOTE: previous alignments were done using genome v074. Although there shouldn’t be a difference between this genome and the one on OFS (Panopea-generosa-v1.0.fa), I am currently performing alignment of the 5/16/19 trimmed reads and the 9/23/19 trimmed reads for EPI-167/

from shellytrigg

Sam’s Notebook: DNA Isolation and Quantification – C.bairdi Hemocyte Pellets in RNAlater

Isolated DNA from 22 samples (see Qubit spreadsheet in “Results” below for sample IDs) using the Quick DNA/RNA Microprep Kit (ZymoResearch; PDF) according to the manufacturer’s protocol for liquids/cells in RNAlater.

These samples were from an RNA isolation on the following date:

Brief rundown of method:

  • Used 35uL from each RNAlater/hemocyte slurry.
  • Mixed with equal volume of H2O (35uL).
  • Retained DNA on the Zymo-Spin IC-XM columns at 4oC for isolation after RNA isolation.
  • DNA was eluted in 15uL H2O

DNA was quantified on the Roberts Lab Qubit 3.0 using the 1x DNA High Sensitivity Assay (Invitrogen), using 1uL of each sample.

Sam’s Notebook: qPCR – C.bairdi RNA Check for Residual gDNA

Previuosly checked existing crab RNA for residual gDNA on 20200226 and identified samples with yields that were likely too low, as well as samples with residual gDNA. For those samples, was faster/easier to just isolate more RNA and perform the in-column DNase treatment in the ZymoResearch Quick DNA/RNA Microprep Plus Kit; this keeps samples concentrated. So, I isolated more RNA on 20200306 and now need to check for residual gDNA.

Used 2ng of RNA (1uL) in each reaction. A 5uL dilution of each sample was made to a concentration of 2ng/uL. The decision for this quantity was based on the amount of RNA we might use in a reverse transcription reaction (50ng/25uL) and the volume of the resulting cDNA we’d run in a qPCR reaction (1uL). Calculations and the qPCR results (Cq values) are below (Google Sheet).

All reactions were run with 2x SsoFast EVA Green qPCR Master Mix (BioRad) on the Roberts Lab CFX Connect qPCR machine.

All samples were run in duplicate. See the qPCR Report (in Results section below) for plate layout, cycling params, etc.

Master mix calcs are here (Google Sheet):

Shelly’s Notebook: Mon. Mar. 9, 2020 Geoduck Sig. DMR heatmaps with group means

Plotting % methylation of DMRs ID’d with ANOVA

I added lines 312-365 to this R markdown script MCmax30_asinT_groupStats.Rmd to plot group means of % methylation for each of the 4 comparisons (all ambient samples, all day 10 samples, all day 135 samples, and all day 145 samples).

I changed the heat colors to match more closely with the heatmaps Hollie has been generating. We determined that row scaling in the heatmap is the best way to visualize group differences in methylation for each comparison and that pheatmap defaults to no scaling

Significant DMRs from all ambient samples: amb_MCmax30DMR_Taov0.1_GROUPmean_heatmap2.jpg

Significant DMRs from all Day 10 samples: d10_MCmax30DMR_Taov0.1_GROUPmean_heatmap2.jpg

Significant DMRs from all Day 135 samples: d135_MCmax30DMR_Taov0.1_GROUPmean_heatmap2.jpg

Significant DMRs from all Day 145 samples: d145_MCmax30DMR_Taov0.1_GROUPmean_heatmap2.jpg

Finally I created a figure for the manuscript with these heatmaps: asinT_groupStats_heatmaps.jpg

Next steps are to:

  • determine from GO enrichment which DMRs in which genes are mainly contributing to enriched terms
  • compare DMRs with Hollie’s DMGs and determine overlap (will do this at meeting on Friday)
  • compare GO enrichment results for DMRs with Hollie’s results for DMGs
  • There is defintely a difference in the number of DMGs identified by the binomial glm that Hollie ran (>1000 gene significant at FDR adj. p value < 0.05) and the number of DMRs identified by the ANOVA I ran (38 DMRs signficant at uncorrected ANOVA p value < 0.1). Determine if a different method (e.g. binomial glm) should be used for identifying significant DMRs.

from shellytrigg

Shelly’s Notebook: Sat. Mar. 7, 2020 Geoduck DMR GO analysis

GO Analysis of DMRs from 11/02/2019 analysis

DMRs were identified as described here:

Created Rmarkdown script (20200306_anno.Rmd) to perform GOseq on DMRs based on Hollie’s script (GM.Rmd).

from shellytrigg