Trimming – Additional 20bp from C.virginica Gonad RNAseq with fastp on Mox

When I previously aligned trimmed RNAseq reads to the NCBI C.virginica genome (GCF_002022765.2) on 20210720, I specifically noted that alignment rates were consistently lower for males than females. However, I let that discrepancy distract me from a the larger issue: low alignment rates. Period! This should have thrown some red flags and it eventually did after Steven asked about overall alignment rate for an alignment of this data that I performed on 20220131 in preparation for genome-guided transcriptome assembly. The overall alignment rate (in which I actually used the trimmed reads from 20210714) was ~67.6%. Realizing this was a on the low side of what one would expect, it prompted me to look into things more and I came across a few things which led me to make the decision to redo the trimming:

  1. As mentioned, I used untrimmed reads for original set of RNAseq alignments. Although, as as Steven has pointed out in this GitHub Issue, trimming might not actually have any real impact on alignments as described in this paper: Read trimming is not required for mapping and quantification of RNA-seq reads at the gene level. Liao and Shi, 2020!
  2. I contacted ZymoResearch to see if they had Bioanalyzer electropherograms post-rRNA removal (thinking that a large amount of contaminating rRNA would easily explain the low mapping rates). They did not perform this analysis, but they did point me to a section of the RiboFree Kit they used when preparing the libraries explaining that the R2 reads should have an additional 10bp removed after adapter removal. Here’s the section from the manual:
Trimming Reads

The Zymo-Seq RiboFree® Total RNA Library Kit employs a low-
complexity bridge to ligate the Illumina® P7 adapter sequence to the

library inserts (See the library structure below). This sequence can
extend up to 10 nucleotides. QC analysis software (e.g., FastQC1
) may
raise flags such as “Per base sequence content” at the beginning of
Read 2 due to this low complexity bridge sequence.

If desired, these 10 nucleotides can be removed in addition to adapter
trimming. An example using Trim Galore!2

for such trimming is as below:

trim_galore --paired --clip_R2 10 \
-a NNNNNNNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
-a2 AGATCGGAAGAGCGTCGTGTAGGGAAAGA \
sample.R1.fastq.gz
sample.R2.fastq.gz

Considering that we wanted to use these for a transcriptome assembly (which I already performed on 20220207), this residual “low complexity bridge” could lead to some spurious results.

  1. Looking at some of the samples from the initial trimming suggested that additional trimming would improve things a bit. Here’s an example from the initial fastp trimming. View the sections with “After filtering: read1: base contents” and “After filtering: read2: base contents” to see that the first ~20bp ratios are a bit rough.

NOTE: Words are tough to read on dark background – sorry! Also, you can click on any section heading to collapse it – This reduces the need to scroll so much!

So, with that I decided to run an additional round of trimming to remove 20bp from the 5’ ends of R1 and R2 reads.

Job was run on Mox.

SBATCH script (GitHub):

#!/bin/bash
## Job Name
#SBATCH --job-name=20220224_cvir_gonad_RNAseq_fastp_trimming
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=200G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20220224_cvir_gonad_RNAseq_fastp_trimming


### 2nd round of trimming Yaamini's C.virginica gonad RNAseq trimming using fastp, and MultiQC.
### Removes additional 20bp from 5' ends of R1 and R2 reads.

### 1st round of trimming performed on 20210714 by Sam.

### This additional round was prompted by low mapping rates to genome and feedback from ZymoResearch
### indicating that additional bases should be removed from 5' end of R2, after adapter removal.
### Reviewing intial trimming reports, it appears that an additional 20bp should be removed from
### 5' ends of both R1 and R2 reads.

### Expects input FastQ files to be in format: *_R[12].fastp-trim.20210714.fq.gz



###################################################################################
# These variables need to be set by user

## Assign Variables

# Set number of CPUs to use
threads=40

# Input/output files
trimmed_checksums=trimmed_fastq_checksums.md5
reads_dir=/gscratch/srlab/sam/data/C_virginica/RNAseq/
fastq_checksums=input_fastq_checksums.md5

# Paths to programs
fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp
multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc

## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()
R1_names_array=()
R2_names_array=()


# Programs associative array
declare -A programs_array
programs_array=(
[fastp]="${fastp}" \
[multiqc]="${multiqc}"
)


###################################################################################

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability
module load intel-python3_2017

# Capture date
timestamp=$(date +%Y%m%d)

# Sync raw FastQ files to working directory
rsync --archive --verbose \
"${reads_dir}"*fastp-trim.20210714.fq.gz .

# Create arrays of fastq R1 files and sample names
for fastq in *_R1.fastp-trim.20210714.fq.gz
do
  fastq_array_R1+=("${fastq}")
  R1_names_array+=("$(echo "${fastq}" | awk 'BEGIN {FS = "[._]";OFS = "_"} {print $1, $2}')")
done

# Create array of fastq R2 files
for fastq in *_R2.fastp-trim.20210714.fq.gz
do
  fastq_array_R2+=("${fastq}")
  R2_names_array+=("$(echo "${fastq}" | awk 'BEGIN {FS = "[._]";OFS = "_"} {print $1, $2}')")
done

# Create list of fastq files used in analysis
# Create MD5 checksum for reference
for fastq in *fastp-trim.20210714.fq.gz
do
  echo "${fastq}" >> input.fastq.list.txt
  md5sum ${fastq} >> ${fastq_checksums}
done

# Run fastp on files
# Adds JSON report output for downstream usage by MultiQC
for index in "${!fastq_array_R1[@]}"
do
  R1_sample_name=$(echo "${R1_names_array[index]}")
  R2_sample_name=$(echo "${R2_names_array[index]}")
  ${fastp} \
  --in1 ${fastq_array_R1[index]} \
  --in2 ${fastq_array_R2[index]} \
  --disable_adapter_trimming \
  --trim_front1 20 \
  --trim_front2 20 \
  --thread ${threads} \
  --html "${R1_sample_name}".fastp-trim.20bp-5prime."${timestamp}".report.html \
  --json "${R1_sample_name}".fastp-trim.20bp-5prime."${timestamp}".report.json \
  --out1 "${R1_sample_name}".fastp-trim.20bp-5prime."${timestamp}".fq.gz \
  --out2 "${R2_sample_name}".fastp-trim.20bp-5prime."${timestamp}".fq.gz

  # Generate md5 checksums for newly trimmed files
  {
      md5sum "${R1_sample_name}".fastp-trim.20bp-5prime."${timestamp}".fq.gz
      md5sum "${R2_sample_name}".fastp-trim.20bp-5prime."${timestamp}".fq.gz
  } >> "${trimmed_checksums}"

  # Remove original FastQ files
  echo ""
  echo " Removing ${fastq_array_R1[index]} and ${fastq_array_R2[index]}."
  rm "${fastq_array_R1[index]}" "${fastq_array_R2[index]}"
done

# Run MultiQC
${multiqc} .

####################################################################

# Capture program options
if [[ "${#programs_array[@]}" -gt 0 ]]; then
  echo "Logging program options..."
  for program in "${!programs_array[@]}"
  do
    {
    echo "Program options for ${program}: "
    echo ""
    # Handle samtools help menus
    if [[ "${program}" == "samtools_index" ]] \
    || [[ "${program}" == "samtools_sort" ]] \
    || [[ "${program}" == "samtools_view" ]]
    then
      ${programs_array[$program]}

    # Handle DIAMOND BLAST menu
    elif [[ "${program}" == "diamond" ]]; then
      ${programs_array[$program]} help

    # Handle NCBI BLASTx menu
    elif [[ "${program}" == "blastx" ]]; then
      ${programs_array[$program]} -help
    fi
    ${programs_array[$program]} -h
    echo ""
    echo ""
    echo "----------------------------------------------"
    echo ""
    echo ""
  } &>> program_options.log || true

    # If MultiQC is in programs_array, copy the config file to this directory.
    if [[ "${program}" == "multiqc" ]]; then
      cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
    fi
  done
fi


# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

RESULTS

Unsurprisingly, this was pretty fast since no adapter trimming was needed, at just under 2hrs:

fastp trimming runtime screencap on Mox

Output folder:

In my opinion, things look better. Here’s the same sample shown above, now with the additional 20bp trimmed from the 5’ ends of R1 and R2 reads:

Links to MultiQC report and trimmed FastQ files are below.

MultiQC Report (HTML):

- NOTE: Visit the output folder linked above to view each individual samples `fastp` HTML report; has additional info not shown in [`MultiQC`](https://multiqc.info/) summary report.

- [20220224_cvir_gonad_RNAseq_fastp_trimming/multiqc_report.html](https://gannet.fish.washington.edu/Atumefaciens/20220224_cvir_gonad_RNAseq_fastp_trimming/multiqc_report.html)

Trimmed FastQ files:

from Sam’s Notebook https://ift.tt/EOwHfrZ
via IFTTT

#ifttt, #sams-notebook

Fri. Feb. 18, 2022

NOPP-gigas-ploidy-temp analysis – Part 1 –
Project name: NOPP-gigas-ploidy-temp Funding source: National Oceanographic Partnership Program Species: Crassostrea gigas variable: ploidy, elevated seawater temperature, desiccation previous notebook entry next notebook entry Background We sent 72 samples for RNA sequencing (TagSeq) to the Genomic Sequencing and Analysis Facility at the University of Texas at Austin. Here is a…

from Matthew N. George, Ph.D. https://ift.tt/CRns7S2
via IFTTT

#ifttt, #matthew-n-george, #ph-d

Data Wrangling – C.virginica lncRNA Extractions from NCBI GCF_002022765.2 Using GffRead

Continuing to work on our Crassostrea virginica (Eastern oyster) project examining the effects of OA on female and male gonads (GitHub repo), Steven tasked me with parsing out long, non-coding RNAs (GitHub Issue). To do so, I relied on the NCBI genome and associated files/annotations. I used GffRead, GFFutils, and samtools. The process was documented in the followng Jupyter Notebook:


RESULTS

Output folder:

from Sam’s Notebook https://ift.tt/XR3ASs6
via IFTTT

#ifttt, #sams-notebook