Genome Annotations – Splice Site and Exon Extractions for C.virginica GCF_002022765.2 Genome Using Hisat2 on Mox

Previously performed quality trimming on the Crassostrea virginica (Eastern oyster) gonad/sperm RNAseq data on 20210714. Next, I needed to identify exons and splice sites, as well as generate a genome index using HISAT2 to be used with StringTie downstream to identify potential alternative transcripts. This utilized the following NCBI genome files:

  • FastA: GCF_002022765.2_C_virginica-3.0_genomic.fna
  • GFF: GCF_002022765.2_C_virginica-3.0_genomic.gff
  • GTF: GCF_002022765.2_C_virginica-3.0_genomic.gtf

Metadata for this project is here:

https://ift.tt/3eAxT7y

This was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=5-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices ## Script using HiSat2 to build a genome index, identify exons, and splice sites in NCBI C.virginica genome assemlby using Hisat2. ################################################################################### # These variables need to be set by user ## Assign Variables # Set number of CPUs to use threads=40 genome_index_name="cvir_GCF_002022765.2" # Paths to programs hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2_build="${hisat2_dir}/hisat2-build" hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py" hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py" # Input/output files exons="cvir_GCF_002022765.2_hisat2_exons.tab" genome_dir="/gscratch/srlab/sam/data/C_virginica/genomes" genome_gff="${genome_dir}/GCF_002022765.2_C_virginica-3.0_genomic.gff" genome_fasta="${genome_dir}/GCF_002022765.2_C_virginica-3.0_genomic.fna" splice_sites="cvir_GCF_002022765.2_hisat2_splice_sites.tab" transcripts_gtf="${genome_dir}/GCF_002022765.2_C_virginica-3.0_genomic.gtf" # Programs associative array declare -A programs_array programs_array=( [hisat2_build]="${hisat2_build}" \ [hisat2_exons]="${hisat2_exons}" \ [hisat2_splice_sites]="${hisat2_splice_sites}" ) ################################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Create Hisat2 exons tab file "${programs_array[hisat2_exons]}" \ "${transcripts_gtf}" \ > "${exons}" # Create Hisat2 splice sites tab file "${programs_array[hisat2_splice_sites]}" \ "${transcripts_gtf}" \ > "${splice_sites}" # Build Hisat2 reference index using splice sites and exons "${programs_array[hisat2_build]}" \ "${genome_fasta}" \ "${genome_index_name}" \ --exon "${exons}" \ --ss "${splice_sites}" \ -p "${threads}" \ 2> hisat2_build.err # Generate checksums for all files md5sum * >> checksums.md5 # Copy Hisat2 index files to my data directory for later use with StringTie rsync -av "${genome_index_name}"*.ht2 "${genome_dir}" ####################################################################################################### # Capture program options if [[ "${#programs_array[@]}" -gt 0 ]]; then echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "

Trimming – C.virginica Gonad RNAseq with FastP on Mox

Needed to trim the Crassostrea virginica (Eastern oyster) gonad RNAseq data we received on 20210528.

All the metadata associated with these samples are available here:

https://ift.tt/3eAxT7y

Decided to run fastp, followed with MultiQC, on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210714_cvir_gonad_RNAseq_fastp_trimming ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210714_cvir_gonad_RNAseq_fastp_trimming ### Yaamini's C.virginica gonad RNAseq trimming using fastp, and MultiQC. ### Expects input FastQ files to be in format: *_R1.fastq.gz ################################################################################### # These variables need to be set by user ## Assign Variables # Set number of CPUs to use threads=40 # Input/output files trimmed_checksums=trimmed_fastq_checksums.md5 raw_reads_dir=/gscratch/scrubbed/samwhite/data/C_virginica/RNAseq/ fastq_checksums=raw_fastq_checksums.md5 # Paths to programs fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() R1_names_array=() R2_names_array=() # Programs associative array declare -A programs_array programs_array=( [fastp]="${fastp}" \ [multiqc]="${multiqc}" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Capture date timestamp=$(date +%Y%m%d) # Sync raw FastQ files to working directory rsync --archive --verbose \ "${raw_reads_dir}"*.gz . # Create arrays of fastq R1 files and sample names for fastq in *_R1.fastq.gz do fastq_array_R1+=("${fastq}") R1_names_array+=("$(echo "${fastq}" | awk 'BEGIN {FS = "[._]";OFS = "_"} {print $1, $2}')") done # Create array of fastq R2 files for fastq in *_R2.fastq.gz do fastq_array_R2+=("${fastq}") R2_names_array+=("$(echo "${fastq}" |awk 'BEGIN {FS = "[._]";OFS = "_"} {print $1, $2}')") done # Create list of fastq files used in analysis # Create MD5 checksum for reference for fastq in *.gz do echo "${fastq}" >> input.fastq.list.txt md5sum ${fastq} >> ${fastq_checksums} done # Run fastp on files # Adds JSON report output for downstream usage by MultiQC for index in "${!fastq_array_R1[@]}" do R1_sample_name=$(echo "${R1_names_array[index]}") R2_sample_name=$(echo "${R2_names_array[index]}") ${fastp} \ --in1 ${fastq_array_R1[index]} \ --in2 ${fastq_array_R2[index]} \ --detect_adapter_for_pe \ --thread ${threads} \ --html "${R1_sample_name}".fastp-trim."${timestamp}".report.html \ --json "${R1_sample_name}".fastp-trim."${timestamp}".report.json \ --out1 "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz \ --out2 "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz # Generate md5 checksums for newly trimmed files { md5sum "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz md5sum "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz } >> "${trimmed_checksums}" # Remove original FastQ files echo "" echo " Removing ${fastq_array_R1[index]} and ${fastq_array_R2[index]}." rm "${fastq_array_R1[index]}" "${fastq_array_R2[index]}" done # Run MultiQC ${multiqc} . #################################################################### # Capture program options if [[ "${#programs_array[@]}" -gt 0 ]]; then echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "

Creating PCAs

Alright, as of last time, our general goal was to do the following (along with a few other priorities): Track down data on infection level of crabs over time, and plot For each of our three transcriptomes, make a PCA with one point for each library. See if points cluster together by day, temperature, infection level (from qPCR data), and crab. This will be done using DESeq2 Align crabs D and F to hemat_transcriptomev1.6 and cbai_transcriptomev2.0. This should just check whether our alignment works well – since these are uninfected crab, we shouldn’t get any alignments to hemat_v1.6, and the alignment to cbai_v2.0 (which is unfiltered) should give an idea of the accuracy of our filtering for cbai_v4.0 (which is filtered to exclude all non-bairdi sequences). Write a blog post on the bittercrab website We can check off number 4 – take a peek here! The rest of the entry…

from Aidan F. Coyle https://ift.tt/3hQ6UqR
via IFTTT

Manual Modules 3

This hasn’t exactly been the most productive few weeks I’ve had, but for some extremely happy reasons! Matthew and I took a trip up to Alaska and got engaged!! As a result, I haven’t exactly been too focused on crabs for the past week or two, but you know what – I think it’s worth it. Alright, now we’re back in Seattle, so now it’s back to business. A quick and broad recap: Goal (primary): Analyze gene expression in Hematodinium-infected Chionoecetes At this point, rather than examining differentially-expressed genes, I’m clustering genes into modules and then examining the composition of each module over time/temperature treatment There are two ways I’m trying to accomplish this. WGCNA. This takes a complete set of expression data and clusters genes into modules based on similar expression patterns. It then examines modules to see if any show a significant correlation with either time or temperature…

from Aidan F. Coyle https://ift.tt/3kaVwY5
via IFTTT

Summary – Geoduck RNAseq Data

Per this GitHub Issue, I’ve compiled a summary table, with links, of all of our Panopea generosa (Pacific geoduck) RNAseq data as it exists in NCBI. This will be a “dynamic” notebook entry, whereby I will update this post continually as we acquire new data and/or change the information we’d like to have here.

Run BioSample Experiment Library_ID Sample_Name Tissue/Treatment
SRR12218868 SAMN15524261 SRX8729323 Geoduck-heart-RNA geoduck-20150811-heart heart
SRR12218869 SAMN15524260 SRX8729322 Geoduck-gonad-RNA geoduck-20150811-gonad gonad
SRR12218870 SAMN15524251 SRX8729321 Geoduck-ctenidia-RNA geoduck-20150811-ctenidia ctenidia
SRR12226692 SAMN15524260 SRX8735427 NR006 geoduck-20150811-gonad gonad
SRR12226693 SAMN15524251 SRX8735426 NR012 geoduck-20150811-ctenidia ctenidia
SRR12207404 SAMN15517239 SRX8718216 EPI_124 EPI_124 juvenile(ambient)
SRR12207405 SAMN15517238 SRX8718215 EPI_123 EPI_123 juvenile(ambient)
SRR12207406 SAMN15517237 SRX8718214 EPI_116 EPI_116 juvenile(super-low)
SRR12207407 SAMN15517236 SRX8718213 EPI_115 EPI_115 juvenile(super-low)
SRR12227930 SAMN15517239 SRX8736520 NR019 EPI_124 juvenile(ambient)
SRR12227931 SAMN15517237 SRX8736519 NR005 EPI_116 juvenile(super-low)

from Sam’s Notebook https://ift.tt/2Vq4t5o
via IFTTT

WGBS Analysis Part 32

Miscellaneous analyses

After updating my methods and results sections with information from my 4 vs. 4 analysis, I realized there were a few things I needed to complete before sending my draft to my Reading Committee.

Modifying GOterm annotations

The first thing I wanted to do was add methylation difference information from methylKit to the GOterm annotations. This way, I could see if certain genes or biological processes were more hyper- or hypomethylated in my low pH samples. I returned to this Jupyter notebook and modified my paste | awk code to include the column with methylation difference information:

#Column 5 = meth.diff !paste DML-pH-50-Cov5-All.GeneIDs ../../output/10_DML-characterization/DML-pH-50-Cov5-All-Gene-wb.bed \ | awk -F'\t' -v OFS='\t' '{print $1, $2, $3, $4, $5}' \ | sort \ > DML-pH-50-Cov5-All.GeneIDs.geneOverlap 

Since all of the files build off of eachother in my code, my final annotation file now included methylation difference information.

I then wanted to count the number of genes and transcripts associated with the enriched GOterms. I opened this R Markdown script, then created a dataframe with only the enriched GOterms (P-value < 0.01). I merged the enriched GOterms with the annotation information, then counted the number of unique genes or transcripts:

sigRes.allBP <- allRes.allBP[1:35,c(1, 6)] #Filter significantly enriched GOterms, only keep GOID and p-value colnames(sigRes.allBP) <- c("GOID", "p.value") #Change column names head(sigRes.allBP) 
sigRes.allBPAnnot <- merge(sigRes.allBP, allDMLGOtermsBP, by = "GOID") #Additional annotations sigRes.allBPAnnot <- unique(sigRes.allBPAnnot[,-11]) #Drop GOcat column and retain only unique lines length(unique(sigRes.allBPAnnot$geneID)) #20 unique genes length(unique(sigRes.allBPAnnot$transcript)) #42 unique genes head(sigRes.allBPAnnot) #Confirm formatting 

Interestingly, there were 20 genes associated with enriched biological process GOterms, but 42 transcripts! This tells me that the DML associated with enriched GOterms are found in genes that produce multiple transcripts. This could be important for alternative splicing as a way to control transcription. I repeated this process with the molecular function GOterms, and found that they were associated with 12 unique genes and 50 unique transcripts.

Visualizing enriched terms with simplifyEnrichment

In the same R Markdown script, I decided to create heatmaps similar to Chille et al. (2021) to visualize my enriched GOterms. I started by installing the package simplifyEnrichment. When I tried installing the package, I realized it was only available for the latest version of R! I couldn’t update R on genefish, so instead I updated R on my own computer. After I got the latest R version, I was able to install simplifyEnrichment with BiocManager.

To make the heatmap, I needed to create a semantic similarity matrix of my enriched GOterms. Following the instruction manual, I calculated the semantic similarity matrix using the “Relevance” measure (default) using GO_similarity. Then, I used simplifyGO to create the plot! I had to dig into the manual to find how to change aesthetics related to the word cloud and text sizing:

matBP <- GO_similarity(go_id = sigRes.allBP$GOID, ont = "BP") #Calculate the semantic similarity matrix using the Rel method (default) 
#pdf("figures/simplifyEnrichment-BP.pdf", width = 11, height = 8.5) #Save figure simplifyGO(matBP, column_title = "", col = rev(plotColors), fontsize_range = c(10,40), word_cloud_grob_param = list(col = "black", max_width = 25)) #Plot GOterms based on semantic similarity. Do not include a column title. Set colors to be plot colors, and set fontsize to range from 10 to 40. Pass arguments to word_cloud_grob_param to dictate the colors of the words and maximum width #dev.off() 

Screen Shot 2021-07-06 at 4 38 55 PM

Screen Shot 2021-07-08 at 2 54 37 PM

Figures 1-2. Biological process and molecular function simplifyEnrichment figure

I used the biological process figure in the main text, and kept the molecular function one in the supplement.

Sequencing summary table

The last thing I wanted to do was concatenate all the information related to sequence trimming and alignment. I looked through the fastqc and bismark alignment, deduplication, and methylation extraction MultiQC reports to create this monster table for the supplement.

Going forward

  1. Report mc.cores issue to methylKit
  2. Perform randomization test
  3. Update mox handbook with R information
  4. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3xyXVQ9
via IFTTT