Correlating Crab Variables

A pretty basic question that I haven’t really answered yet is how different variables are actually correlated with each other! So to figure this out, I decided to make a correlation matrix. Each variable applies to one individual crab. So we’re leaving the time dimension out of it, and are just examining covariance between crabs. I made this table a while back. Each row is a value for each individual crab. As a reminder, the three transcriptomes used for alignment in this experiment are: cbai_transcriptomev2.0: unfiltered by taxa cbai_transcriptomev4.0: filtered, presumably only C. bairdi seqs hemat_transcriptomev1.6: filtered, presumably only Hematodinium seqs The following columns are present: Crab ID (A-I) Uniq_ID: This links the notation used in Grace’s analysis (and mine) with some of the earlier Jensen tables Treatment: The temperature treatment the crab was exposed to Timepoints:The number of timepoints (3 for ambient- and lowered-temperature treatment groups, 2 for elevated)…

from Aidan F. Coyle https://ift.tt/3EnumoF
via IFTTT

Adaptive Immune Gene Dive

Previously, I identified several genes with GO terms associated with an adaptive immune response. A writeup to each of these is described in the link, but here’s the suspects: Q92956 (TNR14_HUMAN): Tumor necrosis factor receptor, interacts with CD160 O95971 (BY55_HUMAN): CD160 antigen. Likely plays role in both anti-viral innate immune response along with adaptive immune response P60033 (CD81_HUMAN): CD81 antigen. Whole bunch of roles, inc. structural component, enables receptor complex assembly upon encounter with pathogens, and more. NOTE: P35762 (CD81_MOUSE) was also present in the crab transcriptome. Alright, so we’ve got genes linked to adaptive immunity! This raises two key questions. Are these genes actually related to adaptive immunity? Are these genes correctly identified? Are these genes at a significant expression level? 1 and 2 are the big ones – the presence alone of adaptive immune-related genes would be a neat little finding. But 3 is also important to this…

from Aidan F. Coyle https://ift.tt/3E2mfO0
via IFTTT

Linear Modeling, for real

Longtime readers \ will remember my post from June on using linear modeling to model Hematodinium infection. In case you don’t, I’ll give you a quick recap As part of the QERM 514 class I took, I created a linear model from the survey data that Pam Jensen sent over. These data were Southeast AK Tanner crab surveys from 2007 to 2012, and I focused on modeling Hematodinium infection status. My model found the following linkages to Hematodinium infection. Shell condition: strong linkage. Low for fresh-shell crabs, high for new-shell, then decline as shell age increases Depth: Infection rates higher for crabs in shallower waters Carapace width: Larger crabs were more likely to be infected Julian day: Included in final model, but likely unimportant (p-val = 0.85) Black Mat model: Included in final model, not judged to be significant (p-val = 0.54). However, despite sampling ~800 crabs with Black Mat,…

from Aidan F. Coyle https://ift.tt/3zUI93j
via IFTTT

Jensen Samples

Alright, at this point, I’ve got a few different projects going. First, I’m examining gene expression within crabs infected with Hematodinium (using Grace Crandall’s samples). Second, I’m working on modeling a few decades of survey data on Hematodinium infection rates from the Alaska Department of Fish and Game (I’ve been meaning to write up a lab notebook post on that, will do it soon). However, recently a third project got added to the pile! During Hack Week, we went into the lab to tidy things up. While there, I found a HUGE number of archived samples that were sent over by Pam Jensen after her retirement. They generally consist of hemolymph samples in deep 96-well plates, and appear to contain samples from both infected and uninfected crabs. Samples date back to the mid-2000s, and there’s literally four large totes full of them. This is an absolute goldmine of data, and…

from Aidan F. Coyle https://ift.tt/38tXkEB
via IFTTT

Manually Clustering Immune Genes

Previously, we created files of transcripts that matched GO terms related to immune response. However, to make our steps more logically consistent, we recently changed those files to only include transcripts that matched the specific GO term (GO:0006955) for Immune Response. We created files for each of our three transcriptomes – cbai_transcriptomev2.0 (unfiltered), cbai_transcriptomev4.0 (bairdi only), and hemat_transcriptomev1.6 (hematodinium only). That was done using this script Once we had those files, we pulled both their raw counts and TPM counts for all immune-related transcripts for the two cbai transcriptomes. The hemat transcriptome was skipped, as there were only 5 transcripts that fit the immune response GO term. The raw counts were used to create PCAs of immune gene expression by each of our five possible variables – crab, day, hematodinium level, infeciton level, and temperature. Those PCAs are available for cbai_v2.0 and cbai_v4.0. Both PCAs were created with this script…

from Aidan F. Coyle https://ift.tt/2XTa8lR
via IFTTT

Immune Genes

A week or so ago, we decided to investigate a possible alternative path. Until now, we have generally been examining all genes. However, we decided that it may be interesting to look specifically at a subset in detail. Precisely, we decided to examine expression of immune genes among all individual libraries to see if we could spot any patterns. To do this, we took a file of accession IDs and their corresponding GO terms, and filtered it to only include the accession IDs with a GO term with “immune” in the term name. The list of GO terms linked is as follows. Since multiple transcripts can correspond to a single gene, the gene list is, of course, shorter. And since there is some overlap in the GO terms assigned to genes (for instance, a gene may have the GO terms for both “immune response” and “humoral immune response”), the totals…

from Aidan F. Coyle https://ift.tt/3iXYWfM
via IFTTT

Manual Modules 4

Back in mid-July, I manually created modules to cluster genes by expression patterns. Again, “manual” oversells it a bit – I really just provided a cut height, looked at the resulting clusters, and then named them by overall expression patterns. A recap of that naming system can be found in my lab notebook entry on 2021-07-12 (named Manual Modules 3) Anyway, clusters were created for each individual crab. However, after some discussion, we concluded that the previous bar for genes was a bit low. Previously, our bar was taking the libraries for the individual crab (say, the 3 libraries for Crab A), and only including genes with 5+ counts among all libraries. So for instance, a gene with 4 counts on Day 0, 0 on Day 2, and 1 on Day 17 would be included. Of course, the downside of this is that a) that’s a really low bar, so…

from Aidan F. Coyle https://ift.tt/3lNNntj
via IFTTT

Crab Plasticity

In our previous post, we created PCAs for each transcriptome, with each point as one library. This raised some interesting questions, notably that we didn’t necessarily see libraries from the same crab clustered together (though for some crab, they did clearly cluster). This could be an indication of crab plasticity, as they adjust to both tank effects and the temperature treatment. We initially considered looking at plasticity qualitatively, but figured it’d be better to go the extra mile and take a quantitative approach. Therefore, we modified our DESeq2 analysis scripts for our three transcriptomes – cbai_v2.0, cbai_v4.0, and hemat_v1.6. Yet another reminder – cbai_v2.0 is unfiltered, cbai_v4.0 is presumably Chionoecetes genes only, and hemat_v1.6 is presumably Hematodinium genes only. These modifications created a PCA for each, extracted the coordinates for PC1 and PC2 for each library, used that to get the centroid for each crab, and then calculated the mean…

from Aidan F. Coyle https://ift.tt/3luLkKL
via IFTTT

Genome Annotations – Splice Site and Exon Extractions for C.virginica GCF_002022765.2 Genome Using Hisat2 on Mox

Previously performed quality trimming on the Crassostrea virginica (Eastern oyster) gonad/sperm RNAseq data on 20210714. Next, I needed to identify exons and splice sites, as well as generate a genome index using HISAT2 to be used with StringTie downstream to identify potential alternative transcripts. This utilized the following NCBI genome files:

  • FastA: GCF_002022765.2_C_virginica-3.0_genomic.fna
  • GFF: GCF_002022765.2_C_virginica-3.0_genomic.gff
  • GTF: GCF_002022765.2_C_virginica-3.0_genomic.gtf

Metadata for this project is here:

https://ift.tt/3eAxT7y

This was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=5-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices ## Script using HiSat2 to build a genome index, identify exons, and splice sites in NCBI C.virginica genome assemlby using Hisat2. ################################################################################### # These variables need to be set by user ## Assign Variables # Set number of CPUs to use threads=40 genome_index_name="cvir_GCF_002022765.2" # Paths to programs hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2_build="${hisat2_dir}/hisat2-build" hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py" hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py" # Input/output files exons="cvir_GCF_002022765.2_hisat2_exons.tab" genome_dir="/gscratch/srlab/sam/data/C_virginica/genomes" genome_gff="${genome_dir}/GCF_002022765.2_C_virginica-3.0_genomic.gff" genome_fasta="${genome_dir}/GCF_002022765.2_C_virginica-3.0_genomic.fna" splice_sites="cvir_GCF_002022765.2_hisat2_splice_sites.tab" transcripts_gtf="${genome_dir}/GCF_002022765.2_C_virginica-3.0_genomic.gtf" # Programs associative array declare -A programs_array programs_array=( [hisat2_build]="${hisat2_build}" \ [hisat2_exons]="${hisat2_exons}" \ [hisat2_splice_sites]="${hisat2_splice_sites}" ) ################################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Create Hisat2 exons tab file "${programs_array[hisat2_exons]}" \ "${transcripts_gtf}" \ > "${exons}" # Create Hisat2 splice sites tab file "${programs_array[hisat2_splice_sites]}" \ "${transcripts_gtf}" \ > "${splice_sites}" # Build Hisat2 reference index using splice sites and exons "${programs_array[hisat2_build]}" \ "${genome_fasta}" \ "${genome_index_name}" \ --exon "${exons}" \ --ss "${splice_sites}" \ -p "${threads}" \ 2> hisat2_build.err # Generate checksums for all files md5sum * >> checksums.md5 # Copy Hisat2 index files to my data directory for later use with StringTie rsync -av "${genome_index_name}"*.ht2 "${genome_dir}" ####################################################################################################### # Capture program options if [[ "${#programs_array[@]}" -gt 0 ]]; then echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "

Trimming – C.virginica Gonad RNAseq with FastP on Mox

Needed to trim the Crassostrea virginica (Eastern oyster) gonad RNAseq data we received on 20210528.

All the metadata associated with these samples are available here:

https://ift.tt/3eAxT7y

Decided to run fastp, followed with MultiQC, on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210714_cvir_gonad_RNAseq_fastp_trimming ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210714_cvir_gonad_RNAseq_fastp_trimming ### Yaamini's C.virginica gonad RNAseq trimming using fastp, and MultiQC. ### Expects input FastQ files to be in format: *_R1.fastq.gz ################################################################################### # These variables need to be set by user ## Assign Variables # Set number of CPUs to use threads=40 # Input/output files trimmed_checksums=trimmed_fastq_checksums.md5 raw_reads_dir=/gscratch/scrubbed/samwhite/data/C_virginica/RNAseq/ fastq_checksums=raw_fastq_checksums.md5 # Paths to programs fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() R1_names_array=() R2_names_array=() # Programs associative array declare -A programs_array programs_array=( [fastp]="${fastp}" \ [multiqc]="${multiqc}" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Capture date timestamp=$(date +%Y%m%d) # Sync raw FastQ files to working directory rsync --archive --verbose \ "${raw_reads_dir}"*.gz . # Create arrays of fastq R1 files and sample names for fastq in *_R1.fastq.gz do fastq_array_R1+=("${fastq}") R1_names_array+=("$(echo "${fastq}" | awk 'BEGIN {FS = "[._]";OFS = "_"} {print $1, $2}')") done # Create array of fastq R2 files for fastq in *_R2.fastq.gz do fastq_array_R2+=("${fastq}") R2_names_array+=("$(echo "${fastq}" |awk 'BEGIN {FS = "[._]";OFS = "_"} {print $1, $2}')") done # Create list of fastq files used in analysis # Create MD5 checksum for reference for fastq in *.gz do echo "${fastq}" >> input.fastq.list.txt md5sum ${fastq} >> ${fastq_checksums} done # Run fastp on files # Adds JSON report output for downstream usage by MultiQC for index in "${!fastq_array_R1[@]}" do R1_sample_name=$(echo "${R1_names_array[index]}") R2_sample_name=$(echo "${R2_names_array[index]}") ${fastp} \ --in1 ${fastq_array_R1[index]} \ --in2 ${fastq_array_R2[index]} \ --detect_adapter_for_pe \ --thread ${threads} \ --html "${R1_sample_name}".fastp-trim."${timestamp}".report.html \ --json "${R1_sample_name}".fastp-trim."${timestamp}".report.json \ --out1 "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz \ --out2 "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz # Generate md5 checksums for newly trimmed files { md5sum "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz md5sum "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz } >> "${trimmed_checksums}" # Remove original FastQ files echo "" echo " Removing ${fastq_array_R1[index]} and ${fastq_array_R2[index]}." rm "${fastq_array_R1[index]}" "${fastq_array_R2[index]}" done # Run MultiQC ${multiqc} . #################################################################### # Capture program options if [[ "${#programs_array[@]}" -gt 0 ]]; then echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "