Re-annotation based on salmonid database, differential expression analysis, stress gene identification, and gene ontology enrichment analysis summarized. R-code and files available upon request.
I wanted to rework the annotation and differential expression analysis of my RNAseq data before identifying stress genes for future analysis. The original annotation was against all species in the reviewed swissprot database, which is hard to compare to other studies of salmonids as gene descriptions are not easily compared this database only include ~200 genes belonging to the closely related rainbow trout species (Onchoryncus my kiss). In an attempt to have a more uniform annotation to compare to other studies of salmonid species, the transcriptome was again blasted against a database of salmonid only proteins. To create this database, all entries in the uniprot TrEMBL (unreviewed) and Swiss-Prot (reviewed) databases belonging to all Onchoryncus species were downloaded in fasta format and a searchable blast database created using blast+. By including the RrEMBL annotations in a searchable database I increased possible salmonid matches from ~200 to >60,000. The transcriptome was blasted against custom database resulting in 132370 contigs mapping to a gene match with 34,653 unique matches. This list was used for gene stress identification and for GO enrichment analysis.
Differential Expression (DE) Analysis:
In an attempt to simplify the differential expression (DE) analysis pairwise comparisons were recalculated using the simplified exact-test (versus the generalized linear model test used early), since only one variable (site location) was different for the samples. Raw counts were used to construct a counts matrix consisting of all 24 individuals with contigs identified by TrinityID. Non dimensional scaling was used to explore the structure of these raw counts:
This count matrix was used in the R-package Edger. Counts were normalized by individual library size and lowly expressed contigs removed from analysis. Pairwise comparisons were conducted on all possible combinations of sites. These pairwise comparisons result in a DE data set for each comparison, reporting log-fold change and p-value. These DE tables were used for stress gene and gene ontology enrichment analysis described below.
Identification of Stress Genes
For the identification of known stress genes in the RNAseq data, the DE matrices were combined without regard to site comparison. The interest here is purely in which genes from the potential list of stress genes are being differentially expressed at any of the sites compared to any other. This combined list was then filtered to remove contigs with a p-value < .05 and a fold change <2. This list was then annotated using the salmonid derived blast data set. Redundant genes from this list were removed for the purpose of gene identification, that is contigs mapping to the same Uniprot enrtry ID were removed so only one remained. This list was then compared to the list of rainbow trout stress genes identified in Wiseman et al. (2007) by gene descriptions. Gene descriptions were used over gene IDs because often genes have more than one entry number in the Uniprot database either due to multiple entries or different isoforms. From this comparison 51 stress genes were identified (table below) as differentially expressed between the different sampling sites and are suggested for further analysis at the remaining sites via a Nanostring investigation method. Concern remains over how to correctly chose sequences for the Nanostring "primer" creation process.
Gene Ontology Enrichment Analysis
Related gene ontology (GO) terms that were associated with annotated genes were retrieved by querying the uniprot database at uniprot.org. These GO terms were then joined with the differential expression matrices produced above. Using the R-package topGO, pairwise comparisons were investigated to determine which terms were overrepresented based on significance (p-value >.05). This was completed for 3 pairwise comparisons (treatment conditions versus reference). Results representing the top 15 over-represented GO terms for molecular function function are below.
Wiseman, S. 2007. Gene expression pattern in the liver during recovery from an acute stressor in rainbow trout. Comparative Biochemistry and Physiology Part D Genomics Proteomics. 234-44.