Update on RNAseq analysis

Re-annotation based on salmonid database, differential expression analysis, stress gene identification, and gene ontology enrichment analysis summarized. R-code and files available upon request.

Annotation:
I wanted to rework the annotation and differential expression analysis of my RNAseq data before identifying stress genes for future analysis. The original annotation was against all species in the reviewed swissprot database, which is hard to compare to other studies of salmonids as gene descriptions are not easily compared this database only include ~200 genes belonging to the closely related rainbow trout species (Onchoryncus my kiss). In an attempt to have a more uniform annotation to compare to other studies of salmonid species, the transcriptome was again blasted against a database of salmonid only proteins. To create this database, all entries in the uniprot TrEMBL (unreviewed) and Swiss-Prot (reviewed) databases belonging to all Onchoryncus species were downloaded in fasta format and a searchable blast database created using blast+. By including the RrEMBL annotations in a searchable database I increased possible salmonid matches from ~200 to >60,000. The transcriptome was blasted against custom database resulting in 132370 contigs mapping to a gene match with 34,653 unique matches. This list was used for gene stress identification and for GO enrichment analysis.

Differential Expression (DE) Analysis:

In an attempt to simplify the differential expression (DE) analysis pairwise comparisons were recalculated using the simplified exact-test (versus the generalized linear model test used early), since only one variable (site location) was different for the samples. Raw counts were used to construct a counts matrix consisting of all 24 individuals with contigs identified by TrinityID. Non dimensional scaling was used to explore the structure of these raw counts:

Screen Shot 2016-08-22 at 1.35.44 PM

This count matrix was used in the R-package Edger. Counts were normalized by individual library size and lowly expressed contigs removed from analysis. Pairwise comparisons were conducted on all possible combinations of sites. These pairwise comparisons result in a DE data set for each comparison, reporting log-fold change and p-value. These DE tables were used for stress gene and gene ontology enrichment analysis described below.

Identification of Stress Genes

For the identification of known stress genes in the RNAseq data, the DE matrices were combined without regard to site comparison. The interest here is purely in which genes from the potential list of stress genes are being differentially expressed at any of the sites compared to any other. This combined list was then filtered to remove contigs with a p-value < .05 and a fold change <2. This list was then annotated using the salmonid derived blast data set. Redundant genes from this list were removed for the purpose of gene identification, that is contigs mapping to the same Uniprot enrtry ID were removed so only one remained. This list was then compared to the list of rainbow trout stress genes identified in Wiseman et al. (2007) by gene descriptions. Gene descriptions were used over gene IDs because often genes have more than one entry number in the Uniprot database either due to multiple entries or different isoforms. From this comparison 51 stress genes were identified (table below) as differentially expressed between the different sampling sites and are suggested for further analysis at the remaining sites via a Nanostring investigation method. Concern remains over how to correctly chose sequences for the Nanostring "primer" creation process.

DE_stress

Gene Ontology Enrichment Analysis

Related gene ontology (GO) terms that were associated with annotated genes were retrieved by querying the uniprot database at uniprot.org. These GO terms were then joined with the differential expression matrices produced above. Using the R-package topGO, pairwise comparisons were investigated to determine which terms were overrepresented based on significance (p-value >.05). This was completed for 3 pairwise comparisons (treatment conditions versus reference). Results representing the top 15 over-represented GO terms for molecular function function are below.

Screen Shot 2016-08-22 at 3.02.57 PM

Screen Shot 2016-08-22 at 3.05.20 PM

Screen Shot 2016-08-22 at 3.07.08 PM

Wiseman, S. 2007. Gene expression pattern in the liver during recovery from an acute stressor in rainbow trout. Comparative Biochemistry and Physiology Part D Genomics Proteomics. 234-44.

Spanjer- Summary of RNAseq analysis to-date

De Novo Transcriptome Assembly

Files were retrieved from the UW genomics facility in fastq format, with each individual lane of sequencing and sample (fish) given its own file resulting in 4 files for each individual. Files were first concatenated by individual so that I had 48 fastq files, representing 6 individuals from each of four sampling locations, with 2 files of paired end reads for each individual. The largest file from each site, meaning the sample from each site with the largest number of reads, was identified, concatenated together in the correct order to form 2 files (one for each read direction) and uploaded through genomespace/google drive to usegalaxy.org. These individuals were:

10ACO (J.fastq), 1AIS (I.fastq), 6AJE (V.fastq), 17ASW (F.fastq)

Fastqc was run on each file to determine if adapter contamination existed, and to identify if timing of low quality reads was needed. No major issues were found, though base quality degraded in the 120-150 bp range. No trimming was performed.

After the QC checks these files were transferred to the Indiana University galaxy instance for transcriptome construction using Trinity. This was tried in two ways. The first was with the paired end data directly with the Trinity pipeline, and second with the paired-end data first merged using PEAR, and than ran through the Trinity pipeline:

Screen Shot 2016-06-30 at 11.40.15 AM

The merged data resulted in a greater number of reads mapping back to the transcriptome in downstream processing, so this was used for the remainder of this analysis. That is, although the PE data resulted in more assembled contigs (172,930 vs 138,481), the mapping rate using sailfish was around 60% using PE data vs 85-90% for merged data, resulting in more mapping reads using the transcritome from merged data.

Individual sample file prep and read counting

Individual sample paired end fastq files were uploaded to the usegalaxy.org server, and the same PEAR merging script that was used for the transctiome files was used to make a single fastq file of reads for each individual:

Screen Shot 2016-06-30 at 11.42.20 AM

Read for each individual were used to map back to the merged_transcriptome.fastq to make a table of read counts per contig using the Sailfish script (a K-mer and index counting method) :

Screen Shot 2016-06-30 at 11.43.17 AM

This quantification counted the number of reads aligning to assembled transcripts (contigs), but further work should be done to count reads at the gene level. (I’m not sure how to do this)

The resulting sailfish table for each individual was in the this format:

Screen Shot 2016-06-30 at 11.43.48 AM

Length= the length of the each contig
TPM= Total per million reads
NumReads= count of reads mapping to each transcript

These tabular files were downloaded, and pulled into excel to make a matrix for each individual with the name of the transcript and number of reads. These matrices (one for each individual) were used for differential expression analysis.

Differential expression analysis

Read count matrices for each individual were imported into R and the package Edger used for differential expression analysis. For this initial analysis pairwise comparisons were made against the reference site, resulting in three comparisons. The r-code is as follows:

Import files and assign to one of four factors:
Screen Shot 2016-06-30 at 11.44.37 AM
factors:

1= Coulter Creek (reference)
2= Issaquah Creek (tier 1-low urban)
3= Jenkins Creek (tier 4- moderate urban)
4= Swamp Creek (tier 5- high urban)

Removed lowly expressed transcripts (fewer that 2 reads per million amongst all individuals):
Screen Shot 2016-06-30 at 11.45.11 AM

Set experimental design based on factors and fit glm:
Screen Shot 2016-06-30 at 11.45.45 AM

Run test for significance for each pairwise comparison and output results table:
Screen Shot 2016-06-30 at 11.46.28 AM

The resulting differential expression tables look like this:
Screen Shot 2016-06-30 at 11.46.57 AM

For each site comparison, transcripts were extracted that showed significant differentiation, (p-value less than .05). This list was further divided into those transcripts showing up-regulation (positive logFC) and those showing down-regulation (negative logFC). This resulted in 6 files, two for each comparison of significantly differentiated gene expression, one for up-regulated genes and one for down-regulated genes. These were used for gene identification (blast).

Transcript annotation

For this analysis both the six files and the larger merged trancriptome were “blasted” against the swissprot database to identify potential genes associated with assembled and differentially expressed transcripts. The following code was used in terminal for each file:
Screen Shot 2016-06-30 at 11.48.19 AM

The resulting .xml output was paired with the original .fasta reads in Blast2Go and mapped to gene ontology function:

Screen Shot 2016-06-30 at 11.48.50 AM

This allows for a searchable database of differentially expressed gene and their respective function. I have have roughly 2000-3000 differentially expressed transcripts at each site (roughly half up- and half down-regulated). At this point I’m not sure how to best visualize and present this data…

Some observations and concerns:

-The annotation of the full transciptome resulted in only about 50,000 of the 138,481 identified in the swissprot database or ~36%, whereas the annotation of the differential expression lists is routinely resulting in over 90% identification. It seems differentially expressed transcripts for some reason are more likely those that are also identifiable.

-I would like to figure out how to conduct the sailfish analysis at the gene level instead of the transcript level so that the differential expression analysis is cleaner, I’m not sure how to make the needed GTF file from the blast results that matches identified genes to transcript name. This would allow me to get aggregated gene-level abundance estimates using sailfish instead of transcript abundance estimates.

-The next step will be to identify a list of gene targets that are of interest to analyze in the rest of my samples. What are some good ways to approach the list of differentially expressed genes to identify genes that are related to a stress or immune response?