De Novo Transcriptome Assembly
Files were retrieved from the UW genomics facility in fastq format, with each individual lane of sequencing and sample (fish) given its own file resulting in 4 files for each individual. Files were first concatenated by individual so that I had 48 fastq files, representing 6 individuals from each of four sampling locations, with 2 files of paired end reads for each individual. The largest file from each site, meaning the sample from each site with the largest number of reads, was identified, concatenated together in the correct order to form 2 files (one for each read direction) and uploaded through genomespace/google drive to usegalaxy.org. These individuals were:
10ACO (J.fastq), 1AIS (I.fastq), 6AJE (V.fastq), 17ASW (F.fastq)
Fastqc was run on each file to determine if adapter contamination existed, and to identify if timing of low quality reads was needed. No major issues were found, though base quality degraded in the 120-150 bp range. No trimming was performed.
After the QC checks these files were transferred to the Indiana University galaxy instance for transcriptome construction using Trinity. This was tried in two ways. The first was with the paired end data directly with the Trinity pipeline, and second with the paired-end data first merged using PEAR, and than ran through the Trinity pipeline:
The merged data resulted in a greater number of reads mapping back to the transcriptome in downstream processing, so this was used for the remainder of this analysis. That is, although the PE data resulted in more assembled contigs (172,930 vs 138,481), the mapping rate using sailfish was around 60% using PE data vs 85-90% for merged data, resulting in more mapping reads using the transcritome from merged data.
Individual sample file prep and read counting
Individual sample paired end fastq files were uploaded to the usegalaxy.org server, and the same PEAR merging script that was used for the transctiome files was used to make a single fastq file of reads for each individual:
Read for each individual were used to map back to the merged_transcriptome.fastq to make a table of read counts per contig using the Sailfish script (a K-mer and index counting method) :
This quantification counted the number of reads aligning to assembled transcripts (contigs), but further work should be done to count reads at the gene level. (I’m not sure how to do this)
The resulting sailfish table for each individual was in the this format:
Length= the length of the each contig
TPM= Total per million reads
NumReads= count of reads mapping to each transcript
These tabular files were downloaded, and pulled into excel to make a matrix for each individual with the name of the transcript and number of reads. These matrices (one for each individual) were used for differential expression analysis.
Differential expression analysis
Read count matrices for each individual were imported into R and the package Edger used for differential expression analysis. For this initial analysis pairwise comparisons were made against the reference site, resulting in three comparisons. The r-code is as follows:
Import files and assign to one of four factors:
1= Coulter Creek (reference)
2= Issaquah Creek (tier 1-low urban)
3= Jenkins Creek (tier 4- moderate urban)
4= Swamp Creek (tier 5- high urban)
Removed lowly expressed transcripts (fewer that 2 reads per million amongst all individuals):
Set experimental design based on factors and fit glm:
Run test for significance for each pairwise comparison and output results table:
The resulting differential expression tables look like this:
For each site comparison, transcripts were extracted that showed significant differentiation, (p-value less than .05). This list was further divided into those transcripts showing up-regulation (positive logFC) and those showing down-regulation (negative logFC). This resulted in 6 files, two for each comparison of significantly differentiated gene expression, one for up-regulated genes and one for down-regulated genes. These were used for gene identification (blast).
For this analysis both the six files and the larger merged trancriptome were “blasted” against the swissprot database to identify potential genes associated with assembled and differentially expressed transcripts. The following code was used in terminal for each file:
The resulting .xml output was paired with the original .fasta reads in Blast2Go and mapped to gene ontology function:
This allows for a searchable database of differentially expressed gene and their respective function. I have have roughly 2000-3000 differentially expressed transcripts at each site (roughly half up- and half down-regulated). At this point I’m not sure how to best visualize and present this data…
Some observations and concerns:
-The annotation of the full transciptome resulted in only about 50,000 of the 138,481 identified in the swissprot database or ~36%, whereas the annotation of the differential expression lists is routinely resulting in over 90% identification. It seems differentially expressed transcripts for some reason are more likely those that are also identifiable.
-I would like to figure out how to conduct the sailfish analysis at the gene level instead of the transcript level so that the differential expression analysis is cleaner, I’m not sure how to make the needed GTF file from the blast results that matches identified genes to transcript name. This would allow me to get aggregated gene-level abundance estimates using sailfish instead of transcript abundance estimates.
-The next step will be to identify a list of gene targets that are of interest to analyze in the rest of my samples. What are some good ways to approach the list of differentially expressed genes to identify genes that are related to a stress or immune response?