My current cluster eliminates proteins that were never detected because I combined the data sets from silo 3 and silo 9 that contained only the corresponding silo’s abundant proteins. This means that when my cluster analysis is finished, I have uneven amounts of proteins for each silo. I want to create an even number of proteins per silo at the end of the cluster. This means I will need to edit the original raw data containing all of the silos rather than working off of the separate silo data sets.
After I do this, I will rerun the cluster analysis to get a new list of ‘unique’ proteins (unique proteins are defined as those that were in separate cluster groups based on temperature [ie. silo]). This final unique-proteins dataframe will be used for gene enrichment and to create heatmaps. I think a unique possible heatmap would be of parent terms based on the abundance of the proteins whose genes are annotated to that parent term.
I got my computer back so I made sure that all my files are up to date on all systems. I heavily modified my cluster code to create a more accurate unique-proteins dataframe since previously it had redundant and incorrect columns mixed in with correct columns. Scales for heatmaps are normalized abundance values.
Note that in this heatmap, the data has been:
- normalized (mean centered)
- mSet<-Normalization(mSet, "NULL", "NULL", "MeanCenter", ratio=FALSE, ratioNum=20)
- mSet<-PlotNormSummary(mSet, "norm_0_", "png", 72, width=NA)
- mSet<-PlotSampleNormSummary(mSet, "snorm_0_", "png", 72, width=NA)
If needed later: displaying two heatmaps by each other.
Now that I have unique proteins, I need to map them back to biological processes. Gene enrichment examines what genes are over-represented relative to a background genome. A single biological process maps to multiple genes. A biological process (cellular component, or molecular function) is enriched based on the additive changes in genes that occurred based on a treatment or phenotype. Therefore, the background of list of genes you analyze when identifying enriched processes is very important.
My current list of unique proteins examines two of my three silos. I have a couple options with this data for a gene list:
- split up silos and examine enrichment based on unique proteins per silo
- look at enriched processes of all unique proteins.
For my background I can include:
- all detected proteins in the experiment (all silos)
- detected proteins in silo 3 and 9 only
- all unique proteins in silo 3 and 9.
I’m not sure which background would be most useful for determining the differences between temperature.
I want to visualize the differences that are occurring between the silos in addition to identifying the changes between biological processes. Heatmaps are a great way to view that information. Considering I have 1532 unique proteins between the silos, a heatmap with them would be difficult to interpret. It would be nice if I could incorporate biological process into the heatmap, however I am not sure how to do that.
When Shelly and I discussed this, I thought that I could map the parent GO term back to the protein using the Uniprot accession code and then plot the abundances of the proteins underneath the parent term in a heat map. I don’t know if protein abundances can be representative of enriched processes though since the group of genes is determined independent of any quantitative value.
Terms to refer back to for DAVID…
- RT: related term (genes that share similar sets of annotation terms are in the same biological mechanism) by Kappa value (closer to 1 = better)
- COG ontology: NCBI’s COG (Clusters of Orthologous Groups of proteins) for genome scale analysis of proteins functions and evolution
- SP_PIR_KEYWORDS: SwissProt/Uniprot and PIR keywords
- UP_SEQ_FEATURE: annotation category, Uniprot Sequence Feature at the Uniprot site
- SCOP_..: Protein structure
- GO FAT: filters out broad GO terms based on a “measured specificity” of each term (not by level)
- GO_ALL is all levels
- GO_GP_1 is level 1
- Information typically increases with levels as nodes annotate deeper, however information can decrease if a node does not annotate any deeper
- GO_Direct: directly annotated by source database without any parent terms (but low specificity)
- Functional Annotation Clustering report groups/displays similar annotations together which improves the ability to understand the biology easier compared to the traditional Functional annotation chart
- Initial Group Members- min gene number in seeding group (lower = more genes/functional group/many small groups)
- Final Group Members- min gene number after cleanup (number of genes cluster must have to be presented)
- Multi-Linkage Threshold- how seeding groups are merged (higher % = sharper separation)
- Clustering typically contains more than chart b/c of significant neighbors
Parsing out unique proteins from hierarchical clusters:
I updated the method to average rather than ward.D2 (Ward 1963) because it changed the cophenetic correlation from 0.6299225 to 0.9433488. The typical accepted value is at least 0.75. There is now a total of 41 clusters (note the agglomerative coefficient = 0.9959262).
Most proteins are in cluster 1 (14381/14510 = 98.68%). However, because the cluster hierarchy represents the original object-by-object dissimilarity matrix so well, I am not so worried about this.
Now I want to remove proteins that are in the same cluster. I’m guessing most of cluster 1 will be removed, but this will help remove many proteins (and thus noise) in my data and highlight the proteins that cluster differently based on abundance. My attempts to solve this problem are in this issue.
This spreadsheet contains a list of proteins that were uniquely abundant between silos.
Now I want to move on to gene enrichment of these proteins, but I need to determine the most appropriate background for DAVID or what the background is for CompGO.
The importance of a genetic background. I’m going to use all of the proteins detected in silo 3 and 9 which were all used for the cluster analysis. I chose not to use the proteins that may be unique in silo 2 since it was not included in this analysis, however it will be worthwhile in the future to do the same methods done here between silos 2 and 3 since there were mortality differences despite the silos being the same temperature. I will only use detected proteins in silos 2 and 3 at that point, and depending on the results, I could redo the same method with all silos and use silos 2 and 3 as a replicate. The proteins would be distinguished by temperature only at this point rather than by group.
Quick note on small things I learned so I won’t forget down the road…
- I was having some problems with git finding removed files and Github wanting to track them. This code fixes that in a user friendly way:
git clean -i -fd
- -i for interactive
-f for file
-d for directory
- Note: Add -n or –dry-run to just check what it will do.
- -i for interactive
- R is a 1 based index.
I also installed Shutter on Roadrunner using the Ubuntu Software (App store) for image editing to post my issue.
I ran the broodstock DNA extractions on the Bioanalzyer today (10/10). Here are the results. For the first chip, the samples are:
1 – 10-3
The second chip is already labelled, and I reran UK-08 on it as well. Unfortunately, during preparation of the chip with the gel dye matrix, the plunger clip released to a higher position. I think this disrupted the gel matrix in the chip and caused some errors on the gel matrix although the electrophregram.
Previously, I was attempting to use a Bray-Curtis distance matrix which was used on my cluster analysis for each individual silo. However, bray-curtis is an asymmetrical analysis such that any double zero values will be removed. My data contains several double zero values (where abundances weren’t detected for multiple days or in multiple silos for a day), but that is relevant information when examining the pattern of abundances. I choose to use a euclidean distance matrix instead because it is the most commonly used symmetrical distance matrix.
I also changed the method I was using. Average is still a good choice but I think Ward’s method (1963) will be better since it is less sensitive to outliers. It minimizes the error of sum of squares by minimizing the increase in the sum of squares distances at each step. Average gave the best cophentic correlation (0.9433488). This should be done on my previous clusters for each individual silo if we choose to utilize those plots. This is not kmeans clustering- it is hierarchical clustering.
The table can be found here (along with a frequency table, scree plot, dendrogram and line plots which are pasted below for convenience.)
In the table, there is a protein for Silo 3 and a protein for Silo 9 so there are 2 of each proteins. Clusters are represented in a separate column. I want to determine what proteins were sorted into the same cluster for each silo. Obviously, I could do this manually but that would take a while. Next, I need a code that removes a protein if the cluster is the same for the duplicated protein.
First, loading values in ASCA…
The goal: I want to get the loading values (eigen values) after I reformatted the data and ran the ASCA in R to look at what proteins had the highest values since those proteins are the most influential in my dataset. I will plot these proteins’ abundances over time.
Shelly and I parsed out that the 1 represents PC1, and that the V values in svd are the loadings based on the MetStaT github code that explains the “behind the scenes” code that happens when using the package MetStaT. So we can examine the laoding values by running
loadings <- as.data.frame(ASCA$1
$svd$v[,1]). These were the lines in github that deciphered which table the loading values were:
pr.object <- asca[[ee]]$svd
pcs[tuple] # 1st element of tuple contains index of PC1 within PCs
However, the proteins are not labelled in the loadings values:
There are the correct number of loadings, 7988, and I assume they coordinate with the same column row of the protein, but because I can’t be sure, I’m going to revisit Metboanalyst where I believe I can get a loading values table with the protein name. If the results look promising there, I will try to create the corresponding row names in the above dataframe and I will compare it to the Metboanalyst results.
Originally I was having errors running my data through Metboanalyst. I had my datasheet organized with my samples in columns in the same order Metboanalyst shows, however it seems that the actual software needs the order as ‘Sample’ followed by ‘Temperature’ and finally ‘Time’ in my case because it reads ‘Time’ as the number of groups in your data, and states that you need three replicates per group. Essentially, make sure that the second row is the group that contains at least three replicates.
I reorganized my data for the ASCA by modifying the code Shelly created. This time the samples are in rows with ‘Temperature’ as the first factor so Metboanalyst understands that there are replicates for this factor. I also made sure to select the Time-Series/Two-factor module rather than the Statistical Analysis Module.
Options and Results:
1) Data processing information:
2) I choose the standard deviation for data filtering.
3) Normalization –> Data scaling –> Mean centering:
4) Analysis paths:
Multivariate- ASCA (The time is not in order because it is ordered by the first character.):
These results are not significant: implying that this data needs to be parsed out, possibly by the loadings- if I only run the high loading valued proteins, or by eliminating similarly abundant proteins via kmeans clustering.
SPE is the squared prediction error which measures the expected squared distance between the predicted value and the true value (ie. it measures the quality of the predictor).
Leverage measures the influence of each observation for a principal component. Score plots will identify observations with high leverages, ie. observations that tend to pull the PCA towards them.
Outliers are proteins that do not follow the general trend of the data, but don’t necessarily have a high leverage on the PCA. A high leverage value is not necessarily an outlier because it can leverage the PCA but still maintain the trend of the data.
As you can see the plots don’t decipher much information because of the amount of data, but you can download the associated tables of outliers and significant features.
Leverage threshold was 0.9 and alpha threshold was 0.05.
Highest leverage value for outliers in consideration of temperature.
There was no significant features for time or interactions, only for temperature.
ANOVA2 (two way anova) gave no significant results and thus did not plot anything.
MEBA sounds like something I would be interested in but it gives the error, “Please make sure data are balanced for time series analysis, In particular, for each time points, all experiments must exist and cannot be missing.”
Today I finished extracting the broodstock DNA You can see the details here!
I also helped Grace run her samples on the bioanalyzer at the end of the day.