Kaitlyn’s notebook: Unique proteins


Parsing out unique proteins from hierarchical clusters:

I updated the method to average rather than ward.D2 (Ward 1963) because it changed the cophenetic correlation from 0.6299225 to 0.9433488. The typical accepted value is at least 0.75. There is now a total of 41 clusters (note the agglomerative coefficient = 0.9959262).

Most proteins are in cluster 1 (14381/14510 = 98.68%). However, because the cluster hierarchy represents the original object-by-object dissimilarity matrix so well, I am not so worried about this.


Now I want to remove proteins that are in the same cluster. I’m guessing most of cluster 1 will be removed, but this will help remove many proteins (and thus noise) in my data and highlight the proteins that cluster differently based on abundance. My attempts to solve this problem are in this issue.

This spreadsheet contains a list of proteins that were uniquely abundant between silos.

Now I want to move on to gene enrichment of these proteins, but I need to determine the most appropriate background for DAVID or what the background is for CompGO.

The importance of a genetic background. I’m going to use all of the proteins detected in silo 3 and 9 which were all used for the cluster analysis. I chose not to use the proteins that may be unique in silo 2 since it was not included in this analysis, however it will be worthwhile in the future to do the same methods done here between silos 2 and 3 since there were mortality differences despite the silos being the same temperature. I will only use detected proteins in silos 2 and 3 at that point, and depending on the results, I could redo the same method with all silos and use silos 2 and 3 as a replicate. The proteins would be distinguished by temperature only at this point rather than by group.

Quick note on small things I learned so I won’t forget down the road…

  • I was having some problems with git finding removed files and Github wanting to track them. This code fixes that in a user friendly way:
git clean -i -fd
    • -i for interactive
      -f for file
      -d for directory
      • Note: Add -n or –dry-run to just check what it will do.
  • R is a 1 based index.

I also installed Shutter on Roadrunner using the Ubuntu Software (App store) for image editing to post my issue.

Kaitlyn’s notebook: Bioanalyzer and Clustering

Bioanalyzer Results

I ran the broodstock DNA extractions on the Bioanalzyer today (10/10). Here are the results. For the first chip, the samples are:

1 – 10-3
2- 3-T1
3- UK-05
4- 12-T6
5- UK-06
6- 8-T2
7- 11-T4
8- UK-02
9- 7-T2
10- UK-08
11- 5-T3

The second chip is already labelled, and I reran UK-08  on it as well. Unfortunately, during preparation of the chip with the gel dye matrix, the plunger clip released to a higher position. I think this disrupted the gel matrix in the chip and caused some errors on the gel matrix although the electrophregram.


Previously, I was attempting to use a Bray-Curtis distance matrix which was used on my  cluster analysis for each individual silo. However, bray-curtis is an asymmetrical analysis such that any double zero values will be removed. My data contains several double zero values (where abundances weren’t detected for multiple days or in multiple silos for a day), but that is relevant information when examining the pattern of abundances. I choose to use a euclidean distance matrix instead because it is the most commonly used symmetrical distance matrix. I also changed the method I was using. Average is still a good choice but I think Ward’s method (1963) will be better since it is less sensitive to outliers. It minimizes the error of sum of squares by minimizing the increase in the sum of squares distances at each step. Average gave the best cophentic correlation (0.9433488). This should be done on my previous clusters for each individual silo if we choose to utilize those plots. This is not kmeans clustering- it is hierarchical clustering.

The table can be found here (along with a frequency table, scree plot, dendrogram and line plots which are pasted below for convenience.)

In the table, there is a protein for Silo 3 and a protein for Silo 9 so there are 2 of each proteins. Clusters are represented in a separate column. I want to determine what proteins were sorted into the same cluster for each silo. Obviously, I could do this manually but that would take a while. Next, I need a code that removes a protein if the cluster is the same for the duplicated protein.

Kaitlyn’s notebook: Metboanalyst round 2

First, loading values in ASCA…

The goal: I want to get the loading values (eigen values) after I reformatted the data and ran the ASCA in R to look at what proteins had the highest values since those proteins are the most influential in my dataset. I will plot these proteins’ abundances over time.

Shelly and I parsed out that the 1 represents PC1, and that the V values in svd are the loadings based on the MetStaT github code that explains the “behind the scenes” code that happens when using the package MetStaT. So we can examine the laoding values by running loadings <- as.data.frame(ASCA$1$svd$v[,1]). These were the lines in github that deciphered which table the loading values were:

pr.object <- asca[[ee]]$svd

pcs[tuple[1]] # 1st element of tuple contains index of PC1 within PCs


However, the proteins are not labelled in the loadings values: loading-values-ASCA

There are the correct number of loadings, 7988, and I assume they coordinate with the same column row of the protein, but because I can’t be sure, I’m going to revisit Metboanalyst where I believe I can get a loading values table with the protein name. If the results look promising there, I will try to create the corresponding row names in the above dataframe and I will compare it to the Metboanalyst results.

Now, Metboanalyst-

Originally I was having errors running my data through Metboanalyst. I had my datasheet organized with my samples in columns in the same order Metboanalyst shows, however it seems that the actual software needs the order as ‘Sample’ followed by ‘Temperature’ and finally ‘Time’ in my case because it reads ‘Time’ as the number of groups in your data, and states that you need three replicates per group. Essentially, make sure that the second row is the group that contains at least three replicates.

I reorganized my data for the ASCA by modifying the code Shelly created. This time the samples are in rows with ‘Temperature’ as the first factor so Metboanalyst understands that there are replicates for this factor. I also made sure to select the Time-Series/Two-factor module rather than the Statistical Analysis Module.

Options and Results:

1) Data processing information: data-processing-info

2) I choose the standard deviation for data filtering.

3) Normalization –> Data scaling –> Mean centering:


4) Analysis paths: analyses-pathes

Multivariate- ASCA (The time is not in order because it is ordered by the first character.):


These results are not significant: implying that this data needs to be parsed out, possibly by the loadings- if I only run the high loading valued proteins, or by eliminating similarly abundant proteins via kmeans clustering.




SPE is the squared prediction error which measures the expected squared distance between the predicted value and the true value (ie. it measures the quality of the predictor).

Leverage measures the influence of each observation for a principal component. Score plots will identify observations with high leverages, ie. observations that tend to pull the PCA towards them.

Outliers are proteins that do not follow the general trend of the data, but don’t necessarily have a high leverage on the PCA. A high leverage value is not necessarily an outlier because it can leverage the PCA but still maintain the trend of the data.

As you can see the plots don’t decipher much information because of the amount of data, but you can download the associated tables of outliers and significant features.



temp-featuresLeverage threshold was 0.9 and alpha threshold was 0.05.


Highest leverage value for outliers in consideration of temperature.

There was no significant features for time or interactions, only for temperature.

ANOVA2 (two way anova) gave no significant results and thus did not plot anything.

MEBA sounds like something I would be interested in but it gives the error, “Please make sure data are balanced for time series analysis, In particular, for each time points, all experiments must exist and cannot be missing.”

Kaitlyn’s notebook: DNA extractions and Bioanalyzer samples

Today I finished extracting the broodstock DNA You can see the details here!

I also helped Grace run her samples on the bioanalyzer at the end of the day.

Kaitlyn’s notebook: ASCA results

I used the new silo3_9 table I made for kmeans clustering and modified Shelly’s code to do the ASCA. Factor 1 is temperature and factor 2 is time.

Helpful tips:


A combination of variables can explain the total variation based on temperature (PC1 = 100% for temperature) while 52.15% of the variance is explained by a single principal component for time. The interaction of time and temp produce the least amount of explained variance for PC1 (42.20%).

The sum of squares model describes how well the data fits a linear regression. Centering the data is important because PCA is a regression model without intercept. Non-centered data can be misleading since the eigen vector may point in the appropriate direction. Time contributes to the variance the most at 63.84% if the data is centered.

Temperature loadings plot of PC1 and PC2:

Time loadings plot of PC1 and PC2:

Interaction loadings plot of PC1 and PC2:

Single Score plot for Temp:

Single score plot for Time:

Single score plot for interaction:

Score plot with projected data for Temp:

Score plot with projected data for Time:

Score plot with projected data for Interaction:

ASCA.DoPermutationTest() shows that no factors have a significant effect on protein abundance, but time has the greatest effect.
1(Temp): 0.119
2(Time): 0.015
12(Interaction): 0.191

Kaitlyn’s notebook: Bioanalzyer results

I ran Yaamini’s samples on the bioanalyzer and got the results from the procedure tests for broodstock C. gigas DNA extraction from blocks.


Kaitlyn’s notebook: ASCA in MetStaT

I downloaded the R package MetStat to preform an ASCA. I’m currently working through how best to organize the data because you input two dataframes. This is the example I’m working with.

I believe the first dataframe or the ‘data’ will only contain a list of the proteins and abundances (“variables are represented by columns, observations by rows”). Proteins will be in rows columns and abundances will be in columns rows  however, I don’t think order will be considered. Unless proteins should be in columns with abundance in rows… I’m going to look more into this tomorrow or Thursday since the ASCA is supposed to take into consideration time (that was the whole point) and I’ll update this. because observations are measured values (eg. abundance) and variables are what is observed/measured (eg. proteins).

The second dataframe or ‘levels’ will contain the temperature data for each protein/observation (“numeric matrix describing the experimental design. Each factor is represented by a column. The elements of the columns give the treatment level the row belongs to”). There will be one column representing a signal factor (temperature) and the elements will be 23 or 29, but I’m not sure how/if I can make time a factor?

I think the column would have to be “3, 5, 7, 9, 11, 13, 15” repeating, and I need to make sure that the order of proteins is the same for both dataframes so that the elements (23 or 29C) correctly match the factor (temperature). The data will only be described by temperature.

This will be okay since I will only use Silo 3 and 9. If I decide to do an ASCA between the 23C silos then I will make the elements “silo 2” or “silo 3″ for the factor.

Equation elements are specified as a string that indicates the factor to use in the ASCA. Factors are specified by the column (eg. =”1″) or interacting factors can be considered (=”123″). Multiple factors can also be entered (=”1,2,12”).

ASCA.Calculate(data, levels, equation.elements ="")

I wrote up an issue for help in our github.