Kaitlyn’s notebook: clustering on real NSAF values

These plots were made using silo3and9_nozerovals_noincnstprot.csv.

Workflow for hierarchical clustering:

  1. Make dissimilarity matrix (used euclidean distance)
  2. hclust from cluster package to cluster
  3. Agglomeration coefficient (0.9991113) and cophentic correlation (0.9635755)
  4. Scree plot
  5. Dendrogram- find height to cut based on branching (50)
  6. 33 clustered with protein abundances
  7. Heatmaps- combined silos and indvidual silos.

scree-plot.jpeg4. Scree plot made from clustered data. The elbow is difficult to find therefore we make a dendrogram to find the best place to ‘cut’ the data.

dendrogram.jpeg5. Red line represents where the dendrogram was cut, and it was based on the area right before heavy branching occurred.

This slideshow requires JavaScript.

  1. A total of 33 clusters were produced. The first image shows all proteins and the second is only the proteins that deferentially clustered. The proteins in the second image were used for the heatmaps.


  1. Heatmap of deferentially clustering proteins.

These protein values are different between silos. It makes sense to plot them together on the heatmap to see this difference, but does clustering the proteins (that are different based on clustering) on the heatmap make sense? Because clustering is trying to group proteins in similar patterns together. I want to show similar patterns in Silo 3 and silo 9 separately, not together.

It looks like there are three main groups of abundance profiles found by clustering:

  1. The bottom group has fairly consistent abundance values between both silos but it looks like day 3 in silo 3 separated them. Abundance levels were very low for these proteins initially.
  2. The second group, in the middle, appears to have heavier expression at the end of the experiment in silo 9 compared to silo 3.
  3. The third group is the reverse of the second group where expression is greater in the beginning of the experiment in silo 3 compared to silo 9.

Additionally, would the abundance by cluster line plots complement this figure by showing the abundance based on group? Or is that redundant?

Silo 3


Silo 9


Does showing them individually, i.e., clustering the proteins individually, show the differences in abundance between silos better?