Goals for today:
Quick refresher: Hierarchical clustering compares the pattern of abundance between each protein. It does not factor in time as a dependent value. Instead it considers each time point as an independent variable.
I wanted to better evaluate the cutoff value for my hierarchical clustering dendrogram. The dendrogram is inserted below (ignore the current red cutoff line).
[A dendrogram was created because the scree plot was not detailed enough to find an appropriate cutoff value (at the elbow).]
Clusters directly correlate to the nodes therefore the higher the cutoff value, or height, the fewer clusters are created (because less nodes are included).
I choose 3 cutoff values, at 90, 110 and 180 because they were either highly inclusive of branches with fewer nodes or highly exclusive of branches with many nodes.
- 180 includes all small branches.
- 80 includes only the dense nodes and intricate branching at the bottom.
- 110 sits between high exclusive and highly inclusive.
Next, I compared the line plots at each cutoff value to determine which level of inclusivity best examined the relationship between proteins. I was looking for something that grouped proteins together well. This value couldn’t be so inclusive of branching that all proteins grouped together into so few clusters that individuality was lost, or so exclusive of branching that proteins were so often clustered separately that patterns were lost.
Here are line plots of the abundance values of each cluster:
Cutoff height at 90 producing 43 clusters. Highly exclusive of branching which means there are several nodes and thus clusters. The similarity between proteins is most strict at this cutoff value. We can see this most in clusters 1 and 2 which visually look pretty similar but that this cutoff value determines as distinct. There are two dense clusters: Cluster 15 has proteins that stayed relatively constant between about 25 and 100. Cluster 23 has proteins whose value stayed consistently below 50 abundance. 28 clusters contain only 1 protein.
Cutoff height at 110 producing 31 clusters. This value is moderately exclusive/inclusive. You can see a major change between cutoff value 90 and 110 in cluster 1. Previously, at 90, those proteins were distinct, but here, at 110, they are clustered together. You can also see that there are 3 dense clusters here: Cluster 8 has proteins whose abundance stayed between 80 and 160. Cluster 11 has proteins that stayed between 25 and 100, and cluster 16 has proteins who generally stayed below 50 abundance. At 90, there were only two dense clusters. Here, 20 clusters contain 1 protein as opposed to 28 at the 90 cutoff value.
Cutoff height at 180 with 11 clusters. It looks like all the dense clusters at previous cutoff values were grouped into cluster 6 here. This results in the lowest number of clusters with only 1 protein (2 total) but it looks like this might be a bit too inclusive to parse out abundance patterns based on clusters 10, 7 and 1 which look like the similarity between proteins is not as good as the previous cutoff value.
I choose cutoff 110 because of the balance between grouping similar proteins and not being so exclusive that mostly single protein clusters are produced.