Kaitlyn’s notebook: dendrogram cutoffs and euclidean vs bray-curtis

Shelly did some amazing work creating a network using proteins identified by ASCA and p-values (calculated using a proportion test with spectral counts). The backround colors on the network represent the log fold change. We want to see if hierarchical clustering (HC) identified proteins with large fold changes and if that might be good to add into this network; We also want to look at the differences in proteins identified in HC and ASCA in the network.

Here I am comparing different cutoff values for HC (e.g. where I decide to cut the dendrogram) and how that affects the protein list and thus network visualization. One question we have always had is whether bray-curtis or euclidean dissimilarity matrices are appropriate. To answer this question, I looked at varying cutoff values and made multiple protein lists to insert into the network with both dissimilarity matrices.

#agglomerate coefficent (how likely a protein is to be placed in a new cluster)
#bray=0.9349286; euclidean=0.9969951

#cophenetic correlation (how well the dissimilarity matrix represents the original data)
#bray=0.7690317; euclidean=0.9460521 (both values reach 0.75 value)

Cutoff value 0.7 0.65 0.6 0.5 300 250 150
Clusters 9 17 23 54 25 35 79

Red lines represent cutoff values shown above in table.

Euclidean dendrogram:

Bray-curtis dendrogram:

Protein lists include the abundance value for the protein so each protein is duplicated- one for 23C and one for 29C. If you want just the list of unique proteins regardless of the silo, just subset one of the silos and that will include all of the uniquely clustered proteins.

The other files in the folders:

  • XXXfaceted-abund.jpeg
    • faceted abundance plots of all proteins
  • XXXfaceted-unq.jpeg
    • faceted abundance plots of only the unique proteins
  • XXXfreq.csv
    • frequency tables for the total number of proteins in each cluster
  • Scree plot
  • Dendrogram