Kaitlyn’s notebook: dendrogram cutoffs and euclidean vs bray-curtis

Shelly did some amazing work creating a network using proteins identified by ASCA and p-values (calculated using a proportion test with spectral counts). The backround colors on the network represent the log fold change. We want to see if hierarchical clustering (HC) identified proteins with large fold changes and if that might be good to add into this network; We also want to look at the differences in proteins identified in HC and ASCA in the network.

Here I am comparing different cutoff values for HC (e.g. where I decide to cut the dendrogram) and how that affects the protein list and thus network visualization. One question we have always had is whether bray-curtis or euclidean dissimilarity matrices are appropriate. To answer this question, I looked at varying cutoff values and made multiple protein lists to insert into the network with both dissimilarity matrices.

#agglomerate coefficent (how likely a protein is to be placed in a new cluster)
#bray=0.9349286; euclidean=0.9969951

#cophenetic correlation (how well the dissimilarity matrix represents the original data)
#bray=0.7690317; euclidean=0.9460521 (both values reach 0.75 value)

Cutoff value 0.7 0.65 0.6 0.5 300 250 150
Clusters 9 17 23 54 25 35 79

Red lines represent cutoff values shown above in table.

Euclidean dendrogram:

Bray-curtis dendrogram:

Protein lists include the abundance value for the protein so each protein is duplicated- one for 23C and one for 29C. If you want just the list of unique proteins regardless of the silo, just subset one of the silos and that will include all of the uniquely clustered proteins.

The other files in the folders:

  • XXXfaceted-abund.jpeg
    • faceted abundance plots of all proteins
  • XXXfaceted-unq.jpeg
    • faceted abundance plots of only the unique proteins
  • XXXfreq.csv
    • frequency tables for the total number of proteins in each cluster
  • Scree plot
  • Dendrogram

Kaitlyn’s notebook: 20190123 Geoduck histology

We sampled geoduck at the end of January and took some histology samples. I believe only geoduck from tank 4 were strip spawned and we noticed that the treated tanks had poorly developed gonads. I examined the histology slides that we just received from sampling.

I reviewed how Grace staged her geoduck as well as the criteria in the paper Brent sent Shelly and I (Ropes 1968). I talked with Grace a little more about what she did and showed her some slides and we were both having trouble placing the geoduck at different stages. It was hard to decide whether they were producing less, reabsorbing, or just developing gonad more slowly than the other tanks.

I came up with some scoring criteria to get a better idea of how tanks and treatments compared and entered it into this spreadsheet.

Overall, it appears that the ambient tanks were further along in development than the treated tanks. Tank 4 was furthest in development compared to tank 1 which was dominated by early active geoduck.

All treatment group geoducks were early active:

Least developed male 4 with only spermatognium present in tiny acini-

Early active male 1 with dense connective tissue and very low amount of spermatids-

Early active male 018 with less connective tissue previously but is still early active (granted further along than 1) because of the number of acini with less than 30-40% spermatids; also the acini are smaller than seen below in the late active male –

Early active female 006 with dense connective tissue, low volume eggs, and most eggs are on the follicular wall

Early active female 034 with many elongated eggs on the wall of a small follicles however the eggs do have some good volume and roundness. This is a good example of difficulties staging-

While Ambient geoducks looked more like this:

Late active male 41 with visible spermatids covering at least 50% of large acini and all acini contain spermatids; connective tissue is less dense than 018-

Ripe female 045 with little connective tissue and high volume eggs; this female has the most eggs/follicle but it still isn’t a lot compared to a heavily ripe female-

Images are located on the FFAR-Geoduck drive under /Images/20190123-histo and on OWL. I also updated the histology-databank with the slide locations.

Ropes, J. W. 1968. Reproductive Cycle of the Surf Clam, Spisula Solidissima, in Offshore New Jersey. Biol Bull 135:349-365

Kaitlyn’s notebook: Extracting clusters from heatmap (and a Venn diagram)

Extracting Clusters

I’ve been working on extracting the clusters out from my heatmap and finally succeeded with this R script!

Cluster 1 

Cluster 2


29C 23C 29C
Mean Protein Abundance 181.025 168.728 284.94 301.742

Important note:

  • distfun = function(x) as.dist(1 - cor(t(x), use = "pa"))
    • heatmap3 states that distfun=dist is default but that is incorrect and the above is the appropriate default;
    • gplots does have the correct distfun=dist default so you  must change it if heatmap3 gives a better dendrogram as in this case.

Question: Should I get error rates for the mean protein abundance of each cluster and silo?

I’m currently working on doing the same thing with the ASCA temperature influenced proteins (the ASCA table requires some different reformatting).

Venn Diagram

I had a lot of trouble trying to make a venn diagram in R (here is my attempt…). I did use the previous script to sort out a list of the proteins in each temperature, but ultimately I decided to refer back to Emma’s paper (2) and use Venny (2) the same way she did.

(1) Oliveros, J. C. Venny. An interactive tool for comparing lists with Venn’s diagrams.
https://bioinfogp.cnb.csic.es/tools/venny/index.html (accessed February 13, 2019).

(2) Timmins-Schiffman, E. B., Crandall, G. A., Vadopalas, B., Riffle, M. E., Nunn, B. L., Roberts, S. B. (2017). Integrating Discovery-driven Proteomics and Selected Reaction Monitoring To Develop a Noninvasive Assay for Geoduck Reproductive Maturation. Journal of Proteome Research, 16(9), 3298–3309.doi:10.1021/acs.jproteome.7b00288.

Kaitlyn’s notebook: oyster seed proteomics comparisons

Comparing ASCA and clustering identified proteins

I modified and reran the clustering code using the technical replicate averages. The code and other plots can be found here. I used the proteins ASCA identified as being affected by temperature.

This is the code I used to compare the analyses. A total of 25 proteins were common between clustering and ASCA analysis. ASCA identified 113 different proteins while clustering identified 8 other proteins. It would be interesting to see how manipulating the dendrogram height cutoff value during clustering affects the latter number (I used a cutoff of 250).

Here is a heatmap of the similar proteins (note the first column is Day 0 followed by 23C days 3-13 and finally 29C days 3-13):

Venn Diagram of 23C and 29C

I’m working on making a better venn diagram in R with just the silo 3 and 9 proteins. Here is a quick and dirty venn digram, but I’m trying to usethe VennDiagram package to improve it.

Note that all 0 values were changed to 0.100 in the technical replicate averages. I removed all values less than 0.600 as these would be undetected in the silo because of the change (there are 6 days of data since were are excluding day 15 because of the temperature malfunction).

This is obviously still a work in progress to create a better venn diagram.

Kaitlyn’s notebook: finalizing the methods and cluster heatmap

Shelly and I have developed nearly all of the methods. We need to develop and decide what we will use to either quantify the differences in the differentially abundant proteins we identify (eg. pairwise comparisons) or describe the differences (eg. gene enrichment). I also am working on identifying useful papers for the introduction and background on larval/seed mortality and temperature.
I really liked Emma’s presentation and take on the geoduck project. Since we have a very similar dataset I decided to run the deferentially identified proteins in DAVID again since we have found out that the values I was previously working with were not NSAF values. My idea was that different proteins could have been identified and led to different and possibly more significant results here that may be worth doing more digging in to. I choose to run only the differentially clustered proteins as a quick test to see if this was a method we should pursue (granted the drawback is there are only 32 proteins in this list, of which 18 do not have accessions and only 10 mapped to a DAVID ID). Here are the results: Nothing is signifigant by a Benjamini corrected p-value but translation and organonitrogen translation was. Translation is broad but is an important process for growth. Organonitrogen biosynthesis is too broad for me to make any sort of guess.
Overall I would say these results were not really useful. Emma went through her data by cluster. I can look into that after the pairwise comparison decisions and analysis is completed.
Note that 18 of the 32 proteins identified by differential hierarchical clustering do not have associated Uniprot accessions/annotations.
This is interesting because it seems like the proteins that have differential protein abundance between the two temperature treatments may be oyster specific, or at the very least unidentified. If we decide to pursue gene enrichment then we should blast the .fasta file against Uniprot or another database again to be sure of this.

In lab meeting we decided to keep all proteins and not filter. I reran the clustering code to see if this significantly changed the heatmap, and there are some changes but three groups of proteins are still identified by the dendrogram. I edited the new heatmap and added it to our document with a comment for interpretation. I also added the protein abundance plot to compare the differences showing protein abundance over time in line plots vs the heatmap.

A pairwise comparison will help to better quantify the differences between the treatments after the ASCA and hierarchical cluster analysis identify deferentially abundant proteins. I have a script and previously added in fold change analysis to the methods. I think if we do fold change, we may need to consider comparing the day before rather than every day compared to day 0. However, I also want to look into other pairwise comparisons such as a T-test.

I also looked into the differences in the way my line plots look compared to Emma’s line plots for the geoduck project. It appears this difference is based on the dissimilarity matrix. If I use bray-curtis to create my dissimilarity matrix, this is what my deferentially clustered proteins look like. There is about 3800 proteins here.

I will also upload my new cluster and heatmap code since they are highly edited now (the workflow is mostly the same).


Kaitlyn’s notebook: clustering on real NSAF values

These plots were made using silo3and9_nozerovals_noincnstprot.csv.

Workflow for hierarchical clustering:

  1. Make dissimilarity matrix (used euclidean distance)
  2. hclust from cluster package to cluster
  3. Agglomeration coefficient (0.9991113) and cophentic correlation (0.9635755)
  4. Scree plot
  5. Dendrogram- find height to cut based on branching (50)
  6. 33 clustered with protein abundances
  7. Heatmaps- combined silos and indvidual silos.

scree-plot.jpeg4. Scree plot made from clustered data. The elbow is difficult to find therefore we make a dendrogram to find the best place to ‘cut’ the data.

dendrogram.jpeg5. Red line represents where the dendrogram was cut, and it was based on the area right before heavy branching occurred.

This slideshow requires JavaScript.

  1. A total of 33 clusters were produced. The first image shows all proteins and the second is only the proteins that deferentially clustered. The proteins in the second image were used for the heatmaps.


  1. Heatmap of deferentially clustering proteins.

These protein values are different between silos. It makes sense to plot them together on the heatmap to see this difference, but does clustering the proteins (that are different based on clustering) on the heatmap make sense? Because clustering is trying to group proteins in similar patterns together. I want to show similar patterns in Silo 3 and silo 9 separately, not together.

It looks like there are three main groups of abundance profiles found by clustering:

  1. The bottom group has fairly consistent abundance values between both silos but it looks like day 3 in silo 3 separated them. Abundance levels were very low for these proteins initially.
  2. The second group, in the middle, appears to have heavier expression at the end of the experiment in silo 9 compared to silo 3.
  3. The third group is the reverse of the second group where expression is greater in the beginning of the experiment in silo 3 compared to silo 9.

Additionally, would the abundance by cluster line plots complement this figure by showing the abundance based on group? Or is that redundant?

Silo 3


Silo 9


Does showing them individually, i.e., clustering the proteins individually, show the differences in abundance between silos better?

Kaitlyn’s notebook: NSAF vs NUMSPEC and working draft methods

Shelly and I are working on putting together a paper focusing on proteomics for the 2016 oyster seed data. Here is the working draft!

We also found out that the data file I have been working with is not NSAF (normalized spectral abundance factor) values. Instead they are spectral peak values (NUMSPEC) which do not really correlate to protein abundance values.. All downstream analyses need to be done on NSAF values, but we are now sure that these are the correct values and that a redone file will export in the same format.

I went through Emma’s notebook today and found some information on technical replicates, MS preparation and experimental justification to name a few things. Important/relevant posts are linked in the draft that can be discussed and used to help discern the parts of the method we are unsure about.

We are focusing on the methods section. I added in information about hierarchical clustering and a fold change analysis. We are unsure if fold change analysis will stay in the paper. Previously we had down this with the NUMSPEC data, so in order to know if it will stay in the methods section, I need to redo it with the new NSAF data. Here is what I’ve done so far:

First I needed to organize the .tsv file. It contains several other NSAF values and spectral values. I know from the above issue that adjusted NSAF values should be used so I extracted all columns containing ‘ADJNSAF’. Next, I had to remove everything but the sample name which was tricky since the sample names are not the same length. Another tough aspect of rearranging this data sheet is that the sample names do not intuitively correspond to the silo, temperature or day as seen here, but another benefit to taking the time to reformat the data, is that it can be put through the other scripts much easier now to generate a new clustering and ASCA heatmap.

All silo 3 and 9 samples from day 0 (competent larvae) to day 13:

  • 1- s0d0
  • 4- s3d3
  • 8- s9d3
  • 12- s3d5
  • 16- s9d5
  • 20- s3d7
  • 24- s9d7
  • 28- s3d9
  • 32- s9d9
  • 36- s3d11
  • 40- s9d11
  • 44- s3d13
  • 48- s9d13

Row means were taken followed by a foldchange analysis for each day. Originally we removed fold changes less than 2 (as written in the methods), but that leaves many NA values for proteins that can’t be visualized. I need to find a way to remove NAs and infinity values without setting the cutoff value for this reason.

Kaitlyn’s notebook: new height for dendrogram

Goals for today:

New Heatmap

Quick refresher: Hierarchical clustering compares the pattern of abundance between each protein. It does not factor in time as a dependent value. Instead it considers each time point as an independent variable.

I wanted to better evaluate the cutoff value for my hierarchical clustering dendrogram. The dendrogram is inserted below (ignore the current red cutoff line).

[A dendrogram was created because the scree plot was not detailed enough to find an appropriate cutoff value (at the elbow).]



Clusters directly correlate to the nodes therefore the higher the cutoff value, or height, the fewer clusters are created (because less nodes are included).

I choose 3 cutoff values, at 90, 110 and 180 because they were either highly inclusive of branches with fewer nodes or highly exclusive of branches with many nodes.

  • 180 includes all small branches.
  • 80 includes only the dense nodes and intricate branching at the bottom.
  • 110 sits between high exclusive and highly inclusive.

Next, I compared the line plots at each cutoff value to determine which level of inclusivity best examined the relationship between proteins. I was looking for something that grouped proteins together well. This value couldn’t be so inclusive of branching that all proteins grouped together into so few clusters that individuality was lost, or so exclusive of branching that proteins were so often clustered separately that patterns were lost.

Here are line plots of the abundance values of each cluster:


Cutoff height at 90 producing 43 clusters. Highly exclusive of branching which means there are several nodes and thus clusters. The similarity between proteins is most strict at this cutoff value. We can see this most in clusters 1 and 2 which visually look pretty similar but that this cutoff value determines as distinct. There are two dense clusters: Cluster 15 has proteins that stayed relatively constant between about 25 and 100.  Cluster 23 has proteins whose value stayed consistently below 50 abundance. 28 clusters contain only 1 protein.


Cutoff height at 110 producing 31 clusters. This value is moderately exclusive/inclusive. You can see a major change between cutoff value 90 and 110 in cluster 1. Previously, at 90, those proteins were distinct, but here, at 110, they are clustered together. You can also see that there are 3 dense clusters here: Cluster 8 has proteins whose abundance stayed between 80 and 160. Cluster 11 has proteins that stayed between 25 and 100, and cluster 16 has proteins who generally stayed below 50 abundance. At 90, there were only two dense clusters. Here, 20 clusters contain 1 protein as opposed to 28 at the 90 cutoff value.


Cutoff height at 180 with 11 clusters. It looks like all the dense clusters at previous cutoff values were grouped into cluster 6 here. This results in the lowest number of clusters with only 1 protein (2 total) but it looks like this might be a bit too inclusive to parse out abundance patterns based on clusters 10, 7 and 1 which look like the similarity between proteins is not as good as the previous cutoff value.

I choose cutoff 110 because of the balance between grouping similar proteins and not being so exclusive that mostly single protein clusters are produced.


Kaitlyn’s notebook: Proteomics paper


Referenced from Shelly’s post

  • Compare ASCA proteins (high loadings) with hierarchical cluster (differentially clustered) proteins
    • make raw abundance line plots facetted by protein
    • examine if GO enrichment changes when ASCA and cluster proteins are combined
  • Determine how time is factored into the cluster
  • Determine if the permutation test with ASCA tests needs to be improved for the high loadings proteins to be considered highly influential
  • Redo BLAST of CHOYP proteins to 2018 Uniprot database
  • Begin identifying and locating data files that we need to deposit in public protein repository (i.e. ProteomeXchange and PeptideAtlas)
    • need to regenerate ‘table_blastout_gigatonpep-uniprot’
    • need fasta file of the peptides ID’d by mass spec
    • make a simplified supplementary table containing CHOYP IDs, UniProt Accessions, e.val, Protein names, Gene names
      • Can modify current datasheet to get this info as well

Completed tasks

Paper questions

Question: Does temperature influence the proteome of larval C. gigas, and if so, how?

Referenced from Shelly’s post

  • Do we need to explain we did 2 x 4 treatments, or just say we did 1 x 2 treatments?
  • Do we have survival data for other silos to compare to silo 2, 3, and 9?
    • Can we rule out silo 2 as an anomaly or should we include it?

Kaitlyn’s notebook: clustering without day 0

I redid the hierarchical clustering of the combined silo 3 and 9 datasheet without day 0.

Include Day 0 Remove Day 0
Agglomerate coefficient 0.9964979 0.9976217
Cophenetic correlation 0.9477519 0.9630485
Clusters 41 84
Diferentially clustered proteins 33 213

31 “diferentially clustered proteins” remained differentially clustered whether day 0 was included or excluded. So, removing day 0 causes more proteins to cluster separately (the agglomerate coefficient is slightly increased).

  • i.e., removing day 0 causes more proteins to be identified as having different abundances.
    • All abundances are the same for both silos on day 0 since no treatment had been administered yet.
      • Removing day 0 means that only days following treatment are analyzed which makes more sense since we are attempting to identify proteins that have different abundances during treatment only.


This slideshow requires JavaScript.

This slideshow requires JavaScript.