Kaitlyn’s notebook: Proteomics paper


Referenced from Shelly’s post

  • Compare ASCA proteins (high loadings) with hierarchical cluster (differentially clustered) proteins
    • make raw abundance line plots facetted by protein
    • examine if GO enrichment changes when ASCA and cluster proteins are combined
  • Determine how time is factored into the cluster
  • Determine if the permutation test with ASCA tests needs to be improved for the high loadings proteins to be considered highly influential
  • Redo BLAST of CHOYP proteins to 2018 Uniprot database
  • Begin identifying and locating data files that we need to deposit in public protein repository (i.e. ProteomeXchange and PeptideAtlas)
    • need to regenerate ‘table_blastout_gigatonpep-uniprot’
    • need fasta file of the peptides ID’d by mass spec
    • make a simplified supplementary table containing CHOYP IDs, UniProt Accessions, e.val, Protein names, Gene names
      • Can modify current datasheet to get this info as well

Completed tasks

Paper questions

Question: Does temperature influence the proteome of larval C. gigas, and if so, how?

Referenced from Shelly’s post

  • Do we need to explain we did 2 x 4 treatments, or just say we did 1 x 2 treatments?
  • Do we have survival data for other silos to compare to silo 2, 3, and 9?
    • Can we rule out silo 2 as an anomaly or should we include it?

Kaitlyn’s notebook: clustering without day 0

I redid the hierarchical clustering of the combined silo 3 and 9 datasheet without day 0.

Include Day 0 Remove Day 0
Agglomerate coefficient 0.9964979 0.9976217
Cophenetic correlation 0.9477519 0.9630485
Clusters 41 84
Diferentially clustered proteins 33 213

31 “diferentially clustered proteins” remained differentially clustered whether day 0 was included or excluded. So, removing day 0 causes more proteins to cluster separately (the agglomerate coefficient is slightly increased).

  • i.e., removing day 0 causes more proteins to be identified as having different abundances.
    • All abundances are the same for both silos on day 0 since no treatment had been administered yet.
      • Removing day 0 means that only days following treatment are analyzed which makes more sense since we are attempting to identify proteins that have different abundances during treatment only.


This slideshow requires JavaScript.

This slideshow requires JavaScript.

Kaitlyn’s notebook: Silo 3 and 9 NMDS with color


I can change colors symbols, or the legends as well.

Kaitlyn’s notebook: Total protein abundances and unique proteins


I’m not sure how best to put this in a table. The third table is the best way I could think of right now. The first three tables are just to show differences between two silos only.


Kaitlyn’s notebook: Enrichment and Protein abundance heatmaps

These heatmaps were created by using the BP-FAT file created by DAVID when I entered my uniquely clustered proteins with a background of all detected proteins (in silos 3 and 9).

  • Enriched processes merged back to a protein based on Uniprot Accession IDs
  • Protein abundances are values in heatmap
  • merged-test-table

These heatmaps show how the protein abundances change over time for proteins whose Uniprot accession codes are associated with enriched BPs represented by parent terms which were given by DAVID during enrichment analysis.

I also included the heatmaps clustered by time below the first pair of plots.

The patterns of abundance seem to be very similar between the two silos, but silo 3 tends to have higher abundances than silo 9 except with platelet degranulation.




The days are clustered in the plots below. If we look only until the second node, we see  three main groups for Silo 3:

  1. Day 13
  2. Days 7, 5, 11, 9 
  3. Days 15, 0, 3

and two main groups for Silo 9:

  1. Days 0, 3, 9   
  2. Days 15, 13, 11, 7, 5

Based on this, we can see that protein abundance patterns are different between the silos, but we knew this already since these proteins were selected based on differential clustering before. The new information we can see is how the processes the proteins are linked to shift based on time.



This was done with the BP-FAT file because it had the fewest enriched processes and would be the easiest to work with and view as a test attempt. Here is the code I made.

Kaitlyn’s notebook: reviewing material for possible next steps or gene enrichment visualization

Now that I have a list of terms that show some significance, I want to figure out how to visualize the data in an informative way. I’m doing some research on gene enrichment visualization tools. Here are some that I’ve come across:

  • REVIGOREVIGO-resultsREVIGO-treemap
    • The only two processes that weren’t significant were
      • negative regulation of  biological process
      • regulation of anatomical structure size
    • I also saved the results in the table REVIGO produces.
    • WebGiviwebgivi-test
      • This is a quick example of a visualization that WebGivi does. It might be interesting to reorganize the data such that proteins are drawn to parent terms (rather than GO IDs being drawn to the GO term).
    • InterMineRgithub here and bioconductor
      • Both an enrichment and visualization package in R
    • Panther and Gorilla seem to be for model organisms or else you need protein sequences to analyze against the database.
    • This website has a list of gene enrichment tools that I want to go through. I’ve come across and mentioned some already, but there is quite a few on here.
    • Some good info on gene enrichment interpretation and presentation.

Additionally, I’m reviewing literature on how gene enrichment has been visualized before, and if there are other methods that might be suitable for my data set.

Kaitlyn’s notebook: Gene enrichment

14 IDs could be mapped (out of 28) using DAVID.

Focal adhesion was enriched in the KEGG-PATHWAY with a p-value of 5.4E-2 and Benjamini 6.1E-1.

A8TX70 collagen type VI alpha 5 chain(COL6A5) RG Homo sapiens
P21333 filamin A(FLNA) RG Homo sapiens

I downloaded:

  • BP_FAT
  • BP_ALL
  • the functional clustering chart for BP_FAT, BP_ALL, and BP_DIRECT

BP_DIRECT had the fewest enriched processes (and they all fit in one screenshot unlike the others that could only be accurately visualized if they were downloaded):


BP_DIRECT are the annotations from the source (which I believe would be considered Uniprot) without any parent terms included.

The number of enriched processes has increased quite a bit since I added in the 0 abundance proteins to even out the protein list between silos after cluster analysis. Creating a heat map with the processes doesn’t seem like it will visualize the data correctly or easily. I’m going to see what other visualization tools downstream of gene enrichment analysis exist and if any are feasible for my data that I can try.