Kaitlyn’s notebook: Options for cluster analysis


Now that I have unique proteins, I need to map them back to biological processes. Gene enrichment examines what genes are over-represented relative to a background genome. A single biological process maps to multiple genes. A biological process (cellular component, or molecular function) is enriched based on the additive changes in genes that occurred based on a treatment or phenotype. Therefore, the background of list of genes you analyze when identifying enriched processes is very important.

My current list of unique proteins examines two of my three silos. I have a couple options with this data for a gene list:

  • split up silos and examine enrichment based on unique proteins per silo


  • look at enriched processes of all unique proteins.

For my background I can include:

  • all detected proteins in the experiment (all silos)
  • detected proteins in silo 3 and 9 only
  • all unique proteins in silo 3 and 9.

I’m not sure which background would be most useful for determining the differences between temperature.

I want to visualize the differences that are occurring between the silos in addition to identifying the changes between biological processes. Heatmaps are a great way to view that information. Considering I have 1532 unique proteins between the silos, a heatmap with them would be difficult to interpret. It would be nice if I could incorporate biological process into the heatmap, however I am not sure how to do that.

When Shelly and I discussed this, I thought that I could map the parent GO term back to the protein using the Uniprot accession code and then plot the abundances of the proteins underneath the parent term in a heat map. I don’t know if protein abundances can be representative of enriched processes though since the group of genes is determined independent of any quantitative value.

Terms to refer back to for DAVID…

  • RT: related term (genes that share similar sets of annotation terms are in the same biological mechanism) by Kappa value (closer to 1 = better)
  • COG ontology: NCBI’s COG (Clusters of Orthologous Groups of proteins) for genome scale analysis of proteins functions and evolution
  • SP_PIR_KEYWORDS: SwissProt/Uniprot and PIR keywords
  • UP_SEQ_FEATURE: annotation category, Uniprot Sequence Feature at the Uniprot site
  • SCOP_..: Protein structure
  • GO FAT: filters out broad GO terms based on a “measured specificity” of each term (not by level)
    • GO_ALL is all levels
    • GO_GP_1 is level 1
      • Information typically increases with levels as nodes annotate deeper, however information can decrease if a node does not annotate any deeper
    • GO_Direct: directly annotated by source database without any parent terms (but low specificity)
  • Functional Annotation Clustering report groups/displays similar annotations together which improves the ability to understand the biology easier compared to the traditional Functional annotation chart
    • Initial Group Members- min gene number in seeding group (lower = more genes/functional group/many small groups)
    • Final Group Members- min gene number after cleanup (number of genes cluster must have to be presented)
    • Multi-Linkage Threshold- how seeding groups are merged (higher % = sharper separation)
    • Clustering  typically contains more than chart b/c of significant neighbors