Kaitlyn’s Notebook: New table with annotations and Kmeans run time…

I merged the Uniprot annotated table with each silo that had the quantitative and qualitative tags I previously made: new table .

I want to note this table includes proteins that are not abundant in each silo. I choose to include this for now since they are easily removable. I was thinking that some Revigo plots with the 0 abundance proteins might reveal some differences between the silos… There is ~1000 to 1500 proteins not expressed in each table (out of about 8400 proteins).

I’m making a new scree plot since my last scree plots weren’t with the right code (nhclus.scree(x, max.k=#)), however it has not been successful yet because of the amount of time it’s taking. I’ve let it run over 4 hours with no results produced. I am trying it one more time and am planning on letting it run overnight, however if it takes that long it may not be feasible since I need to do it 3 times and then run kmeans which takes a few hours itself…

Also, this came up lab meeting: changing max.print options in R .

Kaitlyn’s Notebook: Kmeans Clustering

I used an R script provided by Emma to cluster each silo.

The scree plots for each silo are:

I choose four clusters for each silo so that they could be better compared between one another. It was somewhat difficult to determine cluster numbers because the first component was so much larger than any others…

The plots are comparing the first measured day of the experiment with the last.

Initially I choose only 3 clusters for Silo 2. The comparison between 3 and 4 clusters for silo 2:

Here is Silo 3 followed by Silo 9 with 4 clusters:


I’m working next on Eigen vectors (I’m having a problem getting R to read my data as numeric) and incorporating the cluster assignment into the excel spreadsheet as a tag.

Kaitlyn’s Notebook: Basic Statistical ‘Tags’

I updated an excel spreadsheet so it has multiple stats that I thought might be useful to see any patterns in expression. There are multiple sheets on the file: combined data with few tags followed by silo 2, 3 and 9 with all tags. The tags and why they may be helpful in seeing protein expression patterns are listed below.

  1. Average- is this protein typically highly or lowly expressed?
  2. Standard Deviation- how much does each day deviate from one another on average?
  3. Coefficient of Variance – normalized variance; how dispersed the protein expression is
  4. Variance- less useful than (3) however another representation of the dispersion of protein expression
  5. Median- valuable if compared to the average protein abundance to understand if protein expression is consistent
  6. Slope- liner regression to understand overall trend of protein expression (decreasing vs. increasing)
  7. Kurtosis- understand if the protein has a sharp peak in protein expression
  8. Skewness- informs us if the protein is being expressed more in a certain hald of the experiment
  9. Max- is the protein expressed a lot at any point in the experiment?
  10. Min- is there a time when the protein is not expressed?
  11. Range- the overall change in protein expression (does not inform us whether it is increasing or decreasing)
  12. 1st quartile- What is the cutoff for 25% expression over the course of the experiment?
  13. 4th quartile- What is the cutoff for 75% expression over the course of the experiment?
  14. Sum- determines if the protein was highly abundant over the course of the experiment (relative to the sums of other proteins)
  15. Day0:Day15- a ratio of the day before treatment to the final day of the experiment; informs us if the protein significantly changed after treatment
  16. Day3:Day15- a ratio of the first day of measured day of treatment over the final day of treatment
  17. Average for Days 0-7- valuable when compared to the second average to see if there was a change in protein expression half way through the larvas’ lives
  18. Average for Days 9-15- a compliment to the above tag
  19. Range for Days 0-7- valuable when compared to the range for days 9-15; further elucidates changes in expression between the first half of the experiment and the second half
  20. Sum:Total Proteins Identified- what percentage of the total proteins in the experiment are caused by expression of this protein?

I’m not sure how else I should proceed with this data. I could potentially look at gene enrichment, but I believe that a significant portion of proteins should be eliminated before hand. Knowing which proteins to eliminate can be difficult because each ‘tag’ can highlight a new trait of that protein. Therefore, eliminating proteins will mostly depend on future interests for this data set.

Kaitlyn’s Notebook: NMDS and Protein Quant

Just a refresher, I’ve been working with Rhonda’s 2016 oyster larvae data.

This experiment looked at oyster mortality based on two temperatures: 23C and 29C. Proteomic work was done on 3 silos, two at 23C and one at 29C.  When the data between Silo 2 and Silo 3 is compared, we can see that Silo 2 had higher rates of mortality.  Silo 3 and 9, which were at 23C and 29C respectively, had the same mortality rate of 10%. Therefore we decided to make an NMDS that looked at Silo 3 and 9 only. The X is an artifact of the code. The First number is the silo number and the number following the underscore is the day of the experiment.

It looks like Silo 3 had more days that were less similar to most days of the experiment. Silo 3 was at 23C. Rhonda previously reported that hatcheries are growing larvae at 29C because of the higher mortality rate that occurs at 23C. This NMDS plot shows that the beginning of the experiment (day 3) and the end of the experiment (day 15), silo 3 stood out. Silo 9 seems more consistently related to other days of the experiment.

For comparison, here is the previous NMDS plot containing all of the silos:

In addition to the NMDS, I did a quick quantification of the different proteins expressed each day (under “Protein”) and the total protein abundance (under “Protein Abundance”).

Kaitlyn’s Notebok: Gene enrichment of unique proteins

I grouped proteins that had 0 abundance on day 1 based on the number of days they were abundant and ran it through CompGO for biological processes at 0.1. To see if there was a differences in the number of proteins expressed in each group, I made a small table. All values were similar. Highlighted values had p-values of at least 0.1.

I think it’s interesting that all silos had enrichment with proteins that had abundance for only 1 day. All silos had peptidyl-tyrosine-dephosphorylation or “the removal of phosphoric residues from peptidyl-O-phospho-tyrosine to form peptidyl-tyrosine” at p-values greater than 1E-1.

Silo 2- 1 day of protein abundance:

Silo 3- 1 day of protein abundance:

Silo 9- 1 day of protein abundance:

Other enriched processes are:

Silo 2-  cellular response to retionoic acid (6 days),

Silo 3-  intracellular protein transport (4 days) and maturation of SSU-rRNA from tricistronic rRNA transript (5 days),

and Silo 9: negative regulation of endopeptidase activity (7 days). This can be viewed below in respective order.

Silo 2- 6 days of protein abundance:

Silo 3- 4 and 5 days of protein abundance respectively:

Silo 9- 7 days of protein abundance:

Kaitlyn’s Notebook: Unique Expression

I have continued working with Rhonda’s data and did some gene enrichment analysis on any proteins that had abundance on any day of the experiment after 0 abundance on day 1. I used Animal Genome for GO terms and produced a graph based on those GO terms.

I also made a graph for GO terms that had 0 abundance on any day of the experiment after some abundance on day 1.

I also thought it would be worthwhile examining what process seemed to change overall. Therefore I combined the data and produced the following graph:

Although biological process isn’t descriptive, for many proteins that was the only GO term which can be seen in screenshots and charts here. Furthermore, of the proteins I identified, many were not enriched which is why I choose to analyze gene enrichment for proteins that appeared or disappeared at any point in the experiment.

However I have now produced excel sheets that can identify proteins that were expressed only 1, 2, 3, 4 or all 5 days after no abundance on the first day. In other words we can now look at proteins based on the number of days they appeared after 0 abundance. This is also separated by silo as the analysis was before.

You can see that the original data was converted to a dichotomy using R and then based on the sum of those columns, we can identify proteins that were abundant for 1, 2 , 3, 4 or all 5 days. I included the original data so that we could identify if any of those proteins were abundant at very high levels such as the protein in the second photo that was expressed on only 1 day but at almost 40 abundance. I also included some annotations that we can look through as well.

Silo 2- unique expression based on days abundance

Here are the links for the other silos:

Silo 3- unique expression based on days abundance

Silo 9- unique expression based on days abundance

Kaitlyn’s Notebook: Unique Proteins that Appeared

I parsed out proteins from Rhonda’s data that initially had 0 abundance on day 1, but later had some measurable abundance for at least 1 day in the experiment. I ran the list of proteins I identified for each silo through CompGO. Many of the proteins did not have associated GO terms which was disappointing since some of those proteins were very uniquely abundant in the experiment.

I recorded this in a jupyter notebook entry.