Laura’s Notebook: SRM Analysis, revisiting correlations between abundance & environment

Taking a simpler approach to identifying correlations between environmental data and protein abundances. Goals of today was to:

  1. Subselect salinity data, generate summary stats, to be able to include in analysis
  2. Generate correlation plots with R^2 and P-values between differentially abundant proteins and environmental summary stats
  3. From plots, select env. summary stats to use in 1) multiple linear regression model & 2) structural equation model
  4. Generate correlation plots with selected env. variables to identify potential interactions
  5. Run multiple linear regression model
  6. Run structural equation model
  7. Interpret results from both
  8. Revise paper to include results from these tasks

1. Subselected salinity data, generate summary stats, include in analysis

In one of our team meetings we made the determination to ignore salinity data for the time being, since several probes malfunctioned. Up until this point I have performed all tasks/stats on salinity data alongside the other parameters, but ignored salinity when analyzing correlations. Today I generated correlation plots and found salinity may play a role. So, I decided to clean up salinity data as much as I could to include in my analysis. To do so, I reviewed raw salinity data via this plotly plot. Determined that the following salinity data needed to be removed due to probe malfunction: All CIE; FBB > 7/3 @ 09:50; WBB > 6/25 @ 05:30. I then removed all outliers from the updated salinity time-series data, and re-plotted. This is the resulting plot

Raw salinity data:

image

Bad data & outliers removed:

image

2. Generated correlation plots with R^2 and P-values between differentially abundant proteins and environmental summary stats

Discovered the excellent ggscatter function in the ggpubr library to generate correlation plots. I generated correlation plots for the 3 differentially abundant proteins (HSP90, Puromycin-sensitive aminopeptidase, Trifunctional Enzyme), using Pep1 for each. Reminder: Pep1 is the peptide in that protein with the highest overall abundance across all samples.

3. From plots, selected env. summary stats to use in models

Selected the following environmental parameters per the respective proteins for model constructions, using an approximate alpha=0.01 as cutoff: % Growth, DO.sd, DO.var, pH.sd.1, salinity.mean, salinity.sd.1. Here are the correlation plots for these parameters, using HSP90 for the purposes of this example:

imageimageimageimageimageimage

4. Generate correlation plots with selected env. variables to identify potential interactions

from LabNotebook http://ift.tt/2Ddzy2g
via IFTTT