Grace’s Notebook: July 5, 2017

(1) Today I finished the oyster seed morphology measurements. Link to google doc spreadsheet: here (the separate day measurements are under “Cryotubes” tab). Link to google folder containing measured images: here.

Note*: After two separate attempts at imaging all the seed in Silo 9 Day 13, only 65 were able to measured and counted.

(2) I tried again to image the OA oyster larvae. Goal is to image 100 countable and measurable larvae. This is what it looks like using the dissecting scope:


I am unsure of the feasibility of getting 100 measurable larvae in one image.

(3) Am starting to look at Geoduck data.

Laura’s Notebook: Post Skyline Data Processing

Please see my Geoduck-DNR Repo, June Analyses folder for most up-to-date version of this process.

In this workflow I process Skyline data to assess proteins expressed in Geoduck during outplant in 5 sites throughout Puget Sound in 2 treatments: eelgrass, and bare (no eelgrass).

Pertinent files for ths process: R code, Biostats.R

Step 1: Export report from Skyline

screen shots TBD

Step 2: Add site/treatment to columns, remove #N/A, and normalize based on Total Ion Current

The raw file from Skyline, opened in Excel, looks like this:

I used Excel to make these small modifications. Yes, I could have dones this in R, but I didn’t (next time I will). I referenced my Mass Spec Sequence File to match file names to sample numbers, and then my Geoduck Organization file to match sample names to site and treatment. To edit all #N/A I used the Find & Replace tool. Skyline data ID'd and #N/A removed

Finally, I opened a new tab and normalized the data by the Total Ion Current (TIC) for each sample. TIC data was given to me from Emma. Skyline data normalized by TIC

I saved this tab as a new file, 2017-06-02_SKYLINE-Total-Protein-Area-NORM.csv2017-06-02_SKYLINE-Total-Protein-Area-NORM.csv, cleaned up the header with simplified sample codes: e.g. FB-E-1 & CI-B-2 for Fidalgo Bay Eelgrass technical rep 1, Case Inlet Bare technical replicate 2, respectively. Also notice that all the n/a spaces are showing up as 0. I’ll take care of this in R

Normalized Skyline Data for R

Step 3: Read data into R and play with it. I borrowed a lot of this script from Yaamini, with some modifications – thanks Yaamini!

 ### Import Dataset ### ### Note: this dataset has already been massaged in Excel: sum area by protein, normalized by TIC, all n/a's removed (replaced with zeros), and columns renamed setwd("~/Documents/Roberts Lab/Geoduck-DNR/Analyses/2017-June_Analyses") NormProtArea <- read.csv("2017-06-02_SKYLINE-Total-Protein-Area-NORM.csv", header=TRUE, na.strings = "0") # import local file head(NormProtArea)  

imported data

Step 4: Aggregate transition area by protein. Data currently represents the area under the curve for each transition. To find the total area per protein I use the aggregate() function to sum all transition areas by protein.

 ### SUM TRANSITION AREA BY PROTEIN ### # The "Area" values (which have already been normalized by TIC) are peak area for each transition. Sum them to determine total area for each protein NormProtAreaAgg <- aggregate(cbind(FB.E.1, CI.E.1, PG.B.1, SK.E.1, FB.B.1, WB.B.1, SK.B.1, CI.B.1, PG.E.1, WB.B.2, PG.E.2, FB.E.2, FB.B.2, CI.B.2, SK.E.2, PG.B.2, SK.B.2, CI.E.2) ~ Protein.Name, FUN = sum, data = NormProtArea, na.rm = TRUE, na.action = NULL) head(NormProtAreaAgg) #confirming that proteins are now summed  

agg data

Step 5: Average technical replicates. Each sample ran through the MS/MS twice; here I calculate the average for each treatment/site combo:

 #### AVERAGE REPLICATES #### # FYI No eelgrass sample for Willapa Bay bareCaseInlet <- ave(NormProtAreaAgg$CI.B.1, NormProtAreaAgg$CI.B.2) # create vector for each site/treatment that averages the replicates (of which there are 2) bareFidalgoBay <- ave(NormProtAreaAgg$FB.B.1, NormProtAreaAgg$FB.B.2) bareWillapaBay <- ave(NormProtAreaAgg$WB.B.1, NormProtAreaAgg$WB.B.2) bareSkokomishRiver <- ave(NormProtAreaAgg$SK.B.1, NormProtAreaAgg$SK.B.2) barePortGamble <- ave(NormProtAreaAgg$PG.B.1, NormProtAreaAgg$PG.B.2) eelgrassCaseInlet <- ave(NormProtAreaAgg$CI.E.1, NormProtAreaAgg$CI.E.2) eelgrassFidalgoBay <- ave(NormProtAreaAgg$FB.E.1, NormProtAreaAgg$FB.E.2) eelgrassSkokomishRiver <- ave(NormProtAreaAgg$SK.E.1, NormProtAreaAgg$SK.E.2) eelgrassPortGamble <- ave(NormProtAreaAgg$PG.E.1, NormProtAreaAgg$PG.E.2) NormProtAreaAggAveraged <- data.frame(NormProtAreaAgg$Protein.Name, bareCaseInlet, bareFidalgoBay, barePortGamble, bareSkokomishRiver, bareWillapaBay, eelgrassCaseInlet, eelgrassFidalgoBay, eelgrassPortGamble, eelgrassSkokomishRiver) # combine site/treatment vectors into new dataframe head(NormProtAreaAggAveraged)  

ave data

Step 6: Prep file to merge with annotations. The Protein ID in my Skyline files are formatted differently from the GeoID in my annotation file. Here I edit out the extra string (in bold), cds.comp100047_c0_seq1|m.5980

 ### EDITING GEOID NAME TO MATCH ANNOTATED FILE ### NormProtAreaAggAveragedGeoID <- NormProtAreaAggAveraged # copy database, create new one to be used to join to annotated file GeoID.a <- (gsub("cds.", "", NormProtAreaAggAveragedGeoID$NormProtAreaAgg.Protein.Name, fixed=TRUE)) # remove cds. from protein ID GeoID.b <- (gsub("\\|m.*", "", GeoID.a)) # remove |m.#### from protein ID noquote(GeoID.b) # remove quotes from resulting protein ID NormProtAreaAggAveragedGeoID[1] <- GeoID.b # replace the newly created vector of protein ID's in the full database head(NormProtAreaAggAveragedGeoID) # confirm changes nrow(NormProtAreaAggAveragedGeoID) # confirm # rows still same write.table(NormProtAreaAggAveragedGeoID, "", quote=F, row.names = F, sep="\t") # Save to file  

edit id

Step 7: Upload annotations. This file is the annotated proteins found via geoduck gonad transcriptome.

 ### UPLOAD ANNOTATED GEODUCK PROTEOME FROM URL ### GeoduckAnnotations <- read.table( "", sep="\t", header=TRUE, fill=TRUE, stringsAsFactors = FALSE, quote="") #fill=TRUE for empty spaces nrow(GeoduckAnnotations) # check that the # rows matches the source (I know this from uploading to Galaxy; not sure how else to do it) ncol(GeoduckAnnotations) # check that the # columns matches the source head(GeoduckAnnotations) # inspect; although it's easier to "view" in RStudio  

geo annotations file

Step 8: Merge the 8076 total proteins found in my samples with the annotations. After merge, there are 5,690 annotated proteins in my data. Annotated file is

 # MERGE ANNOTATIONS WITH ALL MY PROTEINS ### AnnotatedProteins <- merge(x = NormProtAreaAggAveragedGeoID, y = GeoduckAnnotations, by.x="NormProtAreaAgg.Protein.Name", by.y="GeoID") head(AnnotatedProteins) #confirm merge nrow(AnnotatedProteins) #count number of proteins that were matched with the list of annotations write.table(AnnotatedProteins, "", quote=F, row.names = F, sep="\t") #write out .tab file  

merged annotation

Step 9: Make NMDS plot to assess overall site relatedness. Again, thanks to Yaamini for providing me the code! Note: I saved the biostats.R code as an R script in my repo/working directory

  #### CREATE NMDS PLOT #### #Load the source file for the biostats package source("biostats.R") #Either load the source R script or copy paste. Must run this code before NMDS. library(vegan) #Make sure first column of protein names is recognized as row names instead of values - this takes the first column containin protein names, and assigns it to the row "header" names area.protID <- NormProtAreaAggAveraged[-1] rownames(area.protID) <- NormProtAreaAggAveraged[,1] head(area.protID) write.csv(area.protID, file="2017-07-04_NormProtAreaAgg&Ave.csv") #write this file out for safe keeping  


  #Transpose the file so that rows and columns are switched area.protID.t <- t(area.protID[,1:9]) nrow(NormProtAreaAggAveraged) # Compare row and column lengths before and after transposing ncol(NormProtAreaAggAveraged) nrow(area.protID.t) ncol(area.protID.t) #Make MDS dissimilarity matrix proc.nmds <- metaMDS(area.protID.t, distance = 'bray', k = 2, trymax = 100, autotransform = FALSE)  

draw nmds box

  #Make figure fig.nmds <- ordiplot(proc.nmds, choices=c(1,2), type='none', display='sites', xlab='Axis 1', ylab='Axis 2', cex=1) #bare=circle #eelgrass=triangle #case=red #fidalgo=blue #willapa=black #skokomish=green #gamble=magenta  

draw nmds box

  points(fig.nmds, 'sites', col=c('red', 'blue', 'black', 'green', 'magenta','red', 'blue', 'black', 'green', 'magenta'), pch=c(rep(16,5), rep(17,5))) legend(0.185,0.1, pch=c(rep(16,5), 1, 2), legend=c('Case Inlet', "Fidalgo Bay", "Willapa Bay", "Skokomish", "Port Gamble", "Bare", "Eelgrass"), col=c('red', 'blue', 'black', 'green', 'magenta', 'black', 'black'))  

apply pointsNMDS plot


  #Install package install.packages("pheatmap") library(pheatmap) #Data should be log(x) or log(x+1) transformed for this analysis. I thus invert my data (since I had normalized by TIC so values were <1), then log transform. #Invert data then log transformation for heat map area.protID.t.inv <- (1/area.protID.t) <- sapply(area.protID.t.inv, is.infinite) # change "Inf" to "NA" area.protID.t.inv.log <- data.trans(area.protID.t.inv, method = 'log', plot = FALSE) ncol(area.protID.t.inv.log) nrow(area.protID.t.inv.log) #Create heatmap pheatmap(area.protID.t.inv.log, cluster_rows = T, cluster_cols = T, clustering_distance_rows = 'euclidean', clustering_distance_cols = 'euclidean', clustering_method = 'median', show_rownames = T, show_colnames = F)  


  #Export preliminary heatmap as a .png png(filename = "2017-07-04_Heatmap-by-median.png") pheatmap(area2.tra2, cluster_rows = T, cluster_cols = T, clustering_distance_rows = 'euclidean', clustering_distance_cols = 'euclidean', clustering_method = 'median', show_rownames = T, show_colnames = F)  

heatmap image

Step 11: Calculate coefficients of variances & means, and plot. I did this to visualize how variable the proteins are- plot TBD

  ### Find COEFFICIENT OF VARIANCES & MEANS ### ProtVariance <- c(var(area2.tra2[-1], area2.tra2[1], na.rm=TRUE)) ProtMeans <- c(colMeans(area2.tra2[-1], na.rm=TRUE)) ProtVar.Means <- data.frame(,y) ProtVar.Means <- data.frame(NormProtAreaAggAveraged[1], ProtMeans, ProtVariance) nrow(NormProtAreaAggAveraged[1]) length(ProtMeans) length(ProtVariance) head(ProtMeans) ncol(ProtMeans)  

from The Shell Game

Laura’s Notebook: July Goals

Happy July! file_000 5

Goals for the month:

  • “Sync” online lab notebook with hard copy lab notebook from the last week, and be more diligent about updating daily
  • ID final list of proteins for next round on Mass Spec
  • Complete final steps geoduck protein samples: speed vac, desalt, add standards
  • Execute next round of MS/MS
  • Dig into lit review for geoduck project
  • Use ImageJ to measure size for SN larval growth experiment
  • Revisit my draft O. lurida paper to update, particularly the methods
  • Begin to analyze O. lurida spawning data – are there significant differences between: populations? treatments?
  • Complete the larval stage of my experiment! Anticipated completion July 17th
  • Keep juvenile oysters alive 🙂
  • Visit Big Beef Creek to assess location/gear needed for shellfish farm
  • ID potential hosts for the GROW program, research their current projects, and brainstorm project ideas
  • Nail down the dive class schedule with Will & Lindsay

from The Shell Game