Kaitlyn’s notebook: ASCA in MetStaT

I downloaded the R package MetStat to preform an ASCA. I’m currently working through how best to organize the data because you input two dataframes. This is the example I’m working with.

I believe the first dataframe or the ‘data’ will only contain a list of the proteins and abundances (“variables are represented by columns, observations by rows”). Proteins will be in rows columns and abundances will be in columns rows  however, I don’t think order will be considered. Unless proteins should be in columns with abundance in rows… I’m going to look more into this tomorrow or Thursday since the ASCA is supposed to take into consideration time (that was the whole point) and I’ll update this. because observations are measured values (eg. abundance) and variables are what is observed/measured (eg. proteins).

The second dataframe or ‘levels’ will contain the temperature data for each protein/observation (“numeric matrix describing the experimental design. Each factor is represented by a column. The elements of the columns give the treatment level the row belongs to”). There will be one column representing a signal factor (temperature) and the elements will be 23 or 29, but I’m not sure how/if I can make time a factor?

I think the column would have to be “3, 5, 7, 9, 11, 13, 15” repeating, and I need to make sure that the order of proteins is the same for both dataframes so that the elements (23 or 29C) correctly match the factor (temperature). The data will only be described by temperature.

This will be okay since I will only use Silo 3 and 9. If I decide to do an ASCA between the 23C silos then I will make the elements “silo 2” or “silo 3″ for the factor.

Equation elements are specified as a string that indicates the factor to use in the ASCA. Factors are specified by the column (eg. =”1″) or interacting factors can be considered (=”123″). Multiple factors can also be entered (=”1,2,12”).

ASCA.Calculate(data, levels, equation.elements ="")

I wrote up an issue for help in our github.

Kaitlyn’s notebook: MetboAnalyst “results” (not feasible with my data)

Univariate analyses

I’m going to go through the results I looked at based on the order you can choose them: 1)ANOVA, 2) Correlation Analysis, 3) Pattern Searching.

After you select ANOVA, this is the first plot Metbo shows you. I believe it is only useful for ms data. However, I chose Tukey’s test since it uses a pooled estimate for the variance and controls the family-wise error rate whereas Fisher’s LSD test is a series of pairwise t-tests and is not good for more than 3 groups.

The second plot gives me p-values for each protein. These p-values would tell me how different the proteins are from each other, but we have to remember that our silos are considered replicates here. So one protein is the pooled abundance from all three silos. This means that if two proteins were shown as different, they had significantly varied abundance when averaged across all silos.

This isn’t really answering my question. I would have to reorganize the data such that temperature the group however I do not have three replicates for each temperature and I could only include 1 silo since the proteins are duplicated for each silo.

After chatting with Roberto a little bit, I can see that none of these stats are going to be feasible with my data because of the lack of biological replicates. I’m going to post the results I got for future reference if I ever get the mass spec files or if someone else wants to know what Metbo can do (and because it took a lot of time to do this and interpret/understand the plots).

I’m going to move onto writing a script for ASCA.

ANOVA-1-TukeyANOVA-1-protein_by_proteinHere is the table this spits out.

  • This is the correlation heat map. It doesn’t make much sense to me or look very useful. correlation-1-nopatt
  • These are the plots for pattern analysis. This plot is supposed to measure the correlation between the temperature and proteins. The first plot has Pearson’s correlation coefficient on the X axis and each protein on the Y axis. Temperature is a perfect 1 since it is what we are correlating the proteins against. You can see the rst of the proteins listed correlate very well with temperature except for the last protein which has very little correlation with temperature.
  • The second plot is shows the “concentration” (or in this case abundance) of each protein and a table of the values for each protein which is here.pattern-1-optionspattern-1-table-protein

PCA analysis:

  • These are the PCA plots.




Cluster Analysis




Random Forest


Kaitlyn’s notebook: MetboAnalyst procedure

My datasheet needed to be manipulated before using MetboAnalyst (I assume I will need it organized this way for some other stats I will run in the future since MetboAnalyst uses R.) They have a helpful link for data formats.

I had to include all 3 silos because MetboAnalyst requires that each group (in my case each day) had to have 3 replicates. In this case my “replicates” will be each silo, but the replicates are considered different samples and thus they will be analyzed separately and not grouped.

Excel tips:

=RIGHT(A1,LEN(A1)-3) <– removes first three characters in a cell
=LEFT(A1,LEN(A1)-3) <– removes last three characters in a cell

  • Time-series + one experimental factor (samples in columns):



  • Upload your data.
    • Data type = spectral bins (no data type will match my data since this is made for mass spec files; I am only using it for statistical analysis)
    • Format = Samples in rows (unpaired)
  • Next:Unpaired-Data-Integrity-Check-Metbo
  • Step 2: I accepted the default settings for missing value estimation.


  • Step 3: SD because I’m most interest in the trends of abundance over time.
  • Step 4-Normalization: My data is already normalized so I didn’t select any options.4-no-normalization

    I don’t think these graphs matter since they are again looking knfor ms peaks, but here they are anyway.4-normalization-feature_view4-normalization-sample_view

  • Step 5: analyze. Here are the types of analyses you can do now! 5-analyses

Sam’s Notebook: Bedgraph – Olympia oyster transcriptome (FAIL)


Progress on generating bedgraphs from our Olympia oyster transcriptome continues.

Transcriptome assembly with Trinity completed 20180919.

Then, aligned the assembled transcriptome to our genome using Bowtie2.

Finally, I used BEDTools to convert the BAM to BED to bedgraph.

This required an initial indexing of our Olympia oyster genome FastA using samtools faidx tool.

SBATCH script file:

 #!/bin/bash ## Job Name #SBATCH --job-name=20180924_oly_bedgraphs ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=5-00:00:00 ## Memory per node #SBATCH --mem=500G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/20180924_oly_RNAseq_bedgraphs # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # Set genome assembly FastA oly_genome_fasta=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081.fa # Set indexed genome assembly file oly_genome_indexed=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081.fa.fai # Set sorted transcriptome assembly bam file oly_transcriptome=/gscratch/scrubbed/samwhite/20180919_oly_transcriptome_bowtie2/20180919_Olurida_v081.sorted.bam # Set program paths bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin samtools=/gscratch/srlab/programs/samtools-1.9/samtools # Index genome FastA ${samtools} faidx ${oly_genome_fasta} # Format indexed genome for bedtools ## Requires only two columns: namelength awk -v OFS='\t' {'print $1,$2'} ${oly_genome_indexed} > Olurida_v081.fa.fai.genome # Create bed file ${bedtools}/bamToBed \ -i ${oly_transcriptome} \ > 20180924_oly_RNAseq.bam.bed # Create bedgraph ## Reports depth at each position (-bg in bedgraph format) and report regions with zero coverage (-a). ## Screens for portions of reads coming from exons (-split). ## Add genome browser track line to header of bedgraph file. ${bedtools}/genomeCoverageBed \ -i ${PWD}/20180924_oly_RNAseq.bed \ -g Olurida_v081.fa.fai.genome \ -bga \ -split \ -trackline \ > 20180924_oly_RNAseq.bed 

Alignment was done using the following version of the Olympia oyster genome assembly:

Sam’s Notebook: Transcriptome Alignment – Olympia oyster Trinity transcriptome aligned to genome with Bowtie2


Progress on generating bedgraphs from our Olympia oyster transcriptome continues.

Transcriptome assembly with Trinity completed 20180919.

Next up, align transcriptome to Olympia oyster genome.

Alignment and creation of BAM files was done using Bowtie2 on our HPC Mox node.

SBATCH script file:

Alignment was done using the following version of the Olympia oyster genome assembly:

Kaitlyn’s notebook: Cheese grater and kmeans

Cheese Grater

I’m using a DVR disk to download Ubuntu on Cheese Grater with these instructions (pretty vague). Note that I cannot use Boot Camp Assistant because I don’t have a windows installation disk which is prompts you for here:20180924_162253.jpg

so I read on this forum that you can just manually partition a disk as MS-DOS FAT32. When I tried to install Boot Camp, it wouldn’t let me select a partition size less than 20GB so I choose to partition the SSD2 disk and create a 25GB partition called W-MS-DOSFAT. The W is for Windows and the ret is for the MS-DOS (FAT) format. This is some info on partitions.


I tried holding down C while rebooting and the CD drive opened.

Next, I held alt and only the drives opened as a choice for booting. I selected the SSD2 on the off chance I could get to the new partition, but that didn’t work, and the CD drive opened at the login screen.

I repeated these steps and got the same result. I’m not really sure what to do next. I’m going to try reading more on reformatting the USB to see if there’s another way to fix that so it is readable unless anyone else has any suggestions…

Kmeans clustering

What seemed like the easiest stats to run on the 2016 oyster data, is not turning out to be so. I wrote down the issues I’m having here.


I also reorganized my github today and uploaded all the line plot data so all the codes and images can be easily found.