More proteomics software fun.

So we finally have Pecan up and running on Emu! It took some doing however. Here’s a little list of the things I had to do to make it work.

First, Grid Engine. What a pain. It’s very picky about host IP and resolved names matching up. I eventually found a guide for installation and configuring that worked.

Second, Pecan requires pymzml, a python package. However the package version distributed by PIP does not contain the proper psi-ms-4.0.1.obo file in the package for Pecan to run. The file, and associated issue for the pymzml developers can be found here

Then came the pecanpie oddities. Pecanpie is an accessory program that sets up a series of job files for Grid Engine. It adds an argument to each grid engine call-l mem_requested = xxG which I originally thought to be an argument passed on to the Pecan or Percolator programs, but after talking to the
developer of Pecan, this is an argument for Grid Engine itself which, depending on the version, can also be mem_free. There is a space to change this in the config file for Pecan, but that requires a re-compile of pecan to fix.

Emu is currently running full tilt on Laura’s proteomic data, using 14/16 cores at the moment, and every scrap of free memory. Pecan originally wanted 92gb of memory to run, so hopefully it will behave having only half of that. Hyak will definitely be exciting for future proteomics work.

Total Alkalinity and Proteomics.

Yesterday I met with Micah at DNR in Olympia and got my badge to access the lab to measure TA and got acclimated to the process. I wrote a walkthrough for the process which an be found here.

Aside from writing the walkthrough this morning, I helped Laura get Pecan working for DIA proteomic analysis. Interesting program, that Pecan. It requires, among other things, a text file that spells out the different .mzML files that need processed. It appears to the append the path for the text file to the .mzML files which creates some odd situations.

For example: If your text file has ~/Documents/Proteomics/Converted/Sample1.mzML in it, and the text file is located in ~/Documents/Proteomics/ the Pecan looks in ~/Documents/Proteomics/~Documents/Proteomics/Converted/ for the sample file. That doesn’t work.

We eventually got it to work by copying all of the .mzML files in to the same directory as the text file and having the text file just consist of a list of file names, with no associated directory.

Next hurdle, Pecan estimates it needs 92gb of RAM to run. Emu only has 48 gb, so we’ll have to see if that’s enough to appease Pecan.

Big news!

We heard from Micah at DNR today, he’s got my temporary badge ready and we’re scheduled to meet next Wednesday to go over TA sample processing again. After that I’m able to run samples independently!

Also, I processed the combined Day 10 and Day 135 analysis. Ended up with 401 DMRs that had significant p-values, but none passed the q-value/FDR test, which is a bummer. I left Emu churning on a gap statistic analysis for potential clustering, I’ll see what that looks like when it’s done!


So, I was running our day 145 Geoduck data and I had a realization that I applied the coverage cutoff a little early in my data curation. The result of this is I removed potential hypo-methylated sites. Not good. So I re-ran Day 10, 135, and 145 data and the result? More DMLs! Up to 267 from 41 for the Day 10 data. Maybe a good thing in the end? We’ll see.

While I was re-running I looked into potential clustering in the revised Day 10 data. I tried to determine a proper number of clusters through a couple of different methods, but it didn’t look very convincing to me, and the gap statistic method never converged for any values of K between 0 and 100.

notebook entry here:

Finally, I reorganized my scaphapoda directory on Owl, because it was a nightmare. Geoduck related stuff can now be found in ``. Readme files for the directory and what everything means will be forthcoming!

Proteomics software.

Unfortunately I have little to show for today’s work. Most of my day was spent trying to get MSConvert, a native windows program, to work in Linux through WINE. It did not go well. Getting Wine to interact nicely with the MSConvert is nightmarishly difficult, and after reading some stuff on WineHQ, this is apparently the nature of Wine, every piece of software requires it’s own individual tweaks and work-arounds to get installed.

I did get a semi-working clustering graph for Day 10 DMLs for Hollie’s Geoduck stuff, I just have to convince ggplot to not re-order factors as given.

How to convert .raw files once more

or: why you never use sudo with wine.

So, this morning has been a learning experience for me. Steven wanted to do some proteomics stuff on Emu which brought up all kinds of exciting shortcomings of Linux, Wine, and my system administration skills!

It did have a couple of bonus upsides however! We learned how to skip the .raw to .mzXML step, which is nice. That’s done via

WINEPREFIX=~/.wine32 wine /home/shared/comet/comet.2016012.win32.exe -Pcomet.params.file -Dproteome.file sample.raw


But the developer for comet suggested converting from .raw to .mzXML prior to running comet, as wine requires enough system overhead to slow down the process, so we went back to the old plan of using ReAdW to convert. It’s here that I learned that wine is… picky to say the least.

Wine, when you install it creates a user specific wine environment, that only you have access/control over. I should have realized this by even the WINEPREFIX=~/.wine32, but apparently not. When I had been working on this previously, I was testing on both the srlab and my own account, and so they got built organically, as I was working through problems. I either spaced that this was the issue, or didn’t realize it, but now I do know! And here’s how to fix it.

Pretend we have a new user, bob. He logs in to his account and wants to convert some .raw files.

First, we needs to create his 32-bit wine environment and run winetricks to install a bunch of needed libraries (.NET 2.0, .NET2.0 SP1 and 2, and Visual C++ Runtime environments from 2008 and 2010 via

WINEARCH=win32 WINEPREFIX=~/.wine32 winetricks dotnet20 dotnet20sp1 dotnet20sp2 vcrun2008 vcrun2010

After clicking Yes a bunch of times, we now have a basic windows environment with the parts needed to run the subsequent programs.

Next we need to install the thermo DLLs specifically for converting raw to mzXML files by running MSFileReader.exe found on Emu in /home/shared/MSFileReader/.

That’s done via WINEPREFIX=~/.wine32 wine /home/shared/MSFileReader/MSFileReader.exe

Click through like you would installing any program, and we’re almost there!

You should be able to now run ReAdW via

WINEPREFIX=~/.wine32 wine ReAdW.2016010.msfilereader.exe sample.raw sample.mzXML and convert happily away.


This took a while because someone, probably myself, ran wine as root via sudo wine ... which is highly, highly unadvisable. When wine creates files using sudo, root owns the files, even though they’re located in your personal directory. Wine will then be unable to change them in the future, leading to all kinds of unintended consequences. Don’t do it. No sudo wine.

Making bedgraphs and q-value results….

Making bedgraphs and q-value results.

So I had a couple of projects to tackle today, first Steven wanted some new bedgraphs for the Day 10 samples he’s working on annotating. The new ones can be found at here and the notebook with code for generation, and an explanation for how they are made can be found here

Next was playing with the qvalue() package in R. The q statistic is a way to determine the false discovery rate in an analysis with multiple comparisons. Given that we have at a minimum of ~2000 of loci we look at, FDR is an important consideration. In my notebook here. I work through the package supplied sample data, to get an idea as to what I should expect, and then work through our 10x and 5x coverage data.

The end result, our results look nothing like the sample results.

Their results:


Our Results:


My next plan, at the advice of Brent, is to go through and simulate a range of data, starting with what the sample data looks like, and working my way towards what my data looks like (% significant, number of comparisons, distribution of p-values) and see if I can find a trend in what happens that explains why we’re seeing what we do.

Also, I plan on re-reading the paper around the q-value development, which you can find here if you’re feeling particularly masochistic.