Running Trinity and Trinotate on…

Running Trinity and Trinotate on a sample dataset:

Looks like I got Trinity and Trinotate to run, mostly, on a small sample data set provided by the developers. RNAMMER isn’t working, so I still have to fix that, but it’s progress!

R notebook: here

Results file: results.txt

Installing Trinity and Trinotate on…

Installing Trinity and Trinotate on Emu:

Looks like we’re getting some flounder RNA-seq data sometime soon, so Steven asked me to install and play with the Trinity and Trinotate pipeline. Install order/issues will be documented below. Some programs were already installed at the time of this writing, but I documented as though it was a complete install.

1. Install Trinity from here.
2. Install Jellyfish from here.
3. Install Trimmomatic from here.
4. Install Trinotate from here.
5. Install HMMER from here. Note: This is a different site than listed on the Trinotate installation instructions, but their site was down at the writing of this. Hopefully this one will work.
6. Install TransDecoder from here.
7. Install SQLite via sudo apt-get install SQLite.
8. Install NCBI-BLAST+ from here.
9. Install optional programs signalP, tmhmm, and RNAMMER found at this site. Note: Each program requires email permission by the developer which is only good for 4 hours. Kind of a pain, but oh well. RNAMMER also requires some hacking, which is described in detail on the Trinotate website.

Building the Trinotate protein database:

sudo ./admin/ Trinotate is how you’re supposed to be able to build the custom SwissProt and PFam database required by Trinotate, but there were a few missing perl modules. So prior to running this, sudo perl -MCPAN -e 'install DBD::SQLite', sudo perl -MCPAN -e 'install Bundle::DBI'

Everything looks like it installed ok, and after adding everything to the $PATH in /etc/environment, rebooting, and realizing I’d forgotten to save this midway through the reboot (thankfully it saved). I believe everything is done!

Next step, testing.

More RADMeth strangeness: I finished…

More RADMeth strangeness:

I finished the Day 135 samples in methcounts and ran them through the RADMeth program a few different ways. First (not really, it just happened to be faster) I looked at DMRs in Low vs Super Low pH treatments. The results were… even stranger than yesterday’s for Day 10 stuff.


That’s the one DMR between the two samples. With… 0 methylation in all samples? I read a ton of messages on the methpipe boards and no one has encountered a similar experience, but it looked like most of them were doing mouse/human stuff with super robust genomes to reference, so maybe that’s our issue? Something like RADMeth assuming chromosomes in the genome and maybe large swaths of methylation to compare against. Just an idea? Still trying to figure out how to look at/display the pre-merged DML data from RADMeth.

I also looked at hypomethylation in the Day135 samples. Bed and IGV XML files are found here.

Might be some interesting results here, I’m working on a good way to compare Day10 and Day135 samples, Maybe just by scaffold, since in all likelihood there won’t be overlap reliable.


Will update this with the R notebook once the Day 135 Ambient/Non-Ambient samples are finished running in RADMeth.

Kaitlyn’s Notebook: Moving on to Jupyter

I’ve been working with Bash the last couple of days. I’m running Bash on my Windows (which is capable of running a Linux based shell with the latest Windows update). There are definitely still some challenges. I found out you must change your home directory due to Windows file organization, but I am able to navigate my PC through Bash fairly effectively. I can now create directories and delete files, however I am still confused about utilizing pipes and filters in Bash so I’m still attempting to understand/work through this on Bash as well as opening files through the terminal. Most functions transfer over well from Linux to Windows except opening files. There is a patch from Windows as well as a few workarounds including running cbwin (another terminal) which runs through Bash. While it is a lot of work, running linux through Windows seems like the better option once I figure out these bugs because of the ability to use linux based commands which permits more actions.

While continuing to manage Bash, I’ve also downloaded Anaconda3 which includes Python 3.6 and Jupyter. I’m familiarizing myself with Jupyter (which I open through the GUI system until I figure out Bash). Once I am more comfortable with Jupyter, I will move onto running Blast so that I can run the oyster proteomics data through Blast and hopefully identify some proteins!

RADMeth DMR in IGV: Looks…


Looks like I may have spoken too soon. I loaded up the RADMeth produced DMRs in IGV and the DMRs didn’t even remotely line up to sites with methylation in the Day 10 samples. This may be an artifact of the merging process, or something being wonky with the model. Off the top of my head, potential oddities for the model may include a lack of nearby neighbors for the p-value adjusting step, or some oddity in trying to calculate DMRs from a very sparse set of loci.

I’m going to try to figure out how to load up either the regression or the adjustment steps in IGV and see if that shines any light on the problem.


IGV session XML file

RADMeth output for Day 10:…

RADMeth output for Day 10:

Over the last weekend I ran RADMeth on the day 10 samples I had converted and got some results! This was a little surprising time wise, as the RADMeth manual suggests you run the program on a cluster with “a few hundred” nodes available. Maybe that was just wishful thinking on their part, or something else? I’m not honestly sure.

The program doesn’t really have any screen output, so I’ll just paste the little bit of code and the output here for brevity’s sake. The merging and whatnot was done in the notebook found here

First, we run the regression portion of the program with

radmeth regression -factor trt designmatrix6.txt Day10-merged.meth > cpgs.bed

That outputs a cpgs.bed file that looks like:


The first four columns are location, strand, and context information, and the 6th-9th columns are total coverage and methylated counts at that location. The 5th column is the interesting column, representing the value obtained from a log-likelihood ratio test. The MethPipe developers suggest not using this p-value directly, and instead adjusting based upon neighboring values using the adjust argument to radmeth below.

radmeth adjust -o cpgs.adj cpgs.bed

which results in:


The first 5 and last 4 columns are the same as above, with columns 6 and 7 representing modified p-values based on neighboring sites, and an FDR corrected value (I need to look in to what method they use for FDR correction.

Next, the developers suggest we merge the loci in to regions, I went ahead and did this, though we may be more interested in the loci themselves, so both datasets will be avaiable. Merging is done via

radmeth merge -o cpgs.output cpgs.adj

which gives us:


Hey, that’s not an empty file, which is surprising given that MACAU + q-value FDR correction resulted in no significant DMRs. Interesting. I’ll throw those in to IGV and update this shortly. Columns are scaffold, start/stop location, context, number of methylated cytosines in the region, and log-likelihood ratio value.

Output files will be located in owl once they’re finished uploading.

RADMeth uses a beta-binomial based regression model so we preserve the ability to directly model count data, but lose the ability to control for some biological covariates and model genetic relatedness based covariance which will be something to consider.

Kaitlyn’s Notebook: Oysters and Excel/Continuing Work

Using excel, I was asked to identify proteins that were consistently high or varied across samples in the Pacific oyster proteomic data.

First I had to figure out how to open up a .tsv file from Github which I had never done before. I saved the .tsv file by right clicking on the RAW link then I followed these instructions which were very straightforward.

Once I had the file open in excel, I decided used the average(n…) and median(n…) function on all rows.I then selected conditional formatting and choose a color gradient in order to better visualize protein values for each rows. The average would show those that had higher protein counts while the median could provide insight to potential outliers.

I also wanted to provide a range for each row, however I could not find a command for this action. Instead I used the min(n…) and max(n…) functions in separate columns. I created a subsequent column subtracting the minimum value from the maximum value in each row to provide one value representing a range. This time I chose data bars for conditional formatting, mostly to mix it up from the previous selection.

After posting my progress on the file, there was discussion of possible error in technical replicates. In an attempt to show where differences in the sample and technical replicates(denoted by …#A) may be substantial, I calculated averages and medians for the samples and technical replicates. Next, I subtracted the replicate protein values from the original sample values. Then I assigned new rule under conditional formatting to mark values with a difference greater than 10- which I arbitrarily choose but can easily be changed.

Rules in conditional formatting

It seems there are continued problems identifying why replicates had significantly different values, but I will work on using blast to identify these proteins next.

I have also been trying to familiarize myself with bash. Fortunately I am running a 64 bit version of windows which enables me to use bash rather than Git. I enabled developer mode which allowed me to run Linux based programs including bash. I am going to start working through the bash tutorial for FISH546. I will also start looking into running blast with large files (to identify the Pacific Oyster proteins) in addition to familiarizing myself with Jupyter.

This is all pretty new to me (Github and WordPress included) but I’m really enjoying learning more about bioinformatics and working with something new!

Some basic stats on our Geoduck methylation data.

Was talking to Steven about some visualization options, and he asked about just some basic information regarding # of Loci, coverage, and other stuff, so I figured that out real quick!

Total number of unique methylated loci sequenced – 5,016,316
Total number of methylated loci with reads in every sample – 611,458
Mean coverage of those 611,000 loci -7.70

Total number of loci with reads in every sample, and at least 10x coverage – 17823
Mean coverage of those 17,000 loci – 48.24

I saved the different files for future use here and the R notebook is here

Also, was playing around with methpipe, which has the ability to identify hypo-methlylated regions (and hyper-methylated regions, with a bit of trickery), so I ran those on the Day10 samples I had completed.

IGV xml cam be found here and the .bed files for the individual samples are located here


Yaamini’s Notebook: NSA Poster

Here’s the evolution of my poster for NSA!

I tend to have the same evolultion of posters everytime I make them: not enough information for my results in my first draft. Hopefully having this reference will remind me to try and be as specific as possible in my results section the first time!

My first draft:


Steven’s feedback:


Second draft:


Third draft:


Final draft:


from the responsible grad student


So I finally got the Day 10 files processed through methpipe and have been trying to get RADMeth to work all day. After combining the different samples in to a single file and building the design matrix, it keeps giving me a sample out of order/not matching the design matrix error. I’m not sure what’s going on, but I’ve applied to join the methpipe Google group, and hopefully someone there can help me.

notebook entry: