Sean’s Notebook: C. virginica MACAU…

Sean’s Notebook: C. virginica MACAU results.

So I finished running MACAU on the C. virginica oil spill samples a couple of different ways. First, I ran all 6 samples, 3 control and 3 oil exposed and secondly, 2 control and 3 oil exposed. The difference being one of the samples had really low read counts from the sequencing facility, and that brought down the number of available loci with a minimum of 10x coverage to look at from ~70,000 to ~17,000

Neither produced many loci that had meaningful priors, the 6 sample run had 4, and the 5 sample had 16. None had meaningful predictor parameters (beta), meaning that oil exposure was not a significant predictor of methylation status. After talking to Steven, this is not surprising, as other recent studies have shown similar lack of differentially methylated regions/loci.

Also, I’ve started revamping the Hyak wiki, trying to chunk it out to make it a little more readable by the average person. It’s on my personal Github wiki at the moment, but I will copy it over once I got some input on formatting and a few things.

Notebook for Methylation stuff: here
Hyak Wiki stuff: here

Sean’s Notebook: My FALCON won’t fly.

I’ve been trying to install FALCON, a PacBio based assembler, and it’s been a huge pain. Mostly typical Hyak permission issues, but also lots of errors with no way to figure out what they mean. LFS error code 32512?

That’s super useful, especially when it doesn’t specify if that’s Git’s LFS, or Lustre File System, or who knows what else (It’s likely the Git LFS option, but they don’t seem to have documented error codes easily accessible).

Going to look through the FALCON repo’s issues and see if there’s anything of use there, then post an issue of my own if not.

In other, more successful notes.

Uploaded more TA data to GitHub. The titratior seems to be behaving better, but still showing a ~40 point swing between beginning and end of day samples.

Finally finished the methylation count files for the C. virginica stuff and will start MACAU chewing on them tonight most likely. Played around with SQLShare way too long, but Tuesday I got a response from their helpdesk and learned that SQLShare converts all input to lower case letters, so if you’re trying to load the Loc column from your dataset, it sees dataset.Loc as dataset.loc, and then gets angry. Gotta use dataset.”Loc”. Even after figuring that out, it was still a giant pain, so I just subsetted my sample count files, and welded them back together at the end.

I started with 10x coverage, and will pare that back to probably 5x for a second run.

Notebook

Count Files

Sean’s Notebook: Oly Genome re-assembly, try 2.

Finished the Redundans run on the Oly genome with a smaller k-mer length, and things look much better!

Full Scaffold stats:

D-173-250-161-130:NewOlyAssembly Sean$ assembly-stats scaffolds.fa
stats for scaffolds.fa
sum = 567422454, n = 337054, ave = 1683.48, largest = 74226
N50 = 3492, n = 44086
N60 = 2683, n = 62641
N70 = 1989, n = 87196
N80 = 1355, n = 121603
N90 = 737, n = 177120
N100 = 200, n = 337054
N_count = 30184511
Gaps = 286008

Reduced Scaffold stats:

D-173-250-161-130:NewOlyAssembly Sean$ assembly-stats scaffolds.reduced.fa
stats for scaffolds.reduced.fa
sum = 546928670, n = 300593, ave = 1819.50, largest = 78788
N50 = 3852, n = 36149
N60 = 2917, n = 52500
N70 = 2137, n = 74420
N80 = 1455, n = 105359
N90 = 810, n = 155023
N100 = 200, n = 300593
N_count = 5238187
Gaps = 209203

Full Contig stats:

D-173-250-161-130:NewOlyAssembly Sean$ assembly-stats contigs.fa
stats for contigs.fa
sum = 1633553496, n = 12395459, ave = 131.79, largest = 23341
N50 = 141, n = 2124062
N60 = 115, n = 3463771
N70 = 92, n = 5003872
N80 = 69, n = 7087338
N90 = 61, n = 9624083
N100 = 58, n = 12395459
N_count = 0
Gaps = 0

Reduced Contig stats:

D-173-250-161-130:NewOlyAssembly Sean$ assembly-stats contigs.reduced.fa
stats for contigs.reduced.fa
sum = 532736649, n = 791403, ave = 673.15, largest = 23341
N50 = 940, n = 151868
N60 = 726, n = 216469
N70 = 553, n = 300627
N80 = 408, n = 412850
N90 = 289, n = 568264
N100 = 200, n = 791403
N_count = 0
Gaps = 0

Complete Contig fasta: here

Reduced Contig fasta: here

Complete scaffold fasta: here

Reduced scaffold fasta: here

SLURM execution script: here

Redundans output: here

All data: here

Sean’s Notebook: Calibrating the DNR titration pH probe:

Update: 4 CRMs run, last three were within +/- 10, but still ~ 30 points higher than the reference. It’s at least consistent in the degree off, but I believe that it can be accounted for with an offset. Hooray?

So Micah and I decided last week that there was a fair amount of drift in the pH probe week to week, potentially being one of the causes of the reference sample variation we see. To combat this, starting today we’ll recalibrate the probe at the beginning of each sample day.

Steps to recalibrate probe:

1. Obtain 4/7/10 buffer solutions. These are in stock bottles on the left hand side of the upper shelf in the lab.
2. Fill a sample cup ~1/3rd full with each solution. These don’t need to be weighed, they just need to cover the pH probe bulb completely.
3. Run the Probe Calibration Method on the Titratior. You’ll load the 4.0 buffer, hit ok, then it will run for a while, then load the 7.0 and then the 10.

IMG_2111

IMG_2112

IMG_2113

4. When the calibration has finished it will show a summary screen like:

IMG_2110

and the number we’re after is the second SLOPECAL value, -58.18mV/pH. This is the number that lets us equate the millivolt readings from the pH probe to the pH of the sample. For reference, last week the SLOPECAL value was -58.90mV/pH so a little change, but not much.

5. After we have our slope value, we need to calculate mV values for pH 3.5, where the initial offgas spin starts, and 3.0, where the titration ends. To do this, we just figure out how many pH units we move from 7 (7 has a mV reading of 0) to 3.5, or 3.5 units, and multiply that time the SLOPECAL value, and get 203.63. Then we calculate our end mV reading, a change of 4 units and get 232.72.

6. These numbers we take to the laptop, and go in to the Analysis Tab, then double click on the New TA Titration method. This brings up a graphical flow chart of the entire process, and we want the “Titration (EQP) [1] method. There we want to check the Predispense Potential, and the Termination Potential values. According to Micah we choose the nearest 10s value to our targets, due to granularity in the dispensing protocol, we won’t actually hit exactly 203 or 232. I chose 200, and 230.

IMG_2114

IMG_2115

IMG_2116

IMG_2117

Time to run some CRM’s and see if this helps at all!

Oly Assembly try 2: It’s yuuuuuuge!

So after hearing from Katherine about some ideas for using Platanus, I tried running the Illumina assembly once more, this time using a kmer length of 20. This seems to have had an effect, as the output contig file went from 21mb in size, to 1.9gb. I’m shuffling files around and haven’t had a chance to actually look at the contents of the file, but I’m at least heartened by the massive size increase. Size = data, right?

Assembly-stats output:

D-69-91-148-111:RobertsLab Sean$ ~/Documents/RobertsLab/assembly-stats/build/assembly-stats ~/Documents/RobertsLab/Oly_Out_Contig.fa 
stats for /Users/Sean/Documents/RobertsLab/Oly_Out_Contig.fa
sum = 1633553496, n = 12395459, ave = 131.79, largest = 23341
N50 = 141, n = 2124062
N60 = 115, n = 3463771
N70 = 92, n = 5003872
N80 = 69, n = 7087338
N90 = 61, n = 9624083
N100 = 58, n = 12395459
N_count = 0
Gaps = 0

Sean’s Notebook: Things went from…

Sean’s Notebook: Things went from not working, to working. It’s a miracle!

So I’d been having problems with Platanus throwing a segmentation fault during graph construction with a smaller k-mer length argument, which was a pain because it’s such a generic error. Turns out it just needed more memory, because 400gb isn’t enough? Hopefully it will be done by tomorrow and we can get try Redundans again, with better results!

Also, I got MethylExtract working on the C. virginica methylation data. Looks like when you have PBAT data, you have to switch the flags to flagW=0 and flagC=16, even though you’re using paired end data.

MethylExtract notebook: here
C. virginia CG and SNP data: here pending renaming/reorganization of files.

Sean’s Notebook: All TA, all the time.

Spent another day at DNR in Olympia running water samples. Hollie suggested I try drawing acid directly from the stock container to remove the chance of changing acid concentration by mixing old acid and new acid in the titrator reservoir. I did this, as well as mixing the acid bottle before each sample run and still found a frustrating amount of variation in CRM samples. Micah was available to watch my technique and we decided it did not appear to be caused by operator error.

We re-checked the probe calibration and found that it had drifted 5mV at 3.5pH in the week since it was last checked, so we adjusted the titrator script and ran a set of CRMs in the middle of the day and found better consistency with results, but they were still reading higher than intended. We decided that in the future checking the probe calibration at the beginning of each processing day would be a good move.

After thinking of other potential reasons for consistently high CRM readings, we checked the calibration of the balance used to weigh out samples, thinking a higher than expected mass reading could lead to higher than expected TA readings, but after finding a reference weight in the nearby geology lab, we found that the scale was accurate within 0.01g, so that wasn’t the culprit. The only other thing we could think of, was potentially the acid burette was dispensing an incorrect volume of acid. Micah said that he would contact Mettler-Toledo regarding potential ways to test this/ways to calibrate it, and will go from there!

TA values: here
Raw data: here

Back to the C. virginica…

Back to the C. virginica DMR analysis, plus trying to improve Platanus/Redundans running.

I started a Bismark run for the C. virginia BS-Seq data using --non_directional argument or Bismark and after hard trimming the first 16 bases off of reads as suggested by Mackenzie. This improved mapping rates significantly, from ~ 8% average to the high 20s. I uploaded the bam files to owl here and notebook here.

After finishing that, I started running the data through some methylation extraction options including bismark_methylation_extractor and MethylExtract. Having two options should allow me the ability to test a few different options for DMR calling. As of writing the Bismark extraction has finished with data available here and notebook found here

On the subject of Platanus and Redundans, I’ve been trying to get Redundans to do gap closing with the illumina short reads, but for some reason it isn’t able to extract reads from the 50bp reads. Not sure on why this is, but have been experimenting with a few different arguments in Redundans to see if that helps at all. Will update when I’ve found a solution.

Oly Genome: Redundans run finished.

Redundans finished over the weekend, but the results were a little… odd.

Stats for the Illumina only Platanus and the completed Redundans run are below. We increase the N50 from 105 to 251, but in doing so lose 75% of the total number of bases, and 80% of the contigs from the initial assembly. Something’s wonky in the pipeline.

I did some more research and I realized I may have made a faulty assumption that Redundans would take the final output of the Platanus pipeline, a gap closed scaffold assembly. It looks like it actually wants the initial assembled contigs.

I’ve started a run again supplying the contig assembly, and we’ll see if that yields better results! I’m a little skeptical though because the contig assembly only has 16mm base pairs, which still seems awfully low.

Redundans Output can be found: here/

The final assembly is: scaffolds.reduced.fa

I’ve also installed Falcon on Hyak. Falcon is de novo assembler for PacBio only reads.

Some notes on the assembly of Falcon.

  • Don’t load the anaconda2_4.3.1 module. Falcon is nice scripts that download and install things inline, so you don’t have the ability to specify where things are installed (user directory vs anaconda module directory).
  • Everything has to be done on a build node, rather than an interactive node, since the scripts both download and install packages
  • Install networkx v 1.10 prior to starting using easy_install, specifying an install directory via something like easy_install --install-dir /gscratch/srlab/programs/networkx networkx==1.10

Sean’s Notebook: Olympia Oyster genome…

Sean’s Notebook: Olympia Oyster genome assembly in Platanus.

I’ve been trying to incorporate the new PacBio data we recently got and we’ve settled on a combination Platanus/Redundans approach for assembling a new genome from scratch.

I’ve finished the first portion surprisingly quickly. Hooray for Hyak I guess?

Some stats on the final gap closed Platanus assembly:

stats for /Users/Sean/Documents/RobertsLab/Oly__gapClosed.fa
sum = 4121175, n = 28721, ave = 143.49, largest = 1971
N50 = 133, n = 10380
N60 = 122, n = 13618
N70 = 114, n = 17127
N80 = 109, n = 20829
N90 = 105, n = 24686
N100 = 100, n = 28721
N_count = 32133
Gaps = 975

Still a lot of gaps. Hopefully the Redundans run with the PacBio stuff will help with that.

Data: here

Gap filled Assembly:

Now on to Redundans.