Sean’s Notebook: MUMmer/nucmer error.

So I set up Metassembler to run on the BGI and Platanus assemblies to try and merge them in to a single assembly. It ran for a while, and then returned this error:

 


---------- Run bash command ----------

nucmer BGI_Plat_MetAssemb/BGI/BGI.fa BGI_Plat_MetAssemb/Platanus/Platanus.fa:
nucmer --maxmatch -l 50 -c 300 -p BGI_Plat_MetAssemb/Metassembly/QPlatanus.BGI/Q Platanus.BGI BGI_Plat_MetAssemb/BGI/BGI.fa BGI_Plat_MetAssemb/Platanus/Platanus. fa ...

1: PREPARING DATA
2,3: RUNNING mummer AND CREATING CLUSTERS
# reading input file "BGI_Plat_MetAssemb/Metassembly/QPlatanus.BGI/QPlatanus.BGI .ntref" of length 777071945
# construct suffix tree for sequence of length 777071945
# (maximum reference length is 536870908)
# (maximum query length is 4294967295)
# process 7770719 characters per dot
/gscratch/srlab/programs/MUMmer3.23/mummer: suffix tree construction failed: tex tlen=777071945 larger than maximal textlen=536870908
ERROR: mummer and/or mgaps returned non-zero
----------------------------------------

After some googling it turns out that MUMmer naturally compiles as a 32 bit program, and since we have so much data (it ***is*** possible to have too much data) we have to recompile it in 64 bit mode to have larger intergers available to us.

Who knew.

Oh well, going to recompile and start over!

 

 

Sean’s Notebook: BWA-Meth Output for…

Sean’s Notebook: BWA-Meth Output for EPI-135.

Found a different methylation aligner, BWA-meth, that’s based on a Burrows Wheeler aligner that’s supposed to deal better with gap alignment than Bowtie2. Fired it up with the EPI-135 and 10k Geoduck genome. It gave an answer, but I *really* don’t believe it. The bamtools stats output I believe claimed 80% mapping rate. Compared to the 6% from Bismark.

sean@emu:~/Documents/Geoduck_Rerun/bwatest$ bamtools stats -in bwa-meth.bam

**********************************************
Stats for BAM file(s): 
**********************************************

Total reads:       55253237
Mapped reads:      48098666	(87.0513%)
Forward strand:    30884874	(55.8969%)
Reverse strand:    24368363	(44.1031%)
Failed QC:         16970759	(30.7145%)
Duplicates:        0	(0%)
Paired-end reads:  55253237	(100%)
'Proper-pairs':    30054258	(54.3937%)
Both pairs mapped: 47463686	(85.9021%)
Read 1:            27626448
Read 2:            27626789
Singletons:        634980	(1.14922%)

.bam file is available: here

Sean’s Notebook: Starting GARM meta-assembly…

Sean’s Notebook: Starting GARM meta-assembly of PacBio and BGI assemblies for Olys.

The Pilon polishing for the CANU assembly finished last night, and it seems pretty small, but hopefully between that, the BGI assembly, and the Platanus assembly, we can assemble ourselves up one decent genome. Fingers crossed at least.

Assembly stats on the Polished assembly:

D-69-91-159-59:BGI_Oly_Genome Sean$ assembly-stats oly_polished_.fasta 
stats for oly_polished_.fasta
sum = 46364927, n = 3388, ave = 13685.04, largest = 61211
N50 = 14126, n = 1230
N60 = 12962, n = 1573
N70 = 11906, n = 1947
N80 = 10932, n = 2352
N90 = 9590, n = 2803
N100 = 2074, n = 3388
N_count = 1
Gaps = 1

1 gap is interesting, but with the assembly size being at least 1, of not two orders of magnitude smaller than the expected genome size, I think we’re short on coverage to allow for conservative error correction levels. Will have to reassemble with looser standards and see if we can bump it up.

Polished CANU assembly found: here

Pilon output file: here

Next step is GARM, to see what that gives us. I think I’ll also re-assemble the PacBio stuff with much less stringent error correction to see if that gives any measurable difference in outputs.

Edit: Also, I finished the –non-directional runs for Bismark, no change in mapping rates and less than 1% complementary mapping, so it looks like the regular arguments are correct. Output .bam files are found here with the NonDir tag.

Seans Notebook: The internet lies.

Part of the Pilon prep for polishing the PacBio Oly assembly is feeding it a bunch of Illumina data aligned to the PacBio assembly using your favorite aligner, in my case Bowtie2. I initially got a bunch of `.sam` files from Bowtie2 and wanted to convert them, so I turned to Google like any good person does these days. After looking at a bunch of different options, all answers pointed to `samtools view -sB file.sam > file.bam` as the preferred way to do this.

Thinking I knew what I was doing, I whipped up a quick slurm job script to convert everything and file up Pilon. It completed in less than 5 minutes, with a bunch of `.bam` files ~40kb in size. This was suspect, as their original `.sam` files were ~40gb.

After a reading the samtools manual, which corroborated the above conversion syntax, re-aligning some of the Illumina data files thinking I’d done something wrong there, and a whole bunch more googling, it turns out the `> file.bam` syntax is actually supposed to be `-o file.bam`.

Apparently the internet and documentation does not keep up with program changes as well as we’d like.

Sean’s Notebook: Pilon prep finished.

Finished the Pilon prep (except for .sam -> .bam conversion) yesterday afternoon. Found a 32-35% mapping rate for Illumina sequencing files to the PacBio backbone. Not great, but hopefully the whole meta-assembly route will manage to combine everything together in to one super assembly.

For fun, I mapped an Illumina file to a couple of the BGI assemblies, and got ~75% mapping. Wish we could have gotten that out of the PacBio thing, but ah well.

Bowtie Output

Im transferring the .sam files over to Owl now, but they’re ~40gb a piece, so that’ll take a bit. Will update here when finished!

Sean’s Notebook: Trimming and Quality…

Sean’s Notebook: Trimming and Quality Checking EPI-135 and EPI-135WG.

We’re back to playing with Hollie’s Geoduck methylation data, and noticed that there was some wonky nucleotide sequence at the beginning of reads potentially hampering the mapping efficiency downstream. Steven asked me to trim the first 18 nucleotides and last nucleotide from each sequence, and I did that here.

EPI-135 Pre Trimming:
Read 1:
EPI-135_1-untrimmed

Read 2:
EPI-135_2-untrimmed

EPI-135 Post Trimming
Read 1:
EPI-135_1-Untrimmed

Read 2:
EPI-135_2-Untrimmed

EPI-135WG Pre Trimming:
Read 1:
EPI-135WG_1-UnTrimmed

Read2:
EPI-135WG_2-UnTrimmed

EPI-135WG Post Trimming:
Read 1:
EPI-135WG_1-Trimmed

Read 2:
EPI-135WG_2-Trimmed

It looks like trimming cleaned most everything up, though maybe it could have stood to have an additional base pair or two trimmed off the tail.

I’ve started running Bismark on the outputs to see if this trimming improves mapping rates, hopefully it will. I’m going to try tinkering with the number of allowed mismatches in Bowtie, to see if that helps also.

On the Oly Genome front. I talked with Katherine today, and have a pretty decent idea of a game plan for going forward. It looks like I was under the wrong impression that all the Illumina data was paired end, it looks like the longest insert stuff was actually mate pair, which the Platanus devs say is not ideal for assembling. Oops.

The plan:

1. Finish polishing the Canu assembly with Pilon. Still mapping reads back to the Canu assembly with bowtie.
2. Re-run Platanus using only the PE150 data from Illumina, then scaffold with PE150 and MP50 data. Then throw it in to Redundans **without** the reduction step. Katherine thinks that Redundans may be throwing away too much data due to high heterozygosity, and turning that step off may prevent the loss.
3. Throw the BGI assembly, the Platanus/Redundans Assembly, and the Canu/Pilon assembly in to a meta assembler such as GARM.
4. Have the best assembly ever. Or at least a better assembly.

Sean’s Notebook: Canu run finished

Looks like Hyak finished it’s Canu assembly for the Oly PacBio data over the weekend, and here are the statistics for it.

[seanb80@mox2 oly-test]$ /gscratch/srlab/programs/canu/Linux-amd64/bin/tgStoreDump -sizes -s 50000000 -T /gscratch/srlab/data/CanuTest/oly-test/unitigging/oly_pacbio_.ctgStore 2 -G /gscratch/srlab/data/CanuTest/oly-test/unitigging/oly_pacbio_.gkpStore
lenContig ng10       23155 bp   lg10     175   sum    5001399 bp
lenContig ng20       19201 bp   lg20     416   sum   10012857 bp
lenContig ng30       16610 bp   lg30     697   sum   15014659 bp
lenContig ng40       15073 bp   lg40    1014   sum   20013348 bp
lenContig ng50       13606 bp   lg50    1364   sum   25008188 bp
lenContig ng60       12421 bp   lg60    1749   sum   30009994 bp
lenContig ng70       11361 bp   lg70    2170   sum   35007399 bp
lenContig ng80       10115 bp   lg80    2634   sum   40002506 bp
lenContig ng90        8041 bp   lg90    3178   sum   45002283 bp
lenContig sum   46288879 (genomeSize 50000000)
lenContig num       3388
lenContig ave      13662

13.6 kbp NG50 is an improvement! Next, polishing with Pilon.

Sean’s Notebook: Restarting Canu

Looks like they took Mox down for a bit last night/this morning for maintenance so our Canu run got bumped. Thankfully it can fail gracefully and restart midway so hopefully it will be done soon.

Ive got my long read samples concatenated and ready for consensus calling, next I need to organize the short read stuff by insert sizes for polishing, and see what additional prep needs to be done for them.

Sean’s Notebook: Canu running, figuring…

Sean’s Notebook: Canu running, figuring out Racon and what it needs next.

So I got Racon, a consensus caller installed, and it looks like it requires mapping position information. The Racon developers suggested Minimap for that purpose. I installed Minimap on Emu so I could run it on the filtered reads while Canu is running on Hyak.

It looks like the general process will be as follows

  • Canu to make preliminary contigs
  • Map Raw reads to contigs with Minimap
  • Consensus call with Racon
  • Repeat the above steps ~ 3 times (that seems to be a common number)
  • Map the raw Illumina reads to the final Racon output with samtools
  • Use Pilon to polish with the mapped .sam files.
  • End with an assembly based on PacBio reads, and finished with Illumina, the opposite of the Platanus/Redundans workflow.

We’re currently on step 1, so this might take a little while…

Sean’s Notebook: Falcon out, Canu in.

I’ve been trying to get Falcon installed on Hyak, or Emu, or even my laptop for a couple days now with no success. There wasn’t much help on the Falcon GitHub, so after doing some reading, it looks like Canu may be an option that is as good, or better than Falcon, so lets give that a whirl!

Canu’s GitHub is here and documentation is here

To install was super simple, just cloned the GitHub repo on to Hyak via git clone https://github.com/marbl/canu.git in to /gscratch/srlab/programs/canu/, changed directory in to /gscratch/srlab/programs/canu/src/ and ran make.

The Canu developers supply a sample assembly data set which can be downloaded via

curl -L -o p6.25x.fastq http://gembox.cbcb.umd.edu/mhap/raw/ecoli_p6_25x.filtered.fastq

which I downloaded in to /gscratch/srlab/data/CanuTest.

To run the assembly, I spool up a 4 hour Interactive session (hopefully this is long enough) and run /gscratch/srlab/programs/canu/Linux-amd64/bin/canu -p ecoli -d ecoli-auto genomeSize=4.8m -pacbio-raw p6.25x.fastq .

This did not work, as Canu is built to run on a scheduler system, so it needs the --useGrid=FALSE argument added to the command. After changing that, everything looks like it’s working fine. After I make sure this works, I’ll get to work on the PacBio only assembly for the Oly genome.

Edit: It finished, and looks like it works with the sample data. Now to try it with our Oly PacBio stuff.