Back to the C. virginica…

Back to the C. virginica DMR analysis, plus trying to improve Platanus/Redundans running.

I started a Bismark run for the C. virginia BS-Seq data using --non_directional argument or Bismark and after hard trimming the first 16 bases off of reads as suggested by Mackenzie. This improved mapping rates significantly, from ~ 8% average to the high 20s. I uploaded the bam files to owl here and notebook here.

After finishing that, I started running the data through some methylation extraction options including bismark_methylation_extractor and MethylExtract. Having two options should allow me the ability to test a few different options for DMR calling. As of writing the Bismark extraction has finished with data available here and notebook found here

On the subject of Platanus and Redundans, I’ve been trying to get Redundans to do gap closing with the illumina short reads, but for some reason it isn’t able to extract reads from the 50bp reads. Not sure on why this is, but have been experimenting with a few different arguments in Redundans to see if that helps at all. Will update when I’ve found a solution.

Oly Genome: Redundans run finished.

Redundans finished over the weekend, but the results were a little… odd.

Stats for the Illumina only Platanus and the completed Redundans run are below. We increase the N50 from 105 to 251, but in doing so lose 75% of the total number of bases, and 80% of the contigs from the initial assembly. Something’s wonky in the pipeline.

I did some more research and I realized I may have made a faulty assumption that Redundans would take the final output of the Platanus pipeline, a gap closed scaffold assembly. It looks like it actually wants the initial assembled contigs.

I’ve started a run again supplying the contig assembly, and we’ll see if that yields better results! I’m a little skeptical though because the contig assembly only has 16mm base pairs, which still seems awfully low.

Redundans Output can be found: here/

The final assembly is: scaffolds.reduced.fa

I’ve also installed Falcon on Hyak. Falcon is de novo assembler for PacBio only reads.

Some notes on the assembly of Falcon.

  • Don’t load the anaconda2_4.3.1 module. Falcon is nice scripts that download and install things inline, so you don’t have the ability to specify where things are installed (user directory vs anaconda module directory).
  • Everything has to be done on a build node, rather than an interactive node, since the scripts both download and install packages
  • Install networkx v 1.10 prior to starting using easy_install, specifying an install directory via something like easy_install --install-dir /gscratch/srlab/programs/networkx networkx==1.10

Sean’s Notebook: Olympia Oyster genome…

Sean’s Notebook: Olympia Oyster genome assembly in Platanus.

I’ve been trying to incorporate the new PacBio data we recently got and we’ve settled on a combination Platanus/Redundans approach for assembling a new genome from scratch.

I’ve finished the first portion surprisingly quickly. Hooray for Hyak I guess?

Some stats on the final gap closed Platanus assembly:

stats for /Users/Sean/Documents/RobertsLab/Oly__gapClosed.fa
sum = 4121175, n = 28721, ave = 143.49, largest = 1971
N50 = 133, n = 10380
N60 = 122, n = 13618
N70 = 114, n = 17127
N80 = 109, n = 20829
N90 = 105, n = 24686
N100 = 100, n = 28721
N_count = 32133
Gaps = 975

Still a lot of gaps. Hopefully the Redundans run with the PacBio stuff will help with that.

Data: here

Gap filled Assembly:

Now on to Redundans.

Sean’s Notebook: Bismark mapping efficiency…

Sean’s Notebook: Bismark mapping efficiency with Hard trimmed C. virginica sample.

Yesterday Mackenzie Gavery came by and offered some suggestions to increase mapping rates for our Virginica BS-Seq data using Bismark. Her two suggestions were using the –non_directional flag to account for the PBATness of the data, which had a huge effect, and hard trim the first 16 bases in our samples, because they look weird.

I tried everything on a single sample for speed and finished it this morning.

The weirdness

Untrimmed:
untrimmed

Default Trimmed:
trimmed

Hard Trimming of the first 16 bases:
hardtrimmed

That cleans stuff up for sure. Unfortunately it didn’t have much of an effect on mapping rate, bringing us up from 28% to 28.3%. Was worth a shot though!

Final Alignment report
======================
Sequences analysed in total:	12197930
Number of alignments with a unique best hit from the different alignments:	3456338
Mapping efficiency:	28.3%
Sequences with no alignments under any condition:	5760842
Sequences did not map uniquely:	2980750
Sequences which were discarded because genomic sequence could not be extracted:0

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT:	181719	((converted) top strand)
CT/GA:	166362	((converted) bottom strand)
GA/CT:	1588675	(complementary to (converted) top strand)
GA/GA:	1519582	(complementary to (converted) bottom strand)

Final Cytosine Methylation Report
=================================
Total number of C's analysed:	61813350

Total methylated C's in CpG context:	12572131
Total methylated C's in CHG context:	4005979
Total methylated C's in CHH context:	12350257
Total methylated C's in Unknown context:	0

Total unmethylated C's in CpG context:	2271987
Total unmethylated C's in CHG context:	12442077
Total unmethylated C's in CHH context:	18170919
Total unmethylated C's in Unknown context:	5

C methylated in CpG context:	84.7%
C methylated in CHG context:	24.4%
C methylated in CHH context:	40.5%
C methylated in Unknown context (CN or CHN):	0.0%

Bismark output files located: here

now time to run the rest of them!

Sean’s Notebook: EpiTEome and Bismark revisited.

Steven found a new program for simultaneously finding transposable element insertion points as well as their methylation levels called epiTEome and asked me to see how applicable it was to our data. I read through some papers, and it looks like it should work, and give us some interesting information regarding CHH context methylation at transposable element insertion points. I’ve been trying to get it installed on my laptop, but some of the perl modules won’t install correctly. I’ll update this with a guide once it’s behaving properly.

Also, Mackenzie Gavery came by and offered some suggestions about our mapping efficiency issue in Bismark. We apparently have PBAT libraries, that are the complement of the strand of interest. This is obvious, once you know what to look for, in the FastQC report due to the sequence files being G depleted, as opposed to C depleted as one would expect in bisulfite treated samples. I’m trying Bismark again on a single file with the --non_directional flag, which should output all 4 strand possibilities so we can see if anything changes.

On the Hyak front, platanus assemble finished after running only overnight. Which is a little disturbing since the last time I ran it, it ran for a week before finally crashing due to lack of memory. I guess that’s the power of 500gb of ram and a pair of Skylake Xeons? Next we move on to the scaffolding step, which will hopefully be as fast!

Update:

Finished the Bismark run on that one file, and went from 8% mapping to 28% mapping. Pretty huge increase!

Bismark report for: trimmed-2112_lane1_ACAGTG_L001_R1.fastq (version: v0.16.3)
Option '--non_directional' specified: alignments to all strands were being performed (OT, OB, CTOT, CTOB)
Bismark was run with Bowtie 2 against the bisulfite genome of /home/srlab/Documents/C-virginica-BSSeq/genome/ with the specified options: -q -N 1 --score-min L,0,-0.2 --ignore-quals

Final Alignment report
======================
Sequences analysed in total:	12260444
Number of alignments with a unique best hit from the different alignments:	3439005
Mapping efficiency:	28.0%
Sequences with no alignments under any condition:	6508309
Sequences did not map uniquely:	2313130
Sequences which were discarded because genomic sequence could not be extracted:	4

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT:	194297	((converted) top strand)
CT/GA:	177441	((converted) bottom strand)
GA/CT:	1574633	(complementary to (converted) top strand)
GA/GA:	1492630	(complementary to (converted) bottom strand)

Final Cytosine Methylation Report
=================================
Total number of C's analysed:	72855963

Total methylated C's in CpG context:	15487306
Total methylated C's in CHG context:	5008319
Total methylated C's in CHH context:	15081081
Total methylated C's in Unknown context:	5

Total unmethylated C's in CpG context:	2557484
Total unmethylated C's in CHG context:	14183192
Total unmethylated C's in CHH context:	20538581
Total unmethylated C's in Unknown context:	78

C methylated in CpG context:	85.8%
C methylated in CHG context:	26.1%
C methylated in CHH context:	42.3%
C methylated in Unknown context (CN or CHN):	6.0%

Sean’s Notebook: Water Samples at DNR in Olympia

vwMin

After many trials and tribulations, I’ve finally started running real honest to god TA samples from Hollie’s geoduck experiment. It took a few CRM runs for the new probe on the titratior to settle down, but I got three reps within 10 TA units of each other (They were reading ~20 units higher than expected, so I’ll likely just use an offset to account for that). Progress!

Now only a few hundred more samples to go…

On the Hyak front, after finishing Andrew’s salmon run, I’ve started running the platanus assembler on the Oly Illumina data. After Platanus finishes, we’ll try to integrate the long read PacBio stuff via Redundans. Hopefully PBJelly will finish the gap filling process so we will have two assemblies to compare.

Salmonid Trinity Run

Since file transfers to/from Hyak are being wonky and I already have the files uploaded, I’m going to start Andrew Spanjer’s salmon trinity assembly.

First, I need to make a new BlastX database using files provided by Andrew.

/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin/makeblastdb  -in SalmonUni.fasta -parse_seqids -dbtype prot

Building a new DB, current time: 05/12/2017 09:26:38
New DB name:   /gscratch/srlab/data/andrew-trinity/SalmonUni.fasta
New DB title:  SalmonUni.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 141125 sequences in 5.81965 seconds.

Andrew supplied me with his two data files, all_val_1.fq.gz and all_val_2.fq.gz, so I threw those in to a slum batch file and fired up trinity via slurs..

[seanb80@mox1 andrew-trinity]$ cat TrinRun.sh 
#!/bin/bash
## Job Name
#SBATCH --job-name=Salmon_Trinity
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (ten minutes)
#SBATCH --time=480:00:00
## Memory per node
#SBATCH --mem=350G
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/srlab/data/andrew-trinity/

source /gscratch/srlab/programs/scripts/paths.sh

Trinity --seqType fq --left all_val_1.fq.gz --right all_val_2.fq.gz  --CPU 50 --trimmomatic --max_memory 350G

[seanb80@mox1 andrew-trinity]$ sbatch -p srlab -A srlab TrinRun.sh
Submitted batch job 13312
[seanb80@mox1 andrew-trinity]$ scontrol show job 13312
JobId=13312 JobName=Salmon_Trinity
   UserId=seanb80(557445) GroupId=hyak-srlab(415510) MCS_label=N/A
   Priority=276 Nice=0 Account=srlab QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:04 TimeLimit=20-00:00:00 TimeMin=N/A
   SubmitTime=2017-05-12T09:56:15 EligibleTime=2017-05-12T09:56:15
   StartTime=2017-05-12T09:56:17 EndTime=2017-06-01T09:56:17 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=srlab AllocNode:Sid=mox1:38936
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n2203
   BatchHost=n2203
   NumNodes=1 NumCPUs=28 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=28,mem=350G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=350G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/gscratch/srlab/data/andrew-trinity/TrinRun.sh
   WorkDir=/gscratch/srlab/data/andrew-trinity/
   StdErr=/gscratch/srlab/data/andrew-trinity//slurm-13312.out
   StdIn=/dev/null
   StdOut=/gscratch/srlab/data/andrew-trinity//slurm-13312.out
   Power=

Watching top in another window, everything seems to be running ok, but trimmomatic seems to only be using ~ 8 cores, hopefully trinity will be better

#sbatch