The whole enchilada (but better this time)
Now that I’ve tested the entire
methylKit pipeline on a 10,000 read subset for each C. virginica sequencing sample, I can begin validating the full pipeline. Steven already ran the full samples on this pipeline in this notebook. My job is to reproduce his workflow.
I opened my full pipeline Jupyter notebook. Since I’m using trimmed files from this directory, the first thing I wanted to do was test my
xargs commands to ensure I isolated the right files.
Figure 1. Testing
xargs on a new set of filenames.
Then, I piped the
xargs output into
bismark. Unlike Steven, I did not set a parameter for
-score_min. In our conversation today, we decided that I should run the alignment with the default parameter so we can compare our results. The default is L,0,-0.2 (instead of L,0,-1.2 used by Steven).
Figure 2. Description of each line of code, along with the actual command.
It took me a bit of finnagling to get that code to work because of some typos! I’m going to jot down a few reminders for myself:
- Always check to see if each line has a “”
- Use tab complete to ensure lack of typos and the correct path to files
- Verify you’re using the correct comand
- Test small bits of code before putting everything together
I’ll check on the alignment progress tomorrow, but my guess is that this will take 2-3 days to complete. Once the alignment is finished, I will deduplicate, sort, and index the resulting .bam files.
from the responsible grad student https://ift.tt/2I7zZcR
My goal was to run my samples through Steven’s methylKit R code. The first thing I did was copy his code into this R script. I then added more descriptive names for the objects, and commented out each line of code so I understood what was happening.
I was able to run
processBismarkAln on my deduplicated.sorted.bam files! However, I could not
unite these files. That’s when I posted a Github issue. After some back-and-forth, Steven suggested I download some of his dedup.sorted.bam and run those files through the R script. By using his files, I could figure out if the
unite issue stemmed from my files, or from software differences on the Mac mini.
I created a new R script and downloaded dedup.sorted.bam files 1 and 10 (i.e. one from each treatment). I was able to successfully
unite these files and go through the entire script! While I couldn’t identify any DMRs between the two files, I did come up with two reasons why Steven’s files worked and mine did not:
- Steven’s files are trimmed, but mine are not
- Steven’s files used the first 1,000,000 as a subset, but mine used the first 10,000
Either way, Steven said I should move on to the full pipeline and validate everything. Sam and Steven both pointed me to this directory, which has the trimmed files. He also said that I should remove the
-score_min argument. This will make
bismark run the default setting for alignment mismatches, and allow us to compare the alignment results. Time to start a computer process that will take several days to run!
from the responsible grad student https://ift.tt/2Im6dVp
We have an upcoming meeting with Illumina to discuss how the geoduck genome project is coming along and to decide how we want to proceed.
So, we wanted to get a quick idea of how well our geoduck assemblies are by performing some quick alignments using Bowtie2.
Used the following assemblies as references:
- sn_ph_01 : SuperNova assembly of 10x Genomics data
- sparse_03 : SparseAssembler assembly of BGI and Illumina project data
- pga_02 : Hi-C assembly of Phase Genomics data
The analysis is documented in a Jupyter Notebook.
Jupyter Notebook (GitHub):
NOTE: Due to large amount of stdout from first genome index command, the notebook does not render well on GitHub. I recommend downloading and opening notebook on a locally install version of Jupyter.
Here’s a brief overview of the process:
- Generate Bowtie2 indexes for each of the genome assemblies.
- Map 1,000,000 reads from the following Illumina NovaSeq FastQ files: