Yaamini’s Notebook: Gonad Methylation Analysis Part 15

A revised enchilada

I tried rerunning bismark_methylation_extractor before the weekend and still got an error. Steven suggested I run the analysis on these files instead. The files were created by Steven when he tried (and failed) to replicate my error in this notebook.

The command worked! I’m not sure what the difference is between Steven’s files and my files. Understanding that difference will be important for reproducibility.

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2rX44pI

Yaamini’s Notebook: Remaining Analyses Part 22

Revising environmental data

Before I begin, a quick shoutout to myself for making the comments on my R scripts detailed enough that I can return to a script I created in December and rerun it!

Micah and Alex both pointed out that the Pacific oysters were only outplanted on June 19, 2016. This means all of my environmental data calculations used data points before the experiment even started! To fix this, I made changes to this R script. I used two commands to subset the data I needed:

temperatureData$Date <- as.Date(temperatureData$Date, format = “%m/%d/%y”)

This command makes R recognize all of my dates as actual dates instead of character strings.

temperatureData <- temperatureData[temperatureData$Date >= “2016-06-19”, ]

This command allows me to subset rows with dates on or after 2016-06-19, the start of the outplant experiment.

I repeated those commands for my temperature, pH, dissolved oxygen, and salinity data. With the subset, I calculated variables of interest:

In separate R scripts, I used the same commands to subset data and remake my boxplots and line graphs that show fluctuation over time. You can find the scripts below:

A quick note on salinity:

I had this .csv file with calculated salinity output from Micah. In my paper draft, Micah pointed out that those are conductivity values, and not salinity. However, conductivity values would be much smaller! I emailed Micah to confirm what the values were, and he sent me this response:

I did indeed send you ‘calculated salinity’ values – I took the conductivity output in mS/cm and converted it to salinity in PSU at the in situ temperature using the swSCTp function in the R package oce. BUT, after making a call to the manufacturer, I’ve found out that the conductivity output is standardized by the sensor software to ‘what it would be at 25°C’ rather than outputting the raw value at the in situ temperature. So, roughly speaking, salinity values should be ~7 units lower than they are in this spreadsheet.

TL;DR They are salinity values but the wrong ones. Either way, Willapa Bay still has the lowest salinity values. My code will be able to handle revised data once I get it, so I’m all set on that front.

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2IvgR91

Yaamini’s Notebook: Gonad Methylation Analysis Part 14

A spoiled enchilada

I aligned all of my files, and proceeded to deduplication, sorting, and indexing in this Jupyter notebook. All of that was successful!

I was ready to use bismark_methylation_extractor but I ran into an error instead. To run bismark_methylation_extractor, I need to provide the path to my deduplicated — but unsorted — .bam files. The error file says that I need to use unsorted files, but I’m already directing the program to my unsorted files! I posted this issue to figure it out.

Since I’m replicating Steven’s code, the only difference in my work and his is that I used the default –score_min option. However, he replicated my work in this notebook and this notebook, but did not get an error! Thinking everything was peachy, I reran my code and still got the same error!


I am officially stumped and unable to continue these analyses for now. Back to the DNR paper!

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2KyTCvv

Sam’s Notebook:FastQC – RRBS Geoduck BS-seq FASTQ data


Earlier today I finished trimming Hollie’s RRBS BS-seq FastQ data.

However, the original files were never analyzed with FastQC, so I ran it on the original files.

These libraries were originally created by Hollie Putnam using the TruSeq DNA Methylation Kit (Illumina):

FastQC was run, followed by MultiQC. Analysis was run on Roadrunner.

All analysis is documented in a Jupyter Notebook; see link below.

Jupyter Notebook:

Sam’s Notebook:TrimGalore/FastQC/MultiQC – TrimGalore! RRBS Geoduck BS-seq FASTQ data


Steven requested that I trim the Geoduck RRBS libraries that we have, in preparation to run them through Bismark.

These libraries were originally created by Hollie Putnam using the TruSeq DNA Methylation Kit (Illumina):

All analysis is documented in a Jupyter Notebook; see link below.

Overview of process:

  1. Copy EPI* FastQ files from owl/P_generosa to roadrunner.
  2. Confirm data integrity via MD5 checksums.
  3. Run TrimGalore! with --paired, --rrbs, and --non-directional settings.
  4. Run FastQC and MultiQC on trimmed files.
  5. Copy all data to owl (see Results below for link).
  6. Confirm data integrity via MD5 checksums.

Jupyter Notebook:

Sam’s Notebook:Data Management – Illumina NovaSeq Geoduck Genome Sequencing


As part of the Illumina collaborative geoduck genome sequencing project, their end goal has always been to sequence the genome in a single run.

They’ve finally attempted this by running 10x Genomics, Hi-C, Nextera, and TruSeq libraries in a single run of the NovaSeq.

I downloaded the data using the BaseSpace downloader using Chrome on a Windows 7 computer (this is not available on Ubuntu and the command line tools that are available from Illumina are too confusing for me to bother spending the time on to figure out how to use them just to download the data).

Data was saved here:

Generated MD5 checksums (using md5sum on Ubuntu) and appended to the checksums file:

Illumina was unable to provide MD5 checksums on their end, so I was unable to confirm data integrity post-download.

Illumina sample info is here:

Will add info to:

List of files received:

 10x-Genomics-Libraries-Geo10x5-A3-MultipleA_S10_L001_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleA_S10_L001_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleA_S10_L002_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleA_S10_L002_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleB_S11_L001_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleB_S11_L001_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleB_S11_L002_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleB_S11_L002_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleC_S12_L001_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleC_S12_L001_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleC_S12_L002_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleC_S12_L002_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleD_S13_L001_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleD_S13_L001_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleD_S13_L002_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x5-A3-MultipleD_S13_L002_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleA_S14_L001_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleA_S14_L001_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleA_S14_L002_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleA_S14_L002_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleB_S15_L001_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleB_S15_L001_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleB_S15_L002_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleB_S15_L002_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleC_S16_L001_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleC_S16_L001_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleC_S16_L002_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleC_S16_L002_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleD_S17_L001_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleD_S17_L001_R2_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleD_S17_L002_R1_001.fastq.gz 10x-Genomics-Libraries-Geo10x6-B3-MultipleD_S17_L002_R2_001.fastq.gz HiC-Libraries-GeoHiC-C3-N701_S18_L001_R1_001.fastq.gz HiC-Libraries-GeoHiC-C3-N701_S18_L001_R2_001.fastq.gz HiC-Libraries-GeoHiC-C3-N701_S18_L002_R1_001.fastq.gz HiC-Libraries-GeoHiC-C3-N701_S18_L002_R2_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP10-B2-AD013_S7_L001_R1_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP10-B2-AD013_S7_L001_R2_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP10-B2-AD013_S7_L002_R1_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP10-B2-AD013_S7_L002_R2_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP11-C2-AD014_S8_L001_R1_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP11-C2-AD014_S8_L001_R2_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP11-C2-AD014_S8_L002_R1_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP11-C2-AD014_S8_L002_R2_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP12-D2-AD015_S9_L001_R1_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP12-D2-AD015_S9_L001_R2_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP12-D2-AD015_S9_L002_R1_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP12-D2-AD015_S9_L002_R2_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP9-A2-AD002_S6_L001_R1_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP9-A2-AD002_S6_L001_R2_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP9-A2-AD002_S6_L002_R1_001.fastq.gz Nextera-Mate-Pair-Library-GeoNMP9-A2-AD002_S6_L002_R2_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA1-A1-NR006_S1_L001_R1_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA1-A1-NR006_S1_L001_R2_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA1-A1-NR006_S1_L002_R1_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA1-A1-NR006_S1_L002_R2_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA3-C1-NR012_S2_L001_R1_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA3-C1-NR012_S2_L001_R2_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA3-C1-NR012_S2_L002_R1_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA3-C1-NR012_S2_L002_R2_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA5-E1-NR005_S3_L001_R1_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA5-E1-NR005_S3_L001_R2_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA5-E1-NR005_S3_L002_R1_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA5-E1-NR005_S3_L002_R2_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA7-G1-NR019_S4_L001_R1_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA7-G1-NR019_S4_L001_R2_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA7-G1-NR019_S4_L002_R1_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA7-G1-NR019_S4_L002_R2_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA8-H1-NR021_S5_L001_R1_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA8-H1-NR021_S5_L001_R2_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA8-H1-NR021_S5_L002_R1_001.fastq.gz Trueseq-stranded-mRNA-libraries-GeoRNA8-H1-NR021_S5_L002_R2_001.fastq.gz 

from Sam’s Notebook https://ift.tt/2Im5uUS

Grace’s Notebook: May 14, 2018, R and figuring out qPCR and DNA Qubit data

Data organization

Today I combined the qPCR and DNA Qubit data from Pam with the subset of samples that had 20ng/µL of RNA or higher and three samples per crab based on FRP.

This is the code I used:

The file (raw: is kind of crazy now. I want to figure out how to match the data within the spreadsheet between columns.

This file is all the samples that have:

  • 3 samples per crab
  • 20ng/µL RNA or higher
  • Pam’s new data from qPCR (but only for the samples that have 3 samples/crab and 20ng/µL RNA or higher)

The file has three columns that I want to compare to see if the infection status of the samples has changed from what was originally determined by conventional PCR.

The initial column is “infection_status”. This column has “0” for uninfected and “1” for infected, based on conventional PCR.

A second column is “pos_neg_no_quant” and has “POS” for infected, and “NEG” for uninfected, and “0” as a placeholder for the empty cells.

A third column is “pos_neg” and has “pos” for infected, and “neg” for uninfected and “0” as a placeholder for empty cells.

I can’t really figure out based on Pam’s email responses to my questions what the difference between POS and pos and NEG and neg are… I’ll show Sam what she sent me and maybe we can figure it out together on Wednesday.

But, by the end of this week for sure, I will have a visual of all the good samples that I have within the different categories of treatments (infected, uninfected: cold, amb, and warm) and at all three sample dates so that I will know whether I need to isolate more samples, or if we can start preparing more seriously for sending them for sequencing.

from Grace’s Lab Notebook https://ift.tt/2IlToXN

Sam’s Notebook:Assembly – Geoduck Hi-C Assembly Subsetting


Steven asked me to create a couple of subsets of our Phase Genomics Hi-C geoduck genome assembly:

  • Contigs >10kbp
  • Contigs >30kbp

I used pyfaidx and the following commands:

 faidx --size-range 10000,100000000 PGA_assembly.fasta > PGA_assembly_10k_plus.fasta  
 faidx --size-range 30000,100000000 PGA_assembly.fasta > PGA_assembly_30k_plus.fasta  

Output folder: 20180512_geoduck_fasta_subsets/

10kbp contigs (FastA): 20180512_geoduck_fasta_subsets/PGA_assembly_10k_plus.fasta

30kbp contigs (FastA): 20180512_geoduck_fasta_subsets/PGA_assembly_30k_plus.fasta

from Sam’s Notebook https://ift.tt/2IhZSHg

Yaamini’s Notebook: Gonad Methylation Analysis Part 13

The whole enchilada (but better this time)

Now that I’ve tested the entire bismark to methylKit pipeline on a 10,000 read subset for each C. virginica sequencing sample, I can begin validating the full pipeline. Steven already ran the full samples on this pipeline in this notebook. My job is to reproduce his workflow.

I opened my full pipeline Jupyter notebook. Since I’m using trimmed files from this directory, the first thing I wanted to do was test my find and xargs commands to ensure I isolated the right files.

screen shot 2018-05-09 at 1 45 41 pm

Figure 1. Testing find and xargs on a new set of filenames.

Then, I piped the find and xargs output into bismark. Unlike Steven, I did not set a parameter for -score_min. In our conversation today, we decided that I should run the alignment with the default parameter so we can compare our results. The default is L,0,-0.2 (instead of L,0,-1.2 used by Steven).

screen shot 2018-05-09 at 1 48 28 pm

Figure 2. Description of each line of code, along with the actual command.

It took me a bit of finnagling to get that code to work because of some typos! I’m going to jot down a few reminders for myself:

  • Always check to see if each line has a “”
  • Use tab complete to ensure lack of typos and the correct path to files
  • Verify you’re using the correct comand
  • Test small bits of code before putting everything together

I’ll check on the alignment progress tomorrow, but my guess is that this will take 2-3 days to complete. Once the alignment is finished, I will deduplicate, sort, and index the resulting .bam files.

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2I7zZcR

Yaamini’s Notebook: Gonad Methylation Analysis Part 12

Reproducibility rodeo

My goal was to run my samples through Steven’s methylKit R code. The first thing I did was copy his code into this R script. I then added more descriptive names for the objects, and commented out each line of code so I understood what was happening.

I was able to run processBismarkAln on my deduplicated.sorted.bam files! However, I could not unite these files. That’s when I posted a Github issue. After some back-and-forth, Steven suggested I download some of his dedup.sorted.bam and run those files through the R script. By using his files, I could figure out if the unite issue stemmed from my files, or from software differences on the Mac mini.

I created a new R script and downloaded dedup.sorted.bam files 1 and 10 (i.e. one from each treatment). I was able to successfully unite these files and go through the entire script! While I couldn’t identify any DMRs between the two files, I did come up with two reasons why Steven’s files worked and mine did not:

  1. Steven’s files are trimmed, but mine are not
  2. Steven’s files used the first 1,000,000 as a subset, but mine used the first 10,000

Either way, Steven said I should move on to the full pipeline and validate everything. Sam and Steven both pointed me to this directory, which has the trimmed files. He also said that I should remove the -score_min argument. This will make bismark run the default setting for alignment mismatches, and allow us to compare the alignment results. Time to start a computer process that will take several days to run!

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2Im6dVp