Spent Friday looking for differentially methylated regions in the Day 10 geoduck samples of Hollie’s.
We’re using the MACAU algorithm for DMR identification, and that requires four files to properly run.
- Vector of predictor variables (in our case, OA treatment of Ambient/Low/Super Low)
- n x n matrix of pairwise relatedness values for n samples
- Methylation counts for samples, with columns being samples, and rows being site IDS: A note regarding MethylExtract output, the output has a vector of scaffolds and a vector of nucleotide positions, these need to be concatenated in to a single vector to conform to MACAUs requirement.
- Total Coverage counts for samples. Has the same requirements as the methylation counts for single scaffold/position vector.
I had built the predictor file previously. Ambient samples were coded 1, Low pH samples were coded 2, and Super-low pH samples were coded 3. Sample information was found [here](https://github.com/hputnam/project_juvenile_geoduck_OA/tree/master/Sample_Processing)
For the relatedness matrix we’re using VCFtools based on SNPs:
I was unable to get MethylExtract to operate on a group of samples properly, so I had to run each sample individually. This necessitated an additional step in merging all of the sample SNPs in to a single file. This is also done in VCFtools via the `vcd-merge` command, which requires bgzipped SNP files, with tabix index files. The code below creates the bgzip SNP files and the index files in tabix, then runs vcf-merge. This takes a while, ~ 25 minutes for the Day 10 samples.
After that, we feed the aggregated SNP file in to vcftools
which results in
VCFtools output is not quite appropriate for MACAU, as it’s a 3 x ((n-1)(n)/2) length vector so I need to reformat it in Excel to the n x n format we want. This is just a whole lot of copy and transpose paste. The result in excel looks like:
Notice there are no column or row headers. Meaning for values is derived by the order put in, so count files, predictor files, and relatedness files have to have the same order.
Methylation and Total Coverage count files were created simultaneously in R.
A quick walkthrough of the code below. First, I make a vector of samples I want to iterate over. Then I work on my first sample outside of the loop, to avoid any fencepost issues. I read the data in, turn it in to a data.table (an expanded data frame, which deals with very large files better), sum across both the Watson and Crick strands for methylation and total coverage, saving these to new vectors in my data table and then save the Scaffold, location, context, and sum of methylation/coverage to a new data table.
After I have my first post created, I iterate through all other samples, using the same approach. The end result are two data tables with Columns for Scaffold, Position, Context, and counts for each sample.
This isn’t quite the proper format for MACAU, plus it has a bunch of locations that don’t have a complete set of reads across all samples at the proper sequencing depth, so I find all of the samples we want to actually analyze below
I begin by creating a new data table that only has positions that have data for each sample. Then I further subset the data by only accepting sites that have 10x coverage across all samples. I then created a new data table with a concatenated Scaffold/Position column, and count data from each sample that meets our requirements. These files are saved as tsv files for input into MACAU
I now have all the files I need to run the MACAU algorithm, which I do below. The output isn’t very exciting, but it’s done!
The results look like:
MACAU creates a binomial mixed model for each location, and tests if the coefficient on the predictor value, beta, is significant. In essence, does the predictor/treatment add meaningful/significant information to the model for the location, which I believe indicates that the predictor/treatment has a significant effect on the methylation pattern at that site.
With the MACAU results, I want to go through and find the locations are significantly affected by the OA treatment. I do that below. First I save all files with beta p-values less than 0.05, then I create a new data table following the .bed format of scaffold, location a, location b, and a value, I chose the beta value, but that was just an off the cuff decision. I then save this as a bed file for viewing in IGV.
I had some percent methylation bed graph files I had created earlier that I compared to the bed file I created above in IGV. I didn’t save screenshots from IGV, but will update this post with screenshots once I’m done downloading the files.