Checking stacked barplots and post-hoc tests
Previously I remade M. capitata genome tracks and my genomic location stacked barplot. Something looked funky, but I tabled my investigation until later. Later is now, and my goal is to resolve that issue and finalize post-hoc tests to understand GLM output for methylation status and genomic location.
M. capitata genomic location
The first thing I did was check my
bedtools code in this Jupyter notebook. That code looked good, so I figured there must have been an issue when I processed line count output in this R Markdown script. Before I began, I cleared by environment. I then reran the script to produce the overlap counts and overlap percents tables. Looking at the tables, WGBS detected 10% of CpGs that overlapped with CDS, while MBD-BS detected 14% of CpGs overlapping with CDS. This was not reflected in the stacked barplot I made! I ran the code fo my barplot and saw that it now reflected those percentages. My hunch is that clearing my environment did the trick. Since my union data and individual data scripts use the same dataframe names, I think the barplot I made last week called the correct dataframe name, but the content was for individual samples. I posted the revised barplot in this issue and moved on.
Figure 1. M. capitata and P. acuta overlaps with various genome features (pdf here).
I felt good about my PCoA, perMANOVAs, and models, but I still didn’t have all the information I needed. For each CpG methylation status or genomic feature overlap, the models test WGBS vs. RRBS or WGBS vs. MBD-BS, since WGBS is the base condition. However, the models don’t compare RRBS with MBD-BS. I wasn’t sure what post-hoc test to run, so I asked Evan. He suggested I use
emmeans to run post-hoc tests; I did so in this script. The package runs estimated marginal means tests on each model and corrects p-values. I can then obtain log odds ratio test output and corrected p-values with a FDR. The command itself is very simple:
emmeans(McapCpGHighModel, pairwise ~ seqMethod, adjust = "FDR")
The output provides confidence intervals for proportion overlap for each method on the logit scale, as well as log odds ratio results. I only wanted the test results since we’re not concerned with absolute proportions. To do this, I specified
$contrasts. I also took the output with adjusted p-values and saved it as a dataframe:
McapCpGHighPostHoc <- data.frame(emmeans(McapCpGHighModel, pairwise ~ seqMethod, adjust = "FDR")$contrasts) #Run pairwise comparisons (estimated marginal means). Obtain log odd ratio results and not confidence intervals for individual methods in a dataframe format. Specify FDR instead of Tukey post-hoc test (default) head(McapCpGHighPostHoc) #Look at log odd ratio results
rbind to combine statistical output for each model:
After I knit my script, I shared the output in this issue and on Slack. I still think Meth 8 may be an outlier, and I don’t know if the PCoAs should be included with the stacked barplot as a multipanel plot, or if they should just go in the supplement.
I think that’s it for my statistical analyses! Based on what we want for figures I can go back and make the PCoAs look pretty. Now I need to understand these results and update the paper.
- Update methods
- Update results
- Update figures
- Look into program for mCpG identification