Hawaii Gigas Methylation Analysis Part 17

Trying DSS for DML identification

Initially, I wanted to test-run DSS and ramwas. I looked into the manuals for both packages, and wasn’t convinced that ramwas was a good fit for my data. I could associate methylation information with genotype data, which seemed valuable. However, my genotype data came from BS-Snper analyses with my WGBS data, and looking for associations between WGBS and WGBS-derived genotype data didn’t make sense to me! Instead, I focused on using DSS to identify differentially methylated loci.

DSS background

DSS uses a Bayesian hierarchical model to estimate dispersions, then uses a Wald test for differential methylation, to calculate CpG methylation in a general experimental design. I was interested in using this package because I wanted to test pH and ploidy simultaneously, instead of using each variable as a covariate in my analysis! Coverage data (i.e. count data) are modeled using a beta-binomial distribution with an arcisne link function and log-normal priors, and the model is fit on the transformed data using the generalized least squares method. A Wald test is used at each CpG site for statistical testing. The test itself depends on sequencing depth and a dispersion parameter. This dispersion parameter is defined by a shrinkage estimator from the Bayesian hierarchical model. After conducting the Wald test, I can define DML based on p-value and FDR thresholds.

According to the package creators, this method is better than the Fisher’s test used by methylKit. With the Fisher’s test, read counts are added between samples in a treatment (think unite). By doing this, the test assumes that data in each sample are distributed in the same way, and obscures any variation between samples in a treatment. Based on the data I’ve seen from methylKit, the data do have similar distributions, but I have seen certain samples look like outliers in the PCA. Retaining the biological variation in each treatment with the dispersion parameters seems like a good step in the right direction.

Another thing DSS does while modeling is smoothing, which involves using information from nearby CpG sites to better inform methylation calculations at a specific CpG site. The manual made it sound like smoothing was always recommended to improve calculations, but the paper only suggested using smoothing when the data is “dense” and there are CpG sites closeby. This immediately made me think of differences between the vertebrate and invertebrate methylation model. It’s not likely that I’ll have many CpGs nearby to estimate methylation information, so I decided not to use any smoothing.

Oh, did I mention that DSS was designed to be less computationally intensive? Very important.

Installing packages on R Studio server

The first step of my analysis was installing packages on mox to use with the R Studio server! I needed to install DSS and its dependency bsseq. At first, I opened a build node and tried installing the packages that way, but I ran into continual errors. I then tried to install the packages from the R Studio server interface itself, and ran into similar issues. Turns out Aidan was having similar issues and posted this discussion. Sam mentioned that if we wanted to install packages to use on the server, we needed to create our own singularity container.

I followed Sam’s instructions in the handbook to create a singularity container. First I created a definition file:

Screen Shot 2021-05-25 at 2 35 52 PM

I then created the script r_packages_installs.R to install specific R packages, like methylKit, bsseq, and DSS:

Screen Shot 2021-05-25 at 2 36 14 PM

After logging into a build node, I loaded the singularity module and built the container rstudio-4.0.2.yrv-v1.0.sif. Finally, I modified my R Studio SLURM script, found here, to call my new singularity container! I think I ran into some errors later on because my singularity container was in my login node instead of somewhere wiht more storage, but that didn’t seem to be the sole cause. There might be something weird with my container or SLURM script, but it didn’t fully affect my ability to do the analysis.

Working through DSS

To work through this analysis, I relied heavily on the DSS manual! I pasted the code that ran on the R Studio server into this R Markdown script so it was easier to follow.

First, I modified my merged coverage files from bismark to fit DSS input requirements. Each sample needed a corresponding text file with chromosome, position, read coverage, and reads methylated, and required a header. For position, I used the start of the CpG dinucleotide.

#Create chr, pos, read cov, and reads meth columns for f in *cov do awk '{print $1"\t"$2"\t"$5+$6"\t"$5}' ${f} \ > ${f}.txt done #Add header for f in *txt do sed -i '1i chr\tpos\tN\tX' ${f} done #Rename files: zr3644_#.txt for f in *txt do [ -f ${f} ] || continue mv "${f}" "${f//_R1_val_1_val_1_val_1_bismark_bt2_pe..CpG_report.merged_CpG_evidence.cov/}" done 

To import multiple files into R Studio, I used a method with list2env I found two years ago:

coverageFiles <- list.files(pattern = "*txt") #Create a file list for all 24 files to import list2env(lapply(setNames(coverageFiles, make.names(gsub("", "", coverageFiles))), read.table, header = TRUE), envir = .GlobalEnv) #Import all coverage files into R global environment, using first line as header 

Once I had the data in R Studio, I created a BSseq object (this is why I needed the bsseq dependency). This object is just a large conglomeration of coverage information for each sample and sample ID:

BSobj <- makeBSseqData(dat = list(zr3644_1.txt, zr3644_2.txt, zr3644_3.txt, zr3644_4.txt, zr3644_5.txt, zr3644_6.txt, zr3644_7.txt, zr3644_8.txt, zr3644_9.txt, zr3644_10.txt, zr3644_11.txt, zr3644_12.txt, zr3644_13.txt, zr3644_14.txt, zr3644_15.txt, zr3644_16.txt, zr3644_17.txt, zr3644_18.txt, zr3644_19.txt, zr3644_20.txt, zr3644_21.txt, zr3644_22.txt, zr3644_23.txt, zr3644_24.txt), sampleNames = c("2H-1", "2H-2", "2H-3", "2H-4", "2H-5", "2H-6", "2L-1", "2L-2", "2L-3", "2L-4", "2L-5", "2L-6", "3H-1", "3H-2", "3H-3", "3H-4", "3H-5", "3H-6", "3L-1", "3L-2", "3L-3", "3L-4", "3L-5", "3L-6")) #Make BSseq object. dat = list of dataframes with coverage information. sampleNames = sample names. 

The BSseq object is what’s used when fitting the model! After creating the BSseq object, I defined the experimental design, making sure to match sample ID information in the row names with what I included in the BSseq object:

design <- data.frame("ploidy" = c(rep("D", times = 12), rep("T", times = 12)), "pH" = c(rep("H", times = 6), rep("L", times = 6), rep("H", times = 6), rep("L", times = 6))) #Create dataframe with experimental design information row.names(design) <- c("2H-1", "2H-2", "2H-3", "2H-4", "2H-5", "2H-6", "2L-1", "2L-2", "2L-3", "2L-4", "2L-5", "2L-6", "3H-1", "3H-2", "3H-3", "3H-4", "3H-5", "3H-6", "3L-1", "3L-2", "3L-3", "3L-4", "3L-5", "3L-6") #Set sampleID as rownames 

To fit the general linear model to the data, I used the DMLfit.multiFactor function, which allowed me to specify a model with ploidy, pH, and in the interaction between the two variables. After fitting the model, I checked the column names to 1) confirm that the model fit the variables I wanted and 2) see which condition was picked at the “treatment” condition (this usually occurs alphabetically, and luckily for me my treatments are L and T, which come after H and D):

DMLfit <- DMLfit.multiFactor(BSobj, design = design, formula = ~ploidy + pH + ploidy:pH) #Formula: intercept, additive effect of ploidy and pH, and interaction colnames(DMLfit$X) #Column names of the design matrix (ploidyT = ploidy triploid, pHL = pH low, ploidyT:pHL = interaction) 

For each coefficient, I extracted model fit information, then identified DML using a p-value < 0.05 and FDR < 0.01. This is roughly equivalent with a p-value < 0.05 and q-value < 0.01 I used with methylKit:

DMLtest.pH <- DMLtest.multiFactor(DMLfit, coef = "pHL") #Test the pH effect pH.DML <- subset(DMLtest.pH, subset = pvals < 0.05 & fdrs < 0.01) #Obtain loci with p-value < 0.05 and FDR < 0.01 

One important thing to note is that I couldn’t select a percent methylation difference with a generalized linear model. The model doesn’t produce methylation values to begin with, and using multiple regressions makes it more complicated to return these kinds of values (see the FAQ here).

Looking at DML in IGV

Next step: look at DML in IGV! I opened this IGV session and added in DSS DML BEDfile tracks I produced in this Jupyter notebook. When I looked at the DML, it was easier to understand why some DML were called over others. I definitely ended up in a confusing/existential place: if we don’t quite understand what methylation levels really mean, can I look at a DML in IGV and make a judgement call as to what looks like differential methylation? I decided to keep my DSS DML moving forward, even if I couldn’t understand why certain loci were identified as differentially methylated.

DML overlaps between methods

Based on the differences I saw with methylKit and DSS DML, I was interested in what overlaps existed between the different sets. In this Jupyter notebook, I used intersectBed to look at those overlaps.

Table 1. Number of DML common between contrasts and methods. 25 DML overlapped between all DSS lists.

DML List methylKit pH methylKit ploidy DSS pH DSS ploidy DSS ploidy:pH
methylKit pH 42 2 1 N/A N/A
methylKit ploidy 29 N/A 3 N/A
DSS pH 178 21 11
DSS ploidy 154 17
DSS ploidy:pH 53

I found it interesting that only 1 DML overlapped between the different pH lists, and there were 3 common between the ploidy lists. This is probably a testament to the different methods used to call differential methylation (Fisher’s exact test vs. Wald test).

DML overlaps with genomic features

For each DSS list, I used intersectBed to look at genomic locations of DML in this Jupyter notebook.

Table 2. DML distribution in genomic features for DSS lists

Genome Feature pH ploidy ploidy:pH
Total 178 154 53
Genes 123 145 48
Exon UTR 4 8 3
CDS 15 26 6
Introns 104 114 41
Upstream flanks 2 8 0
Downstream flanks 2 8 0
Intergenic regions 27 18 5
lncRNA 4 1 2
Transposable elements 86 66 14
Unique C/T SNPs 5 3 1

Even though the number of DML are different between the two methods, the genomic locations are basically the same: most DML are in introns, followed by transposable elements. One thing that is important to note is that the percent of DML that overlap with C/T SNPs is much less with DSS than methylKit (~2% vs. ~20%).

So, I guess I just have multiple DML lists now? I’ll probably move forward with each list separately for enrichment analysis, then make some sort of decision after that. I will also need to find a way to construct a randomization test with DSS! My guess is that I could get results from that randomization test sooner rather than later, since DSS was less computationally intensive than methylKit.

Going forward

  1. Perform enrichment
  2. Conduct randomization test with DSS
  3. Try EpiDiverse/snp for SNP extraction from WGBS data
  4. Run methylKit randomization test on mox
  5. Investigate comparison mechanisms for samples with different ploidy in oysters and other taxa
  6. Transfer scripts used to a nextflow workflow
  7. Update methods
  8. Update results
  9. Create figures

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2RLvHlE

TWIP 17 – What Happens when Steven does not show up?

Hawaii Gigas Methylation Analysis Part 16

Reworking pH-DML

When writing up my methods, I realized that I incorrectly re-assigned treatments in methylKit when I was shuffling between ploidy and pH treatment assignments! Now that I was using min.per.group with unite, I couldn’t just re-assign treatments because the min.per.group would always be based off of the ploidy treatment information, not the pH treatment information. This could mean that I would call DML with less than eight samples per pH treatment! So, back to the code I go.

Returning to methylKit

I opened the R Studio server and my R script to recode the treatment re-assignment. I just needed to re-assign treatments the step before unite:

methylationInformationFilteredCov5T <- methylKit::reorganize(processedFilteredFilesCov5, sample.id = sampleMetadata$sampleID, treatment = sampleMetadata$pHTreatment) %>% methylKit::unite(., destrand = FALSE, min.per.group = 8L) #Re-assign treatment information to filtered and normalized 5x CpG loci, then unite 5x loci with coverage in at least 8/12 samples/treatment 

With this re-assignment, I had 5,086,421 CpGs with data in at least eight samples per pH treatment. I then used this new object to identify DML! After finishing up on mox, I used rsync to move my revised DML lists and data to gannet.

Unique C/T SNPs

The next thing I wanted to do was determine how many unique C/T SNPs were in my data in this Jupyter notebook. First, I used cat to combine all SNP lists:

#Combine C/T SNPs into one file !cat *CT-SNPs.tab >> all-CT-SNPs.tab 

Then, I used sort and uniq to pull out all the unique SNPs:

!sort all-CT-SNPs.tab \ | uniq \ > unique-CT-SNPs.tab 

When I looked at the output, I realized there were still duplicate C/T SNPs! The eighth column had information from the VCF file that was unique from each sample, so that was likely preventing me from pulling out the unique SNPs properly. Instead, I used awk to pull the first five columns out (chr, bp, ?, C, T), and used that as input into sort and uniq.

!awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5}' all-CT-SNPs.tab \ | sort \ | uniq \ > unique-CT-SNPs.tab 

That gave me 289,826 unique C/T SNPs in my data.

DML characterization

In this Jupyter notebook, I used my new pH-DML and unique C/T SNP lists to look at DML locations in the genome.

Table 1 Overlaps between DML and various genome features.

Genome Feature pH-DML ploidy-DML Common DML
Total 42 29 2
Genes 36 25 2
Exon UTR 5 0 0
CDS 7 7 1
Introns 24 18 1
Upstream flanks 0 0 0
Downstream flanks 4 1 0
Intergenic regions 2 3 0
lncRNA 3 0 0
Transposable elements 16 8 0
Unique C/T SNPs 6 5 0

Going forward

  1. Test-run DSS and ramwas
  2. Try EpiDiverse/snp for SNP extraction from WGBS data
  3. Run methylKit randomization test on mox
  4. Investigate comparison mechanisms for samples with different ploidy in oysters and other taxa
  5. Transfer scripts used to a nextflow workflow
  6. Update methods
  7. Update results
  8. Create figures

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3oEWjAG

TWIP 15 – Oh look! They are twins!

Manual Clustering

Moving Forward? Good news: I made a whole bunch of progress Bad news: Doesn’t seem like WGCNA will work out well So I figured out how to run RStudio on Mox, and then used that to run WGCNA. I ran every group of crab over time, and with all possible transcriptomes. In case you haven’t been following along too closely, that means I ran WGCNA on ambient-temperature crabs (that’s crabs A, B, and C), aligned to a transcriptome unfiltered by taxa (cbai_transcriptomev2.0). I then repeated for those same libraries, but aligned to a transcriptome BLASTed against a Chionoecetes genome (cbai_transcriptomev4.0) and another transcriptome filtered to contain only Alveolata sequences (hemat_transcriptomev1.6). I then repeated the process for elevated-temperature crabs (or crabs G, H, and I). All those scripts are available in my scripts directory, with the notation 5#_WGCNA_[transcriptome of choice]_[temperature_treatment]Crabs.Rmd Results are available within output/WGCNA_output. WGCNA produces modules, which cluster the…

from Aidan F. Coyle https://ift.tt/3fqBzsk

Trimming – O.lurida BGI FastQs with FastP on Mox

After attempting to submit our Ostrea lurida (Olympia oyster) genome assembly annotations (via GFF) to NCBI, the submission process also highlighted some short comings of the Olurida_v081 assembly. When getting ready to submit the genome annotations to NCBI, I was required to calculate the genome coverage we had. NCBI suggested to calculate this simply by counting the number of bases sequenced and divide it by the genome size. Doing this resulted in an estimated coverage of ~55X coverage, yet we have significant stretches of Ns throughout the assembly. I understand why this is still technically possible, but it’s just sticking in my craw. So, I’ve decided to set up a quick assembly to see what I can come up with. Of note, the canonical assembly we’ve been using relied on the scaffolded assembly provided by BGI; we never attempted our own assembly from the raw data.

To start, I needed to trim the data. I ran the BGI Illumina sequencing data through fastp on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210518_olur_fastp_bgi ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210518_olur_fastp_bgi ### Fastp 10x Genomics data used for P.generosa genome assembly by Phase Genomics. ### In preparation for use in BlobToolKit ### Expects input filenames to be in format: *.fastq.gz ################################################################################### # These variables need to be set by user ## Assign Variables # Set number of CPUs to use threads=40 # Input/output files trimmed_checksums=trimmed_fastq_checksums.md5 raw_reads_dir=/gscratch/srlab/sam/data/O_lurida/DNAseq fastq_checksums=raw_fastq_checksums.md5 # Paths to programs fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() # Programs associative array declare -A programs_array programs_array=( [fastp]="${fastp}" \ [multiqc]="${multiqc}" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Capture date timestamp=$(date +%Y%m%d) # Sync raw FastQ files to working directory rsync --archive --verbose \ "${raw_reads_dir}"/1*.fq.gz . # Create arrays of fastq R1 files and sample names for fastq in 1*_1*.fq.gz do fastq_array_R1+=("${fastq}") done # Create array of fastq R2 files for fastq in 1*_2*.fq.gz do fastq_array_R2+=("${fastq}") done # Run fastp on files # Trim 10bp from 5' from each read # Adds JSON report output for downstream usage by MultiQC for index in "${!fastq_array_R1[@]}" do # Remove .fastq.gz from end of file names R1_sample_name=$(echo "${fastq_array_R1[index]}" | sed 's/.fq.gz//') R2_sample_name=$(echo "${fastq_array_R2[index]}" | sed 's/.fq.gz//') # Get sample name without R1/R2 labels sample_name=$(echo "${fastq_array_R1[index]}" | sed 's/_[12].*//') echo "" echo "fastp started on ${sample_name} FastQs." # Run fastp # Specifies reports in HTML and JSON formats ${fastp} \ --in1 ${fastq_array_R1[index]} \ --in2 ${fastq_array_R2[index]} \ --detect_adapter_for_pe \ --thread ${threads} \ --html "${sample_name}".fastp-trim."${timestamp}".report.html \ --json "${sample_name}".fastp-trim."${timestamp}".report.json \ --out1 "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz \ --out2 "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz echo "fastp completed on ${sample_name} FastQs" echo "" # Generate md5 checksums for newly trimmed files { md5sum "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz md5sum "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz } >> "${trimmed_checksums}" # Create MD5 checksum for reference { md5sum "${fastq_array_R1[index]}" md5sum "${fastq_array_R2[index]}" } >> ${fastq_checksums} # Remove original FastQ files rm "${fastq_array_R1[index]}" "${fastq_array_R2[index]}" done # Run MultiQC ${programs_array[multiqc]} . ################################################################################### # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "

Hawaii Gigas Methylation Analysis Part 15

Investigating methylation data

Quick notebook post to update my progress characterizing this dataset! I followed workflows I established with the Manchester data to characterize the general methylation landscape and understand DML locations.

Characterizing methylation landscapes

I used this Jupyter notebook to obtain methylated, sparsely methylated, and unmethylated CpG loci with at least 5x coverage in my sequencing data. Similar to the Manchester dataset, I created a union BEDgraph to concatenate percent methylation across samples, then used this union file and individual sequence files for downstream analyses. As I used intersectBed to identify methylated, sparsely methylated, and unmethylated loci for 25 files, then determined the genomic location of the loci in those output files, I realized I was producing a lot of intermediate output files! When I go visualize this data, I need to find a way to easily take file line counts and turn into a data table. But that’s a problem for future Yaamini.

DML locations

Now to DML locations! Based on my methylKit results, I decided to use a 25% methylation difference to define differential methylation. I first converted the DML lists into BEDfiles in this Jupyter notebook, and visualized the DML and 5x sample BEDgraphs in this IGV session. I color-coded the samples so that light blue = 2N + high pH, light purple = 3N + high pH, dark blue = 2N + low pH, and dark purple = 3H + low pH. I looked at a couple of the DML and saw the patterns matched pretty well with what I thought a pH- or ploidy-DML should look like. After having a quick look, I used intersectBed to find overlaps between pH- and ploidy-DML, and discern the genomic location of DML. Interestingly, only two DML overlapped between treatments! As expected, a majority of DML were found in genic regions. I also looked at overlaps between DML and C/T SNPs I filtered in this Jupyter notebook. Again, I need to find an efficient way to take line counts from my intersectBed output files and format them in a table!

Going forward

  1. Test-run DSS and ramwas
  2. Try EpiDiverse/snp for SNP extraction from WGBS data
  3. Run methylKit randomization test on mox
  4. Investigate comparison mechanisms for samples with different ploidy in oysters and other taxa
  5. Transfer scripts used to a nextflow workflow
  6. Update methods
  7. Update results
  8. Create figures

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3bClFdm

WGBS Analysis Part 29

Running randomization tests on mox

I initially tried running the randomization tests on the R Studio server, but encountered several timeouts. I decided to run an R script through a SLURM script, like I’ve tried previously.

Editing code

First, I removed my randomization test code from my DML identification script, and copied my code into a new randomization test script. I trimmed down the randomization test code to only include the 75% difference for female samples and 99% difference for indeterminate samples in calculateDiffMeth. I also pointed to the revised script in this SLURM script, and modified my output text file name so I could easily find it.

Then I ran the SLURM script! Of course I ran into package installation issues. When loading methylKit, R could not find the GenomicRanges package. Thinking this had something do to with the ~/.Renviron file I modified for the R Studio server, I set a new path for my R packges: /gscratch/srlab/rpackages. I realized soon after that this wouldn’t solve the problem, since methylKit and all its associated packages are nested in my user directory: /gscratch/home/yaaminiv/R/x86_64-pc-linux-gnu-library/3.6/! In a build node, I went down the rabbit hole of packages that needed to be explicitly loaded, and ended up with the following list to get methylKit loaded successfully:

require(BiocGenerics, lib.loc = "/gscratch/home/yaaminiv/R/x86_64-pc-linux-gnu-library/3.6/") require(S4Vectors, lib.loc = "/gscratch/home/yaaminiv/R/x86_64-pc-linux-gnu-library/3.6/") require(IRanges, lib.loc = "/gscratch/home/yaaminiv/R/x86_64-pc-linux-gnu-library/3.6/") require(GenomeInfoDb, lib.loc = "/gscratch/home/yaaminiv/R/x86_64-pc-linux-gnu-library/3.6/") require(GenomicRanges, lib.loc = "/gscratch/home/yaaminiv/R/x86_64-pc-linux-gnu-library/3.6/") require(methylKit, lib.loc = "/gscratch/home/yaaminiv/R/x86_64-pc-linux-gnu-library/3.6/") 

Another mox error

Once I cleared this hurdle, I encountered another one. My randomization test was not running because the sample ID information was not formatted correctly. I thought this was strange becaus I ran this code on the R Studio server without running into this error. In any case, I looked at the output file and saw that the sample ID information was saved as factors, and not characters. After creating the sampleMetadataFem dataframe with information, I converted the factors to characters:

sampleMetadataFem <- data.frame("sampleID" = c("1", "3", "4", "6", "7", "8"), "slide-label" = c("02-T1", "06-T1", "08-T2", "UK-02", "UK-04", "UK-06"), "pHTreatment" = c(rep(1, times = 3), rep(0, times = 3))) #Create metadata table sampleMetadataFem$sampleID <- as.character(sampleMetadataFem$sampleID) #Convert to character format 

I did this for the indeterminate samples too, since I’d likely run into the same error! When I ran my revised script, it started cranking through. Then I encountered other errors I hadn’t seen before:

Screen Shot 2021-05-17 at 5 48 46 PM

I posted this discussion to get feedback, and started running the randomization test without the mc.cores argument to see if I could reproduce the error. Two hours after running the test without mc.cores in calculateDiffMeth, I still hadn’t encountered that error. I wonder if these errors could have contributed to the R Studio server crashing when I used mc.cores previously. In any case, I’m not sure how I can run these tests now…

Going forward

  1. Run methylKit randomization test
  2. Update mox handbook with R information
  3. Obtain relatedness matrix and SNPs with EpiDiverse/snp
  4. Identify overlaps between SNP data and other epigenomic data
  5. Write methods
  6. Write results
  7. Identify significantly enriched GOterms associated with DML
  8. Identify methylation islands and non-methylated regions
  9. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3eVgGGL

WGBS Analysis Part 28

Working with BS-Snper SNPs

I’m returning to the BS-Snper analysis so I can characterize overlaps with SNPs and DML! This will help me understand if the differential methylation is solely attributable to treatment differences, or if there is also a genetic component.

Understanding VCF output

First, I wanted to understand (roughly) the VCF output I got from BS-Snper. I found this link that had good explanations of the header and columns. The first ~100 lines of the file are part of the VCF header, and I ignored it for the most part. Once I got to the column header information and actual output, I appreciated the clear information from the GATK website. Based on my workflow, I needed a way to filter SNPs with a C reference allele and T alternative allele.

Screen Shot 2021-05-15 at 8 12 06 PM

Filtering C/T SNPs

Steven suggested a few weeks ago that I could use bedtools intersect to look at overlaps between SNPs and CG sites or DML. I opened this Jupyter notebook to first filter C/T SNPs. I thought knowing how many C/T SNPs were present in my data would be helpful.

I knew I could use VCF files with intersectBed, and at first I tried intersecting the VCF with the CpG methylation output. However, intersectBed wouldn’t accept the format of the CpG tab-delimited output, since it only had one column with base pair position information. Instead, I decided to use the CG motif track I generated. Before writing out the file, I looked at the head and saw there were G/A SNPs that intersected with my CG motif track! This makes sense, because the CG motif track has positions for the Cs AND Gs that make up the CpG dinucleotide. Based on Mac’s comment on my previous discussion, she was looking at C/T SNPs only. I then used grep to filter the standard input so the reference and alternative allele columns were C and T, respectively:

#Intersect VCF with SNP locations and CG motif track #BEDtools output has C/T and non-C/T SNPs #Use grep to isolate C/T SNPs (tab between C and T required) !{bedtoolsDirectory}intersectBed \ -u \ -a /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/SNP-results.vcf \ -b ../../genome-feature-files/cgigas_uk_roslin_v1_fuzznuc_CGmotif.gff \ | grep "C T" \ > /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/CT-SNPs.tab 

My filtered SNP file can be found here! I proceeded to do something similar with SNPs from individual samples:

%%bash for f in zr3616*vcf do /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \ -u \ -a ${f} \ -b /Users/yaamini/Documents/project-gigas-oa-meth/genome-feature-files/cgigas_uk_roslin_v1_fuzznuc_CGmotif.gff \ | grep "C T" \ > ${f}_CT-SNPs.tab head ${f}_CT-SNPs.tab wc -l ${f}_CT-SNPs.tab done 

At this point, I ended up with some really ugly file names (ex. zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe.SNP-results.vcf_CT-SNPs.tab ). I couldn’t figure out how to get xargs | basename to work within a loop, so I looked for another way to rename files en masse. Stack Overflow reminded me that mv existed beyond moving files from one place to another! The syntax looked a lot like sed, but it allowed me to remove the same pattern from all my filenames:

%%bash for f in zr3616*CT-SNPs.tab do [ -f ${f} ] || continue mv "${f}" "${f//_R1_val_1_val_1_val_1_bismark_bt2_pe.SNP-results.vcf/}" done 

The merged BAM file contained ~5 million SNPs, and ~120,000 of those were C/T SNPs (Table 1). Looking at individual files, there were ~3 million overall SNPs for each sample, with ~65,000-70,000 C/T SNPs. That means about 2% of SNPs in any BAM file were C/T SNPs. I meant to look at the C/T SNPs in IGV, but after I added my BAM files and was adding SNP files, the application crashed. I’ll try again later.

Overlaps with DML

I used intersectBed to look at overlaps between merged SNPs or individual SNPs, and either female-, indeterminate- or common DML in this notebook.

Table 1. Number of overall SNPs, C/T SNPs, and SNPs overlapping with various DML categories.

BAM All SNPs C/T SNPs Female-DML Indeterminate-DML Common DML
Merged 5,519,308 122,306 38 298 0
1 3,083,382 69,845 18 130 0
2 3,105,080 70,244 17 201 0
3 3,202,988 72,064 15 145 0
4 3,232,583 73,664 17 138 0
5 3,007,518 66,965 18 155 0
6 3,204,395 72,123 19 138 0
7 3,083,706 68,856 21 128 0
8 3,100,801 69,695 23 138 0

For the merged BAM file, ~12% of female-DML are SNPs, while ~4% of the indeterminate-DML are! I’m not really sure what this means, and how many of the individual DML overlap with eachother or the merged file. I think I’ll need to mess around with the SNP-DML overlap lists in R to understand how many unique locations are SNPs, and create some sort of visualization.

Going forward

  1. Finish running methylKit randomization tests on R Studio server
  2. Update mox handbook with R information
  3. Obtain relatedness matrix and SNPs with EpiDiverse/snp
  4. Identify overlaps between SNP data and other epigenomic data
  5. Write methods
  6. Write results
  7. Identify significantly enriched GOterms associated with DML
  8. Identify methylation islands and non-methylated regions
  9. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3tTJ6VE

Hawaii Gigas Methylation Analysis Part 14

Messing around with methylKit DML settings

The R Studio server works!

Last time I mentioned how the R Studio server wasn’t playing well with my R Markdown script, and that I started running an R script for the analysis instead. Turns out making the switch is what I needed to run my code! When I was working on the Manchester analysis, I did end up using an R script instead of an R Markdown file. This makes me think that there’s some sort of issue using R Markdown on R Studio server, specifically with calculateDiffMeth. Something to investigate at a later date.

Evaluating methylKit DML output

After unite, I had 1,984,530 CpG loci with data in all 24 samples. Using this object with calculateDiffMeth (including covariates and an overdispersion correction), I didn’t obtain many ploidy- or pH-DML:

Table 1. Number of DML obtained from various methylation difference thresholds.

Contrast 25% 50% 75%
Ploidy 1 0 0
pH 4 1 0

Table 2. DML obtained at various percent methylation differences for ploidy or pH contrasts.

Contrast chr start end pvalue qvalue meth.diff
ploidy NC_047566.1 46447078 46447080 8.71E-13 1.62E-06 37.3155448
pH NC_047561.1 7843128 7843130 5.50E-08 0.0044294 -26.315789
pH NC_047565.1 4762558 4762560 3.96E-08 0.00364724 -26.731667
pH NC_047565.1 44578741 44578743 1.23E-07 0.00671761 -26.789645
pH NC_047567.1 9113140 9113142 9.18E-08 0.00610252 50.1223071
pH NC_047567.1 9113140 9113142 9.18E-08 0.00610252 50.1223071

I wasn’t expecting many DML from this analysis because of the covariate information, but I’m surprised I don’t have at least a hundred for one of the contrasts. It’s also interesting that all of the DML are found in one chromosome. I should look into that further when I have my final DML list.

Changing min.per.group within unite

Steven suggested I change the min.per.group parameter in the unite command to allow for loci without data in all samples to be included in the analysis. I started by allowing a CpG to be present in 9/12 samples per treatment (75%):

methylationInformationFilteredCov5 <- methylKit::unite(processedFilteredFilesCov5, destrand = FALSE, min.per.group = 9L) #Combine all processed files into a single table. Use destrand = TRUE to not destrand. Based with data in at least 9/12 samples peer treatment will be included 

With this change, I had 4,557,452 CpG loci with data in at least 9 samples per treatment. I got more DML this time, but still less than 100! Looking at my reference script from another paper, it seemed like they tried out several min.per.group options. For reference, I have between 5,173,369 and 8,057,460 CpG loci with data with at least 5x coverage, depending on the sample. Since I’m getting 4,557,452 after unite with min.per.group = 9L, I’m not sure if it’s worth doing something similar to 1) count the number of CpG loci with data and 2) count the number of DML obtained with multiple settings. However, I decided to try min.per.group = 8L, which gave me 5,103,729 CpGs.

Table 3. Number of DML obtained from various methylation difference thresholds using min.per.group = 9L or min.per.group = 8L.

Contrast min.per.group 25% 50% 75%
Ploidy 9 19 0 0
Ploidy 8 29 1 0
pH 9 32 2 0
pH 8 42 3 0

Based on Table 3, I think I need to use a 25% threshold for calling DML. Again, I’m pretty sure my use of covariates is affecting the P– and q-values I get from methylKit. I think I’ll also use the min.per.group = 8L output, since I get marginally more DML.

Redoing comparative analysis with min.per.group = 8L

My previous comparative analysis used the default settings: CpGs must have data in all samples to be included in downstream analysis. I re-did the comparative analysis with the min.per.group = 8L parameter.


Figure 1. Correlation plot


Figure 2. Clustering diagram


Figure 3. PCA

The PCA and clustering diagram don’t look very different, but the correlations between samples are lower 0.01. Overall, nothing that rocks the boat.

Going forward

  1. Test-run DSS and ramwas
  2. Look at DML in IGV
  3. Try EpiDiverse/snp for SNP extraction from WGBS data
  4. Run methylKit randomization test on mox
  5. Investigate comparison mechanisms for samples with different ploidy in oysters and other taxa
  6. Characterize the general methylation landscape
  7. Identify genomic locations of DML
  8. Transfer scripts used to a nextflow workflow
  9. Update methods
  10. Update results

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3oiFueK