WGBS Analysis Part 22

Running R scripts on mox (for real this time)

So, clearly things aren’t going well. I tried running an R script on mox, but landed in a seemingly endless loop of installing dependencies, running the script, having it fail, and trying to install yet another dependency. My theory was that the mechanism of loading R packages in a SLURM script is different than loading packages in an R module. Time to test it out.

Sanity check with the build node

Since I was able to load packages in the build node, I thought I would see if I could run part of my code interactively as a sanity check. First, I opened a build node for four hours:

srun -p build --time=4:00:00 --mem=10G --pty /bin/bash module load r_3.6.0 #Load R module R #Open R 

Then, I loaded the devtools, methylKit, and dplyr packges, confirmed packages were loaded, and set my working directory to folder with my bismark output:

require(devtools) require(methylKit) require(dplyr) sessionInfo() #Confirm packages are loaded 

Screen Shot 2021-04-26 at 11 05 42 AM

getwd() #Confirm I am in my home directory setwd("/gscratch/scrubbed/yaaminiv/Manchester/analyses/methylKit") #Change directory to where bismark output is 

I then started running code to confirm that methylKit and dplyr R commands would work as long as the packages were loaded. I was able to quickly read files into R with methRead:

Screen Shot 2021-04-26 at 11 10 09 AM

I then successfully ran the following code chunk to process bismark alignments and normalize coverage between samples!

processedFilteredFilesCov5 <- methylKit::filterByCoverage(processedFiles, lo.count = 5, lo.perc = NULL, high.count = NULL, high.perc = 99.9) %>% methylKit::normalizeCoverage(.) 

Screen Shot 2021-04-26 at 11 16 03 AM

At this point, I saved my R data and knew that as long as I could reference my packages correctly, I could run my code.

Calling an R Script in a SLURM script

Clearly calling an R module worked better than changing the shebang and running an R script directly on mox! I wanted to try another method: calling an R script within a SLURM script. First, I needed to put my R code in a separate script. I copied and pasted my code and created this R script. Based on the hyak documentation, I needed to create a SLURM script to call the R script. For this SLURM script, I used a 10 day walltime and 100 G memory node. Hopefully I won’t need more than that! Within the script, I needed two lines of code:

module load r_3.6.0 #Load R version 3.6.0 Rscript > output.txt 2>&1 /gscratch/home/yaaminiv/06-methylKit.R #Specify my standard error file (output.txt) and R script location (/gscratch/home/yaaminiv/06-methylKit.R) 

I then ran my SLURM script. It ended after 20 minutes (so…a bit longer than the 18 minute run I was used to previously!) due to more package struggles. Thankfully they were different package problems than before! I was unable to load devtools or dplyr because R could not find the correct versions of dependency packages. However, methylKit loaded with no issues:

Screen Shot 2021-04-26 at 1 09 12 PM

Screen Shot 2021-04-26 at 1 09 29 PM

Screen Shot 2021-04-26 at 1 09 39 PM

I finagled with how I loaded packages in my R script and decided to run require(devtools) with no lib.loc argument. When I would load packages in the build node, I never specified where packages were found and did not encounter any error. I also tried require(tidyverse, lib.loc = "/gscratch/srlab/rpackages") to see if loading tidyverse would bypass any issues I had loading dplyr. I was able to load devtools with no problems, but still ran into issues with dplyr and tidyverse!

Screen Shot 2021-04-26 at 1 17 01 PM

Screen Shot 2021-04-26 at 1 17 12 PM

Screen Shot 2021-04-26 at 1 23 45 PM

Since devtools loaded without specifying a library location, I figured I could do the same for dplyr. The final configuration of loading R packages that worked went as follows:

require(devtools) #Load devtools require(methylKit, lib.loc = "/gscratch/home/yaaminiv/R/x86_64-pc-linux-gnu-library/3.6/") #Load methylKit. I was able to load with no issues including library location, so I didn't change it require(dplyr) #Load devtools sessionInfo() #Confirm packages are loaded 

At this point, my script truly ran…for 30 minutes! I didn’t properly reference my covariate matrix in my calculateDiffMeth command, but once I did that the command ran without any issues (so far). Guess we’ll wait and see if I can indeed run calculateDiffMeth with a covariate matrix and overdispersion correction on mox!

Screen Shot 2021-04-26 at 4 19 39 PM

Screen Shot 2021-04-26 at 4 19 58 PM

Going forward

  1. Write methods
  2. Write results
  3. Update mox handbook with R information
  4. Obtain relatedness matrix and SNPs with EpiDiverse/snp
  5. Identify genomic location of DML
  6. Determine if RNA should be extracted
  7. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2PnTIhk
via IFTTT

WGBS Analysis Part 21

Running R scripts on mox

Alright, I have R installed, which is maybe a moot point but I couldn’t get methylKit installed. Let’s see if I can actually run an R SLURM script today.

Installing packages (round 2)

To install methylKit, I decided to use an older version of R. I first loaded the module:

module load r_3.6.0 #Load R version 3.6.0 R #Start running R 

Once I had the older R version, I was able to run install devtools!:

 install.packages("devtools", lib = "/gscratch/srlab/rpackages") #Install devtools to the specified folder require(devtools) #Load devtools 

My next step was installing Bioconductor. I followed the installation instructions from the Bioconductor website:

install.packages("BiocManager", lib = "/gscratch/srlab/rpackages") #Install BiocManager to the specified folder BiocManager::install(version = "3.10") #Install the correct version of BiocManager for the R version used 

Turns out there are specific BiocManager versions for each R version! I used this Bioconductor release guide to determine which BiocManager version I needed to install. Since I was using R.3.6.0, I could use BiocManager versions 3.9 or 3.10. I figured I’d use 3.10.

Finally, I installed methylKit:

BiocManager::install("methylKit") #Install methylKit 

The package started installing! However, I got a warning that I was using too much of the CPU. That’s when I realized I wasn’t on a build node! I stopped the package installation, quit R, and interrupted my mox session. I then started a build node:

srun -p build --time=4:00:00 --mem=10G --pty /bin/bash #Request a build node for four hours 

I loaded the R module again, then installed methylKit:

require(BiocManager) #Load package BiocManager::install("methylKit") #Install methylKit require(methylKit) #Load package 

It worked! The last package I needed (and almost forgot about) was dplyr. I ran require(dplyr) just to see what happened:

Screen Shot 2021-04-21 at 10 34 00 AM

The package was already installed! I closed the Terminal window, logged in and requested another build node, and ran require(methylKit) to ensure I wouldn’t have to install the package again in my SLURM script:

Screen Shot 2021-04-21 at 10 36 27 AM

Since that worked too, I tried running sessionInfo(). Hopefully this information would be saved into my slurm-out file.

Screen Shot 2021-04-21 at 10 37 28 AM

I exited R and my build node to finish up my preparation.

File paths on mox

When working in R Studio, it’s a lot easier for me to save files to various places, or source the data from a different folder since I can set the working directory in a chunk. For the purpose of the R SLURM script, I think it’s easier to have all the data and output files in the same folder. I created a /gscratch/scrubbed/yaaminiv/Manchester/analyses/methylKit folder to house all relevant files. Then, I navigated to that folder and copied the merged CpG coverage files from gannet to mox:

rsync --archive --progress --verbose yaamini@172.25.149.226:/Volumes/web/spartina/project-gigas-oa-meth/output/bismark-roslin/*merged_CpG_evidence.cov . 

The next thing I wanted to do was create a subdirectory structure that mirrored where I saved output files in this R Markdown script. I usually do this within the script itself since I can switch between bash and R, but I will not be able to do that in a SLURM script. I created:

  • /gscratch/scrubbed/yamainiv/Manchester/analyses/methylKit/general-stats for individual-sample and comparative analysis plots
  • /gscratch/scrubbed/yamainiv/Manchester/analyses/methylKit/DML for DML lists
  • /gscratch/scrubbed/yamainiv/Manchester/analyses/methylKit/rand-test for randomization test output

Running the R SLURM script

All that’s left to do was create the SLURM script! I copied my R Markdown script into this SLURM script. Then, I ran the script. When I checked the queue (squeue | grep "srlab"), I found that my script wasn’t running! When I looked at the SLURM information at the top of the script, I saw SBATCH --mem=500G. I changed it to SBATCH --mem=100G, and ran the script again. Unfortunately, it timed out immediately!

When I looked at the slurm.out file, I saw the following error:

Screen Shot 2021-04-21 at 9 27 29 PM

I then posted in this discussion to see where I should specify --save, --no-save, or --vanilla. Sam responded and said my shebang should be #!/gscratch/srlab/programs/R-3.6.2/bin/Rscript, and not #!/gscratch/srlab/programs/R-3.6.2/bin/R! I changed the shebang and ran the script again.

Obviously, my script timed out again. Looking through the slurm.out, I confirmed a few things. One, any head() command does print to the slurm.out. Second, I got an error that dplyr was not available when I ran require(dplyr). Additionally, there were some packages attached to methylKit that didn’t load. I opened another build node to install dplyr:

install.packages("dplyr", lib = "/gscratch/srlab/rpackages") #Install dplyr require(BiocManager) #Load BiocManager install_github("al2na/methylKit", build_vignettes = FALSE, repos = BiocManager::repositories(), dependencies = TRUE) #Install more methylKit options require(methylKit) #Check that all associated packages load 

I then modified the script to load several packages at the top:

# Load packages require(devtools) require(BiocManager) require(methylKit) require(dplyr) sessionInfo() 

Screen Shot 2021-04-22 at 9 43 51 AM

Screen Shot 2021-04-22 at 9 42 41 AM

Once I ran this revised script, I ran into the same error! Based on the error messages, I think R was unable to find my specified packages. Screen Shot 2021-04-22 at 9 43 51 AM

Screen Shot 2021-04-22 at 9 42 41 AM

I know I installed these packages, so I think they’re not being installed from their actual location. BiocManager, devtools, and dplyr are in the /gscratch/srlab/rpackages/ directory:

Screen Shot 2021-04-22 at 9 47 59 AM

methylKit is installed in /gscratch/home/yaaminiv/R/x86_64-pc-linux-gnu-library/3.6/:

Screen Shot 2021-04-22 at 9 50 32 AM

I posted this discussion to see if there was a way to reference library locations in require(). Why I posted this discussion before actually Googling I don’t know, but Sam and I arrived at the same conclusion: include lib.loc in require to specify the library location. This is important especially because I have packages installed in two separate locations! I modified my script and ran it again and encountered a new error:

Screen Shot 2021-04-23 at 10 51 03 AM

Screen Shot 2021-04-23 at 10 51 17 AM

Interestingly, when I loaded packages in the SLURM script, R was unable to find dependencies, even when they were installed (like usethis). I confirmed that these errors were precluding me from loading packages by running sessionInfo:

Screen Shot 2021-04-23 at 10 58 34 AM

This began a series of installing packages, running my R script, and finding out I needed to explicitly install another dependency:

Screen Shot 2021-04-23 at 11 31 20 AM

Screen Shot 2021-04-23 at 11 32 12 AM

Screen Shot 2021-04-26 at 9 30 36 AM

Screen Shot 2021-04-26 at 9 30 59 AM

…so this is where I quit for now.

Going forward

  1. Try different methods to run R script on mox
  2. Write methods
  3. Obtain relatedness matrix and SNPs with EpiDiverse/snp
  4. Write results
  5. Identify genomic location of DML
  6. Determine if RNA should be extracted
  7. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3gGo8GX
via IFTTT

TWIP 12 – Majorly Zoomed

This week we “visualize” the state of the larval oyster proteome.