Figuring Out WGCNA, Part 2

Recap Alright, so last time, I hit a wall with memory allocation when attempting to analyze ambient-temperature crabs. At first, I tried to get around this issue by just heading into UW and using the computers there. That partially solved the issue, but not entirely. At one stage in the pipeline, you set a bar that determines what genes should be discarded – the original recommendation is genes where counts are below 10 in 90% of samples, but since I had only 9 samples, just changed it to counts below 10 in all samples. Problem: that crashes the lab computers entirely, and even upping the bar to 15 takes practically an entire day to run. So as an alternative, I started using RStudio on Mox, following the guidelines figured out here. That partially worked, but it still takes a really long time to run 10 samples, so I decided to…

from Aidan F. Coyle https://ift.tt/3nBGDh3
via IFTTT

Singularity – RStudio Server Container on Mox

Aidan recently needed to use R on a machine with more memory. Additionally, it would be ideal if he could use RStudio. So, I managed to figure out how to set up a Singularity container running rocker/rstudio.

For those of you not interested in the process, just jump over to our handbook guide on how to use the container to run RStudio Server on Mox (or, build your own).

Now, for the lengthy, and probably unneccessary, parts… Sometimes it’s useful to document the pain.

The UW high-perfomance computing (HPC) cluster, Hyak, used to have an RStudio Server module installed, as well as instructions on how to get it going. However, that module no longer exists, so wasn’t usable. I hit up UW IT and they directed me to this Rocker SLURM job example.. After a few different attempts to get this work, I finally found someone else who’s had the same issue!

I created a functional SLURM script modified from this GitHub Issue and it totally worked! Except, due to the fact that a SLURM script is run on a execute/compute node on Mox, there’s no internet access; which means no installing new packages via RStudio Server when it’s running on Mox. Although I haven’t documented it yet, I was unable to get this to work on build node on Mox (which does have internet access) because parts of the process require the use of sudo which isn’t available on Mox.

So, I turned to installing Singularity on my own computer and building/updating the container locally. Singularity installation didn’t go as smoothly as I would’ve hoped, but I eventually found this Singularity installation guide (GitHub Issue) which took care of any issues I had encountered.

To retrieve the container from Rocker and make it accessible, I needed to build it as a sandbox. The command below also specifies to pull Rstudio with R v4.0.2.

singularity build --sandbox rstudio-4.0.2.sandbox.simg docker://rocker/rstudio:4.0.2

From here, I learned how to get into the container, which would potentially allow me to update/install system and R packages, but you need root access inside the container. To get that, I ran:

sudo singularity shell rstudio-4.0.2.sandbox.simg/

screenshot showing differences in singularity with/without sudo

I also quickly learned that the container needs to be writable:

screenshot of not writable

screenshot of writable

After making some feable attempt to install some R packages, I encountered a number of errors indicating missing system packages. Those were installed via the shell in the container:

Install libbz2:

apt install libbz2-dev

Install liblzma:

apt install liblzma-dev

Install libxml2

apt install libxml2

Install libz-dev

apt install libz-dev

Install libxtst6 to allow R Markdown image rendering:

apt install libxtst6

After resolving those issues, I could start R and install stuff:

BioConductor:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(version = "3.12") 

DESeq2:

BiocManager::install("DESeq2") 

Errors:

first error

second error

MatrixGenerics:

BiocManager::install("MatrixGenerics") 

Error:

matrixStats version too old

matrixStats:

install.packages("https://cran.rstudio.com/src/contrib/matrixStats_0.58.0.tar.gz", repos=NULL, type="source") 

Methylkit:

BiocManager::install("methylKit") 

WGCNA:

BiocManager::install("WGCNA") 

tidyverse:

install.packages("tidyverse") 

After getting all the desired packages installed, I needed to actually build the container image. The image is a single file, whereas the sandbox representation is actually a way to interact with all the system directories.

Build container (from StackOverFlow):

sudo singularity build rstudio-4.0.2.sjw-01 rstudio-4.0.2.sandbox.simg/

This container image was then rsync’d to the following Mox location for everyone to access:

/gscratch/srlab/programs/singularity_containers

This will suffice for now to allow people to play around/test/use it for some things. However, there are other things I’d like to work on:

  • Ideally, get this set up to be able to build containers on a Mox build node.
  • Get Singularity installed on one of the birds (e.g. Emu, Roadrunner) so that people can build their own containers. Singularity is only avialbe for Linux, thus most other lab members won’t be able to use their own computers to build containers; they’ll need one of the birds which run Ubuntu.

from Sam’s Notebook https://ift.tt/3gTaJeH
via IFTTT

WGBS Analysis Part 23

methylKit with R Studio server on mox

TL;DR I can run calculateDiffMeth and get DML! Please enjoy this saga of troubleshooting and not reading errors properly.

Alas, after running for five hours my R SLURM script failed cries. When I looked at the SLURM output, I saw the following error:

Screen Shot 2021-04-27 at 9 33 51 AM

calculateDiffMeth was giving me problems again! I posted this discussion to get input from Sam and Steven about what to do next.

Revisiting the build node

In the meantime, I decided to revisit the build node. I requested a build node and loaded R:

srun -p build --time=4:00:00 --mem=10G --pty /bin/bash #Request a build node for four hours module load r_3.6.0 #Load R version 3.6.0 R #Start R 

I loaded my R data and ran calculateDiffMeth without a covariate matrix or overdispersion correction. It finished running and I ran warnings(). They were all glm.fit errors, which I’ve encountered before. I was able to produce a methylDiff object, so that’s a good sign!

Screen Shot 2021-04-27 at 1 00 47 PM

Screen Shot 2021-04-27 at 1 01 23 PM

I then tried running calculateDiffMeth with the covariate matrix and overdispersion correction. Around the four hour mark, my command timed out. I initially thought it was related to calculateDiffMeth, but I then I realized it was probably because the four hour reservation ended! In any case, it was ready for me to move onto a different option: R Studio server.

Troubleshooting R Studio server

When I started running calculateDiffMeth on the build node, I also started working through the R Studio server Sam set up on mox. I followed the instructions he laid out in this discussion. To run R Studio server, I needed to change my R library directory information in ~/.Renviron, then run a SLURM script to get R Studio login credentials. I could then tunnel into mox in a separate Terminal window to run R Studio.

When I went to change my ~/.Renviron library location, I realized I didn’t have a ~/.Renviron file to begin with! Sam said I should make one, so I did.

Screen Shot 2021-04-28 at 1 28 24 PM

I created this script to get R Studio log in credentials. After running it, I was unable to tunnel into mox and access the R Studio server! I posted this discussion with the specific “connection refused” error I got:

Screen Shot 2021-04-27 at 1 02 18 PM

Turns out I misinterpreted Sam’s original script, and didn’t include any of the important things that started the R Studio server session! I added the script sections I needed to, then ran it again and was able to access R Studio server. I found my R Script and started running calculateDiffMeth. According to Sam, the session should still run even if I close the window.

Later at night, I decided to test that theory. When I closed the R Studio server browser on my local machine, I got an error in my Terminal window. I logged back into R Studio server on genefish, I saw the following error:

Screen Shot 2021-04-27 at 10 34 46 PM

It seemed like calculateDiffMeth finished running, but there was an “error writing to connection” that I didn’t understand. I set up calculateDiffMeth to run through the night. I saw that I was logged out after 60 minutes due to inactivity. When I logged back in, I saw a slew of error messages:

Screen Shot 2021-04-28 at 11 26 07 AM

  1. There were 50 or more warnings (use warnings() to see the first 50): Checked warnings(), they’re glm.fit errors (not related to the issue at hand). At least I know calculateDiffMeth ran!

Screen Shot 2021-04-28 at 11 26 30 AM

  1. Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'head': object 'differentialMethylationStatsTreatment' not found: My fault, didn’t reference the correct data frame. When I referenced the correct calculateDiffMeth output, the command worked!
  2. Same error writing to connection error! However, it my calculateDiffMeth command DID finished running, and my global environment was loaded again when I logged back into the R Studio server.

Sam investigated it further and saw that my home directory may be full, leading to errors saving any R Studio output, like ~/.rstudio or ~/.Rdata in that directory. I was saving my R data in /gscratch/scrubbed/yaaminiv/Manchester/analyses/methylKit, but it’s possible that the large methylKit objects were causing resulting in a ~/.rstudio file. Interestingly, my global environment was loading each time I logged into the R Studio server. I checked my login node quota and ruled out that it was a storage issue on my end! Since the R Studio server was still working for me and I was able to save output to my scrubbed directory, I kept going.

The bottom line

The force termination was likely because the build node timed out because I only requested one for four hours, and the head command error was because I was incorrectly calling the dataframe. Thanks to R Studio server, I could confirm that this was the case and ensure that the warnings were not devastating! I updated my “connection refused” discussion with explanations of my error messages, and posted an update in my original discussion about calculateDiffMeth.

But the best part was that I ran getMethylDiff AND I HAVE DML! I’ll post about that in a separate notebook entry.

So really, it was a typo all along.

Going forward

  1. Finish running methylKit on R Studio server
  2. Write methods
  3. Write results
  4. Update mox handbook with R information
  5. Obtain relatedness matrix and SNPs with EpiDiverse/snp
  6. Identify genomic location of DML
  7. Determine if RNA should be extracted
  8. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3t7yBhm
via IFTTT

WGBS Analysis Part 22

Running R scripts on mox (for real this time)

So, clearly things aren’t going well. I tried running an R script on mox, but landed in a seemingly endless loop of installing dependencies, running the script, having it fail, and trying to install yet another dependency. My theory was that the mechanism of loading R packages in a SLURM script is different than loading packages in an R module. Time to test it out.

Sanity check with the build node

Since I was able to load packages in the build node, I thought I would see if I could run part of my code interactively as a sanity check. First, I opened a build node for four hours:

srun -p build --time=4:00:00 --mem=10G --pty /bin/bash module load r_3.6.0 #Load R module R #Open R 

Then, I loaded the devtools, methylKit, and dplyr packges, confirmed packages were loaded, and set my working directory to folder with my bismark output:

require(devtools) require(methylKit) require(dplyr) sessionInfo() #Confirm packages are loaded 

Screen Shot 2021-04-26 at 11 05 42 AM

getwd() #Confirm I am in my home directory setwd("/gscratch/scrubbed/yaaminiv/Manchester/analyses/methylKit") #Change directory to where bismark output is 

I then started running code to confirm that methylKit and dplyr R commands would work as long as the packages were loaded. I was able to quickly read files into R with methRead:

Screen Shot 2021-04-26 at 11 10 09 AM

I then successfully ran the following code chunk to process bismark alignments and normalize coverage between samples!

processedFilteredFilesCov5 <- methylKit::filterByCoverage(processedFiles, lo.count = 5, lo.perc = NULL, high.count = NULL, high.perc = 99.9) %>% methylKit::normalizeCoverage(.) 

Screen Shot 2021-04-26 at 11 16 03 AM

At this point, I saved my R data and knew that as long as I could reference my packages correctly, I could run my code.

Calling an R Script in a SLURM script

Clearly calling an R module worked better than changing the shebang and running an R script directly on mox! I wanted to try another method: calling an R script within a SLURM script. First, I needed to put my R code in a separate script. I copied and pasted my code and created this R script. Based on the hyak documentation, I needed to create a SLURM script to call the R script. For this SLURM script, I used a 10 day walltime and 100 G memory node. Hopefully I won’t need more than that! Within the script, I needed two lines of code:

module load r_3.6.0 #Load R version 3.6.0 Rscript > output.txt 2>&1 /gscratch/home/yaaminiv/06-methylKit.R #Specify my standard error file (output.txt) and R script location (/gscratch/home/yaaminiv/06-methylKit.R) 

I then ran my SLURM script. It ended after 20 minutes (so…a bit longer than the 18 minute run I was used to previously!) due to more package struggles. Thankfully they were different package problems than before! I was unable to load devtools or dplyr because R could not find the correct versions of dependency packages. However, methylKit loaded with no issues:

Screen Shot 2021-04-26 at 1 09 12 PM

Screen Shot 2021-04-26 at 1 09 29 PM

Screen Shot 2021-04-26 at 1 09 39 PM

I finagled with how I loaded packages in my R script and decided to run require(devtools) with no lib.loc argument. When I would load packages in the build node, I never specified where packages were found and did not encounter any error. I also tried require(tidyverse, lib.loc = "/gscratch/srlab/rpackages") to see if loading tidyverse would bypass any issues I had loading dplyr. I was able to load devtools with no problems, but still ran into issues with dplyr and tidyverse!

Screen Shot 2021-04-26 at 1 17 01 PM

Screen Shot 2021-04-26 at 1 17 12 PM

Screen Shot 2021-04-26 at 1 23 45 PM

Since devtools loaded without specifying a library location, I figured I could do the same for dplyr. The final configuration of loading R packages that worked went as follows:

require(devtools) #Load devtools require(methylKit, lib.loc = "/gscratch/home/yaaminiv/R/x86_64-pc-linux-gnu-library/3.6/") #Load methylKit. I was able to load with no issues including library location, so I didn't change it require(dplyr) #Load devtools sessionInfo() #Confirm packages are loaded 

At this point, my script truly ran…for 30 minutes! I didn’t properly reference my covariate matrix in my calculateDiffMeth command, but once I did that the command ran without any issues (so far). Guess we’ll wait and see if I can indeed run calculateDiffMeth with a covariate matrix and overdispersion correction on mox!

Screen Shot 2021-04-26 at 4 19 39 PM

Screen Shot 2021-04-26 at 4 19 58 PM

Going forward

  1. Write methods
  2. Write results
  3. Update mox handbook with R information
  4. Obtain relatedness matrix and SNPs with EpiDiverse/snp
  5. Identify genomic location of DML
  6. Determine if RNA should be extracted
  7. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2PnTIhk
via IFTTT

WGBS Analysis Part 21

Running R scripts on mox

Alright, I have R installed, which is maybe a moot point but I couldn’t get methylKit installed. Let’s see if I can actually run an R SLURM script today.

Installing packages (round 2)

To install methylKit, I decided to use an older version of R. I first loaded the module:

module load r_3.6.0 #Load R version 3.6.0 R #Start running R 

Once I had the older R version, I was able to run install devtools!:

 install.packages("devtools", lib = "/gscratch/srlab/rpackages") #Install devtools to the specified folder require(devtools) #Load devtools 

My next step was installing Bioconductor. I followed the installation instructions from the Bioconductor website:

install.packages("BiocManager", lib = "/gscratch/srlab/rpackages") #Install BiocManager to the specified folder BiocManager::install(version = "3.10") #Install the correct version of BiocManager for the R version used 

Turns out there are specific BiocManager versions for each R version! I used this Bioconductor release guide to determine which BiocManager version I needed to install. Since I was using R.3.6.0, I could use BiocManager versions 3.9 or 3.10. I figured I’d use 3.10.

Finally, I installed methylKit:

BiocManager::install("methylKit") #Install methylKit 

The package started installing! However, I got a warning that I was using too much of the CPU. That’s when I realized I wasn’t on a build node! I stopped the package installation, quit R, and interrupted my mox session. I then started a build node:

srun -p build --time=4:00:00 --mem=10G --pty /bin/bash #Request a build node for four hours 

I loaded the R module again, then installed methylKit:

require(BiocManager) #Load package BiocManager::install("methylKit") #Install methylKit require(methylKit) #Load package 

It worked! The last package I needed (and almost forgot about) was dplyr. I ran require(dplyr) just to see what happened:

Screen Shot 2021-04-21 at 10 34 00 AM

The package was already installed! I closed the Terminal window, logged in and requested another build node, and ran require(methylKit) to ensure I wouldn’t have to install the package again in my SLURM script:

Screen Shot 2021-04-21 at 10 36 27 AM

Since that worked too, I tried running sessionInfo(). Hopefully this information would be saved into my slurm-out file.

Screen Shot 2021-04-21 at 10 37 28 AM

I exited R and my build node to finish up my preparation.

File paths on mox

When working in R Studio, it’s a lot easier for me to save files to various places, or source the data from a different folder since I can set the working directory in a chunk. For the purpose of the R SLURM script, I think it’s easier to have all the data and output files in the same folder. I created a /gscratch/scrubbed/yaaminiv/Manchester/analyses/methylKit folder to house all relevant files. Then, I navigated to that folder and copied the merged CpG coverage files from gannet to mox:

rsync --archive --progress --verbose yaamini@172.25.149.226:/Volumes/web/spartina/project-gigas-oa-meth/output/bismark-roslin/*merged_CpG_evidence.cov . 

The next thing I wanted to do was create a subdirectory structure that mirrored where I saved output files in this R Markdown script. I usually do this within the script itself since I can switch between bash and R, but I will not be able to do that in a SLURM script. I created:

  • /gscratch/scrubbed/yamainiv/Manchester/analyses/methylKit/general-stats for individual-sample and comparative analysis plots
  • /gscratch/scrubbed/yamainiv/Manchester/analyses/methylKit/DML for DML lists
  • /gscratch/scrubbed/yamainiv/Manchester/analyses/methylKit/rand-test for randomization test output

Running the R SLURM script

All that’s left to do was create the SLURM script! I copied my R Markdown script into this SLURM script. Then, I ran the script. When I checked the queue (squeue | grep "srlab"), I found that my script wasn’t running! When I looked at the SLURM information at the top of the script, I saw SBATCH --mem=500G. I changed it to SBATCH --mem=100G, and ran the script again. Unfortunately, it timed out immediately!

When I looked at the slurm.out file, I saw the following error:

Screen Shot 2021-04-21 at 9 27 29 PM

I then posted in this discussion to see where I should specify --save, --no-save, or --vanilla. Sam responded and said my shebang should be #!/gscratch/srlab/programs/R-3.6.2/bin/Rscript, and not #!/gscratch/srlab/programs/R-3.6.2/bin/R! I changed the shebang and ran the script again.

Obviously, my script timed out again. Looking through the slurm.out, I confirmed a few things. One, any head() command does print to the slurm.out. Second, I got an error that dplyr was not available when I ran require(dplyr). Additionally, there were some packages attached to methylKit that didn’t load. I opened another build node to install dplyr:

install.packages("dplyr", lib = "/gscratch/srlab/rpackages") #Install dplyr require(BiocManager) #Load BiocManager install_github("al2na/methylKit", build_vignettes = FALSE, repos = BiocManager::repositories(), dependencies = TRUE) #Install more methylKit options require(methylKit) #Check that all associated packages load 

I then modified the script to load several packages at the top:

# Load packages require(devtools) require(BiocManager) require(methylKit) require(dplyr) sessionInfo() 

Screen Shot 2021-04-22 at 9 43 51 AM

Screen Shot 2021-04-22 at 9 42 41 AM

Once I ran this revised script, I ran into the same error! Based on the error messages, I think R was unable to find my specified packages. Screen Shot 2021-04-22 at 9 43 51 AM

Screen Shot 2021-04-22 at 9 42 41 AM

I know I installed these packages, so I think they’re not being installed from their actual location. BiocManager, devtools, and dplyr are in the /gscratch/srlab/rpackages/ directory:

Screen Shot 2021-04-22 at 9 47 59 AM

methylKit is installed in /gscratch/home/yaaminiv/R/x86_64-pc-linux-gnu-library/3.6/:

Screen Shot 2021-04-22 at 9 50 32 AM

I posted this discussion to see if there was a way to reference library locations in require(). Why I posted this discussion before actually Googling I don’t know, but Sam and I arrived at the same conclusion: include lib.loc in require to specify the library location. This is important especially because I have packages installed in two separate locations! I modified my script and ran it again and encountered a new error:

Screen Shot 2021-04-23 at 10 51 03 AM

Screen Shot 2021-04-23 at 10 51 17 AM

Interestingly, when I loaded packages in the SLURM script, R was unable to find dependencies, even when they were installed (like usethis). I confirmed that these errors were precluding me from loading packages by running sessionInfo:

Screen Shot 2021-04-23 at 10 58 34 AM

This began a series of installing packages, running my R script, and finding out I needed to explicitly install another dependency:

Screen Shot 2021-04-23 at 11 31 20 AM

Screen Shot 2021-04-23 at 11 32 12 AM

Screen Shot 2021-04-26 at 9 30 36 AM

Screen Shot 2021-04-26 at 9 30 59 AM

…so this is where I quit for now.

Going forward

  1. Try different methods to run R script on mox
  2. Write methods
  3. Obtain relatedness matrix and SNPs with EpiDiverse/snp
  4. Write results
  5. Identify genomic location of DML
  6. Determine if RNA should be extracted
  7. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3gGo8GX
via IFTTT

TWIP 12 – Majorly Zoomed

This week we “visualize” the state of the larval oyster proteome.

Genome Annotation – P.generosa v1.0 Assembly Using DIAMOND BLASTx for BlobToolKit on Mox

To continue towards getting our Panopea generosa (Pacific geoduck) genome assembly (v1.0) analyzed with BlobToolKit, per this GitHub Issue, I’ve decided to run each aspect of the pipeline manually, as I continue to have issues utilizing the automatic pipeline. As such, I’ve run DIAMOND BLASTx according to the BlobToolKit “Getting Started” guide on Mox.

IMPORTANT: This is BLAST’ed against a customized UniProt database, per the BlobToolKit instructions here.. For posterity, here’re the instuctions provided on the website:

mkdir -p uniprot wget -q -O uniprot/reference_proteomes.tar.gz \ ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/$(curl \ -vs ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/ 2>&1 | \ awk '/tar.gz/ {print $9}') cd uniprot tar xf reference_proteomes.tar.gz touch reference_proteomes.fasta.gz find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | xargs cat >> reference_proteomes.fasta.gz echo "accession\taccession.version\ttaxid\tgi" > reference_proteomes.taxid_map zcat */*/*.idmapping.gz | grep "NCBI_TaxID" | awk '{print $1 "\t" $1 "\t" $3 "\t" 0}' >> reference_proteomes.taxid_map diamond makedb -p 16 --in reference_proteomes.fasta.gz --taxonmap reference_proteomes.taxid_map --taxonnodes ../taxdump/nodes.dmp -d reference_proteomes.dmnd cd - 

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20210415_pgen_diamond_blastx_Panopea-generosa-v1.0 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=500G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210415_pgen_diamond_blastx_Panopea-generosa-v1.0 ### DIAMOND BLASTx of Panopea-generosa-v1.0 against customized UniProt database ### for import into BlobToolKit. ### Output is customized for input into BlobToolKit ################################################################################### # These variables need to be set by user # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Programs array declare -A programs_array programs_array=( [diamond]="/gscratch/srlab/programs/diamond-0.9.29/diamond" ) # DIAMOND UniProt database dmnd=/gscratch/srlab/blastdbs/20210401_uniprot_btk/reference_proteomes.dmnd # Genome (FastA) fasta=/gscratch/srlab/sam/data/P_generosa/genomes/Panopea-generosa-v1.0.fa ################################################################################### # Strip leading path and extensions no_path=$(echo "${fasta##*/}") no_ext=$(echo "${no_path%.*}") # Run DIAMOND with blastx # Customized output format for import into BlobToolKit ${programs_array[diamond]} blastx \ --db ${dmnd} \ --query "${fasta}" \ --out "${no_ext}".blastx.btk.outfmt6 \ --outfmt 6 qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore \ --sensitive \ --evalue 1e-25 \ --max-target-seqs 1 \ --block-size 15.0 \ --index-chunks 4 # Generate checksums for future reference echo "" echo "Generating checksum for ${fasta}." md5sum "${fasta}">> fastq.checksums.md5 echo "Completed checksum for ${fasta}." echo "" ################################################################################### # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "

Genome Annotation – P.generosa v1.0 Assembly Using BLASTn for BlobToolKit on Mox

To continue towards getting our Panopea generosa (Pacific geoduck) genome assembly (v1.0) analyzed with BlobToolKit, per this GitHub Issue, I’ve decided to run each aspect of the pipeline manually, as I continue to have issues utilizing the automatic pipeline. As such, I’ve run BLASTn according to the BlobToolKit “Getting Started” guide on Mox.

SBATCH script (GitHub):


#!/bin/bash ## Job Name #SBATCH --job-name=20210415_pgen_blastn-nt_Panopea-generosa-v1.0 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210415_pgen_blastn-nt_Panopea-generosa-v1.0 ### BLASTn of P.generosa genome assembly Panopea-generosa-v1.0.fa ### against NCBI nt database. ### In preparation for use in BlobTools2 ################################################################################### # These variables need to be set by user # Set number of CPUs to use threads=40 # Input/output files fasta="/gscratch/srlab/sam/data/P_generosa/genomes/Panopea-generosa-v1.0.fa" blast_db="/gscratch/srlab/blastdbs/20210401_ncbi_nt/nt" # Programs blastn="/gscratch/srlab/programs/ncbi-blast-2.10.1+/bin/blastn" # Programs associative array declare -A programs_array programs_array=( [blastn]="${blastn}" ) ################################################################################### # Exit script if any command fails set -e # Run BLASTn with custom format/settings for use in blobtools2 ${programs_array[blastn]} \ -db ${blast_db} \ -query ${fasta} \ -outfmt "6 qseqid staxids bitscore std" \ -max_target_seqs 10 \ -max_hsps 1 \ -evalue 1e-25 \ -num_threads ${threads} \ -out Panopea-generosa-v1.0_blobtools2_blast.out ################################################################################### # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]] \ || [[ "${program}" == "blastn" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "

TWIP 11 – Easter Pistachios

This week we welcome Ariana to the mix and get Olivia booted up.

WGBS Analysis Part 19

Separating female and indeterminate samples

My intended R code works, so now I need to finalize my script and create an analogous methylKit script for mox!

Before creating a mox script, I wanted to separate out the indeterminate and female oyster samples. When looking for covariate metadata, I realized that not all samples are from female oysters! When I picking extracted DNA samples to sequence, I tried to get as many female samples as possible, but due to some issues with having enough material for sequencing, I ended up sequencing one indeterminate sample per treatment. My plan was to use this indeterminate sample to see if any DML I identified were more related to maturation status, and how immature oyster gonad methylation could be impacted. I promptly forgot I had this plan until looking at previous lab notebooks, so thank you past Yaamini.

My first instinct was to try subsetting the methylRawListDB created by methRead like a standard dataframe, but that didn’t work. I returned to the methylKit handbook and found a convenience function to subset objects, select. However, this function doesn’t work for methylRawListDB! A quick search lead my back to reorganize, which is the function I used to set up randomization trials. Looking at the examples for the function, I realized I could subset the methylBase object created by unite, instead of subseting the methylRawListDB, filtering data, and combining! I think the comparative analysis portion (PCA and correlation plot) is useful when all samples are included, and that’s created using the unite methylBase object. If I could subset the female or indeterminate samples from this object, then I can identify sex-specific DML in a less computationally-intensive way.

I successfully used the following code to separate female samples from the methylBase object:

methylationInformationFilteredCov5Fem <- methylKit::reorganize(methylationInformationFilteredCov5, sample.ids = c("1", "3", "4", "6", "7", "8"), treatment = c(rep(1, times = 3), rep(0, times = 3))) #Get female sample information from methylBase object created by unite by specifying sample IDs. Include treatment information as well 

By specifying sample.ids, I could pull out the samples I wanted. I confirmed the function worked by looking at my original object (methylationInformationFilteredCov5) and the subsetted object (methylationInformationFilteredCov5Fem):

Screen Shot 2021-04-11 at 1 24 08 PM

Screen Shot 2021-04-11 at 1 24 15 PM

Figures 1-2. methylBase objects before and after subsetting

Sample 2 in the subsetted object is the same as sample 3 in the original object, so I think it worked! I then cleaned up my R Markdown script and created separate sections for female- and indeterminate-DML identification. I used covariate and overdispersion corrections for the female data, but not for the 1 vs. 1 indeterminate sample comparison. I ran separate comparative analyses on the each data subset just to see what it might look like, but I didn’t pay too much attention to the output since I haven’t updated the data I’m working with to the Roslin alignments (I modified the code, but I haven’t rerun those chunks because they’re time-consuming and I’m going to do it on mox anyways). Knowing that I could subset from the methylBase object and assign treatment information in that subsetting process, I also updated my randomization trial code:

for (i in 1:1000) { #For each iteration pHRandTreat <- sample(sampleMetadataFem$pHTreatment, size = length(sampleMetadataFem$pHTreatment), replace = FALSE) #Randomize female treatment information pHRandDML <- methylKit::reorganize(methylationInformationFilteredCov5Fem, sample.ids = sampleMetadataFem$sampleID, treatment = pHRandTreat) %>% methylKit::calculateDiffMeth(., covariates = covariateMetadataFem, overdispersion = "MN", test = "Chisq") %>% methylKit::getMethylDiff(., difference = 25, qvalue = 0.01) %>% nrow() -> pHRandTest25Fem[i] #Shuffle treatments from methylBase object created by unite. Calculate diffMeth and obtain DML that are 25% and have a q-value of 0.01. Save the number of DML identified } 

Going forward

  1. Write methods
  2. Obtain relatedness matrix and SNPs with EpiDiverse/snp
  3. Run R script on mox
  4. Write results
  5. Identify genomic location of DML
  6. Determine if RNA should be extracted
  7. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3gh5BAZ
via IFTTT