WGBS Analysis Part 13

Starting bismark with Manchester data

After evaluating trimgalore output, I decided it’s time to start [bismark](https://github.com/FelixKrueger/Bismark)! To start, I created this script. It’s the same as the one I used for the Hawaii data, but I updated the paths to all the files. I also moved the bisulfite converted genome from the Hawaii data folder to the Manchester data folder. I transferred the script to mox with rsync and it started running…

…only to fail two seconds later. Looking at the slurm output, I saw that it was unable to navigate to bisulfite converted genome:

Screen Shot 2021-02-10 at 1 00 10 PM

I navigated to the directory and saw that it was empty! The files must have been on the computer for longer than 30 days before I moved the genome folder from the Hawaii to Manchester subdirectories. I used wget (not curl because mox doesn’t like it) to download the bisulfite converted genome Sam created from this link. I then extracted the files:

tar -xvzf Crassostrea_gigas.oyster_v9.dna_sm.toplevel_bisulfite.tar.gz #Extract (x) and decompress (z) files (f) in a verbose (v) manner 

I navigated into the directory to double check that there was actually a bisulfite converted genome inside /gscratch/scrubbed/yaaminiv/Manchester/data/Crassostrea_gigas.oyster_v9.dna_sm.toplevel/. Then, I ran the script. Since I only have eight samples, I’m hoping it’ll finish processing in a week!

Going forward

  1. Update the repository README files
  2. Write methods
  3. Write results
  4. Identify DML
  5. Determine if RNA should be extracted
  6. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3jJEQVl

WGBS Analysis Part 12

Reviewing trimgalore output with multiqc

Yesterday I started running [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) so I could evaluate trimgalore output. After my script finished running, I transferred the files to relevant subdirectories in this gannet folder, and moved the HTML reports to my repository and class repository. Then, I looked at the multiqc reports after the first, second, and third trims.

The main thing I wanted to check was the overrepresented sequences remaining in the analysis files, so I started by checking the summary module in each report:

Screen Shot 2021-02-08 at 2 46 25 PM

Screen Shot 2021-02-08 at 2 46 37 PM

Screen Shot 2021-02-08 at 2 46 51 PM

Figures 1-3. MultiQC status checks after the first, second, and third trims.

All samples passed the overrepresented sequences check! When I dug into the reports further, I found that some files still had adapter sequences after the second trim, but they were gone after the third trim:

Screen Shot 2021-02-10 at 10 49 18 AM

Screen Shot 2021-02-10 at 10 49 35 AM

Screen Shot 2021-02-10 at 10 49 54 AM

Figures 4-6. MultiQC overrepresented sequences for sample 7 read 1

I looked at the rest of the MultiQC modules from the third trim to see if there were any other inconsistencies between samples:

Screen Shot 2021-02-08 at 2 51 49 PM

Screen Shot 2021-02-08 at 2 52 23 PM

Figures 7-8. Modules with inconsistencies

The per sequence GC content had 2 files (sample 2 reads 1 and 2) that did not pass the test. There were no spikes that indicated a poly-G tail, and the distributions weren’t completely different from the other samples, so I’m not concerned. I also had 10 files (samples 1-4 and 6, reads 1 and 2) that didn’t pass sequence duplication levels. Again, the distributions didn’t look too different from the other samples. The one thing that does concern me is that all of these samples are from the same treatment: 3N and high pH. The only other sample in that treatment, sample 5, passed the sequence duplication test. When looking at sample methylation levels in a PCA, I’ll need to check if all six samples cluster together, or if the sequence duplication levels will affect that clustering.

Going forward

  1. Start bismark
  2. Update the repository README files
  3. Write methods
  4. Write results
  5. Identify DML
  6. Determine if RNA should be extracted
  7. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3a7U0AL

TWIP Episode 02: Prepared to Argue

This week we increase in numbers and almost hear an argument. Plus – Laura’s audio is amazing!

Back to Square One?

My analysis outline? Incorrect? It’s more likely than you think Last time we checked in, I had finally ran GO-MWU on my four analyses, and came up with more or less nothing. However, after running GO-MWU, I realized I had made an error when determining what analyses to perform. My analyses largely involved comparing different temperature treatments (elevated vs. ambient vs. low) by pooling together samples from Day 0 and Day 2. For instance, to compare elevated and low, I pooled all elevated crab from Day 0+2, and compared gene expression against all lowered-temp crab from Day 0+2. However, as it turns out, Day 0 samples were taken prior to any exposure to different temperatures! As a result, I decided to redo my whole analysis from scratch. This wasn’t as big of a downside as you may think – I started this project back when I was a tiny little…

from Aidan F. Coyle https://ift.tt/3jAWtXp

WGBS Analysis Part 11

trimgalore output and fastqc

Last week, I started trimgalore. My mox script finished running, so I wanted to check the output before I started bismark.


I checked the mox directories to start transferring files onto gannet. The trimming worked successfully, but there was no fastqc output! This was weird because the script I used for these samples was the same as what I used for the Hawaii samples. Confused, I started this discussion with my scripts and slurm output to determine why I didn’t get any fastqc output. Sam looked at the slurm output and saw that I had an error associated with my path:

>>> Now running FastQC on the validated data zr3616_8_R1_val_1.fq.gz<<< Can't exec "fastqc": No such file or directory at /gscratch/srlab/programs/TrimGalore-0.6.6/trim_galore line 1525, <IN2> line 5536487816. >>> Now running FastQC on the validated data zr3616_8_R2_val_2.fq.gz<<< Can't exec "fastqc": No such file or directory at /gscratch/srlab/programs/TrimGalore-0.6.6/trim_galore line 1535, <IN2> line 5536487816. Deleting both intermediate output files zr3616_8_R1_trimmed.fq.gz and zr3616_8_R2_trimmed.fq.gz 

I’m not sure why fastqc would disappear from my path after a few weeks. In any case, I used rsync to transfer all the output to this gannet folder, organized into various subfolders. Then, I followed Sam’s advice to run fastqc separately to determine if it was truly a path issue.


I created this script to run fastqc on all my trimmed samples. In the script, I specified the fastqc and multiqc paths, then used the variables throught the script:

# Paths to programs fastqc=/gscratch/srlab/programs/fastqc_v0.11.9/fastqc multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc 

To run fastqc, I first specified files to analyze by including the absolute path to the directory. I changed the directory path for each trimming iteration:

# Populate array with FastQ files fastq_array=(/gscratch/scrubbed/yaaminiv/Manchester/analyses/trimgalore/*.fq.gz) # Pass array contents to new variable fastqc_list=$(echo "${fastq_array[*]}") 

When running fastqc, I also specified the outdir so the output would be written to the same folder as the trimgalore output.

# Run FastQC # NOTE: Do NOT quote ${fastqc_list} ${fastqc} \ --threads ${threads} \ --outdir /gscratch/scrubbed/yaaminiv/Manchester/analyses/trimgalore \ ${fastqc_list} 

Finally, I created new multiqc reports:

#MultiQC ${multiqc} \ /gscratch/scrubbed/yaaminiv/Manchester/analyses/trimgalore/. 

Unfortunately I didn’t include the -outdir argument so the reports were written to the same directory as the slurm file. Next time! Once the script finished running, I moved all the fastqc and multiqc output files to gannet, included the html reports in this output subdirectory, and my class repository. Tomorrow, I’ll review the output to make sure the trimming went well.

Going forward

  1. Update the repository README files
  2. Check trimming output
  3. Start bismark
  4. Write methods
  5. Write results
  6. Identify DML
  7. Determine if RNA should be extracted
  8. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3p43AIO

Mox Job – blastx

## Job Name
#SBATCH --job-name=blasts
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=0-12:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=sr320@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/sr320/020821-cgig-sp

# Load Python Mox module for Python module availability

module load intel-python3_2017

source /gscratch/srlab/programs/scripts/paths.sh

/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/blastx \
-query /gscratch/srlab/sr320/data/cg/GCF_000297895.1_oyster_v9_cds_from_genomic.fna \
-db /gscratch/srlab/blastdbs/UniProtKB_20190109/uniprot_sprot.fasta  \
-out /gscratch/scrubbed/sr320/020821-cgig-sp/Cg-blastx-sp.tab \
-evalue 1E-05 \
-num_threads 40 \
-max_target_seqs 1 \
-max_hsps 1 \
-outfmt "6 qaccver saccver evalue"


WGBS Analysis Part 10

Analyzing the full dataset

At the end of last week, I got my samples back from ZymoResearch! These are WGBS samples of C. gigas female gonad tissue from my Manchester experiment. The purpose of sequencing these samples is to determine if the low pH exposure altered the methylome. If we find something interesting from these samples, we may sequence some larval samples, or look into gonad RNA.

Obtaining sequences

The first thing I needed to do was download the data. ZymoResearch provided two URLs: one for sample fastq files, and another for sample checksums. Each URL directed to a text file. The file with fastq information contained a link that could be curled to access the individual files. In order to get the data, I needed a straightforward way to specify multiple URLs in a curl or wget command.

I found this link with information on how to do just that! Once I downloaded the text file, I could use xargs to specify the individual URls to download:

“`curl https://ift.tt/3rccANF > download_fastq.txt #Download file provided by ZymoResearch xargs -n 1 curl -O < download_fastq.txt #Download fastq files

 Then, I verified checksums: 

curl https://ift.tt/39EDuI3 > zr3616_MD5.txt #Download MD5 checksums from ZymoResearch find *fq.gz | md5sum -c zr3616_MD5.txt #Cross-reference original checksums with downloaded files. All files passed

 Once I had the data, I needed to move it to `owl` and include metadata in the [`Nightingales Google Sheet`](https://b.link/nightingales). I used `mv` to move samples from my Desktop to `owl` mounted on my computer, but couldn't write to the directory. I posted [this issue](https://github.com/RobertsLab/resources/issues/1085) to get write access to both the `owl` folder and Google Sheet. As an aside, Sam reminded me I should use `rsync` to transfer the files: 

rsync —archive —progress —verbose zr3616_* /Volumes/web-1/nightingales/C_gigas/ #Transfer to nightingales

 While the files transferred over many hours, I updated the Google Sheet with sample metadata. ### Assessing raw data quality When I received the data from ZymoResearch, the pdf implied that adapters were trimmed out of the data. Confused, I posted [this discussion](https://github.com/RobertsLab/resources/discussions/1082) to determine if I needed to modify my trimming protocol. Sam informed me that the samples were likely not trimmed, and I would see this if I ran [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Once the files were transferred to Nightingales, I needed to transfer them to `mox` and do just that. I used `rsync` to transfer the files: 

rsync —archive —verbose —progress yaamini@*gz /gscratch/scrubbed/yaaminiv/Manchester/data/ #Transfer to mox

 Then, I set up [a `fastqc` script](https://github.com/RobertsLab/project-gigas-oa-meth/blob/master/code/01-fastqc.sh). To streamline the repository, I moved my preliminary analyses to the [pooled subset](https://github.com/RobertsLab/project-gigas-oa-meth/blob/master/pooled-subset) subdirectory. This way, my scripts will remain linked to this repository, but it won't clutter it. Running `fastqc` seemed simple. I located the program on mox in the `srlab` programs directory. The command to run the program is `fastqc` + filenames, so I specified the directory with my data, and included code for `[`MultiQC`](https://multiqc.info/)` report compilation. When I ran this script, it obviously didn't work! Figuring Sam must have run `fastqc` recently, I found [this lab notebook entry](https://robertslab.github.io/sams-notebook/2020/11/10/FastQC-MultiQC-C.gigas-Ploidy-WGBS-Raw-Sequence-Data-from-Ronits-Project-on-Mox.html) with a `fastqc` script for `mox.` I wasn't sure why Sam needed to work with the files in an array, instead of calling `fastqc,` but I figured it had to do with the JavaScript error I encountered when I tried to do just that. I copied the code into my script and modified the variables so they pointed to the correct directories and samples. With his code, `fastqc` will generate sequence quality reports, `multiqc` will compile them, and checksums will be generated. I probably don't need the checksum information since `rsync` verifies checksums when moving files, and I have the original checksum information from ZymoResearch. In any case, it couldn't hurt. Once I finished modifying the script, I transferred the script and run it: 

rsync —archive —verbose —progress yaamini@ . #Transfer script to user directory sbatch 01-fastqc.sh #Run script “`

Once the initial quality assessment is done, I’ll run trimgalore to remove adapter sequences!

Going forward

  1. Trim samples, checking specifically for reasons to trim multiple times
  2. Start bismark
  3. Identify DML
  4. Determine if RNA should be extracted
  5. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3td5bj3

Starting GO-MWU

Getting GO-MWU Input Alright, I finally progressed to the next stage – running GO-MWU! In order to run GO-MWU, I needed two files. First, a two-column CSV table with gene IDs and a measure of significance (I used unadjusted p-value). Second, a two-column tab-separated file with gene IDs and GO terms. Obtaining the first was fairly straightforward – my R script can be found here. In that script, I pulled all genes – including those with an unadjusted p-value of NA. That might be relevant later – may want to redo with only non-NA values. Obtaining the second was also straightforward, but took a long, long time. I had previously created a newline-separated file of accession IDs for each of my 4 comparisons (reminder: elevated vs ambient, elev. vs low, amb. vs low, day 0 vs day 17 amb.). I plugged that into a script from Sam, which got the…

from Aidan F. Coyle https://ift.tt/36BB630

February 2021 Goals


With the amount of work I need to get done this short month, I’m definitely on the meow meow train. But I think at the end of this month, I’ll have a much clearer understanding of how much work is left for me before I’m solely in an analysis and writing phase, and when may be a reasonable time to defend!

January Goals Recap

Hawaii Gigas Methylation:

Gigas Gonad Methylation:

  • I didn’t get any of my data until last week, so I haven’t touched it yet!

Virginica Labwork:

  • Obtained WGBS and RNA-Seq quotes from Zymo
  • Started getting PO numbers for each quote
  • Was not able to send samples for sequencing since I don’t have both PO numbers
  • Did not determine if there are additional samples that should be extracted sequenced

My plate was pretty full with bioinformatic analyses and quarantining after getting back from California, so I didn’t make any progress on ATAC-Seq Labwork or Gigas Labwork!

February goals

Hawaii Gigas Methylation:

  • Evaluate bismark output using coral methylation workflow
  • Complete preliminary assessment of DML with methylKit
  • Try identifying DML with DSS
  • Compare methylKit and DSS DML output and determine which approach is suitable
  • Determine genomic location of DML
  • Identify significantly enriched GOterms associated with DML
  • Identify methylation islands and non-methylated regions
  • Locate an ATAC-Seq dataset that could be used to practice integrating chromatin information with methylation
  • Start drafting manuscript

Gigas Gonad Methylation:

  • Trim samples
  • Align samples with bismark and evaluate output
  • Identify DML using either methylKit or DSS
  • Determine genomic location of DML
  • Identify significantly enriched GOterms associated with DML
  • Identify methylation islands and non-methylated regions
  • Decide if it’s worth extracting gonad RNA for integrated RNA-Seq and methylation analyses
  • Start drafting manuscript

Virginica Labwork:

  • Send samples for sequencing!

ATAC-Seq Labwork:

  • Purchase reagents and identify samples to test cell dissociation protocols
  • Ensure protocol is easy to follow and is accessible for the lab
  • Dissociate and cryopreserve some cells


  • Continue work on ocean acidification and reproduction review in order to submit the manuscript this quarter
  • Complete review for Molecular Ecology
  • Watch SICB talks and prepare for live discussion session

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2MLZaJ8