TWIP: Episode 4 – Elevated Paprika

This week we hear about breakfast do and don’ts, everyone’s goals, and get to listen to a LIVE paper submission!

Hawaii Gigas Methylation Analysis Part 5

Evaluating bismark output

I’ve been sitting on the Hawaii bismark data for a while, so it’s time I actually look at it! I referred to the MethCompare repository to remind myself what Shelly did to evaluate bismark output quality. I found this script and realized all I needed to do was 1) get multiqc information for bismark output, and 2) look at the data in IGV!

multiqc

When I looked at my script, I realized I never ran multiqc on the output! I navigated to where I stored the output on gannet and ran multiqc * to get the reports. From this command, I got a HTML report with relevant summary statistics, as well as individual text files with the raw data (found in this folder). Looking at the general statistics, sample alignment averaged at 60%, which feels pretty good! About 12% of reads per sample aligned ambiguously, and 26% did not align at all. I was more interested in duplicated read alignment. Since I’m comparing diploid and triploid oysters, sequence duplication seems like it would be a bigger issue if there were duplicates removed from triploid oysters. Percent duplicated reads removed ranged from 10-12% for each treatment combination (ploidy x pH), so there doesn’t seem to be a treatment effect (at least at first glance). I did notice that sample 7 had more reads than any other sample by a long shot.

Screen Shot 2021-02-21 at 3 38 45 PM

Figure 1. Reads aligned for reach sample

IGV

My next step was looking at the coverage files in IGV! I created coverage files that merged CpG locations (*merged_CpG_evidence.cov), filtered the data to 5x coverage, then created bedgraphs (*5x.bedgraph). I wanted to look at these bedgraphs in IGV with respect to genome feature files.

The first thing I did was download the lateset version of IGV (v 2.9.2) from this link. I found the links to the C. gigas genome and genome feature files in the Genomic Resources wiki. I created this IGV file using the eagle and gannet links, and colored samples based on treatment (light blue (1-6): 2N high pH; light red (7-12), 2N low pH; dark blue (13-18): 3N high pH; dark red (19-24): 3N low pH). From my quick browse, most methylation seems to be in the gene regions (makes sense because invertebrates), but I did find some instances of transposable element methylation.

Screen Shot 2021-02-21 at 4 16 59 PM

Screen Shot 2021-02-21 at 4 17 39 PM

Figure 2-3. IGV tracks

Now that I’ve looked at the data, my next steps are to load data into R to explore with methylKit. While I won’t be able to conduct a full analysis, I can obtain sample methylataion histograms and PCAs. To identify DML, I’m going to look into DSS and ramwas. Addtionally, Steven brought up that there’s a revised C. gigas genome! While I work through DML identification with this dataset, I should also repeat bismark with the new genome.

Going forward

  1. Align samples to the revised C. gigas genome
  2. Obtain preliminary methylation assessment from methylKit
  3. Test-run DSS and ramwas
  4. Investigate comparison mechanisms for samples with different ploidy in oysters and other taxa
  5. Transfer scripts used to a nextflow workflow
  6. Update methods
  7. Update results

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3umhXfG
via IFTTT

job-name=hw-bsnlP

#!/bin/bash
## Job Name
#SBATCH --job-name=hw-bsnlP
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=100G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=sr320@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/sr320/021921-hw-bsnP
 
 
 
# Directories and programs
bismark_dir="/gscratch/srlab/programs/Bismark-0.21.0"
bowtie2_dir="/gscratch/srlab/programs/bowtie2-2.3.4.1-linux-x86_64/"
samtools="/gscratch/srlab/programs/samtools-1.9/samtools"
reads_dir="/gscratch/srlab/sr320/data/cg/"
genome_folder="/gscratch/srlab/sr320/data/Cgig-genome/Crassostrea_gigas.oyster_v9.dna_sm.toplevel/"
 
source /gscratch/srlab/programs/scripts/paths.sh
 
 
 
#${bismark_dir}/bismark_genome_preparation \
#--verbose \
#--parallel 28 \
#--path_to_aligner ${bowtie2_dir} \
#${genome_folder}
 
 
#/zr3644_11_R2.fastp-trim.20201206.fq.gz
 
find ${reads_dir}*_R1.fastp-trim.20201206.fq.gz \
| xargs basename -s _R1.fastp-trim.20201206.fq.gz | xargs -n 1 -P 6 -I{} ${bismark_dir}/bismark \
--path_to_bowtie ${bowtie2_dir} \
-genome ${genome_folder} \
-p 4 \
--non_directional \
-1 ${reads_dir}{}_R1.fastp-trim.20201206.fq.gz \
-2 ${reads_dir}{}_R2.fastp-trim.20201206.fq.gz \
 
 
 
find *.bam | \
xargs basename -s .bam | \
xargs -I{} ${bismark_dir}/deduplicate_bismark \
--bam \
--paired \
{}.bam
 
 
 
${bismark_dir}/bismark_methylation_extractor \
--bedGraph --counts --scaffolds \
--multicore 14 \
--buffer_size 75% \
*deduplicated.bam
 
 
 
# Bismark processing report
 
${bismark_dir}/bismark2report
 
#Bismark summary report
 
${bismark_dir}/bismark2summary
 
 
 
# Sort files for methylkit and IGV
 
find *deduplicated.bam | \
xargs basename -s .bam | \
xargs -I{} ${samtools} \
sort --threads 28 {}.bam \
-o {}.sorted.bam
 
# Index sorted files for IGV
# The "-@ 16" below specifies number of CPU threads to use.
 
find *.sorted.bam | \
xargs basename -s .sorted.bam | \
xargs -I{} ${samtools} \
index -@ 28 {}.sorted.bam

#parallel, #path_to_aligner, #verbose, #bismark, #sbatch

GO-MWU Wrapup + Future Plans

Wrapup of GO-MWU Last post, I had run GO-MWU on my two analyses – Elevated Day 2 vs. Ambient Day 0+2 (individual libraries only), and Ambient Day 0+2+17 + Elevated Day 0 + Lowered Day 0 vs. Elevated Day 2 (individual and pooled libraries). I’ve been continuing on a bunch of different analyses and running them through my pipeline. Reminder: pipeline is kallisto matrix -> DESeq2 -> GO-MWU (with steps in between for formatting). I’ve ran the following pairwise comparisons all the way through both: Ambient Day 0 vs. Ambient Day 2 (individual libraries only) Ambient Day 0 vs. Ambient Day 17 (individual libraries only) Ambient Day 2 vs. Ambient Day 17 (individual libraries only) Ambient Day 2 vs. Elevated Day 2 (individual libraries only) Elevated Day 0 vs. Elevated Day 2 (individual libraries only) Here’s a link to all visualizations of significant GO categories, and here’s a link to…

from Aidan F. Coyle https://ift.tt/2ZmYugx
via IFTTT

job-name=hw-bs

#!/bin/bash
## Job Name
#SBATCH --job-name=hw-bs
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=20-00:00:00
## Memory per node
#SBATCH --mem=100G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=sr320@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/sr320/021321-hw-bs



# Directories and programs
bismark_dir="/gscratch/srlab/programs/Bismark-0.21.0"
bowtie2_dir="/gscratch/srlab/programs/bowtie2-2.3.4.1-linux-x86_64/"
samtools="/gscratch/srlab/programs/samtools-1.9/samtools"
reads_dir="/gscratch/srlab/sr320/data/cg/"
genome_folder="/gscratch/srlab/sr320/data/Cgig-genome/Crassostrea_gigas.oyster_v9.dna_sm.toplevel/"

source /gscratch/srlab/programs/scripts/paths.sh



#${bismark_dir}/bismark_genome_preparation \
#--verbose \
#--parallel 28 \
#--path_to_aligner ${bowtie2_dir} \
#${genome_folder}


#/zr3644_11_R2.fastp-trim.20201206.fq.gz

find ${reads_dir}*_R1.fastp-trim.20201206.fq.gz \
| xargs basename -s _R1.fastp-trim.20201206.fq.gz | xargs -I{} ${bismark_dir}/bismark \
--path_to_bowtie ${bowtie2_dir} \
-genome ${genome_folder} \
-p 4 \
-score_min L,0,-0.6 \
--non_directional \
-1 ${reads_dir}{}_R1.fastp-trim.20201206.fq.gz \
-2 ${reads_dir}{}_R2.fastp-trim.20201206.fq.gz \



find *.bam | \
xargs basename -s .bam | \
xargs -I{} ${bismark_dir}/deduplicate_bismark \
--bam \
--paired \
{}.bam



${bismark_dir}/bismark_methylation_extractor \
--bedGraph --counts --scaffolds \
--multicore 14 \
--buffer_size 75% \
*deduplicated.bam



# Bismark processing report

${bismark_dir}/bismark2report

#Bismark summary report

${bismark_dir}/bismark2summary



# Sort files for methylkit and IGV

find *deduplicated.bam | \
xargs basename -s .bam | \
xargs -I{} ${samtools} \
sort --threads 28 {}.bam \
-o {}.sorted.bam

# Index sorted files for IGV
# The "-@ 16" below specifies number of CPU threads to use.

find *.sorted.bam | \
xargs basename -s .sorted.bam | \
xargs -I{} ${samtools} \
index -@ 28 {}.sorted.bam



# 
# 
# find *deduplicated.bismark.cov.gz \
# | xargs basename -s _R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz \
# | xargs -I{} ${bismark_dir}/coverage2cytosine \
# --genome_folder ${genome_folder} \
# -o {} \
# --merge_CpG \
# --zero_based \
# {}_R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz
# 
# 
# #creating bedgraphs post merge
# 
# for f in *merged_CpG_evidence.cov
# do
#   STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov)
#   cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4}}' \
#   > "${STEM}"_10x.bedgraph
# done
# 
# 
# 
# for f in *merged_CpG_evidence.cov
# do
#   STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov)
#   cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4}}' \
#   > "${STEM}"_5x.bedgraph
# done
# 
# 
# #creating tab files with raw count for glms
# 
# for f in *merged_CpG_evidence.cov
# do
#   STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov)
#   cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4, $5, $6}}' \
#   > "${STEM}"_10x.tab
# done
# 
# 
# for f in *merged_CpG_evidence.cov
# do
#   STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov)
#   cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4, $5, $6}}' \
#   > "${STEM}"_5x.tab
# done

#parallel, #path_to_aligner, #verbose, #bismark, #creating, #sbatch

job-name=cg-spur

#!/bin/bash
## Job Name
#SBATCH --job-name=cg-spur
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=4-12:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=sr320@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/sr320/021321-cgig-spur


# Load Python Mox module for Python module availability

module load intel-python3_2017




source /gscratch/srlab/programs/scripts/paths.sh



/gscratch/srlab/programs/ncbi-blast-2.10.1+/bin/blastx \
-query /gscratch/srlab/sr320/data/cg/GCF_000297895.1_oyster_v9_cds_from_genomic.fna \
-db /gscratch/srlab/sr320/blastdb/ProteinsSpur5.0  \
-out /gscratch/scrubbed/sr320/021321-cgig-spur/Cg-blastx-spur.tab \
-num_threads 40 \
-max_target_seqs 1 \
-max_hsps 1 \
-outfmt "6 qaccver saccver evalue"

#sbatch

TWIP: Episode 03 – The number one hypothesis

This week we hear about the number one hypothesis and get some insight into Sam’s feelings.

SRA Submission – A.elegantissima ONT Fast5 from Jay Dimond

At the beginning of the month, Jay sent us his A.elegantissima ONT Fast5 data in order to help him get it submitted to NCBI Sequencing Read Archive (SRA).

Due to the large amount of data (>300GB), it took a number of days to get the files transferred, but they’re there!

All data is available under the following SRA BioProject:

The above BioProject accession is what NCBI recommends for citation. However, here’s the list of each individual files submitted and their corresponding SRA accessions:

SeqID Library_Name SRA BioProject
aele_A1_aposymb bar05 SRR136852766 PRJNA700526
aele_A3_aposymb bar03 SRR136852755 PRJNA700526
aele_A4_aposymb bar01 [SRR13685274](https://ift.tt/3tTXMoY PRJNA700526
aele_G2_symb bar02 [SRR13685273](https://ift.tt/3rMzk75 PRJNA700526
aele_G3_symb bar06 [SRR13685272](https://ift.tt/3tRhe5U PRJNA700526
aele_G4_symb bar04 SRR13685271 PRJNA700526

from Sam’s Notebook https://ift.tt/3tThoth
via IFTTT

Completed GO-MWU for new analyses

Speeding Things Up As of my last post (yesterday), I was trying to figure out a way to rapidly obtain large numbers of GO IDs. Last time, I used Sam’s shell script by calling it inside this script. However, Sam’s script took an extremely long time to run. A rough ballpark: for my 2 analyses (each with ~120,000 accession IDs), it would take a total of about 8 days on my local machine. That’s a huge pipeline bottleneck, and so we found an alternative. New Script for Uniprot to GO I created an R script for obtaining GO terms from accession IDs. Downside: it requires you to manually download the SwissProt database – with all GO terms included – from https://ift.tt/3p9XlU6. Upside: it’s much, much, much faster. It ran in a few minutes, which by my calculation, is a bit faster than 8 days. After getting GO terms, you need…

from Aidan F. Coyle https://ift.tt/3cXSMd2
via IFTTT

Virginica Gonad DNA Extractions Part 14

Sending samples for sequencing

Quotes obtained? Check. PO numbers generated? Check. Sequence submission forms submitted? Check.

Then there’s nothing left to do but send out the samples! I labelled screw top tubes for all samples with enough DNA and RNA yield based on the extraction metadata. Basically, this meant I labelled DNA and RNA tubes for every sample except for sample 57. I used screw top tubes because they’re more secure and I didn’t want any samples spilling! Once tubes were labelled, I grabbed the samples from the -80 ºC and let them equilibrate to room temperature. I centrifuged each sample to ensure all the liquid could easily be pipetted, and transferred it to the appropriate labelled screw top tube. I then put that tube in a labelled ziploc bag, using separate ones for DNA and RNA. I placed the filled ziploc bags with samples at the bottom of a cooler with an insulating lining. I then filled the lining with dry ice so there was ample ice to keep the samples cold. Once the samples were packaged properly, I sent the package to ZymoResearch via FedEx Overnight!

Going forward

  1. Wait for sequencing data!

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3tJpbdf
via IFTTT