
This week we hear about breakfast do and don’ts, everyone’s goals, and get to listen to a LIVE paper submission!
bismark
output I’ve been sitting on the Hawaii bismark
data for a while, so it’s time I actually look at it! I referred to the MethCompare repository to remind myself what Shelly did to evaluate bismark
output quality. I found this script and realized all I needed to do was 1) get multiqc
information for bismark
output, and 2) look at the data in IGV!
multiqc
When I looked at my script, I realized I never ran multiqc
on the output! I navigated to where I stored the output on gannet
and ran multiqc *
to get the reports. From this command, I got a HTML report with relevant summary statistics, as well as individual text files with the raw data (found in this folder). Looking at the general statistics, sample alignment averaged at 60%, which feels pretty good! About 12% of reads per sample aligned ambiguously, and 26% did not align at all. I was more interested in duplicated read alignment. Since I’m comparing diploid and triploid oysters, sequence duplication seems like it would be a bigger issue if there were duplicates removed from triploid oysters. Percent duplicated reads removed ranged from 10-12% for each treatment combination (ploidy x pH), so there doesn’t seem to be a treatment effect (at least at first glance). I did notice that sample 7 had more reads than any other sample by a long shot.
Figure 1. Reads aligned for reach sample
My next step was looking at the coverage files in IGV! I created coverage files that merged CpG locations (*merged_CpG_evidence.cov
), filtered the data to 5x coverage, then created bedgraphs (*5x.bedgraph
). I wanted to look at these bedgraphs in IGV with respect to genome feature files.
The first thing I did was download the lateset version of IGV (v 2.9.2) from this link. I found the links to the C. gigas genome and genome feature files in the Genomic Resources wiki. I created this IGV file using the eagle
and gannet
links, and colored samples based on treatment (light blue (1-6): 2N high pH; light red (7-12), 2N low pH; dark blue (13-18): 3N high pH; dark red (19-24): 3N low pH). From my quick browse, most methylation seems to be in the gene regions (makes sense because invertebrates), but I did find some instances of transposable element methylation.
Figure 2-3. IGV tracks
Now that I’ve looked at the data, my next steps are to load data into R to explore with methylKit
. While I won’t be able to conduct a full analysis, I can obtain sample methylataion histograms and PCAs. To identify DML, I’m going to look into DSS
and ramwas
. Addtionally, Steven brought up that there’s a revised C. gigas genome! While I work through DML identification with this dataset, I should also repeat bismark
with the new genome.
methylKit
Please enable JavaScript to view the comments powered by Disqus.
from the responsible grad student https://ift.tt/3umhXfG
via IFTTT
#!/bin/bash ## Job Name #SBATCH --job-name=hw-bsnlP ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=100G #SBATCH --mail-type=ALL #SBATCH --mail-user=sr320@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/sr320/021921-hw-bsnP # Directories and programs bismark_dir="/gscratch/srlab/programs/Bismark-0.21.0" bowtie2_dir="/gscratch/srlab/programs/bowtie2-2.3.4.1-linux-x86_64/" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" reads_dir="/gscratch/srlab/sr320/data/cg/" genome_folder="/gscratch/srlab/sr320/data/Cgig-genome/Crassostrea_gigas.oyster_v9.dna_sm.toplevel/" source /gscratch/srlab/programs/scripts/paths.sh #${bismark_dir}/bismark_genome_preparation \ #--verbose \ #--parallel 28 \ #--path_to_aligner ${bowtie2_dir} \ #${genome_folder} #/zr3644_11_R2.fastp-trim.20201206.fq.gz find ${reads_dir}*_R1.fastp-trim.20201206.fq.gz \ | xargs basename -s _R1.fastp-trim.20201206.fq.gz | xargs -n 1 -P 6 -I{} ${bismark_dir}/bismark \ --path_to_bowtie ${bowtie2_dir} \ -genome ${genome_folder} \ -p 4 \ --non_directional \ -1 ${reads_dir}{}_R1.fastp-trim.20201206.fq.gz \ -2 ${reads_dir}{}_R2.fastp-trim.20201206.fq.gz \ find *.bam | \ xargs basename -s .bam | \ xargs -I{} ${bismark_dir}/deduplicate_bismark \ --bam \ --paired \ {}.bam ${bismark_dir}/bismark_methylation_extractor \ --bedGraph --counts --scaffolds \ --multicore 14 \ --buffer_size 75% \ *deduplicated.bam # Bismark processing report ${bismark_dir}/bismark2report #Bismark summary report ${bismark_dir}/bismark2summary # Sort files for methylkit and IGV find *deduplicated.bam | \ xargs basename -s .bam | \ xargs -I{} ${samtools} \ sort --threads 28 {}.bam \ -o {}.sorted.bam # Index sorted files for IGV # The "-@ 16" below specifies number of CPU threads to use. find *.sorted.bam | \ xargs basename -s .sorted.bam | \ xargs -I{} ${samtools} \ index -@ 28 {}.sorted.bam
Wrapup of GO-MWU Last post, I had run GO-MWU on my two analyses – Elevated Day 2 vs. Ambient Day 0+2 (individual libraries only), and Ambient Day 0+2+17 + Elevated Day 0 + Lowered Day 0 vs. Elevated Day 2 (individual and pooled libraries). I’ve been continuing on a bunch of different analyses and running them through my pipeline. Reminder: pipeline is kallisto matrix -> DESeq2 -> GO-MWU (with steps in between for formatting). I’ve ran the following pairwise comparisons all the way through both: Ambient Day 0 vs. Ambient Day 2 (individual libraries only) Ambient Day 0 vs. Ambient Day 17 (individual libraries only) Ambient Day 2 vs. Ambient Day 17 (individual libraries only) Ambient Day 2 vs. Elevated Day 2 (individual libraries only) Elevated Day 0 vs. Elevated Day 2 (individual libraries only) Here’s a link to all visualizations of significant GO categories, and here’s a link to…
from Aidan F. Coyle https://ift.tt/2ZmYugx
via IFTTT
#!/bin/bash ## Job Name #SBATCH --job-name=hw-bs ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=20-00:00:00 ## Memory per node #SBATCH --mem=100G #SBATCH --mail-type=ALL #SBATCH --mail-user=sr320@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/sr320/021321-hw-bs # Directories and programs bismark_dir="/gscratch/srlab/programs/Bismark-0.21.0" bowtie2_dir="/gscratch/srlab/programs/bowtie2-2.3.4.1-linux-x86_64/" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" reads_dir="/gscratch/srlab/sr320/data/cg/" genome_folder="/gscratch/srlab/sr320/data/Cgig-genome/Crassostrea_gigas.oyster_v9.dna_sm.toplevel/" source /gscratch/srlab/programs/scripts/paths.sh #${bismark_dir}/bismark_genome_preparation \ #--verbose \ #--parallel 28 \ #--path_to_aligner ${bowtie2_dir} \ #${genome_folder} #/zr3644_11_R2.fastp-trim.20201206.fq.gz find ${reads_dir}*_R1.fastp-trim.20201206.fq.gz \ | xargs basename -s _R1.fastp-trim.20201206.fq.gz | xargs -I{} ${bismark_dir}/bismark \ --path_to_bowtie ${bowtie2_dir} \ -genome ${genome_folder} \ -p 4 \ -score_min L,0,-0.6 \ --non_directional \ -1 ${reads_dir}{}_R1.fastp-trim.20201206.fq.gz \ -2 ${reads_dir}{}_R2.fastp-trim.20201206.fq.gz \ find *.bam | \ xargs basename -s .bam | \ xargs -I{} ${bismark_dir}/deduplicate_bismark \ --bam \ --paired \ {}.bam ${bismark_dir}/bismark_methylation_extractor \ --bedGraph --counts --scaffolds \ --multicore 14 \ --buffer_size 75% \ *deduplicated.bam # Bismark processing report ${bismark_dir}/bismark2report #Bismark summary report ${bismark_dir}/bismark2summary # Sort files for methylkit and IGV find *deduplicated.bam | \ xargs basename -s .bam | \ xargs -I{} ${samtools} \ sort --threads 28 {}.bam \ -o {}.sorted.bam # Index sorted files for IGV # The "-@ 16" below specifies number of CPU threads to use. find *.sorted.bam | \ xargs basename -s .sorted.bam | \ xargs -I{} ${samtools} \ index -@ 28 {}.sorted.bam # # # find *deduplicated.bismark.cov.gz \ # | xargs basename -s _R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz \ # | xargs -I{} ${bismark_dir}/coverage2cytosine \ # --genome_folder ${genome_folder} \ # -o {} \ # --merge_CpG \ # --zero_based \ # {}_R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz # # # #creating bedgraphs post merge # # for f in *merged_CpG_evidence.cov # do # STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov) # cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4}}' \ # > "${STEM}"_10x.bedgraph # done # # # # for f in *merged_CpG_evidence.cov # do # STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov) # cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4}}' \ # > "${STEM}"_5x.bedgraph # done # # # #creating tab files with raw count for glms # # for f in *merged_CpG_evidence.cov # do # STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov) # cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4, $5, $6}}' \ # > "${STEM}"_10x.tab # done # # # for f in *merged_CpG_evidence.cov # do # STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov) # cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4, $5, $6}}' \ # > "${STEM}"_5x.tab # done
#!/bin/bash ## Job Name #SBATCH --job-name=cg-spur ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=4-12:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=sr320@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/sr320/021321-cgig-spur # Load Python Mox module for Python module availability module load intel-python3_2017 source /gscratch/srlab/programs/scripts/paths.sh /gscratch/srlab/programs/ncbi-blast-2.10.1+/bin/blastx \ -query /gscratch/srlab/sr320/data/cg/GCF_000297895.1_oyster_v9_cds_from_genomic.fna \ -db /gscratch/srlab/sr320/blastdb/ProteinsSpur5.0 \ -out /gscratch/scrubbed/sr320/021321-cgig-spur/Cg-blastx-spur.tab \ -num_threads 40 \ -max_target_seqs 1 \ -max_hsps 1 \ -outfmt "6 qaccver saccver evalue"
This week we hear about the number one hypothesis and get some insight into Sam’s feelings.
At the beginning of the month, Jay sent us his A.elegantissima ONT Fast5 data in order to help him get it submitted to NCBI Sequencing Read Archive (SRA).
Due to the large amount of data (>300GB), it took a number of days to get the files transferred, but they’re there!
All data is available under the following SRA BioProject:
The above BioProject accession is what NCBI recommends for citation. However, here’s the list of each individual files submitted and their corresponding SRA accessions:
SeqID | Library_Name | SRA | BioProject |
---|---|---|---|
aele_A1_aposymb | bar05 | SRR136852766 | PRJNA700526 |
aele_A3_aposymb | bar03 | SRR136852755 | PRJNA700526 |
aele_A4_aposymb | bar01 | [SRR13685274](https://ift.tt/3tTXMoY | PRJNA700526 |
aele_G2_symb | bar02 | [SRR13685273](https://ift.tt/3rMzk75 | PRJNA700526 |
aele_G3_symb | bar06 | [SRR13685272](https://ift.tt/3tRhe5U | PRJNA700526 |
aele_G4_symb | bar04 | SRR13685271 | PRJNA700526 |
from Sam’s Notebook https://ift.tt/3tThoth
via IFTTT
Speeding Things Up As of my last post (yesterday), I was trying to figure out a way to rapidly obtain large numbers of GO IDs. Last time, I used Sam’s shell script by calling it inside this script. However, Sam’s script took an extremely long time to run. A rough ballpark: for my 2 analyses (each with ~120,000 accession IDs), it would take a total of about 8 days on my local machine. That’s a huge pipeline bottleneck, and so we found an alternative. New Script for Uniprot to GO I created an R script for obtaining GO terms from accession IDs. Downside: it requires you to manually download the SwissProt database – with all GO terms included – from https://ift.tt/3p9XlU6. Upside: it’s much, much, much faster. It ran in a few minutes, which by my calculation, is a bit faster than 8 days. After getting GO terms, you need…
from Aidan F. Coyle https://ift.tt/3cXSMd2
via IFTTT
Quotes obtained? Check. PO numbers generated? Check. Sequence submission forms submitted? Check.
Then there’s nothing left to do but send out the samples! I labelled screw top tubes for all samples with enough DNA and RNA yield based on the extraction metadata. Basically, this meant I labelled DNA and RNA tubes for every sample except for sample 57. I used screw top tubes because they’re more secure and I didn’t want any samples spilling! Once tubes were labelled, I grabbed the samples from the -80 ºC and let them equilibrate to room temperature. I centrifuged each sample to ensure all the liquid could easily be pipetted, and transferred it to the appropriate labelled screw top tube. I then put that tube in a labelled ziploc bag, using separate ones for DNA and RNA. I placed the filled ziploc bags with samples at the bottom of a cooler with an insulating lining. I then filled the lining with dry ice so there was ample ice to keep the samples cold. Once the samples were packaged properly, I sent the package to ZymoResearch via FedEx Overnight!
Please enable JavaScript to view the comments powered by Disqus.
from the responsible grad student https://ift.tt/3tJpbdf
via IFTTT