bowtie

#!/bin/bash
## Job Name
#SBATCH --job-name=el_01
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=3-12:00:00
## Memory per node
#SBATCH --mem=100G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=sr320@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/sr320/1117c/


# Eleni 20191107
# The purpose of this script is to align fastq files to a genome, and output the # alignments as bam files, whose mapping quality is greater than 30. 
# to run this script, place in same folder as the files you want to move and write ./bowtie2_cluster.sh in terminal 

source /gscratch/srlab/programs/scripts/paths.sh



find /gscratch/scrubbed/sr320/eleni/*.fq | xargs basename -s .fq | xargs -I{} bowtie2 \
-x /gscratch/scrubbed/sr320/eleni/GCA_900700415 \
-U /gscratch/scrubbed/sr320/eleni/{}.fq \
-p 28 \
-S /gscratch/scrubbed/sr320/1117c/{}.sam



find /gscratch/scrubbed/sr320/1117/*.sam | \
xargs basename -s .sam | \
xargs -I{} /gscratch/srlab/programs/samtools-1.9/samtools \
view -b -q 30 /gscratch/scrubbed/sr320/1117c/{}.sam -o /gscratch/scrubbed/sr320/1117c/{}.bam


#for file in $files
#do
    #echo ${file} # print the filename to terminal screen
    #bowtie2 -q -x GCA_900700415 -U ${file}.fq|samtools view -b -q 30 > ${file}.bam #conduct the alignment and output the file
#done




#Explanation of terms:
#bowtie2 -q -x <bt2-idx> -U <r> -S <sam>
#-q query input files are in fastq format
#-x <bt2-idx> Indexed "reference genome" filename prefix (minus trailing .X.bt2).
#-U <r> Files with unpaired reads.

# The default of bowtie2, is to write the output of the alignment to the terminal. 
# Also, bowtie does not write BAM files directly, but SAM output can be converted to BAM on the fly by piping bowtie's output to samtools view. 
# samtools options
#  -b       output BAM
# -q <integer> : discards reads whose mapping quality is below this number


#for file in $files
#do
    #echo ${file} # print the filename to terminal screen
    #bowtie2 -q -x GCA_900700415 -U ${file}.fq|samtools view -b -q 30 > ${file}.bam #conduct the alignment and output the file
#done


#q, #u, #x, #bowtie2, #conduct, #do, #done, #echo, #explanation, #for, #sbatch

Grace’s Notebook: More RNA extracted from Day 9 – RNA in all

I extracted RNA from the same tubes that I worked with a lot the past couple of weeks on the bioanalyzer and Nanodrop and qubit because there ended up being not much left. RNA was in all samples. Details of extraction and next steps in post.

Samples:

Because I used so much of the eluted RNA from the samples that I tried to get DNA free (post), I re-did those samples.

FRP trtmnt_tank day infection_status maturity tube_number
6157 NA 9 0 M 169
6143 NA 9 0 M 10
6228 NA 9 0 I 29
6150 NA 9 0 M 119
6277 NA 9 0 I 177
6156 NA 9 0 I 24
6272 NA 9 0 I 111
6178 NA 9 0 M 124
6161 NA 9 0 M 53
6136 NA 9 0 M 91
6137 NA 9 1 I 81
6112 MA 9 0 I 28

RNA isolation

Used the same process to isolate as previous posts with the Zymo RNA Microprep kit. No obvious, glaring mistakes were made as far as I’m aware.

Results:

I acciddentally didn’t make enough of the Qubit RNA HS working solution for all 12 samples plus the two standards, so I ran 10 samples, and then the remaining two separately.

Google sheets
Raw qubit (10 samples)
Raw qubit (2 sampels)

test_date qubit_tube_conc_ng.ml original_sample_conc_ng.ul sample_vol_ul dilution_factor tube_number extraction_method ul_sample-used elution_vol_ul total-yield_ng
11/15/19 14:57 157 15.7 2 100 91 Zymo_microprep 35 15 204.1
11/15/19 14:57 312 31.2 2 100 53 Zymo_microprep 35 15 405.6
11/15/19 14:57 228 22.8 2 100 124 Zymo_microprep 35 15 296.4
11/15/19 14:57 399 39.9 2 100 111 Zymo_microprep 35 15 518.7
11/15/19 14:56 282 28.2 2 100 24 Zymo_microprep 35 15 366.6
11/15/19 14:56 630 63 2 100 177 Zymo_microprep 35 15 819
11/15/19 14:56 323 32.3 2 100 119 Zymo_microprep 35 15 419.9
11/15/19 14:56 331 33.1 2 100 29 Zymo_microprep 35 15 430.3
11/15/19 14:56 229 22.9 2 100 10 Zymo_microprep 35 15 297.7
11/15/19 14:56 281 28.1 2 100 169 Zymo_microprep 35 15 365.3
11/15/19 15:04 221 22.1 2 100 28 Zymo_microprep 35 15 287.3
11/15/19 15:04 510 51 2 100 81 Zymo_microprep 35 15 663

RNA in all!!

Next steps:

Come up with pooling plan for the next 6 libraries:

Infection status Day Temp trtmnt
Infected 9 NA
Uninfected 9 NA
Infected 12 cold
Uninfected 12 cold
Infected 12 warm
Uninfected 12 warm

Pool them next week, and hopefully by then we’ll have a quote from NWGC, so that we can submit them next week for library prep and seq.

from Grace’s Lab Notebook https://ift.tt/2OeYsl2
via IFTTT

Grace’s Notebook: GSS 2019 Preparations – Differential Expression between infected and uninfected

Today, I continued to prepare a presentation for GSS 2019 (this Thursday – my talk is at 1:30pm). I met with Steven and discussed a plan for results, and he shared with me the analyses he’s done during the past couple of days. Details of results and what I’ve done in the post. I also tried doing edgeR stuff yesterday and didn’t get super far, but details will be at the end of the post for my future self (won’t do edgeR for GSS).

Stuff SR did (as far as I can understand at the moment)

sr320/nb-2019/C_bairdi/11-Deseq2.html

He joined three files

  1. Abundance-merge.txt – made from the outputs I made in kallisto (bairdisamples-kallisto-updates).
  2. sr320/nb-2019/Crab_DEGlist.txt – a list of all the Differentially Expressed Genes
  3. Cb_v1_blastx_sp_imac.tab – which is the blastoutput from the first assembled transcriptome against uniprot

The combined files resulted in: bigtable.txt

sr320/nb-2019/C_bairdi/11-Deseq2.html

In the above link, Steven used DESeq2 to get the differentially expressed genes from the Abundance-merge.txt table.

DEGs (p-value < 0.1):

## out of 137634 with nonzero total read count ## adjusted p-value < 0.1 ## LFC > 0 (up) : 4529, 3.3% ## LFC < 0 (down) : 59567, 43% ## outliers [1] : 858, 0.62% ## low counts [2] : 10635, 7.7% ## (mean count < 1) ## [1] see 'cooksCutoff' argument of ?results ## [2] see 'independentFiltering' argument of ?results 
  • Total DEGs with adj p-value <0.1 = 64096
  • DEGs in the infected group (down; negative LFC values) = 59567
  • DEGs in the uninfected group (up; positive LFC values) = 4529

DEGs (p-value < 0.05):

## out of 137634 with nonzero total read count ## adjusted p-value < 0.05 ## LFC > 0 (up) : 3062, 2.2% ## LFC < 0 (down) : 55406, 40% ## outliers [1] : 858, 0.62% ## low counts [2] : 13314, 9.7% ## (mean count < 1) ## [1] see 'cooksCutoff' argument of ?results ## [2] see 'independentFiltering' argument of ?results 
  • Total DEGs with adj p-value <0.05 = 58468
  • DEGs in the infected group (down; negative LFC values) = 55406
  • DEGs in the uninfected group (up; positive LFC values) = 3062

from Grace’s Lab Notebook https://ift.tt/2NF7msK
via IFTTT

Shelly’s Notebook: Nov. 7-9, Geoduck DMR feature analysis

Generate appropriate background

  • I previously used within-sample DMRs filtered for coverage in 3/4 individuals/group as the background.
    • HOWEVER, these did not include all sites that had the potential to be methylated
    • To create a more inclusive background, I need to look at all CG sites considered prior to determining within-sample DMRs
  • Within-sample DMRs were determined by:
    • having at least 3 differentially methylated sites (DMS)
      • DMS were determined by:
        • having 5x coverage
          • if sites are within a 30bp window their counts can be combined
        • passing an RMS test significance threshold of 0.01
    • within a max distance of 250bp
  • ATTEMPT 1: If I remove the significance threshold (and set –sig-cutoff to 1 instead of 0.01) for DMS, than I should get results from all sites considered for DMS.
    • I attempted this using this mox script: 20191108_DMRfindAllEPInoSig.sh
      • RESULTS: Resulting files indicate some other filtering is happening because because the number of regions is the nearly the same as when I ran DMRfind with significance level set to 0.01. I had expected a much longer list of regions.
  • ATTEMPT 2: To address if the software only consider CG sites with a value in the methylation counts column of the allc file I created new allc files with the coverage counts column also as the methylation counts column (assuming 100% methylation at every site).
  • ATTEMPT 3: To adjust other parameters in attempt to remove all filtering, I tried the following settings in this mox script here: 20191108_DMRfindAllEPItotCountsRes1.sh on ambient samples only as a test:
    • –resid-cutoff 1
    • –min-tests 1
    • –num-sims 1
    • –sig-cutoff 1
    • RESULTS: resulting files indicate certain criteria in the software are not met as output contains only a header

4th times the charm!

Plot background vs. sig. DMR features

I updated the Rmarkdown file I used previously to generate a bar plot showing the proportion of features that significant DMRs vs. background sites fall into.

BkgdvSigDMRsxfeatures_PropBarplot_GroupFacet.jpg

RESULTS: There are no strong differences between significant DMRs and background CpG sites in the proportion of features they overlap with, except:

  • Day145 comparison where CDS and exon features are under-represented, and repeat_regions are over-represented.
  • CDS and exon are slightly over-represented in Day135 and all ambient sample comparisons

Next steps

  • Look deeper into repeat regions (there are ~6 different catagories, so can see if any are particularly different)
  • For DMRs no in features, check the nearest features
  • Compare genes with DMRs to the ones identified by Hollie’s method
  • Continue GO analysis
    • generate appropriate GO background to use for each comparison
    • run TopGO
  • Go through manuscript methods section
    • update new stuff
    • add comments to areas of concern

from shellytrigg https://ift.tt/2Qdl9J1
via IFTTT

Grace’s Notebook: Still no clear RNA bands on Bioanalyzer; Also working to get diff. exp. for GSS

Today I went through the DNase-free plan with 4 of the extracted crab samples. There was initially quite a bit of DNA, then after I used the Turbo DNA-free Kit, there was significantly less DNA. The Bioanalyzer still didn’t look super great… detail in post. Also, I’m working through kallisto and have abundance files – working to figure out next steps in order to hopefully create a heatmap of differential expression between the infected and uninfected crabs (combining day 12 and day 26). Details in post.

DNA-free plan and results

GitHub Issue: #792

Plan:

  1. Run four samples from the above group on Qubit using DNA HS
  2. Use Turbo DNA-free Kit on those samples
  3. Run those samples on Bioanalyzer
  4. Re-run those samples on Qubit using DNA HS

Results:

  1. Done –> results
  2. Done (Did the routine DNase treatment)
  3. Done (screenshots below) ran each of the 4 samples twice (##-a and ##-b)
  4. Done – samples were eluted in ~40-45ul
    (ran dsDNA HS results; ran RNA HS results)

Electropherogram:
img

Gel:
img

Summary:

There was a LOT less DNA in the samples after the DNA-free Kit process. However, it resulted in pretty dilute RNA because I had to add enough RNA-free water in the beginning in order to have 50ul of sample for the reagents. Also, the RNA still didn’t band very well on the bioanalyzer. I think Steven said we’ll just pool them and send them off the NWGC… will confirm next week.

I’ll have to re-extract those 11 samples that I worked with this week on the qubit and bioanalyzer becuase there is not much material left!! Not a big deal. I can do that Monday or Friday (want to have enough time to prepare for GSS talk next Thursday as well).

Working for new analyses for GSS: Kallisto

Sam helped me a TON today.

He helped me set up hummingbird in FTR 213 so that I have my own user account, and that made everything a lot easier. He also helped me get git working from command line so that I can keep all of my work on GitHub in project crab.

Here’s my jupyter notebook: 2019-11-06-bairdi-kallisto.ipynb

There are so many cells because each sample (4 total) has 2 samples for both lanes (4 files total per sample).

I made individual directories in project-crab/analyses for each sample (number starting with 3#####) and each lane.
project-crab/analyses

Each directory contains three files:

  • abundance.h5
  • abundance.tsv
  • run_info.json

The interesting stuff is in the abundance.tsv files because it has all the count data.

Next steps are to create a table combining all the count data for all the samples, and then using that in DESeq2 in R to create a heatmap… I think. Still trying to figure that out…

Notes:

GitHub Issue #790

from Grace’s Lab Notebook https://ift.tt/2p1sFLL
via IFTTT

Grace’s Notebook: DNA-free, Qubit, and Bioanalyzer Plan; Kallisto work

Today we came up with a plan to address the issues that came up with the weird bioanalyzer results from yesterday (GitHub Issue #792). I also got some help with kallisto and am working on getting some analysis done for looking at differential expression (hopefully have some new results for GSS talk next week).

Weird Bioanalyzer results from yesterday

Yesterday’s post: here

What could be happening?

  • We know reagents work becaue I ran a plate with just the reagents, and all looked good
  • We know there’s RNA in my samples because I have detectable RNA from the Qubit RNA HS Kit (only will show RNA results)
  • The NanoDrop results also looked good
  • The elution substance for all samples was RNase-free water (0.1% DEPC-treated water), so it’s likely not the elution substance

It could be that there is DNA in the samples that is screwing up the Bioanalyzer’s ability to read the RNA. A way to figure out if that’s the case was come up with during today’s meeting:

  1. Run 4 of the samples that I ran on the Bioanalyzer yesterday on the Qubit using the dsDNA HS Kit
  2. Use the Turbo DNA-free Kit on those same four samples to get rid of DNA
  3. Run those same four samples on the Bioanalyzer (hoping to see JUST the RNA at that point)
  4. Re-run those samples on the Qubit using the dsDNA HS Kit to see if there’s still any DNA

I ran 2ul of 4 samples on Qubit using dsDNA HS Kit

Results:

Run ID Assay Name Test Name Test Date QubitÆ tube conc. Units Original sample conc. Units Sample Volume (µL) Dilution Factor tube_number
2019-11-07_164551 dsDNA High sensitivity Sample_#191107-164649 11/7/19 16:46 23.4 ng/mL 2.34 ng/µL 2 100 53
2019-11-07_164551 dsDNA High sensitivity Sample_#191107-164640 11/7/19 16:46 93.2 ng/mL 9.32 ng/µL 2 100 24
2019-11-07_164551 dsDNA High sensitivity Sample_#191107-164630 11/7/19 16:46 220 ng/mL 22 ng/µL 2 100 29
2019-11-07_164551 dsDNA High sensitivity Sample_#191107-164616 11/7/19 16:46 155 ng/mL 15.5 ng/µL 2 100 10

Next steps:

Will do that tomorrow. Then will follow with the Bioanalyzer, and re-run on Qubit with dsDNA HS Kit.

I’m hoping there’s enough of the samples left to do all of this!

Here’s how much volume I think is in those 4 samples:

  • Started with 15ul of eluted RNA in RNase-free water
  • Used 2ul of each sample on Qubit with RNA HS Kit (now 13 ul)
  • Used 1ul 3 separate times to run on the Bioanalyzer (now 10ul)
  • Used 1ul of each on the NanoDrop (now 9ul)
  • Used 2ul of each on the Qubit with the dsDNA HS Kit (now 7ul)

I will use 1ul of each on the Bioanalyzer after using the DNA-free kit, then will use 2ul of each sample on the Qubit with the dsDNA HS Kit. I will end up with ~4ul… I will definitely have to re-extract RNA from those samples. LUCKILY I have plenty of hemolymph in the freezer for those samples left 🙂 .

Kallisto

I showed my notebook for using kallisto and the error that I was getting. It kept saying something about how it needs to have an even number of files to work with, but that I was giving it an odd number…

Turns out the ACTUAL problem was that I had wrote “-out” in the code, where I should have written “-o” OR “-output”…

Anyway, it works now (2019-11-06-bairdi-kallisto.ipynb), but I’ll have to run this on a more powerful computer, because my laptop will not be able to handle it. And the way I have my notebook now, it’d be using the samples from owl, which would require wifi connection, and would further decrease the speed.

I’m trying to figure out which computer I should use, or if I should just try to figure out how to make an .sh script (which I wouldn’t really know how to start) and run it on Mox.

I need to make decisions soon because my talk for GSS is next Thursday at 1:30pm!

from Grace’s Lab Notebook https://ift.tt/2NTpm1l
via IFTTT

Shelly’s Notebook: Wed. Nov. 6, Geoduck DMR Summary and Feature Analysis

Summary of DMRs

  • Within-sample DMRs were called (mox script here: 20191024_DMRfindAllEPI.sh; DMR output files here: https://gannet.fish.washington.edu/metacarcinus/Pgenerosa/analyses/20191024/, see files ending in _DMR250bp_MCmax30_cov5x_rms_results_collapsed.tsv) and the total number of DMRs called for each of the following comparisons were:
    • All ambient samples over time: 249
    • All day 10 samples: 182
    • All day 135 samples: 109
    • All day 145 samples: 213
  • Filtered DMRs: Within-sample DMRs were filtered for those that show coverage in at least 3/4 individuals per experimental group (R script here: mcmax30_DMR_cov_in_0.75_SamplesPerCategory.R, Rproj here: DMR_cov_in_0.75_SamplesPerCategory.Rproj). After these steps the total number of DMRs that went into the ANOVA test for experimental group differences for each of the following comparisons were:
    • All ambient samples over time: 82
    • All day 10 samples: 99
    • All day 135 samples: 42
    • All day 145 samples: 26
  • Significant DMRs After running group statistics (post here: https://shellytrigg.github.io/209th-post/), the fraction of DMRs with an experimental group effect significant at a 1way ANOVA p.value of < 0.1 for each of the following comparisons were:
    • 38/82
    • 29/99
    • 13/42
    • 5/26

DMR Summary Table:

| comparison | num.exp.groups | num.within.sample.DMRs | num.DMRs.filtered.0.75X.indv.per.group | num.DMRs.AOV.sig.at0.1 | num.DMRs.AOV.sig.at0.05 | fraction.filtered.DMRs | fraction.total.DMR.sig.at0.1 | fraction.filtered.DMR.sig.at0.1 | fraction.total.DMR.sig.at0.05 | fraction.filtered.DMR.sig.at0.05 | |———————|—————-|————————|—————————————-|————————|————————-|————————|——————————|———————————|——————————-|———————————-| | all ambient samples | 4 | 249 | 82 | 38 | 33 | 0.33 | 0.15 | 0.46 | 0.13 | 0.40 | | all day 10 samples | 3 | 182 | 99 | 29 | 14 | 0.54 | 0.16 | 0.29 | 0.08 | 0.14 | | all day 135 samples | 3 | 109 | 42 | 13 | 9 | 0.39 | 0.12 | 0.31 | 0.08 | 0.21 | | all day 145 samples | 6 | 213 | 26 | 5 | 3 | 0.12 | 0.02 | 0.19 | 0.01 | 0.12 |

  • no DMR was significant at FDR corrected p.value < 0.05
  • 5 DMRs from all-ambient samples and 1 DMR from Day 10 samples were significant at FDR corrected p.value of < 0.1

Feature analysis including non-significant DMRs

My previous post looked into the number of different genomic features that significant DMRs overlap with. This analysis looks at the number of different genomic features that all DMRs (significant + non-significant) overlap with and compares to that of significant DMRs.

  1. I matched filtered DMRs to the most recent GFF Panopea-generosa-vv0.74.a4-merged-2019-10-07-4-46-46.gff3 using this jupyter notebook: 20191106_functional_analysis.ipynb and created files with genome feature annotations in this directory: 20191106_anno
  2. I added on the Rmarkdown script I previously started 20191102_anno.Rmd to plot the filtered DMRs (“All DMRs”, left side) next to the ANOVA significant DMRs (“significant DMRs”, right side). Each sample comparison (all ambient samples, day10 samples, day135 samples, and day145 samples) is plotted row-wise.

AllvSigDMRsxfeatures_PropBarplot_GroupFacet.jpg

  • There are no obvious differences between significant DMRs and all DMRs in the proportion of features they overlap with, except for the Day145 comparison where CDS and exon features are under-represented, and mRNA and repeat_region are over-represented.

Next steps

  • Look deeper into repeat regions (there are ~6 different catagories, so can see if any are particularly different)
  • For DMRs no in features, check the nearest features
  • Compare genes with DMRs to the ones identified by Hollie’s method
  • Continue GO analysis
    • generate appropriate GO background to use for each comparison
    • run TopGO
  • Go through manuscript methods section
    • update new stuff
    • add comments to areas of concern

from shellytrigg https://ift.tt/2NOGNAo
via IFTTT