Shelly’s Notebook: Apr 24-30, 2020 Sea lice methylome

Sample details

  • 2 lice samples from 20190523
    • 2 adult females
  • Samples from 20191118 (in fridge in 209)
    • 2 pools of adult sea lice from different populations
    • from two different salmon farms
    • different salinities? temperatures?
  • *** need more info about both of sample groups; waiting to hear back from Cristian’s lab

Methylation data

re-trimmed all samples with bismark recommended trimming parameters

aligned trimmed reads to genome

desc. stat Sealice F1 S20 Sealice F2 S22
total reads before trim 226181237 163876559
perc reads trim removed 0.56 0.43
total reads after trim 224907116 163179589
uniq aligned reads 62174415 34864893
perc uniq aligned 27.64 21.37
ambig reads 49967995 31414190
perc ambig aligned 22.22 19.25
perc no align 50.14 59.38
dedup reads 39280061 26425601
dedup reads percent 63.18 75.79
dup reads 22894354 8439292
dup reads percent 36.82 24.21
percent cpg meth 1.1 1.0
percent chg meth 1.0 0.8
percent chh meth 1.4 1.5

prepared merged CpG 5x cov files

categorized CpGs with 5x cov:

5x CpG summary tables:

Sample|Methylated CpG (>= 50%)|Sparsely methylated CpG (10% – 50%)|Unmethylated CpG (< 10%) :—–:|:—–:|:—–:|:—–: F1|2335|391515|8890795 F2|1864|2232274|6108325

CpG category|F1:F2 CpG overlap|Uniq F1 CpGs|Uniq F2 CpGs|frac F1 mCpG overlapping|frac F2 mCpG overlapping :—–:|:—–:|:—–:|:—–:|:—–:|:—–: Methylated CpG(>= 50%) | 545 | 1790 | 1319 |23.34 | 29.24 Sparsely methylated CpG(10% – 50%) | 19838 | 371677 |212436| 5.07 | 8.54 Unmethylated CpG(< 10%) |5431008 |3459787| 677317 |61.09| 88.91

5x merged CpG summary tables:

Sample Methylated CpG (>= 50%) Sparsely methylated CpG (10% – 50%) Unmethylated CpG (< 10%)
F1 1342 233941 6831337
F2 1102 165423 5130433
CpG category F1:F2 CpG overlap Uniq F1 CpGs Uniq F2 CpGs frac F1 mCpG overlapping frac F2 mCpG overlapping
Methylated CpG(>= 50%) 314 1028 788 23.40 28.49
Sparsely methylated CpG(10% – 50%) 13570 220371 151853 5.80 8.20
Unmethylated CpG(< 10%) 4824175 2007162 306258 70.62 94.03

IGV session

Example of highly methylated CpG overlapping between both samples Screen%20Shot%202020-04-29%20at%202.54.54%20PM.png

zoomed in view Screen%20Shot%202020-04-29%20at%202.55.57%20PM.png

Genomic Feature analysis

Feature num. of features
CDS 30022
exon 30022
gene 23686
mRNA 23686
  • Checked for features overlapping with CpGs methylated >=50%
Sample mCpG(>= 50%) mCpG overlapping with genes/mRNA mCpG overlapping with exon/CDS
F1 2335 114 106
F2 1864 103 95
F1.merged 1342 60 58
F2.merged 1102 45 42

Moving forward

  • Overall most mCpGs are not located in genic regions so it’s hard to say how to target this sparse methylation aside from MBD
  • Would be great to get repeat regions and other features (UTR, etc)
  • Options for 1 sequencing run:
    • resequence 2 individuals to attempt to acheive 100% genome coverage
      • this would give at least 500M reads more data per individual
      • cost: $4,940 (1 Novaseq run)
    • WGBS 4 individuals aiming for 400M reads each to attempt acheive 5x coverage of >95% of genome
      • cost: ~$5150 = $4,940 (1 Novaseq run) + library prep (~ $50/sample) + time
      • logic: 226M reads gave 65% genome coverage @5x read depth, 168M gave 45% genome coverage @5x read depth
        • assuming a linear relationship between read depth and genome coverage (2.9 * 100) + 37.5 = ~340M ; see chart below
        • 5x%20Coverage%20(read%20depth).png
    • WGBS 2-3 individuals aiming for 500M reads each to attempt to achieve 10x coverage of > 95% of genome
      • cost: ~$5100 = $4,940 (1 Novaseq run) + library prep (~ $50/sample) + time
      • logic: 226M reads gave 46% genome coverage @ 10x read depth, 168M gave 31% genome coverage @10x read depth
        • assuming a linear relationship between read depth and genome coverage (3.77 * 100) + 51.8 = ~430M ; see chart below
        • 10x%20Coverage%20(read%20depth).png
    • MBD-BS on 10 individuals from each population:

from shellytrigg https://ift.tt/2KKAxbM
via IFTTT

Grace’s Notebook: Annotate 2019 infection status DEG list; short-term paper plan

Today I read a lot about crustacean immune systems and macropinocytosis. Notes in my crab paper. Additionally, SR and I met and chatted about short-term plan. I annotated the 2019 infection status DEG list. Details in post

Annotating DEG list from 2019 infection status comparison

Rmd: annotate-2019infection_status-DEGs.Rmd

Result: 2019-infection_annot-DEGlist.tab

Need for the file was pointed out by Steven when discussing results section of the crab paper. Added link to the file in the paper.

New short-term paper and crab project plan:

Goal is to focus on the paper and smooth it out. Get Intro and other sections in a more paragraph form.

Talk a little about macropinocytosis now, but don’t get hung up on it. Can go back to it later after the big picture things are written and described.

After I get things more smooth in the paper, I’ll look for some similar papers in crustacea or crab and compare the transcriptome assemblies by number of genes ID-ed, average size and N50s, etc.

FOCUSING ON BIG PICUTRE PAPER PROGRESS CURRENTLY.

from Grace’s Lab Notebook https://ift.tt/2YnAzOW
via IFTTT

CHiP-Seq and ATAC-Seq

https://bioinformatics-core-shared-training.github.io/cruk-autumn-school-2017/ChIP/Materials/Lectures/Lecture4_Introduction%20to%20ChIP-seq%20and%20ATAC-seq_SS.pdf

Sam’s Notebook: Transcript Abundance – C.bairdi Alignment-free with Salmon Using 2020-GW Data on Mox

Clarified with Steven an approach for tackling multi-condition comparisons (see this GitHub Issue). As such, I need to have individual transcript abundances for each sample from the 2020 Genewiz RNAseq data before I can proceed. So, I ran salmon (v1.2.1) to perform an alignment-free set of transcript abundances. It’s ridiculously fast, btw…

This was run on Mox and used the C.bairdi-specific reads that were extracted using MEGAN6 on 202020330.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=cbai_DEG_basic ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200429_cbai_salmon_2020GW_transcript_abundances # Script to generate set of transcript abundances for all C.bairdi Genewiz 2020 data. # # C.bairdi-specific reads were extracted with MEGAN6: # https://robertslab.github.io/sams-notebook/2020/03/30/RNAseq-Reads-Extractions-C.bairdi-Taxonomic-Reads-Extractions-with-MEGAN6-on-swoose.html # # Transcriptome was produced here: https://robertslab.github.io/sams-notebook/2020/03/30/Transcriptome-Assembly-C.bairdi-with-MEGAN6-Taxonomy-specific-Reads-with-Trinity-on-Mox.html # Transcriptome is the same as: cbai_transcriptome_v1.5.fasta # # Salmon index generated during a previous gene expression analysis: # https://robertslab.github.io/sams-notebook/2020/04/22/Gene-Expression-C.bairdi-Pairwise-DEG-Comparisons-with-2019-RNAseq-using-Trinity-Salmon-EdgeR-on-Mox.html ################################################################################################################### # BEGIN USER SETTINGS # Programs array declare -A programs_array programs_array=([salmon]="/gscratch/srlab/programs/salmon-1.2.1_linux_x86_64/bin/salmon") ## Designate input files and locations fastq_dir="/gscratch/srlab/sam/data/C_bairdi/RNAseq/" salmon_index="/gscratch/srlab/sam/data/C_bairdi/transcriptomes/20200408.C_bairdi.megan.Trinity.fasta.salmon.idx" # Set number of CPU threads # Salmon default is 56 threads - so not needed # threads=28 # END USER SETTINGS #################################################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log # Caputure working directory #wd="$(pwd)" # Capture program options ## NOTE: This particular instance is specific to salmon! for program in "${!programs_array[@]}" do { echo "Program options for ${programs_array[$program]}: " echo "" ${programs_array[$program]} quant --help echo "" ${programs_array[$program]} quant --help-reads echo "" echo "

Grace’s Notebook: Made new crab transcriptome stress response gene list; notes on crab immunity and macropinocytosis

Today I made a new crab stress response table with more information. I also started an outline for the intro based on the meeting I had with Steven yesterday. I’ve been reading/finding resources about crab/crustacean immunity and macropinocytosis. Will add more to intro and discussion as I learn more.

New stress response gene list

Rmd: 042720-get_crab-stress_response-genelist.Rmd

New list: crab_stress-response-genes.tab

The new list now has BLAST output from the crab transcriptome and the Uniprot-SP-GO information join-ed with the “stress response” list.

Crab paper

Working on:

  • Intro (reading about crab immunity to add to intro)
  • Results (will describe transcriptome more)
  • Discussion (will add more about macropinocytosis and what it means)

from Grace’s Lab Notebook https://ift.tt/2y46RDQ
via IFTTT

Sam’s Notebook: Go To Goslim C.bairdi Enriched Go Terms From 20200422 Degs

pu— layout: post title: GO to GOslim – C.bairdi Enriched GO Terms from 20200422 DEGs date: ‘2020-04-27 12:04’ tags:

  • Tanner crab
  • GO
  • GOslim
  • gene ontology
  • R
  • Chionoecetes bairdi categories:
  • Miscellaneous

    After running pairwise comparisons and identify differentially expressed genes (DEGs) on 20200422 and finding enriched gene ontology terms, I decided to map the GO terms to Biological Process GOslims. Additionally, I decided to try another level of comparison (I’m not sure how valid it is), whereby I will count the number of GO terms assigned to each GOslim and then calculate the percentage of GOterms that get assigned to each of the GOslim categories. The idea being that it might help identify Biological Processes that are “favored” in a given set of DEGs. I decided to set up “fancy” pyramid plots to view a given set of GO-GOslims for each DEG comparison.

All work was done in R. The initial counting/percentage calculations were done with the following R script (note: all of the followin R code are part of an R Project – link is in the RESULTS section of notebook). The R script uses a “flattened” set of the enriched GO terms identified by Trinity/GOseq, where flattened means one GO term per row. So, a gene may be represented multiple times, in multiple rows if there were multiple GO terms assigned to it by Trinity/GOseq.

The script then relies on the GSEABase package (Bioconductor) and the GO Consortium’s “Generic GO subset”:

After that, I plotted the outputs.

GO to GOslim R script:

library(GSEABase) library(tidyverse) ######################################################################### # Scipt to map C.bairdi DEG enriched GO terms to GOslims # and identify the GO terms contributing to each GOslim ######################################################################### ### Download files and specify destination directory download.file(url = "https://gannet.fish.washington.edu/Atumefaciens/20200422_cbai_DEG_basic_comparisons/D9-D12/edgeR.169728.dir/salmon.gene.counts.matrix.D12_vs_D9.edgeR.DE_results.P0.05_C1.D12-UP.subset.GOseq.enriched.flattened", destfile = "./data/D9-D12/salmon.gene.counts.matrix.D12_vs_D9.edgeR.DE_results.P0.05_C1.D12-UP.subset.GOseq.enriched.flattened") download.file(url = "https://gannet.fish.washington.edu/Atumefaciens/20200422_cbai_DEG_basic_comparisons/D9-D12/edgeR.169728.dir/salmon.gene.counts.matrix.D12_vs_D9.edgeR.DE_results.P0.05_C1.D9-UP.subset.GOseq.enriched.flattened", destfile = "./data/D9-D12/salmon.gene.counts.matrix.D12_vs_D9.edgeR.DE_results.P0.05_C1.D9-UP.subset.GOseq.enriched.flattened") download.file(url = "https://gannet.fish.washington.edu/Atumefaciens/20200422_cbai_DEG_basic_comparisons/D9-D26/edgeR.200352.dir/salmon.gene.counts.matrix.D26_vs_D9.edgeR.DE_results.P0.05_C1.D9-UP.subset.GOseq.enriched.flattened", destfile = "./data/D9-D26/salmon.gene.counts.matrix.D26_vs_D9.edgeR.DE_results.P0.05_C1.D9-UP.subset.GOseq.enriched.flattened") download.file(url = "https://gannet.fish.washington.edu/Atumefaciens/20200422_cbai_DEG_basic_comparisons/D9-D26/edgeR.200352.dir/salmon.gene.counts.matrix.D26_vs_D9.edgeR.DE_results.P0.05_C1.D26-UP.subset.GOseq.enriched.flattened", destfile = "./data/D9-D26/salmon.gene.counts.matrix.D26_vs_D9.edgeR.DE_results.P0.05_C1.D26-UP.subset.GOseq.enriched.flattened") download.file(url = "https://gannet.fish.washington.edu/Atumefaciens/20200422_cbai_DEG_basic_comparisons/D12-D26/edgeR.230922.dir/salmon.gene.counts.matrix.D12_vs_D26.edgeR.DE_results.P0.05_C1.D26-UP.subset.GOseq.enriched.flattened", destfile = "./data/D12-D26/salmon.gene.counts.matrix.D12_vs_D26.edgeR.DE_results.P0.05_C1.D26-UP.subset.GOseq.enriched.flattened") download.file(url = "https://gannet.fish.washington.edu/Atumefaciens/20200422_cbai_DEG_basic_comparisons/ambient-cold/edgeR.267393.dir/salmon.gene.counts.matrix.ambient_vs_cold.edgeR.DE_results.P0.05_C1.ambient-UP.subset.GOseq.enriched.flattened", destfile = "./data/ambient-cold/salmon.gene.counts.matrix.ambient_vs_cold.edgeR.DE_results.P0.05_C1.ambient-UP.subset.GOseq.enriched.flattened") download.file(url = "https://gannet.fish.washington.edu/Atumefaciens/20200422_cbai_DEG_basic_comparisons/ambient-cold/edgeR.267393.dir/salmon.gene.counts.matrix.ambient_vs_cold.edgeR.DE_results.P0.05_C1.cold-UP.subset.GOseq.enriched.flattened", destfile = "./data/ambient-cold/salmon.gene.counts.matrix.ambient_vs_cold.edgeR.DE_results.P0.05_C1.cold-UP.subset.GOseq.enriched.flattened") download.file(url = "https://gannet.fish.washington.edu/Atumefaciens/20200422_cbai_DEG_basic_comparisons/ambient-warm/edgeR.297991.dir/salmon.gene.counts.matrix.ambient_vs_warm.edgeR.DE_results.P0.05_C1.warm-UP.subset.GOseq.enriched.flattened", destfile = "./data/ambient-warm/salmon.gene.counts.matrix.ambient_vs_warm.edgeR.DE_results.P0.05_C1.warm-UP.subset.GOseq.enriched.flattened") download.file(url = "https://gannet.fish.washington.edu/Atumefaciens/20200422_cbai_DEG_basic_comparisons/ambient-warm/edgeR.297991.dir/salmon.gene.counts.matrix.ambient_vs_warm.edgeR.DE_results.P0.05_C1.ambient-UP.subset.GOseq.enriched.flattened", destfile = "./data/ambient-warm/salmon.gene.counts.matrix.ambient_vs_warm.edgeR.DE_results.P0.05_C1.ambient-UP.subset.GOseq.enriched.flattened") download.file(url = "https://gannet.fish.washington.edu/Atumefaciens/20200422_cbai_DEG_basic_comparisons/cold-warm/edgeR.328585.dir/salmon.gene.counts.matrix.cold_vs_warm.edgeR.DE_results.P0.05_C1.warm-UP.subset.GOseq.enriched.flattened", destfile = "./data/cold-warm/salmon.gene.counts.matrix.cold_vs_warm.edgeR.DE_results.P0.05_C1.warm-UP.subset.GOseq.enriched.flattened") download.file(url = "https://gannet.fish.washington.edu/Atumefaciens/20200422_cbai_DEG_basic_comparisons/cold-warm/edgeR.328585.dir/salmon.gene.counts.matrix.cold_vs_warm.edgeR.DE_results.P0.05_C1.cold-UP.subset.GOseq.enriched.flattened", destfile = "./data/cold-warm/salmon.gene.counts.matrix.cold_vs_warm.edgeR.DE_results.P0.05_C1.cold-UP.subset.GOseq.enriched.flattened") download.file(url = "https://gannet.fish.washington.edu/Atumefaciens/20200422_cbai_DEG_basic_comparisons/infected-uninfected/edgeR.132470.dir/salmon.gene.counts.matrix.infected_vs_uninfected.edgeR.DE_results.P0.05_C1.infected-UP.subset.GOseq.enriched.flattened", destfile = "./data/infected-uninfected/salmon.gene.counts.matrix.infected_vs_uninfected.edgeR.DE_results.P0.05_C1.infected-UP.subset.GOseq.enriched.flattened") download.file(url = "https://gannet.fish.washington.edu/Atumefaciens/20200422_cbai_DEG_basic_comparisons/infected-uninfected/edgeR.132470.dir/salmon.gene.counts.matrix.infected_vs_uninfected.edgeR.DE_results.P0.05_C1.uninfected-UP.subset.GOseq.enriched.flattened", destfile = "./data/infected-uninfected/salmon.gene.counts.matrix.infected_vs_uninfected.edgeR.DE_results.P0.05_C1.uninfected-UP.subset.GOseq.enriched.flattened") ### Set false discovery rate (FDR) filter, if desired fdr <- as.character("1.0") ### Create list of files goseq_files <- list.files(path = "./data", pattern = "\\.GOseq.[de]", recursive = TRUE, full.names = TRUE) ### Set output filename suffix output_suffix=("GOslims.csv") ### Strip path from goseq files goseq_filename <- basename(goseq_files) ### Vector of GOslim ontologies (e.g. Biological Process = BP, Molecular Function = MF, Cellular Component = CC) ontologies <- c("BP", "CC", "MF") for (slim_ontology in ontologies) { ### Set GOOFFSPRING database, based on ontology group set above go_offspring <- paste("GO", slim_ontology, "OFFSPRING", sep = "") for (item in goseq_files) { ## Get max number of fields # Needed to handle reading in file with different number of columns in each row # as there may be multiple max_fields <- max(count.fields(item, sep = "\t"), na.rm = TRUE) ## Read in tab-delimited GOseq file # Use "max_fields" to populate all columns with a sequentially numbered header go_seqs <- read.delim(item, col.names = paste0("V",seq_len(max_fields))) ## Filter enriched GOterms with false discovery rate goseqs_fdr <- filter(go_seqs, V8 <= as.numeric(fdr)) ## Grab just the individual GO terms from the "category" column) goterms <- as.character(goseqs_fdr$V1) ### Use GSEA to map GO terms to GOslims ## Store goterms as GSEA object myCollection <- GOCollection(goterms) ## Use generic GOslim file to create a GOslim collection # I downloaded goslim_generic.obo from http://geneontology.org/docs/go-subset-guide/ # then i moved it to the R library for GSEABase in the extdata folder # in addition to using the command here - I think they're both required. slim <- getOBOCollection("./data/goslim_generic.obo") ## Map GO terms to GOslims and select Biological Processes group slimsdf <- goSlim(myCollection, slim, slim_ontology) ## Need to know the 'offspring' of each term in the ontology, and this is given by the data in: # GO.db::getFromNamespace(go_offspring, "GO.db") ## Create function to parse out GO terms assigned to each GOslim ## Courtesy Bioconductor Support: https://support.bioconductor.org/p/128407/ mappedIds <- function(df, collection, OFFSPRING) { map <- as.list(OFFSPRING[rownames(df)]) mapped <- lapply(map, intersect, ids(collection)) df[["go_terms"]] <- vapply(unname(mapped), paste, collapse = ";", character(1L)) df } ## Run the function slimsdf <- mappedIds(slimsdf, myCollection, getFromNamespace(go_offspring, "GO.db")) ## Provide column name for first column slimsdf <- cbind(GOslim = rownames(slimsdf), slimsdf) rownames(slimsdf) <- NULL ### Prep output file naming structure ## Split filenames on periods ## Creates a list split_filename <- strsplit(item, ".", fixed = TRUE) ## Split filename on directories ## Creates a list split_dirs <- strsplit(item, "/", fixed = TRUE) ## Slice split_filename list from position 9 to the last position of the list ## Paste these together using a period goseq_filename_split <-paste(split_filename[[1]][9:lengths(split_filename)], collapse = ".") ## Slice split_dirs list at position ## Paste elements together to form output filename fdr_file_out <- paste("FDR", fdr, sep = "_") outfilename <- paste(goseq_filename_split, fdr_file_out, slim_ontology ,output_suffix, collapse = ".", sep = ".") ## Set output file destination and name ## Adds proper subdirectory from split_dirs list outfile_dest <- file.path("./analyses", split_dirs[[1]][3], outfilename) ## Write output file write.csv(slimsdf, file = outfile_dest, quote = FALSE, row.names = FALSE) } } 

Since I got “lazy” and didn’t want to try to figure out how to properly loop through all of the output files from the above script, I just made individual scripts to plot each set of comparison GO terms percentages assigned to GOslims. Here’s an example:

# Script to generate a "pyramid" plot # comparing the percentages of enriched GO terms assinged # to each category of Biological Process GOslims library(dplyr) library(ggplot2) ##################################################### # Set the following variables for the appropriate comparisons/files ##################################################### # Comparison comparison <- "infected-uninfected" # Treatments treatment_01 <- "infected" treatment_02 <- "uninfected" # Read in first comparsion files df1 <- read.csv("analyses/infected-uninfected/P0.05_C1.infected-UP.subset.GOseq.enriched.flattened.FDR_1.0.BP.GOslims.csv") # Read in second comparison file df2 <- read.csv("analyses/infected-uninfected/P0.05_C1.uninfected-UP.subset.GOseq.enriched.flattened.FDR_1.0.BP.GOslims.csv") ###################################################### # CHANGES BELOW HERE ARE PROBABLY NOT NECESSARY ###################################################### # GOslim categories ontologies <- c("BP", "CC", "MF") # Remove generic "biological_process" category df1 <- df1[df1$GOslim != "GO:0008150",] df2 <- df2[df2$GOslim != "GO:0008150",] # Remove generic "cellular_component" category df1 <- df1[df1$GOslim != "GO:0005575",] df2 <- df2[df2$GOslim != "GO:0005575",] # Remove generic "molecular_function" category df1 <- df1[df1$GOslim != "GO:0003674",] df2 <- df2[df2$GOslim != "GO:0003674",] # Select columns df1 <- df1 %>% select(Term, Percent) df2 <- df2 %>% select(Term, Percent) # Create treatment column and assign term to all rows df1$treatment <- treatment_01 df2$treatment <- treatment_02 # Concatenate dataframes df3 <- rbind(df1, df2) # Filename for plot pyramid <- paste(comparison, "GOslims", "BP", "png", sep = ".") pyramid_path <- paste(comparison, pyramid, sep = "/") pyramid_dest <- file.path("./analyses", pyramid_path) # "Open" PNG file for saving subsequent plot png(pyramid_dest, width = 600, height = 1200, units = "px", pointsize = 12) # Create "pyramid" plot ggplot(df3, aes(x = Term, fill = treatment, y = ifelse(test = treatment == treatment_01, yes = -Percent, no = Percent))) + geom_bar(stat = "identity") + scale_y_continuous(labels = abs, limits = max(df3$Percent) * c(-1,1)) + labs(title = "Percentages of GO terms assigned to BP GOslims", x = "GOslim", y = "Percent GO terms in GOslim") + scale_x_discrete(expand = c(-1,0)) + coord_flip() # Close PNG file dev.off() 

Yaamini’s Notebook: MethCompare Part 8

Characterizing CpGs in 5x data

After confirming last week that mapping to a pan-genome wasn’t changing our output, I went ahead with the CpG characterization pipeline. We decided to move ahead with 5x data with the option to return to 10x data later. I decided to create a new Jupyter notebook for running 5x data through the pipeline.

Repo fiasco

Turns out we haven’t mastered how to have five people working on a repo simultaneously without having pull-and-push issues. When Sam reverted to an old repo version, I lost the checksum code I worked on earlier. The first thing I did was add the code back in to my 10x Jupyter notebook and to the new notebook I created for 5x data. This isn’t the first time a repo commit from someone else lost some of my code, so hopefully we figure out!

M. capitata

Using 5x files provided more CpG data to characterize, but the percentages of methylated CpGs total and in various genome features were mostly consistent between 5x and 10x data.

Table 1. Methylated, sparsely methylated, and unmethylated CpGs in M. capitata

Sample Method CpG with Data Methylated CpG Sparsely Methylated CpG Unmethylated CpG
10 WGBS 4571288 450582 (9.9%) 547868 (12.0%) 3572838 (78.2%)
11 WGBS 4661716 528902 (11.3%) 517805 (11.1%) 3615009 (77.5%)
12 WGBS 8791700 1059904 (12.1%) 1000337 (11.4%) 6731459 (76.6%)
13 RRBS 3173254 257741 (8.1%) 152042 (4.8%) 2763471 (87.1%)
14 RRBS 2648697 184742 (7.0%) 135052 (5.1%) 2328903 (87.9%)
15 RRBS 3176517 231347 (7.3%) 179454 (5.6%) 2765716 (87.1%)
16 MBD-BSSeq 583599 106695 (18.3%) 74839 (12.8%) 402065 (68.9%)
17 MBD-BSSeq 242390 45506 (18.8%) 28850 (11.9%) 168034 (69.3%)
18 MBD-BSSeq 153392 29468 (19.2%) 16793 (10.9%) 107131 (69.8%)

Table 2. Number and percent of CpGs that overlap with genes in M. capitata.

Sample Method CpGs with Data Methylated CpGs Sparsely Methylated CpGs Unmethylated CpGs
10 WGBS 1683069 230343 (13.7%) 196827 (11.7%) 1255899 (74.6%)
11 WGBS 1740149 267181 (15.3%) 188348 (10.8%) 1284620 (73.8%)
12 WGBS 3204885 533230 (16.6%) 348967 (10.9%) 2322688 (72.5%)
13 RRBS 1062577 123271 (11.6%) 53235 (5.0%) 886071 (83.4%)
14 RRBS 895852 89865 (10.0%) 48499 (5.4%) 757488 (84.6%)
15 RRBS 1064769 113217 (10.6%) 64106 (6.0%) 887446 (83.3%)
16 MBD-BSSeq 208503 48436 (23.2%) 25316 (12.1%) 134751 (64.6%)
17 MBD-BSSeq 83519 20458 (24.5%) 9441 (11.3%) 53620 (64.2%)
18 MBD-BSSeq 50302 12960 (25.8%) 4843 (9.6%) 32499 (64.6%)

Table 3. Number and percent of CpGs that overlap with CDS in M. capitata.

Sample Method CpGs with Data Methylated CpGs Sparsely Methylated CpGs Unmethylated CpGs
10 WGBS 476579 54412 (11.4%) 60266 (12.6%) 361901 (75.9%)
11 WGBS 496887 64070 (12.9%) 58258 (11.7%) 374559 (75.4%)
12 WGBS 791965 113396 (14.3%) 89455 (11.3%) 589114 (74.4%)
13 RRBS 200808 18158 (9.0%) 11383 (5.7%) 171267 (85.3%)
14 RRBS 172313 12833 (7.4%) 10486 (6.1%) 148994 (86.5%)
15 RRBS 203147 15130 (7.4%) 14015 (6.9%) 174002 (85.7%)
16 MBD-BSSeq 72490 13535 (18.7%) 9666 (13.3%) 49289 (68.0%)
17 MBD-BSSeq 29305 5649 (19.3%) 3690 (12.6%) 19966 (68.1%)
18 MBD-BSSeq 17619 3992 (22.7%) 1907 (10.8%) 11720 (66.5%)

Table 4. Number and percent of CpGs that overlap with introns in M. capitata.

Sample Method CpGs with Data Methylated CpGs Sparsely Methylated CpGs Unmethylated CpGs
10 WGBS 1207609 176102 (14.6%) 136671 (11.3%) 894836 (74.1%)
11 WGBS 1244403 203311 (16.3%) 130195 (10.5%) 910897 (73.2%)
12 WGBS 2415036 420249 (17.4%) 259702 (10.8%) 1735085 (71.8%)
13 RRBS 862191 105145 (12.2%) 41866 (4.9%) 715180 (82.9%)
14 RRBS 723933 77053 (10.6%) 38032 (5.3%) 608848 (84.1%)
15 RRBS 862091 98111 (11.4%) 50116 (5.8%) 713864 (82.8%)
16 MBD-BSSeq 136153 34929 (25.7%) 15666 (11.5%) 85558 (62.8%)
17 MBD-BSSeq 54275 14821 (27.3%) 5755 (10.6%) 33699 (62.1%)
18 MBD-BSSeq 32717 8973 (27.4%) 2942 (9.0%) 20802 (63.6%)

Table 5. Number and percent of CpGs that overlap with intergenic regions in M. capitata.

Sample Method CpGs with Data Methylated CpGs Sparsely Methylated CpGs Unmethylated CpGs
10 WGBS 2888219 220239 (7.6%) 351041 (12.2%) 2316939 (80.2%)
11 WGBS 2921567 261721 (9.0%) 329457 (11.3%) 2330389 (79.8%)
12 WGBS 5586815 526674 (9.4%) 651370 (11.7%) 4408771 (78.9%)
13 RRBS 2110677 134470 (6.4%) 98807 (4.7%) 1877400 (88.9%)
14 RRBS 1752845 94877 (5.4%) 86553 (4.9%) 1571415 (89.6%)
15 RRBS 2111748 118130 (5.6%) 115348 (5.5%) 1878270 (88.9%)
16 MBD-BSSeq 375096 58259 (15.5%) 49523 (13.2%) 267314 (71.3%)
17 MBD-BSSeq 158871 25048 (15.8%) 19409 (12.2%) 114414 (72.0%)
18 MBD-BSSeq 103090 16508 (16.0%) 11950 (11.6%) 74632 (72.4%)

P. acuta

Table 6. Methylated, sparsely methylated, and unmethylated CpGs in P. acuta

Sample Method CpG with Data Methylated CpG Sparsely Methylated CpG Unmethylated CpG
1 WGBS 5546051 110364 (2.0%) 367019 (6.6%) 5068668 (91.4%)
2 WGBS 6358722 126440 (2.0%) 345887 (5.4%) 5886395 (92.6%)
3 WGBS 5866786 124819 (2.1%) 385346 (6.6%) 5356621 (91.3%)
4 RRBS 1835561 31047 (1.7%) 137700 (7.5%) 1666814 (90.8%)
5 RRBS 1451229 30345 (2.1%) 64837 (4.5%) 1356047 (93.4%)
6 RRBS 1517358 26617 (1.8%) 89246 (5.9%) 1401495 (92.4%)
7 MBD-BSSeq 2640625 258222 (9.8%) 296059 (11.2%) 2086344 (79.0%)
8 MBD-BSSeq 539008 213342 (39.6%) 80086 (14.9%) 245580 (45.6%)
9 MBD-BSSeq 2732607 255370 (9.3%) 337855 (12.4%) 2139382 (78.3%)

Table 7. Number and percent of CpGs that overlap with genes in P. acuta.

Sample Method CpG with Data Methylated CpG Sparsely Methylated CpG Unmethylated CpG
1 WGBS 2466992 73959 (3.0%) 157337 (6.4%) 2235696 (90.6%)
2 WGBS 2761956 85861 (3.1%) 144292 (5.2%) 2531803 (91.7%)
3 WGBS 2588278 82377 (3.2%) 161791 (6.3%) 2344110 (90.6%)
4 RRBS 776123 13588 (1.8%) 56290 (7.3%) 706245 (91.0%)
5 RRBS 607225 12789 (2.1%) 25954 (4.3%) 568482 (93.6%)
6 RRBS 639570 11396 (1.8%) 35633 (5.6%) 592541 (92.6%)
7 MBD-BSSeq 1253805 118291 (9.4%) 120485 (9.6%) 1015029 (81.0%)
8 MBD-BSSeq 220096 86074 (39.1%) 27902 (12.7%) 106120 (48.2%)
9 MBD-BSSeq 1281644 125536 (9.8%) 139034 (10.8%) 1017074 (79.4%)

Table 8. Number and percent of CpGs that overlap with CDS in P. acuta.

Sample Method CpG with Data Methylated CpG Sparsely Methylated CpG Unmethylated CpG
1 WGBS 1494340 59188 (4.0%) 89863 (6.0%) 1345289 (90.0%)
2 WGBS 1620632 66365 (4.1%) 76868 (4.7%) 1477399 (91.2%)
3 WGBS 1552715 65245 (4.2%) 89654 (5.8%) 1397816 (90.0%)
4 RRBS 500779 9644 (1.9%) 36616 (7.3%) 454519 (90.8%)
5 RRBS 395869 8808 (2.2%) 17345 (4.4%) 369716 (93.4%)
6 RRBS 415529 7954 (1.9%) 23679 (5.7%) 383896 (92.4%)
7 MBD-BSSeq 931250 92559 (9.9%) 87369 (9.4%) 751322 (80.7%)
8 MBD-BSSeq 184606 70696 (38.3%) 21973 (11.9%) 91937 (49.8%)
9 MBD-BSSeq 924825 95614 (10.3%) 98699 (10.7%) 730512 (79.0%)

Table 9. Number and percent of CpGs that overlap with introns in P. acuta.

Sample Method CpG with Data Methylated CpG Sparsely Methylated CpG Unmethylated CpG
1 WGBS 1840138 41787 (2.3%) 122271 (6.6%) 1676080 (91.1%)
2 WGBS 2113295 51567 (2.4%) 117738 (5.6%) 1943990 (92.0%)
3 WGBS 1946150 47352 (2.4%) 128122 (6.6%) 1770676 (91.0%)
4 RRBS 539524 8446 (1.6%) 38589 (7.2%) 492489 (91.3%)
5 RRBS 414147 8375 (2.0%) 17239 (4.2%) 388533 (93.8%)
6 RRBS 439844 7162 (1.6%) 23825 (5.4%) 408857 (93.0%)
7 MBD-BSSeq 746710 64572 (8.6%) 71046 (9.5%) 611092 (81.8%)
8 MBD-BSSeq 101087 42068 (41.6%) 13365 (13.2%) 45654 (45.2%)
9 MBD-BSSeq 794295 72530 (9.1%) 85117 (10.7%) 636648 (80.2%)

Table 10. Number and percent of CpGs that overlap with intergenic regions in P. acuta.

Sample Method CpG with Data Methylated CpG Sparsely Methylated CpG Unmethylated CpG
1 WGBS 3080835 3080835 (1.2%) 209781 (6.8%) 2834593 (92.0%)
2 WGBS 3598848 40642 (1.1%) 201712 (5.6%) 3356494 (93.3%)
3 WGBS 3280357 42507 (1.3%) 223666 (6.8%) 3014184 (91.9%)
4 RRBS 1060062 17473 (1.6%) 81473 (7.7%) 961116 (90.7%)
5 RRBS 844509 17581 (2.1%) 38900 (4.6%) 788028 (93.3%)
6 RRBS 878266 15232 (1.7%) 53637 (6.1%) 809397 (92.2%)
7 MBD-BSSeq 1387623 140057 (10.1%) 175668 (12.7%) 1071898 (77.2%)
8 MBD-BSSeq 319125 127376 (39.9%) 52215 (16.4%) 139534 (43.7%)
9 MBD-BSSeq 1451853 129949 (9.0%) 198940 (13.7%) 1122964 (77.3%)

Going forward

  1. Create figures for CpG characterization in various genome features
  2. Update code for methylation frequency distribution figure
  3. Generate repeat tracks for each species
  4. Find a way to generate tables programmatically
  5. Figure out how to meaningfully concatenate data for each method
  6. Figure out methylation island analysis

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/35ak8GU
via IFTTT

Sam’s Notebook: Gene Expression – C.bairdi Pairwise DEG Comparisons with 2019 RNAseq using Trinity-Salmon-EdgeR on Mox

Per a Slack request, Steven asked me to take the Genewize RNAseq data (received 2020318) through edgeR. Ran the analysis using the Trinity differential expression pipeline:

Here’re the core input files used for this analysis:

The analyses will perform the following pairwise comparisons:

  • infected-uninfected
  • D9-D12
  • D9-D26
  • D12-D26
  • ambient-cold
  • ambient-warm
  • cold-warm

It will identify differentially expressed genes with >=2-fold log change in expression and a false discovery rate of <=0.05. Additionally, it will perform gene ontology (GO) enrichment analysis using GOseq.

As a brief aside, I’m pretty stoked about the SBATCH script below! It automates FastQ file selection for each comparison, creates appropriately named subdirectories and creates proper Trinity samples list file needed.

After running the DEG analysis, I “flattened” the enriched GO terms files for later use in R to map these GO terms to GOslims. That was run separately and the script is after the SBATCH script.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=cbai_DEG_basic ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200422_cbai_DEG_basic_comparisons # This is a script to identify differentially expressed genes (DEGs) in C.bairdi # using pairwise comparisions of from just the "2020-GW" (i.e. just Genewiz) RNAseq data # which has been taxonomically selected for all Arthropoda reads. See Sam's notebook from 20200419 # https://robertslab.github.io/sams-notebook/ # Script will run Trinity's builtin differential gene expression analysis using: # - Salmon alignment-free transcript abundance estimation # - edgeR # Cutoffs of 2-fold difference in expression and FDR of <=0.05. ################################################################################### # These variables need to be set by user fastq_dir="/gscratch/srlab/sam/data/C_bairdi/RNAseq/" fasta_prefix="20200408.C_bairdi.megan.Trinity" transcriptome_dir="/gscratch/srlab/sam/data/C_bairdi/transcriptomes" trinotate_feature_map="${transcriptome_dir}/20200409.cbai.trinotate.annotation_feature_map.txt" go_annotations="${transcriptome_dir}/20200409.cbai.trinotate.go_annotations.txt" # Array of the various comparisons to evaluate # Each condition in each comparison should be separated by a "-" comparisons_array=( infected-uninfected \ D9-D12 \ D9-D26 \ D12-D26 \ ambient-cold \ ambient-warm \ cold-warm ) # Functions # Expects input (i.e. "$1") to be in the following format: # e.g. 20200413.C_bairdi.113.D9.uninfected.cold.megan_R2.fq get_day () { day=$(echo "$1" | awk -F"." '{print $4}'); } get_inf () { inf=$(echo "$1" | awk -F"." '{print $5}'); } get_temp () { temp=$(echo "$1" | awk -F"." '{print $6}'); } ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log wd="$(pwd)" threads=28 ## Designate input file locations transcriptome="${transcriptome_dir}/${fasta_prefix}.fasta" fasta_seq_lengths="${transcriptome_dir}/${fasta_prefix}.fasta.seq_lens" gene_map="${transcriptome_dir}/${fasta_prefix}.fasta.gene_trans_map" transcriptome="${transcriptome_dir}/${fasta_prefix}.fasta" # Standard output/error files diff_expr_stdout="diff_expr_stdout.txt" diff_expr_stderr="diff_expr_stderr.txt" matrix_stdout="matrix_stdout.txt" matrix_stderr="matrix_stderr.txt" salmon_stdout="salmon_stdout.txt" salmon_stderr="salmon_stderr.txt" tpm_length_stdout="tpm_length_stdout.txt" tpm_length_stderr="tpm_length_stderr.txt" trinity_DE_stdout="trinity_DE_stdout.txt" trinity_DE_stderr="trinity_DE_stderr.txt" edgeR_dir="" #programs trinity_home=/gscratch/srlab/programs/trinityrnaseq-v2.9.0 trinity_annotate_matrix="${trinity_home}/Analysis/DifferentialExpression/rename_matrix_feature_identifiers.pl" trinity_abundance=${trinity_home}/util/align_and_estimate_abundance.pl trinity_matrix=${trinity_home}/util/abundance_estimates_to_matrix.pl trinity_DE=${trinity_home}/Analysis/DifferentialExpression/run_DE_analysis.pl diff_expr=${trinity_home}/Analysis/DifferentialExpression/analyze_diff_expr.pl trinity_tpm_length=${trinity_home}/util/misc/TPM_weighted_gene_length.py # Loop through each comparison # Will create comparison-specific direcctories and copy # appropriate FastQ files for each comparison. # After file transfer, will create necessary sample list file for use # by Trinity for running differential gene expression analysis and GO enrichment. for comparison in "${!comparisons_array[@]}" do # Assign variables cond1_count=0 cond2_count=0 comparison=${comparisons_array[${comparison}]} comparison_dir=${wd}/${comparison}/ salmon_gene_matrix=${comparison_dir}/salmon.gene.TMM.EXPR.matrix salmon_iso_matrix=${comparison_dir}/salmon.isoform.TMM.EXPR.matrix samples=${comparison_dir}${comparison}.samples.txt # Extract each comparison from comparisons array # Conditions must be separated by a "-" cond1=$(echo "${comparison}" | awk -F"-" '{print $1}') cond2=$(echo "${comparison}" | awk -F"-" '{print $2}') mkdir --parents "${comparison}" cd "${comparison}" || exit # Series of if statements to identify which FastQ files to rsync to working directory if [[ "${comparison}" == "infected-uninfected" ]]; then rsync --archive --verbose ${fastq_dir}*.fq . fi if [[ "${comparison}" == "D9-D12" ]]; then for fastq in "${fastq_dir}"*.fq do get_day "${fastq}" if [[ "${day}" == "D9" || "${day}" == "D12" ]]; then rsync --archive --verbose "${fastq}" . fi done fi if [[ "${comparison}" == "D9-D26" ]]; then for fastq in "${fastq_dir}"*.fq do get_day "${fastq}" if [[ "${day}" == "D9" || "${day}" == "D26" ]]; then rsync --archive --verbose "${fastq}" . fi done fi if [[ "${comparison}" == "D12-D26" ]]; then for fastq in "${fastq_dir}"*.fq do get_day "${fastq}" if [[ "${day}" == "D12" || "${day}" == "D26" ]]; then rsync --archive --verbose "${fastq}" . fi done fi if [[ "${comparison}" == "ambient-cold" ]]; then #statements for fastq in "${fastq_dir}"*.fq do get_temp "${fastq}" if [[ "${temp}" == "ambient" || "${temp}" == "cold" ]]; then rsync --archive --verbose "${fastq}" . fi done fi if [[ "${comparison}" == "ambient-warm" ]]; then for fastq in "${fastq_dir}"*.fq do get_temp "${fastq}" if [[ "${temp}" == "ambient" || "${temp}" == "warm" ]]; then rsync --archive --verbose "${fastq}" . fi done fi if [[ "${comparison}" == "cold-warm" ]]; then for fastq in "${fastq_dir}"*.fq do get_temp "${fastq}" if [[ "${temp}" == "cold" || "${temp}" == "warm" ]]; then rsync --archive --verbose "${fastq}" . fi done fi # Create reads array # Paired reads files will be sequentially listed in array (e.g. 111_R1 111_R2) reads_array=(*.fq) echo "" echo "Created reads_array" # Loop to create sample list file # Sample file list is tab-delimited like this: # cond_A cond_A_rep1 A_rep1_left.fq A_rep1_right.fq # cond_A cond_A_rep2 A_rep2_left.fq A_rep2_right.fq # cond_B cond_B_rep1 B_rep1_left.fq B_rep1_right.fq # cond_B cond_B_rep2 B_rep2_left.fq B_rep2_right.fq # Increment by 2 to process next pair of FastQ files for (( i=0; i<${#reads_array[@]} ; i+=2 )) do echo "" echo "Evaluating ${reads_array[i]} and ${reads_array[i+1]}" get_day "${reads_array[i]}" get_inf "${reads_array[i]}" get_temp "${reads_array[i]}" echo "" echo "Got day (${day}), infection status (${inf}), and temp (${temp})." echo "" echo "Condition 1 is: ${cond1}" echo "condition 2 is: ${cond2}" # Evaluate specified treatment conditions and format sample file list appropriately. if [[ "${cond1}" == "${day}" || "${cond1}" == "${inf}" || "${cond1}" == "${temp}" ]]; then cond1_count=$((cond1_count+1)) echo "" echo "Condition 1 evaluated." # Create tab-delimited samples file. printf "%s\t%s%02d\t%s\t%s\n" "${cond1}" "${cond1}_" "${cond1_count}" "${comparison_dir}${reads_array[i]}" "${comparison_dir}${reads_array[i+1]}" \ >> "${samples}" elif [[ "${cond2}" == "${day}" || "${cond2}" == "${inf}" || "${cond2}" == "${temp}" ]]; then cond2_count=$((cond2_count+1)) echo "" echo "Condition 2 evaluated." # Create tab-delimited samples file. printf "%s\t%s%02d\t%s\t%s\n" "${cond2}" "${cond2}_" "${cond2_count}" "${comparison_dir}${reads_array[i]}" "${comparison_dir}${reads_array[i+1]}" \ >> "${samples}" fi # Copy sample list file to transcriptome directory cp "${samples}" "${transcriptome_dir}" done echo "Created ${comparison} sample list file." # Create directory/sample list for ${trinity_matrix} command trin_matrix_list=$(awk '{printf "%s%s", $2, "/quant.sf " }' "${samples}") # Determine transcript abundances using Salmon alignment-free # abundance estimate. ${trinity_abundance} \ --output_dir "${comparison_dir}" \ --transcripts ${transcriptome} \ --seqType fq \ --samples_file "${samples}" \ --est_method salmon \ --aln_method bowtie2 \ --gene_trans_map "${gene_map}" \ --prep_reference \ --thread_count "${threads}" \ 1> "${comparison_dir}"${salmon_stdout} \ 2> "${comparison_dir}"${salmon_stderr} # Convert abundance estimates to matrix ${trinity_matrix} \ --est_method salmon \ --gene_trans_map ${gene_map} \ --out_prefix salmon \ --name_sample_by_basedir \ ${trin_matrix_list} \ 1> ${matrix_stdout} \ 2> ${matrix_stderr} # Integrate Trinotate functional annotations "${trinity_annotate_matrix}" \ "${trinotate_feature_map}" \ salmon.gene.counts.matrix \ > salmon.gene.counts.annotated.matrix # Generate weighted gene lengths "${trinity_tpm_length}" \ --gene_trans_map "${gene_map}" \ --trans_lengths "${fasta_seq_lengths}" \ --TPM_matrix "${salmon_iso_matrix}" \ > Trinity.gene_lengths.txt \ 2> ${tpm_length_stderr} # Differential expression analysis # Utilizes edgeR. # Needs to be run in same directory as transcriptome. cd ${transcriptome_dir} || exit ${trinity_DE} \ --matrix "${comparison_dir}salmon.gene.counts.matrix" \ --method edgeR \ --samples_file "${samples}" \ 1> ${trinity_DE_stdout} \ 2> ${trinity_DE_stderr} mv edgeR* "${comparison_dir}" # Run differential expression on edgeR output matrix # Set fold difference to 2-fold (ie. -C 1 = 2^1) # P value <= 0.05 # Has to run from edgeR output directory # Pulls edgeR directory name and removes leading ./ in find output # Using find is required because edgeR names directory using PID # and I don't know how to find that out cd "${comparison_dir}" || exit edgeR_dir=$(find . -type d -name "edgeR*" | sed 's%./%%') cd "${edgeR_dir}" || exit mv "${transcriptome_dir}/${trinity_DE_stdout}" . mv "${transcriptome_dir}/${trinity_DE_stderr}" . ${diff_expr} \ --matrix "${salmon_gene_matrix}" \ --samples "${samples}" \ --examine_GO_enrichment \ --GO_annots "${go_annotations}" \ --include_GOplot \ --gene_lengths "${comparison_dir}Trinity.gene_lengths.txt" \ -C 1 \ -P 0.05 \ 1> ${diff_expr_stdout} \ 2> ${diff_expr_stderr} cd "${wd}" || exit done 

Flatten enriched GO terms file (GitHub):

#!/bin/bash ############################################################# # Script to "flatten" Trinity edgeR GOseq enrichment format # so each line contains a single gene/transcript ID # and associated GO term ############################################################# # Enable globstar for recursive searching shopt -s globstar # Declare variables output_file="" wd=$(pwd) # Input file ## Expects Trinity edgeR GOseq enrichment format: ## category over_represented_pvalue under_represented_pvalue numDEInCat numInCat term ontology over_represented_FDR go_term gene_ids ## Field 10 (gene_ids) contains comma separated gene_ids that fall in the given GO term in the "category" column for goseq in **/*UP.subset*.enriched do # Capture path to file dir=${goseq%/*} cd "${dir}" || exit tmp_file=$(mktemp) # Count lines in file linecount=$(cat "${goseq}" | wc -l) # If file is not empty if (( "${linecount}" > 1 )) then output_file="${goseq}.flattened" # 1st: Convert comma-delimited gene IDs in column 10 to tab-delimited # Also, set output (OFS) to be tab-delimited # 2nd: Convert spaces to underscores and keep output as tab-delimited # 3rd: Sort on Trinity IDs (column 10) and keep only uniques awk 'BEGIN{FS="\t";OFS="\t"} {gsub(/, /, "\t", $10); print}' "${goseq}" \ | awk 'BEGIN{F="\t";OFS="\t"} NR==1; NR > 1 {gsub(/ /, "_", $0); print}' \ > "${tmp_file}" # Identify the first line number which contains a gene_id begin_goterms=$(grep --line-number "TRINITY" "${tmp_file}" \ | awk '{for (i=1;i<=NF;i++) if($i ~/TRINITY/) print i}' \ | sort --general-numeric-sort --unique | head -n1) # "Unfolds" gene_ids to a single gene_id per row while read -r line do # Capture the length of the longest row max_field=$(echo "$line" | awk -F "\t" '{print NF}') # Retain the first 8 fields (i.e. categories) fixed_fields=$(echo "$line" | cut -f1-8) # Since not all the lines contain the same number of fields (e.g. may not have GO terms), # evaluate the number of fields in each line to determine how to handle current line. # If the value in max_field is less than the field number where the GO terms begin, # then just print the current line (%s) followed by a newline (\n). if (( "$max_field" < "$begin_goterms" )) then printf "%s\n" "$line" else goterms=$(echo "$line" | cut -f"$begin_goterms"-"$max_field") # Assign values in the variable "goterms" to a new indexed array (called "array"), # with tab delimiter (IFS=$'\t') IFS=$'\t' read -r -a array <<<"$goterms" # Iterate through each element of the array. # Print the first n fields (i.e. the fields stored in "fixed_fields") followed by a tab (%s\t). # Print the current element in the array (i.e. the current GO term) followed by a new line (%s\n). for element in "${!array[@]}" do printf "%s\t%s\n" "$fixed_fields" "${array[$element]}" done fi done < "${tmp_file}" > "${output_file}" fi # Cleanup rm "${tmp_file}" cd "${wd}" || exit done 

Sam’s Notebook: FastQC-MultiQC – Laura Spencer’s QuantSeq Data

Laura Spencer received her O.lurida QuantSeq data, so I put it through FastQC/MultiQC and put the pertinent info in the nightingales Google Sheet. I also moved the data to /owl/nightingales/O_lurida, updated the readme file and checksums file. There were 148 individual samples, so I won’t list them all here.

Grace’s Notebook: Crab Results progress – GOslim transcriptome plot, macropinocytosis list and notes

Crab transcriptome GOslim

Got the BLAST output from the crab transcriptome to GO slim terms using 042320-bairdi_transcriptome-BLAST-to-GOslim.ipynb

Saved this GOslim text file: GOslim-P-pie.txt

GOslim was made in excel because that was easy for me. Removed “other biological processes” and other non-descriptive GOslim terms.

img

Macropinocytosis

Macropinocytosis was identified as the only significantly enriched GO term from the 2019 crab dataset comparing the infected and uninfected crabs. (DAVID output: analyses/chart_12EFD.txt)

I made a list of the 12 genes that comprised the macropinocytosis in this Rmd. The list is here.

The genes were all (except one) described in Dictyostelium (slime mold), which is a protist. Hematodinium is also a protist… interesting. I’ve added some notes from papers and other things that I’ve found in the crab paper discussion section.

Mortality

Figure made in excel: analyses/mort_fig.xlsx
img

from Grace’s Lab Notebook https://ift.tt/2VWWsS7
via IFTTT