Workflow Standardization

Research Update – Hematodinium Investigation Yesterday’s notebook update focused on my plans to investigate future research questions. This one focuses on my current project – analyzing differential expression in Hematodinium in Grace’s samples. I’m roughly at the same stage as before – I’ve got all my differentialy-expressed GO terms, but realized a few things. The most important of those: for analysis with GO-MWU/GOseq/some other gene enrichment analysis, I need to provide all my genes, not just the differentially-expressed ones. Because of that, I had to start over on my R script that produces a newline-separated file of UniProt accessions. When going through my R script, I realized I’ve learned a lot in the last few months. The script was badly written, and I had to either sub out specific parts each time, or have differnt scripts for different files. Because of that, I decided to rewrite my scripts, and turned…

from Aidan F. Coyle https://ift.tt/3qBqWqp
via IFTTT

Project Proposal Idea

Research Update – Future Plans I’ve been somewhat neglecting my notebook duties for the last while – I’ve really been working along two avenues, and felt like I didn’t have enough in either to warrant a full post. Well, until recently, that is! This is going to be a two-part post. The first (this), will focus on how I’m progressing with planning for future research – finding a good question, determining feasibility, diving into the lit, and so on. The second will focus on how I’m progressing with my analysis of differential gene expression in Grace’s hematodinium samples. Alright. So before I begin describing my potential research project, some quick housekeeping. I imported all of Grace’s thesis lit (she kindly shared her folder with me) into PaperPile, and linked each with a PDF. I’m now on the hunt for a good method for tagging and sorting all 250-odd papers! Enough…

from Aidan F. Coyle https://ift.tt/35WBLeQ
via IFTTT

R package: usethis

install.packages(“usethis”) #Install packages
usethis::use_course(“hglanz/rworkshop_Jan2021”) #Download materials from GitHub repo (https://github.com/hglanz/rworkshop_Jan2021) to machine

Samples Submitted – M.magister MBD-BSseq Libraries to Univ. of Oregon GC3F

Submitted the M.magister MBD-BSseq libraries created 20201124 using the 4nM aliquots created for the MiSeq test run on 20201202 to the Univ. of Oregon GC3F sequencing core.

For library pooling, I used an updated set of calculations determined by Mac to help equalize individual library contributions, based on the results from the MiSeq run:

Sample Volume_4nMsample_toBalance_uL
CH01-06 2.43
CH01-14 3.1
CH01-22 2.58
CH01-38 1.63
CH03-04 1.66
CH03-15 4.03
CH03-33 1.71
CH05-01 2.1
CH05-06 4.38
CH05-21 1.84
CH05-24 2.98
CH07-06 1.79
CH07-11 2.83
CH07-24 2.98
CH09-02 1.36
CH09-13 4.57
CH09-28 1.4
CH10-01 1.29
CH10-08 1.83
CH10-11 0.99

Full Google Sheet for this project:

from Sam’s Notebook https://ift.tt/2N0fKFw
via IFTTT

Transcriptome Comparisons – C.bairdi Transcriptomes Evaluations with DETONATE rsem-eval on Mox

UPDATE: I’ll lead in with the fact that this failed with an error message that I can’t figure out. This will save the reader some time. I’ve posted the problem as an Issue on the DETONATE GitHub repo, however it’s clear that this software is no longer maintained, as the repo hasn’t been updated in >3yrs; even lacking responses to Issues that are that old.

Here’s the error message and some other details that could be useful for troubleshooting (which are beyond my knowledge – although I suspect that the XM tag is the culprit and the first entry in the BAM file has XM:i:2 and the error message might suggest that 2 is not an acceptable value e.g. Assertion val == 0 val == 1 val == 5’ failed.`):
rsem-synthesis-reference-transcripts cbai_transcriptome_v3.1.fasta.temp/cbai_transcriptome_v3.1.fasta 0 0 0 /gscratch/srlab/sam/data/C_bairdi/transcriptomes/cbai_transcriptome_v3.1.fasta Transcript Information File is generated! Group File is generated! Extracted Sequences File is generated! rsem-preref cbai_transcriptome_v3.1.fasta.temp/cbai_transcriptome_v3.1.fasta.transcripts.fa 1 cbai_transcriptome_v3.1.fasta.temp/cbai_transcriptome_v3.1.fasta Refs.makeRefs finished! Refs.saveRefs finished! cbai_transcriptome_v3.1.fasta.temp/cbai_transcriptome_v3.1.fasta.idx.fa is generated! cbai_transcriptome_v3.1.fasta.temp/cbai_transcriptome_v3.1.fasta.n2g.idx.fa is generated! rsem-parse-alignments cbai_transcriptome_v3.1.fasta.temp/cbai_transcriptome_v3.1.fasta cbai_transcriptome_v3.1.fasta.temp/cbai_transcriptome_v3.1.fasta cbai_transcriptome_v3.1.fasta.stat/cbai_transcriptome_v3.1.fasta b /gscratch/scrubbed/samwhite/outputs/20201224_cbai_bowtie2_transcriptomes_alignments/cbai_transcriptome_v3.1.fasta.sorted.bam -t 3 -tag XM rsem-parse-alignments: parseIt.cpp:92: void parseIt(SamParser*) [with ReadType = PairedEndReadQ; HitType = PairedEndHit]: Assertion `val == 0 || val == 1 || val == 5' failed. "rsem-parse-alignments cbai_transcriptome_v3.1.fasta.temp/cbai_transcriptome_v3.1.fasta cbai_transcriptome_v3.1.fasta.temp/cbai_transcriptome_v3.1.fasta cbai_transcriptome_v3.1.fasta.stat/cbai_transcriptome_v3.1.fasta b /gscratch/scrubbed/samwhite/outputs/20201224_cbai_bowtie2_transcriptomes_alignments/cbai_transcriptome_v3.1.fasta.sorted.bam -t 3 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline! 

Here’s what the head of the BAM file looks like:

[samwhite@n2233 20201224_cbai_bowtie2_transcriptomes_alignments]$ /gscratch/srlab/programs/samtools-1.10/samtools view cbai_transcriptome_v3.1.fasta.sorted.bam | head A00147:108:HLLJFDMXX:1:1369:3893:6637 163 TRINITY_DN5604_c0_g2_i1 1 42 101M = 81 181 GAAAGAAAAACCGACAGGAGGAATTTCTTTGTTACCAACAAAAACTAATATATTTCGCATACCTGACAGACATGGTGACAGCGCCTCTGATGTTCGCCGAA :FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFF AS:i:-2 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:2G31A66 YS:i:0 YT:Z:CP A00147:121:HLLVMDMXX:1:2159:5674:25786 163 TRINITY_DN5604_c0_g2_i1 4 42 101M = 72 169 AGAAAAACCGACAGGAGGAATTTCTTTGTTAACAACAAAAACTAATATATTTCGCATACCTGACAGACATGGTGACAGCGCCTCTGATGTTCGCCGAATTA FFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP A00147:108:HLLJFDMXX:2:1420:13702:14920 163 TRINITY_DN5604_c0_g2_i1 5 42 101M = 197 293 GAAAAACCGACAGGAGGAATTTCTTTGTTAACAACAAAAACTAATATATTTCGCATACCTGACAGACATGGTGACAGCGCCTCTGATGTTCGCCGAATTAA FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP A00147:108:HLLJFDMXX:2:1420:13431:15796 163 TRINITY_DN5604_c0_g2_i1 5 42 101M = 197 293 GAAAAACCGACAGGAGGAATTTCTTTGTTAACAACAAAAACTAATATATTTCGCATACCTGACAGACATGGTGACAGCGCCTCTGATGTTCGCCGAATTAA FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP A00147:121:HLLVMDMXX:1:1258:5141:32377 163 TRINITY_DN5604_c0_g2_i1 5 42 101M = 107 203 GAAAAACCGACAGGAGGAATTTCTTTGTTACCAACAAAAACTAATATATTTCGCATACCTGACAGACATGGTGACAGCGCCTCTGATGTTCGCCGAATTAA FFFFF,FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:F:FFFFFFFF AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:30A70 YS:i:0 YT:Z:CP A00147:121:HLLVMDMXX:1:1259:7645:5838 163 TRINITY_DN5604_c0_g2_i1 5 42 101M = 107 203 GAAAAACCGACAGGAGGAATTTCTTTGTTACCAACAAAAACTAATATATTTCGCATACCTGACAGACATGGTGACAGCGCCTCTGATGTTCGCCGAATTAA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:30A70 YS:i:0 YT:Z:CP A00147:108:HLLJFDMXX:1:1351:27082:1705 163 TRINITY_DN5604_c0_g2_i1 6 42 101M = 158 253 AAAAACCGACAGGAGGAATTTCTTTGTTAACAACAAAAACTAATATATTTCGCATACCTGACAGACATGGTGACAGCGCCTCTGATGTTCGCCGAATTAAA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF: AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP A00147:108:HLLJFDMXX:1:1475:28926:32706 163 TRINITY_DN5604_c0_g2_i1 6 42 101M = 158 253 AAAAACCGACAGGAGGAATTTCTTTGTTAACAACAAAAACTAATATATTTCGCATACCTGACAGACATGGTGACAGCGCCTCTGATGTTCGCCGAATTAAA FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP A00147:121:HLLVMDMXX:1:2162:31385:26381 163 TRINITY_DN5604_c0_g2_i1 6 42 101M = 158 253 AAAAACCGACAGGAGGAANTTCTTTGTTAACAACAAAAACTAATATATTTCGCATACCTGACAGACATGGTGACAGCGCCTCTGATGTTCGCCGAATTAAA FFFFFFFFFFFFFFFFFF#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF: AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:18T82 YS:i:0 YT:Z:CP A00147:121:HLLVMDMXX:2:1425:28745:30639 163 TRINITY_DN5604_c0_g2_i1 7 42 101M = 169 263 AAAACCGACAGGAGGAATTTCTTTGTTAACAACAAAAACTAATATATTTCGCATACCTGACAGACATGGTGACAGCGCCTCTGATGTTCGCCGAATTAAAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF: AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP 

bowtie2 commands to generate the BAM file:

 # Use bowtie2 and paired-end options # Uses settings specified for use with DETONATE # and for paired end reads when using DETONATE. ${programs_array[bowtie2]} \ -x ${transcriptome_name} \ -S ${transcriptome_name}.sam \ --threads ${threads} \ -1 ${R1_list} \ -2 ${R2_list} \ --sensitive \ --dpad 0 \ --gbar 99999999 \ --mp 1,1 \ --np 1 \ --score-min L,0,-0.1 \ --no-mixed \ --no-discordant # Convert SAM to sorted BAM # ${programs_array[samtools_view]} \ -b \ ${transcriptome_name}.sam \ | ${programs_array[samtools_sort]} \ -m ${mem_per_thread} \ --threads ${threads} \ -o ${transcriptome_name}.sorted.bam \ - 

With all of that out of the way, you can find the original post below.

Alignments – C.bairdi RNAseq Transcriptome Alignments Using Bowtie2 on Mox

I had previously attempted to compare all of our C.bairdi transcriptome assemblies using DETONATE on 20200601, but, due to hitting time limits on Mox, failed to successfully get the analysis to complete. I realized that the limiting factor was performing FastQ alignments, so I decided to run this step independently to see if I could at least get that step resolved. DETONATE (rsem-eval) will accept BAM files as input, so I’m hoping I can power through this alignment step and then provided DETONATE (rsem-eval) with the BAM files.

I ran bowtie2 on Mox using the alignment settings described in the DETONATE (rsem-eval) documentation (see SBATCH script below for actual bowtie2 parameters).

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201224_cbai_bowtie2_transcriptomes_alignments ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=15-00:00:00 ## Memory per node #SBATCH --mem=500G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201224_cbai_bowtie2_transcriptomes_alignments # This is a script to generate BAM files for use in DETONATE's # rsem-eval program to compare C.bairdi transcriptome assembly "qualities". ################################################################################### # These variables need to be set by user # Assign Variables reads_dir=/gscratch/srlab/sam/data/C_bairdi/RNAseq transcriptomes_dir=/gscratch/srlab/sam/data/C_bairdi/transcriptomes threads=28 mem_per_thread=10G # Program paths bowtie2_dir="/gscratch/srlab/programs/bowtie2-2.4.2-linux-x86_64" samtools="/gscratch/srlab/programs/samtools-1.10/samtools" # Array of the various comparisons to evaluate # Each condition in each comparison should be separated by a "-" transcriptomes_array=( "${transcriptomes_dir}"/cbai_transcriptome_v1.0.fasta \ "${transcriptomes_dir}"/cbai_transcriptome_v1.5.fasta \ "${transcriptomes_dir}"/cbai_transcriptome_v1.6.fasta \ "${transcriptomes_dir}"/cbai_transcriptome_v1.7.fasta \ "${transcriptomes_dir}"/cbai_transcriptome_v2.0.fasta \ "${transcriptomes_dir}"/cbai_transcriptome_v2.1.fasta \ "${transcriptomes_dir}"/cbai_transcriptome_v3.0.fasta \ "${transcriptomes_dir}"/cbai_transcriptome_v3.1.fasta ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Programs array declare -A programs_array programs_array=( [bowtie2]="${bowtie2_dir}/bowtie2" \ [bowtie2_build]="${bowtie2_dir}/bowtie2-build" \ [samtools_index]="${samtools} index" \ [samtools_sort]="${samtools} sort" \ [samtools_view]="${samtools} view" ) # Loop through each comparison for transcriptome in "${!transcriptomes_array[@]}" do ## Inititalize arrays R1_array=() R2_array=() reads_array=() # Variables R1_list="" R2_list="" transcriptome_name="${transcriptomes_array[$transcriptome]##*/}" # Capture FastA checksums for verification echo "Generating checksum for ${transcriptome_name}" md5sum "${transcriptomes_array[transcriptome]}" >> fasta.checksums.md5 echo "Finished generating checksum for ${transcriptome_name}" echo "" if [[ "${transcriptome_name}" == "cbai_transcriptome_v1.0.fasta" ]]; then reads_array=("${reads_dir}"/20200[15][13][138]*megan*.fq) # Create array of fastq R1 files R1_array=("${reads_dir}"/20200[15][13][138]*megan*R1.fq) # Create array of fastq R2 files R2_array=("${reads_dir}"/20200[15][13][138]*megan*R2.fq) elif [[ "${transcriptome_name}" == "cbai_transcriptome_v1.5.fasta" ]]; then reads_array=("${reads_dir}"/20200[145][13][138]*megan*.fq) # Create array of fastq R1 files R1_array=("${reads_dir}"/20200[145][13][138]*megan*R1.fq) # Create array of fastq R2 files R2_array=("${reads_dir}"/20200[145][13][138]*megan*R2.fq) elif [[ "${transcriptome_name}" == "cbai_transcriptome_v1.6.fasta" ]]; then reads_array=("${reads_dir}"/*megan*.fq) # Create array of fastq R1 files R1_array=("${reads_dir}"/*megan*R1.fq) # Create array of fastq R2 files R2_array=("${reads_dir}"/*megan*R2.fq) elif [[ "${transcriptome_name}" == "cbai_transcriptome_v1.7.fasta" ]]; then reads_array=("${reads_dir}"/20200[145][13][189]*megan*.fq) # Create array of fastq R1 files R1_array=("${reads_dir}"/20200[145][13][189]*megan*R1.fq) # Create array of fastq R2 files R2_array=("${reads_dir}"/20200[145][13][189]*megan*R2.fq) elif [[ "${transcriptome_name}" == "cbai_transcriptome_v2.0.fasta" ]] \ || [[ "${transcriptome_name}" == "cbai_transcriptome_v2.1.fasta" ]]; then reads_array=("${reads_dir}"/*fastp-trim*.fq) # Create array of fastq R1 files R1_array=("${reads_dir}"/*R1*fastp-trim*.fq) # Create array of fastq R2 files R2_array=("${reads_dir}"/*R2*fastp-trim*.fq) elif [[ "${transcriptome_name}" == "cbai_transcriptome_v3.0.fasta" ]] \ || [[ "${transcriptome_name}" == "cbai_transcriptome_v3.1.fasta" ]]; then reads_array=("${reads_dir}"/*fastp-trim*20[12][09][01][24]1[48]*.fq) # Create array of fastq R1 files R1_array=("${reads_dir}"/*R1*fastp-trim*20[12][09][01][24]1[48]*.fq) # Create array of fastq R2 files R2_array=("${reads_dir}"/*R2*fastp-trim*20[12][09][01][24]1[48]*.fq) fi # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename printf "%s\n" "${reads_array[@]##*/}" >> "${transcriptome_name}".fastq.list.txt # Create comma-separated lists of FastQ reads R1_list=$(echo "${R1_array[@]}" | tr " " ",") R2_list=$(echo "${R2_array[@]}" | tr " " ",") # Build Bowtie2 index # Transcriptome name is used as index basename ${programs_array[bowtie2_build]} \ --threads ${threads} \ ${transcriptomes_array[$transcriptome]} \ ${transcriptome_name} # Run rsem-eval # Use bowtie2 and paired-end options # Uses settings specified for use with DETONATE # and for paired end reads when using DETONATE. ${programs_array[bowtie2]} \ -x ${transcriptome_name} \ -S ${transcriptome_name}.sam \ --threads ${threads} \ -1 ${R1_list} \ -2 ${R2_list} \ --sensitive \ --dpad 0 \ --gbar 99999999 \ --mp 1,1 \ --np 1 \ --score-min L,0,-0.1 \ --no-mixed \ --no-discordant # Convert SAM to sorted BAM # ${programs_array[samtools_view]} \ -b \ ${transcriptome_name}.sam \ | ${programs_array[samtools_sort]} \ -m ${mem_per_thread} \ --threads ${threads} \ -o ${transcriptome_name}.sorted.bam \ - # Capture BAM checksums for verification echo "Generating checksum for ${transcriptome_name}.sorted.bam" md5sum ${transcriptome_name}.sorted.bam >> bam.checksums.md5 echo "Finished generating checksum for ${transcriptome_name}.sorted.bam" echo "" done # Remove leftover SAM files rm *.sam # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} fi ${programs_array[$program]} -h echo "" echo "" echo "

GO terms and Venn Diagrams

Obtaining GO IDs My analysis prior to this used DESeq2 to obtain a series of tables of genes with significantly-different expressions for each of my four comparisons. I wrote a script in R that took in each DESeq2 output file, matched it against this annotated transcriptome, and produced a newline-separated file of UniProt accessions. I then took each newline-separated file and put it into this Bash script from Sam, which retrieved the Gene Ontology terms for each Accession ID. This is progressing fairly well! Next steps are as follows: Get the GOslim terms using the GSEAbase package Perform gene enrichment analysis with TopGO or GO-MWU When time allows, go back and annotate the transcriptome myself to understand it more fully Producing Venn Diagrams In order to determine the overlap between DEGs for my different comparisons, I produced a few Venn diagrams. Since a quick analysis found practically no overlap between…

from Aidan F. Coyle https://ift.tt/3hcKQof
via IFTTT

Lists of Differentially-Expressed Genes (plus graphs)

Outline I put a whole bunch of libraries through the same protocol as before – use Kallisto to create pseudoalignments for libraries, use a Trinity pipeline script to merge Kallisto counts into a matrix, analyze matrix with DESeq2, produce list of significantly different genes (padj <= 0.005) At this point,…

from Aidan F. Coyle https://ift.tt/3p9Gq4e
via IFTTT

Adding Individual Libraries

DESeq2 First, I finished up my first quarter! Congratulations to me! Alright, I’ve made a lot of progress since last time. For the moment, building an index for transcriptome 3.0 is on the backburner – the top priority is just getting a list of differentially-expressed genes. I was able to…

from Aidan F. Coyle https://ift.tt/3r3DuYK
via IFTTT

Day 0 Ambient vs. Day 17 Ambient

Comparison Yesterday, I described how I planned to start with a comparison of Day 0 and Day 17 libraries at ambient temperatures. I was able to complete that! Methods followed the same protocol outlined previously – use Kallisto to create pseudoalignments for libraries, use a script from Trinity pipeline to…

from Aidan F. Coyle https://ift.tt/38f0AD7
via IFTTT