Grace’s Notebook: Crab Analyses Progress

I made some progress in getting new results for the crab part of my thesis. I ran DESeq2 on libraries 8-11 based on infection, but taking temperature into account. I then took those results and contrasted the two temperature treatments to get an effect of temperature on infection gene expression. I also am trying to do time-series with the individual crab RNAseq… and that’s been a bit tricky because in some cases we don’t have any replicates, and in ones where we do, there are three time points and I’m struggling to figure out how to analyze them… links to Rmd and results files in post:

DESeq2 on libraries 8-11 based on infection taking temperature into account:

Rmd: scripts/DESeq-libraries8-11.Rmd
.HTML preview: /scripts/DESeq-libraries8-11.htm

Heatmaps are viewable in the .html linked above^

Infection comparison taking temperature into account
DEGlist (.tab)
analyses/DEGlist-infection_temp-libs8-11.tab

DEGlist (.txt with Trinity_ID column):
analyses/DEGlist-infection_temp-libs8-11.txt

With count data:
/analyses/DEGlist-inf-temp-libs8-11_counts.tab

Annotated with transcriptome v. 1.5 blast output and uniprot-GO:
analyses/DEGlist-infection_temp-annot.tab

Contrast temperature treatments from the above results:

DEGlist (.tab)
analyses/DEGlist-contrast_temp-libs8-11.tab

DEGlist (.txt)
[analyses/DEGlist-contrast_temp-libs8-11.txt(https://ift.tt/3dhcehg)

With count data:
analyses/DEGlist-contrast-temp-libs8-11_counts.tab

Annotated with transcriptome v. 1.5 blast output and uniprot-GO:
analyses/DEGlist-contrasttemp-annot.tab

Time series with individual crab RNAseq data attempts:

Rmd: scripts/DESeq-time-series.Rmd

.HTML: scripts/DESeq-time-series.html

To Do for time series:

  • try to figure out how to plot gene counts of crab ABC over time
  • Gene list for ABC vs GHI (I think I have this… but I think because it’s two temperatures at two time points, I have to do what I did above for libraries 8 -11 where I do the initital deseq, and then do a contrast to get the final gene list)

from Grace’s Lab Notebook https://ift.tt/3dcVsA1
via IFTTT

Yaamini’s Notebook: MethCompare Part 19

Multivariate analysis for compositional data

Last week I reached out to Evan and Julian to see if they could help me decide on a statistical approach. I posted Evan’s suggestions in this issue. In short, he suggested pairing multivariate and generalized linear model approaches, giving me more power to discern if sequencing method influences CpG detection in genome features or methylation status.

Data transformation

I opened this R Markdown script to get started. To create my dataframe, I used cbind to combine the CpG type and location percentages for each sample. I used the genomic location data the full sample, and not the individual columns for a sample that looked at proportino of a specific methylation status in that genomic location. I figured the point of a multivariate analysis would be to determine if these variables are related in any sense, so I didn’t want to add any variables that would be redundant.

Since I was working with proportion data, Evan suggested I borrow methods from compositional data analysis techniques. To do this, I used a centered log-ratio transformation on my matrix:

McapCpGMultiTrans <- data.frame(clr(McapCpGMultiMaster)) #Use centered log-ratio transformation 

This is where I ran into my first point of confusion. Looking over Julian’s email, he suggested I use a Hellinger’s transformation on my data to give low weights to smaller proportions:

McapCpGMultiTrans <- data.trans(McapCpGMultiMaster, method = "hellingers", plot = FALSE) #Hellinger (asymmetric) transformation 

When I tried this code, I got a data matrix filled with N/As! I wasn’t sure why this was happening, so I decided to just continue with the CLR transformation and ask Evan which method would be more suitable later.

NMDS and ANOSIM

Although we already had stacked barplots for visualization, I wanted to go through the exercise of creating and NMDS to look at any influential loadings. I followed the same process I used for my DNR paper:

  1. Create a scree plot to determine the optimal number of axes
  2. Confirm that the number of axes returns a low stress value using a Montecarlo permutation
  3. Calculate loadings
  4. Plot an NMDS with loadings that have p-values < 0.05
  5. Conduct global and pairwise ANOSIM tests
  6. Conduct post-hoc tests for significant ANOSIM if necessary

At steps 1 and 2, I got errors that suggested I had insufficient data. Additionally, I was unable to create a scree plot, and my stress value was very close to zero. I recorded these errors but decided to proceed.

For M. capitata, all loadings were significant below 0.05, but for P. acuta, CDS was not an influential loading! Interesting. When I created my NMDS, all samples for the same method were plotted on top of eachother for M. capitata. For P. acuta, my WGBS and RRBS plotted on top of eachother. However, my MBD-BS samples were separated, and sample 8 looked like an outlier within that group.

Related to my lack of sufficient data, I was able to conduct a global ANOSIM for each species, but I got errors suggesting I had insufficient data when I investigated the significant global ANOSIM results with pairwise tests. During the pairwise tests, all possible permutations were conducted (“complete enumeration”). My pairwise tests were not significant, but since my global ANOSIM was, I have a feeling that I do not have enough statistical power to investigate that difference using this multivariate method.

I knit the script into this document and shared it with Evan to get his take on my insufficient data and transformation issues. Most importantly, I wonder if his original suggestion to pair multivariate and generalized linear model approaches is still the best option considering I am data limited.

Going forward

  1. Conduct glm analysis and revise multivariate analysis
  2. Locate TE tracks
  3. [Characterize intersections between data and TE, and create summary tables]
  4. Look into program for mCpG identification

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2zE4zvL
via IFTTT

Shelly’s Notebook:

from Shelly A. Trigg, Ph.D. https://ift.tt/2BaWVte
via IFTTT

Grace’s Notebook: June 2020 Goals

Writing this halfway through the month of June, but it’s ok.

Thesis

Successfully defended on June 2nd!!

To dos:

  • write an intro for my thesis tying two projects together (why is temperature important; why are these two species important)
  • submit thesis to Graduate School by June 25th (last day is Friday, June 26th) (electronic thesis)
  • submit master’s thesis approval form (committee) (form) by June 26th (I don’t think this has been submitted yet…?)

Crab stuff

  • DESeq2 on libraries 8-11 –> compare based on infection status to get effect of temperature (in progress: here)
  • Time series of individual crab RNAseq data (in progress: here)

Resources that I’ve been using to figure out the above tasks:

Things that might come in handy as I continue to try to figure these tasks out:

I’ll create a more detailed post after I find out what works and why.

Oyster stuff

I think just go through the comments from Chelsea and Pam and clean it up.

Goal timeline:

By/on June 18th (this Thursday):

  • have new results for crab project
  • have meeting with Pam (and Steven and Sam) to go over crab project

By Jun 23rd (next Tuesday):

  • have thesis intro written
  • have oytser comments addressed
  • have progress made on new crab results discussion, etc

June 24th (next Wednesday):

  • make sure I know how I’ll submit thesis electronically (info linked in thesis section above)

Submit thesis to graduate school by June 25th!! (June 26th if needed)
Make sure committee signs and submits the form approving my submittal of an electronic thesis to the Graduate School (get signatures) by June 26th at the latest.

from Grace’s Lab Notebook https://ift.tt/2LWlJX1
via IFTTT

Yaamini’s Notebook: MethCompare Part 18

Formatting individual sample information

In this issue we’ve discussed how to revise our CpG methylation status and genomic location statistical analyses. We know we want to compare proportoins while investigating if sequencing method affects the proportion of different methylation statuses or in genomic locations. I posted some suggestions, but in the meantime I thought I could obtain individual-level proportion data.

Thankfully for me, most of the pipeline was already set up! In this Jupyter notebook I counted CpGs for each methylation status and in various genomic features. The only things I needed to modify were ensuring I used bedtools -u, adding code for upstream and downstream flank overlaps, and adding the path to the explicit intragenic region tracks. I took the output files (line counts) and used them in this R Markdown script and used them to create summary tables:

M. capitata:

P. acuta:

Going forward

  1. Conduct statistical analysis
  2. Locate TE tracks
  3. [Characterize intersections between data and TE, and create summary tables]
  4. Look into program for mCpG identification

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2XU7I3V
via IFTTT

Shelly’s Notebook:

from Shelly A. Trigg, Ph.D. https://ift.tt/2XHUM0V
via IFTTT

Sam’s Notebook: Transcriptome Annotation – C.bairdi Transcriptomes v2.1 and v3.1 Using DIAMOND BLASTx on Mox

Decided to annotate the two C.bairdi transcriptomes , cbai_transcriptome_v2.1 and cbai_transcriptome_v3.1, generated on 20200605 using DIAMOND BLASTx on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=cbai_diamond_blastx_v2.1_v3.1 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=0-08:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200608_cbai_diamond_blastx_v2.1_v3.1 ## Script for running BLASTx (using DIAMOND) to annotate ## C.bairdi transcriptomes v2.1 and v3.1 against SwissProt database. ## Output will be in standard BLAST output format 6. ################################################################################### # These variables need to be set by user threads=28 # Programs array declare -A programs_array programs_array=( [diamond]="/gscratch/srlab/programs/diamond-0.9.29/diamond" ) # Transcriptomes arrays transcriptomes_dir="/gscratch/srlab/sam/data/C_bairdi/transcriptomes" transcriptomes=("${transcriptomes_dir}/cbai_transcriptome_v2.1.fasta" \ "${transcriptomes_dir}/cbai_transcriptome_v3.1.fasta") # DIAMOND UniProt database dmnd=/gscratch/srlab/blastdbs/uniprot_sprot_20200123/uniprot_sprot.dmnd ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 for fasta in "${!transcriptomes[@]}" do # Generate checksums for reference md5sum "${transcriptomes[$fasta]}">> fasta.checksums.md5 # Strip leading path and extensions no_path="${transcriptomes[$fasta]##*/}" no_ext="${no_path%.*}" # Run DIAMOND with blastx # Output format 6 produces a standard BLAST tab-delimited file ${programs_array[diamond]} blastx \ --db ${dmnd} \ --query "${transcriptomes[$fasta]}" \ --out "${no_ext}".blastx.outfmt6 \ --outfmt 6 \ --evalue 1e-4 \ --max-target-seqs 1 \ --block-size 15.0 \ --index-chunks 4 done ################################################################################### # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : n } >> system_path.log # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" ${programs_array[$program]} --help echo "" echo "" echo "

Sam’s Notebook: Transcriptome Assessment – BUSCO Metazoa on C.bairdi Transcriptome v3.1

Continuing to try to identify the best C.bairdi transcriptome, we decided to extract all non-dinoflagellate sequences from cbai_transcriptome_v2.0 (RNAseq shorthand: 2018, 2019, 2020-GW, 2020-UW) and cbai_transcriptome_v3.0 (RNAseq shorthand: 2018, 2019, 2020-UW).

Now, want to assess its cbai_transcriptome_v3.1 “completeness” using BUSCO and the metazoa_odb9 database.

BUSCO was run with the --mode transcriptome option on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=cbai_busco_v3.1_transcriptome ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=1-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200605_cbai_busco_transcriptome_v3.1 ### C.bairdi transcriptome assembly completeness assessment using BUSCO. ### This is checking cbai_transcriptome_v1.7.fasta # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log ## Input files and settings busco_db=/gscratch/srlab/sam/data/databases/BUSCO/metazoa_odb9 transcriptome_fasta=/gscratch/srlab/sam/data/C_bairdi/transcriptomes/cbai_transcriptome_v3.1.fasta augustus_species=fly threads=28 ## Save working directory wd=$(pwd) # Extract FastA filename fasta_name=${transcriptome_fasta##*/} ## Set program paths augustus_bin=/gscratch/srlab/programs/Augustus-3.3.2/bin augustus_scripts=/gscratch/srlab/programs/Augustus-3.3.2/scripts blast_dir=/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/ busco=/gscratch/srlab/programs/busco-v3/scripts/run_BUSCO.py hmm_dir=/gscratch/srlab/programs/hmmer-3.2.1/src/ ## Augustus configs augustus_dir=${wd}/augustus augustus_config_dir=${augustus_dir}/config augustus_orig_config_dir=/gscratch/srlab/programs/Augustus-3.3.2/config ## BUSCO configs busco_config_default=/gscratch/srlab/programs/busco-v3/config/config.ini.default busco_config_ini=${wd}/config.ini # Export BUSCO config file location export BUSCO_CONFIG_FILE="${busco_config_ini}" # Export Augustus variable export PATH="${augustus_bin}:$PATH" export PATH="${augustus_scripts}:$PATH" export AUGUSTUS_CONFIG_PATH="${augustus_config_dir}" # Copy BUSCO config file cp ${busco_config_default} "${busco_config_ini}" # Make Augustus directory if it doesn't exist if [ ! -d "${augustus_dir}" ]; then mkdir --parents "${augustus_dir}" fi # Copy Augustus config directory cp --preserve -r ${augustus_orig_config_dir} "${augustus_dir}" # Edit BUSCO config file ## Set paths to various programs ### The use of the % symbol sets the delimiter sed uses for arguments. ### Normally, the delimiter that most examples use is a slash "/". ### But, we need to expand the variables into a full path with slashes, which screws up sed. ### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%"). sed -i "/^;cpu/ s/1/${threads}/" "${busco_config_ini}" sed -i "/^tblastn_path/ s%tblastn_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^makeblastdb_path/ s%makeblastdb_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^augustus_path/ s%augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^etraining_path/ s%etraining_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^gff2gbSmallDNA_path/ s%gff2gbSmallDNA_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^new_species_path/ s%new_species_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^optimize_augustus_path/ s%optimize_augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^hmmsearch_path/ s%hmmsearch_path = /home/osboxes/BUSCOVM/hmmer/hmmer-3.1b2-linux-intel-ia32/binaries/%path = ${hmm_dir}%" "${busco_config_ini}" # Run BUSCO/Augustus training ${busco} \ --in ${transcriptome_fasta} \ --out ${fasta_name} \ --lineage_path ${busco_db} \ --mode transcriptome \ --cpu ${threads} \ --long \ --species ${augustus_species} \ --tarzip \ --augustus_parameters='--progress=true' # Create checksum for potential verification md5sum "${transcriptome_fasta}" >> "${fasta_name}".checksum.md5 

Sam’s Notebook: Transcriptome Assessment – BUSCO Metazoa on C.bairdi Transcriptome v2.1

Continuing to try to identify the best C.bairdi transcriptome, we decided to extract all non-dinoflagellate sequences from cbai_transcriptome_v2.0 (RNAseq shorthand: 2018, 2019, 2020-GW, 2020-UW) and cbai_transcriptome_v3.0 (RNAseq shorthand: 2018, 2019, 2020-UW).

Now, want to assess cbai_transcriptome_v2.1 “completeness” using BUSCO and the metazoa_odb9 database.

BUSCO was run with the --mode transcriptome option on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=cbai_busco_v2.1_transcriptome ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=1-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200605_cbai_busco_transcriptome_v2.1 ### C.bairdi transcriptome assembly completeness assessment using BUSCO. ### This is checking cbai_transcriptome_v1.7.fasta # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log ## Input files and settings busco_db=/gscratch/srlab/sam/data/databases/BUSCO/metazoa_odb9 transcriptome_fasta=/gscratch/srlab/sam/data/C_bairdi/transcriptomes/cbai_transcriptome_v2.1.fasta augustus_species=fly threads=28 ## Save working directory wd=$(pwd) # Extract FastA filename fasta_name=${transcriptome_fasta##*/} ## Set program paths augustus_bin=/gscratch/srlab/programs/Augustus-3.3.2/bin augustus_scripts=/gscratch/srlab/programs/Augustus-3.3.2/scripts blast_dir=/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/ busco=/gscratch/srlab/programs/busco-v3/scripts/run_BUSCO.py hmm_dir=/gscratch/srlab/programs/hmmer-3.2.1/src/ ## Augustus configs augustus_dir=${wd}/augustus augustus_config_dir=${augustus_dir}/config augustus_orig_config_dir=/gscratch/srlab/programs/Augustus-3.3.2/config ## BUSCO configs busco_config_default=/gscratch/srlab/programs/busco-v3/config/config.ini.default busco_config_ini=${wd}/config.ini # Export BUSCO config file location export BUSCO_CONFIG_FILE="${busco_config_ini}" # Export Augustus variable export PATH="${augustus_bin}:$PATH" export PATH="${augustus_scripts}:$PATH" export AUGUSTUS_CONFIG_PATH="${augustus_config_dir}" # Copy BUSCO config file cp ${busco_config_default} "${busco_config_ini}" # Make Augustus directory if it doesn't exist if [ ! -d "${augustus_dir}" ]; then mkdir --parents "${augustus_dir}" fi # Copy Augustus config directory cp --preserve -r ${augustus_orig_config_dir} "${augustus_dir}" # Edit BUSCO config file ## Set paths to various programs ### The use of the % symbol sets the delimiter sed uses for arguments. ### Normally, the delimiter that most examples use is a slash "/". ### But, we need to expand the variables into a full path with slashes, which screws up sed. ### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%"). sed -i "/^;cpu/ s/1/${threads}/" "${busco_config_ini}" sed -i "/^tblastn_path/ s%tblastn_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^makeblastdb_path/ s%makeblastdb_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^augustus_path/ s%augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^etraining_path/ s%etraining_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^gff2gbSmallDNA_path/ s%gff2gbSmallDNA_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^new_species_path/ s%new_species_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^optimize_augustus_path/ s%optimize_augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^hmmsearch_path/ s%hmmsearch_path = /home/osboxes/BUSCOVM/hmmer/hmmer-3.1b2-linux-intel-ia32/binaries/%path = ${hmm_dir}%" "${busco_config_ini}" # Run BUSCO/Augustus training ${busco} \ --in ${transcriptome_fasta} \ --out ${fasta_name} \ --lineage_path ${busco_db} \ --mode transcriptome \ --cpu ${threads} \ --long \ --species ${augustus_species} \ --tarzip \ --augustus_parameters='--progress=true' # Create checksum for potential verification md5sum "${transcriptome_fasta}" >> "${fasta_name}".checksum.md5 

Sam’s Notebook: Sequence Extractions – C.bairdi Transcriptomes v2.0 and v3.0 Excluding Alveolata with MEGAN6 on Swoose

Continuing to try to identify the best C.bairdi transcriptome, we decided to extract all non-dinoflagellate sequences from cbai_transcriptome_v2.0 (RNAseq shorthand: 2018, 2019, 2020-GW, 2020-UW) and cbai_transcriptome_v3.0 (RNAseq shorthand: 2018, 2019, 2020-UW). Both of these transcriptomes were assembled without any taxonomic filter applied. DIAMOND BLASTx and conversion to MEGAN6 RMA6 files was performed yesterday (20200604).

Will import RMA6 files into MEGAN6 and extract all non_Alveolata (dinoflagellates) sequences to create new transcriptomes. The new transcriptomes will be named cbai_transcriptome_v2.1 and cbai_transcriptome_v3.1. I’ll also extract only Alveolata sequences to generate a Hematodinium sp. transcriptome.