Server Maintenance – Fix Server Certificate Authentication Issues

We had been encounterings issues when linking to images in GitHub (e.g. notebooks, Issues/Discussions) hosted on our servers (primarily Gannet). Images always showed up as broken links and, with some work, we could see an error message related to server authentication. More recently, I also noticed that Jupyter Notebooks hosted on our servers could not be viewed in NB Viewer. Attempting to view a Jupyter Notebook hosted on one of our servers results in a 404 error, with a note regarding server certificate problems. Finally, the most annoying issue was encountered when running the shell programs wget to retrieve files from our servers. This program always threw an error regarding our server certificates. The only way to run wget without this error was to add the option --no-check-certificate (which, thankfully, was a suggestion by wget error message).

The perplexing thing is that we have InCommon CA certificates provided by the University of Washington for each of our servers, so it was confusing that we kept encountering these certificate-related errors. I finally got fed up with the problem and decided to do some sleuthing. I had done so in the past, but never came across anything that could point me to what was wrong. Well, today, I stumbled across the problem:

We didn’t have an intermediate InCommon certificate!

I wasn’t previously aware this was a requirement, as this was not mentioned when I initially requested certificates from our department IT group. But, during my (remarkably) bried searching, I stumbled across this guide for installing a certificate on a Synology NAS (which is the brand for all three of our servers) and noticed that the process included an intermediate certificate. The phrasing of the instructions also implied that it was necessary, which is what prompted me to explore this further. Of note, not only did the department IT not mention intermediate certificates, but the Synology “wizard” for guiding the user through the certificate generation/upload process doesn’t require an intermediate certificate; suggesting it’s optional (which, I guess, technically, it is; but it’d be nice if they explained what the impacts might be if one does not procide an intermediate certificate).

Searching for UW intermediate server certificates led me to this page which actually provides the InCommon RSA Server intermediate certficates for UW!. Doing a bit more research, I found the UW page describing InCommon CA server certificates. On that page is this crucial piece of information:

Server admins must install the InCommon CA intermediate certificate.

Notably, that page doesn’t provide a link to find the InCommon CA intermediate certificate…

So, seeing those two pages made me realize that the solution was, indeed, installing the InCommon CA intermediate certificate from that UW page! To do so, I generated a new Certificate Signing Request (CSR) for each server via the Synology interface, sent them off to our department IT group, and then used Synology interface to import the updated server certificate and the intermediate certificate.

And with that, I have managed to resolve all of the issues described in the intial paragraph! It is a great day! Huzzah!

from Sam’s Notebook https://ift.tt/GOWKMbC
via IFTTT

#ifttt, #sams-notebook

Differential Gene Expression – P.generosa DGE Between Tissues Using Nextlow NF-Core RNAseq Pipeline on Mox

Steven asked that I obtain relative expression values for various geoduck tissues (GitHub Issue). So, I decided to use this as an opportunity to try to use a Nextflow pipeline. There’s an RNAseq pipeline, NF-Core RNAseq which I decided to use. The pipeline appears to be ridiculously thorough (e.g. trims, removes gDNA/rRNA contamination, allows for multiple aligners to be used, quantifies/visualizes feature assignments by reads, performs differential gene expression analysis and visualization), all in one package. Sounds great, but I did have some initial problems getting things up and running. Overall, getting things set up to actually run took longer than the actual pipeline run! Oh well, it’s a learning process, so that’s not totally unexpected.

For this pipeline run, I made some modifications to the genome GFF input file used. First, I attempted to create a “gene_biotype” description for the pipeline to use to get some visualizations of read assignments to different components of the genome. I did that in the following fashion:

# Copies header to new GFF
awk 'NR < 4 {print $0}' Panopea-generosa-v1.0.gff > Panopea-generosa-v1.0_biotype.gff
# Adds "gene_biotype" to end of line that matches feature field ($3)
awk 'NR > 3 {print $0";gene_biotype="$3}' Panopea-generosa-v1.0.gff >> Panopea-generosa-v1.0_biotype.gff

Then, modified it further to convert tRNA strand to + instead of . in order to avoid RSEM errors regarding strand info and removed RNAmmer features to also avoid RSEM strand errors.

# Converts strand field ($7) to `+` instead of `.`.
# Works just on tRNA entries
awk '$2 == "GenSAS_5d82b316cd298-trnascan" {$7="+"}1' Panopea-generosa-v1.0.a4_biotype.gff > Panopea-generosa-v1.0.a4_biotype-trna_strand_converted.gff
# Prints all lines which are not rRNA
awk '$2 != "RNAmmer-1.2"' Panopea-generosa-v1.0.a4_biotype-trna_strand_converted.gff > Panopea-generosa-v1.0.a4_biotype-trna_strand_converted-no_RNAmmer.gff

Then, this was all run on Mox.

SBATCH script (GitHub):

#!/bin/bash
## Job Name
#SBATCH --job-name=20220323-pgen-nextflow_rnaseq-tissues
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=17-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20220323-pgen-nextflow_rnaseq-tissues

# Script to run Nextflow NF Core RNAseq pipeline for RNAseq analysis of P.generosa, per this GitHub Issue:
# https://github.com/RobertsLab/resources/issues/1423

# See variable assignments below for input files used: genome, GFF, transcriptome
# List of input FastQs will be generated during run in: sample_sheet-"${SLURM_JOB_ID}".csv
# Custom config file for maximum memory and CPU thread setttings

# Outputs explanations are here: https://nf-co.re/rnaseq/3.6/output
# Input paramaeter explanations are here: https://nf-co.re/rnaseq/3.6/parameters

###################################################################################

# These variables need to be set by user

## Assign Variables

## PROGRAMS ##
# NF Core RNAseq workflow directory
nf_core_rnaseq="/gscratch/srlab/programs/nf-core-rnaseq-3.6/workflow"

# NF Core RNAseq custom config file
nf_core_rnaseq_config=/gscratch/srlab/programs/nf-core-rnaseq-3.6/configs/conf/base-srlab_500GB_node.config

## FILES AND DIRECTORIES ##
# Wordking directory
wd=$(pwd)

# RNAseq FastQs directory
reads_dir=/gscratch/srlab/sam/data/P_generosa/RNAseq

# Genome FastA
genome_fasta=/gscratch/srlab/sam/data/P_generosa/genomes/Panopea-generosa-v1.0.fa

# Genome GFF3
# This was manually modified by me to add gene_biotype to end of each entry.
# Improves NF-Core RNAseq pipeline analysis/visualiztion to have this info present.
genome_gff=/gscratch/srlab/sam/data/P_generosa/genomes/Panopea-generosa-v1.0.a4_biotype-trna_strand_converted-no_RNAmmer.gff

## INITIALIZE ARRAYS ##
# Leave empty!!
R1_array=()
R2_array=()
R1_uncompressed_array=()
R2_uncompressed_array=()

###################################################################################

# Exit script if a command fails
set -e

# Load Anaconda
# Uknown why this is needed, but Anaconda will not run if this line is not included.
. "/gscratch/srlab/programs/anaconda3/etc/profile.d/conda.sh"

# Activate NF-core conda environment
conda activate nf-core_env

# Load Singularity Mox module for NF Core/Nextflow
module load singularity


# NF Core RNAseq sample sheet header
sample_sheet_header="sample,fastq_1,fastq_2,strandedness"
printf "%s\n" "${sample_sheet_header}" >> sample_sheet-"${SLURM_JOB_ID}".csv

# Create array of original uncompressed fastq R1 files
# Set filename pattern
R1_uncompressed_array=("${reads_dir}"/*_1.fastq)

# Create array of original uncompressed fastq R2 files
# Set filename pattern
R2_uncompressed_array=("${reads_dir}"/*_2.fastq)

# Check array size to confirm it has all expected samples
# Exit if mismatch
if [[ "${#R1_uncompressed_array[@]}" != "${#R2_uncompressed_array[@]}" ]]
then
  echo ""
  echo "Uncompressed array sizes don't match."
  echo "Confirm all expected FastQs are present in ${reads_dir}"
  echo ""

  exit
fi

# Create list of original uncompressed fastq files
## Uses parameter substitution to strip leading path from filename
for fastq in "${!R1_uncompressed_array[@]}"
do
  # Strip leading path
    no_path=$(echo "${R1_uncompressed_array[${fastq}]##*/}")

  # Grab SRA name
  sra=$(echo "${no_path}" | awk -F "_" '{print $1}')

  # Only gzip matching FastQs
  # Only generate MD5 checksums for matching FastQs
  if [[ "${sra}" == "SRR12218868" ]] \
    || [[ "${sra}" == "SRR12218869" ]] \
    || [[ "${sra}" == "SRR12226692" ]] \
    || [[ "${sra}" == "SRR12218870" ]] \
    || [[ "${sra}" == "SRR12226693" ]] \
    || [[ "${sra}" == "SRR12207404" ]] \
    || [[ "${sra}" == "SRR12207405" ]] \
    || [[ "${sra}" == "SRR12227930" ]] \
    || [[ "${sra}" == "SRR12207406" ]] \
    || [[ "${sra}" == "SRR12207407" ]] \
    || [[ "${sra}" == "SRR12227931" ]] \
    || [[ "${sra}" == "SRR12212519" ]] \
    || [[ "${sra}" == "SRR12227929" ]] \
    || [[ "${sra}" == "SRR8788211" ]]
  then
    echo ""
    echo "Generating MD5 checksums for ${R1_uncompressed_array[${fastq}]} and ${R2_uncompressed_array[${fastq}]}."
    echo ""
    # Generate MD5 checksums of uncompressed FastQs
    {
      md5sum "${R1_uncompressed_array[${fastq}]}"
      md5sum "${R2_uncompressed_array[${fastq}]}"
    } >> uncompressed_fastqs-"${SLURM_JOB_ID}".md5

    # Gzip FastQs; NF Core RNAseq requires gzipped FastQs as inputs
    echo "Compressing FastQ files."
    if [ ! -f "${R1_uncompressed_array[${fastq}]}.gz" ]
    then 
      gzip --keep "${R1_uncompressed_array[${fastq}]}"
      gzip --keep "${R2_uncompressed_array[${fastq}]}"
    else 
      echo "${R1_uncompressed_array[${fastq}]}.gz already exists. Skipping."
    fi
    echo ""



  fi
done


# Create array of fastq R1 files
# Set filename pattern
R1_array=("${reads_dir}"/*_1.fastq.gz)

# Create array of fastq R2 files
# Set filename pattern
R2_array=("${reads_dir}"/*_2.fastq.gz)

# Check array sizes to confirm they are same size
# Exit if mismatch
if [[ "${#R1_array[@]}" != "${#R2_array[@]}" ]]
  then
    echo ""
    echo "Read1 and Read2 compressed FastQ array sizes don't match."
    echo "Confirm all expected compressed FastQs are present in ${reads_dir}"
    echo ""

    exit
fi

# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in "${!R1_array[@]}"
do
  echo ""
  echo "Generating MD5 checksums for ${R1_array[${fastq}]} and ${R2_array[${fastq}]}."
  echo ""
  # Generate MD5 checksums for compressed FastQs used in NF Core RNAseq analysis
  {
    md5sum "${R1_array[${fastq}]}"
    md5sum "${R2_array[${fastq}]}"
  } >> input_fastqs-"${SLURM_JOB_ID}".md5

  # Strip leading path
    no_path=$(echo "${R1_array[${fastq}]##*/}")

  # Grab SRA name
  sra=$(echo "${no_path}" | awk -F "_" '{print $1}')

  # Set tissue type
  if [[ "${sra}" == "SRR12218868" ]]
  then
    tissue="heart"

    # Add to NF Core RNAseq sample sheet
    printf "%s,%s,%s,%s\n" "${tissue}" "${R1_array[${fastq}]}" "${R2_array[${fastq}]}" "reverse" \
    >> sample_sheet-"${SLURM_JOB_ID}".csv

  elif [[ "${sra}" == "SRR12218869" ]] \
    || [[ "${sra}" == "SRR12226692" ]]

  then
    tissue="gonad"

    # Add to NF Core RNAseq sample sheet
    printf "%s,%s,%s,%s\n" "${tissue}" "${R1_array[${fastq}]}" "${R2_array[${fastq}]}" "reverse" \
    >> sample_sheet-"${SLURM_JOB_ID}".csv

  elif [[ "${sra}" == "SRR12218870" ]] \
    || [[ "${sra}" == "SRR12226693" ]]
  then
    tissue="ctenidia"

    # Add to NF Core RNAseq sample sheet
    printf "%s,%s,%s,%s\n" "${tissue}" "${R1_array[${fastq}]}" "${R2_array[${fastq}]}" "reverse" \
    >> sample_sheet-"${SLURM_JOB_ID}".csv

  elif [[ "${sra}" == "SRR12207404" ]] \
    || [[ "${sra}" == "SRR12207405" ]] \
    || [[ "${sra}" == "SRR12227930" ]] \
    || [[ "${sra}" == "SRR12207406" ]] \
    || [[ "${sra}" == "SRR12207407" ]] \
    || [[ "${sra}" == "SRR12227931" ]]
  then
    tissue="juvenile"

    # Add to NF Core RNAseq sample sheet
    printf "%s,%s,%s,%s\n" "${tissue}" "${R1_array[${fastq}]}" "${R2_array[${fastq}]}" "reverse" \
    >> sample_sheet-"${SLURM_JOB_ID}".csv

  elif [[ "${sra}" == "SRR12212519" ]] \
    || [[ "${sra}" == "SRR12227929" ]] \
    || [[ "${sra}" == "SRR8788211" ]]
  then
    tissue="larvae"

    # Add to NF Core RNAseq sample sheet
    printf "%s,%s,%s,%s\n" "${tissue}" "${R1_array[${fastq}]}" "${R2_array[${fastq}]}" "reverse" \
    >> sample_sheet-"${SLURM_JOB_ID}".csv
  fi

done

echo "Beginning NF-Core RNAseq pipeline..."
echo ""
# Run NF Core RNAseq workflow
nextflow run ${nf_core_rnaseq} \
-profile singularity \
-c ${nf_core_rnaseq_config} \
--input sample_sheet-"${SLURM_JOB_ID}".csv \
--outdir ${wd} \
--multiqc_title "20220321-pgen-nextflow_rnaseq-tissues-${SLURM_JOB_ID}" \
--fasta ${genome_fasta} \
--gff ${genome_gff} \
--save_reference \
--gtf_extra_attributes gene_name \
--gtf_group_features gene_id \
--featurecounts_group_type gene_biotype \
--featurecounts_feature_type exon \
--trim_nextseq 20 \
--save_trimmed \
--aligner star_salmon \
--pseudo_aligner salmon \
--min_mapped_reads 5 \
--save_align_intermeds \
--rseqc_modules bam_stat,\
inner_distance,\
infer_experiment,\
junction_annotation,\
junction_saturation,\
read_distribution,\
read_duplication

##############################################################
# Copy config file for later reference, if needed
cp "${nf_core_rnaseq_config}" .

# Document programs in PATH (primarily for program version ID)
{
  date
  echo ""
  echo "System PATH for $SLURM_JOB_ID"
  echo ""
  printf "%0.s-" {1..10}
  echo "${PATH}" | tr : \\n
} >> system_path.log

echo "Finished logging system PATH"

RESULTS

Runtime was surprisingly fast, at just a bit over 2.3 days…

Screencap of NF-Core RNAseq runtime on Mox showing runtime of 2 days, 9hrs and some change

There is a ton of data here to unpack, so I’ll just link to some of the most useful files.

Refer to the NF-Core/RNAseq “Output docs” for all the things generated, as well as a thorough explanation of the MultiQC Report:

Also, the NF-Core/RNAseq pipeline provides a nice progress report to show you which options are running/completed. This screenshot is from after the pipeline finished successfully:

Screencap of NF-Core/RNAseq pipeline upon completion. Shows percentages and checkboxes to indicate each process completion. Also provides a list of samples passing STAR mapping threshold and pipeline runtime.

Output folder:

from Sam’s Notebook https://ift.tt/7YAu6j0
via IFTTT

#ifttt, #sams-notebook

Nextflow – Trials and Tribulations of Installing and Using NF-Core RNAseq

INSTALLATION

For some reason, I struggled to get things installed correctly. Installing Nextflow was straightforward and didn’t have any issues. First bump in the road came from the [installation directions(https://ift.tt/YwEzc3j] for NF-Core RNAseq. The instructions say “Download the pipeline and test it..”. Well, that doesn’t indicate how/where to download. To add to the confusion (will be evident how in a bit), the previous step says to “only use Conda as a last resort”. Okay, so how do I get the NF-Core RNAseq? I downloaded the latest release from the GitHub Release page. After doing that, I couldn’t figure out how to just get a basic command (e.g. nf-core download) to run. I managed to discover that I likely needed to install nf-core Tools. However, doing that meant that I would have to use either Conda or Docker. Docker seemed overly complicated (didn’t want to have to deal with re-learning container/image usage and within the confines of a SLURM environment like Mox – this whole thing was supposed to be quick and easy!) and using Conda was strongly discouraged for the NF-Core RNAseq. For “simplicity” I went the Conda route. This is where everything went awry…

The pipeline is designed to have internet access to update and download newer config files automatically. Knowing that running SLURM job wouldn’t have internet access, I followed the instructions for downloading a pipeline for offlne use. Running the nf-core download command initiated an interactive mode to specify things, like which version to download, whether or not to provide a directory for $NXF_SINGULARITY_CACHEDIR, and whether or not to compress the Singularity container that would be downloaded. Setting $NXF_SINGULARITY_CACHEDIR seemed to cause issues. Additionally, the Singulairity download was surprisingly fast.

As it all turns out, I think the issue was that I had downloaded NF-Core RNAseq from the GitHub release and installed via Conda. Essentially, Conda creates the same directory structure in the same location as the NF-Core RNAseq install from the GitHub release. And, this fact, I think broke everything.

To fix the installation, I uninstalled all things related to NF-Core RNAseq, but I left the Nextflow installation. Then, I installed nf-core Tools (and, subsequently the RNAseq package) using only the Conda installation. Once I did that (and activated the nf-core/rnaseq conda environment) and ran the nf-core download, I noticed that the Singularity download took significantly longer than previous attempts, which led me to think that things were working, finally.

RUNNING

After finally getting things installed properly, I encountered a number of problems just trying to get the pipeline to run successfully. There were a number of small issues that lead to a lot of troubleshooting:

Error:

Transcript names not matching GFF.

Solution:

If supplying a transcriptome, the names in the FastA description lines have to match those in the GFF. Since we de novo assembled our transcriptome using Trinity, it has Trinity IDs which do not correspond to the genome assembly/annotation. So, I dropped the transcriptome option for the run.

Error:

Unable to parse config file: '/gscratch/srlab/programs/nf-core-rnaseq-3.6/workflow/nextflow.config'

  Cannot compare java.lang.Boolean with value 'true' and java.lang.Integer with value '0'

Solution:

--trim_nextseq option actually requires an integer. Documentation didn’t indicate such at the time, but may now though as I brought this to the attention of the developers via their Slack group and the discussion indicated they should add this info to the websit.

Error:

Unable to parse config file: '/gscratch/srlab/programs/nf-core-rnaseq-3.6/workflow/nextflow.config'

Solution:

When using a custom config file, one cannot use the check_max(). I had just been modifying the values for RAM, CPUs, run times that were in the base.config file. The base.confi file utilizes the check_max() function. After a bunch of sleuthing on the NF-Core RNAseq GitHub and Slack groups, I finally discovered this solution. Here’s what the custom, functional config looks like:

  • /gscratch/srlab/programs/nf-core-rnaseq-3.6/configs/conf/base-srlab_500GB_node.config
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    nf-core/rnaseq Nextflow base config file modified by Sam White on 20220321 for use
    on Roberts Lab 500GB Mox node.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    A 'blank slate' config file, appropriate for general use on most high performance
    compute environments. Assumes that all software is installed and available on
    the PATH. Runs in `local` mode - all jobs will be run on the logged in environment.
----------------------------------------------------------------------------------------
*/

process {

    cpus   = 28
    memory = 500.GB
    time   = 30.d

    errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
    maxRetries    = 1
    maxErrors     = '-1'

    // Process-specific resource requirements
    withLabel:process_low {
        cpus   = 28
        memory = 500.GB
        time   = 30.d
    }
    withLabel:process_medium {
        cpus   = 28
        memory = 500.GB
        time   = 30.d
    }
    withLabel:process_high {
        cpus   = 28
        memory = 500.GB
        time   = 30.d
    }
    withLabel:process_long {
        time   = 30.d
    }
    withLabel:process_high_memory {
        memory = 500.GB
    }
    withLabel:error_ignore {
        errorStrategy = 'ignore'
    }
    withLabel:error_retry {
        errorStrategy = 'retry'
        maxRetries    = 2
    }
    withName:CUSTOM_DUMPSOFTWAREVERSIONS {
        cache = false
    }
}

Error:

Error Message: Strand is neither '+' nor '-'!

ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:MAKE_TRANSCRIPTS_FASTA (rsem/Panopea-generosa-v1.0.fa)'
Caused by:
  Process `NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:MAKE_TRANSCRIPTS_FASTA (rsem/Panopea-generosa-v1.0.fa)` terminated with an error exit status (255)
Command executed:
  rsem-prepare-reference \
      --gtf Panopea-generosa-v1.0_genes.gtf \
      --num-threads 28 \
       \
      rsem/Panopea-generosa-v1.0.fa \
      rsem/genome
  
  cp rsem/genome.transcripts.fa .
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:MAKE_TRANSCRIPTS_FASTA":
      rsem: $(rsem-calculate-expression --version | sed -e "s/Current version: RSEM v//g")
      star: $(STAR --version | sed -e "s/STAR_//g")
  END_VERSIONS
Command exit status:
  255
Command output:
  rsem-extract-reference-transcripts rsem/genome 0 Panopea-generosa-v1.0_genes.gtf None 0 rsem/Panopea-generosa-v1.0.fa
  "rsem-extract-reference-transcripts rsem/genome 0 Panopea-generosa-v1.0_genes.gtf None 0 rsem/Panopea-generosa-v1.0.fa" failed! Plase check if you provide correct parameters/options for the pipeline!
Command error:
  The GTF file might be corrupted!
  Stop at line : Scaffold_02    GenSAS_5d82b316cd298-trnascan   transcript      27467   27541   38.1    .       .       transcript_id "21513.GS22252506.PGEN_.tRNA00000001"; gene_id "21513.GS22252506.PGEN_.tRNA00000001"; Name "Thr"; anti_codon "CGT"; gene_no "1376";

Solution:

As it turns out, this wasn’t an issue with the pipeline, it’s really an issue with how RSEM handles GTF strand parsing. Even though the GFF spec indicates that strand column can be one of +, -, ., or ?, RSEM only parses for + or -. And, as it turns out, our genome GFF has some . for strand info. Looking through our “merged” GenSAS GFF, it turns out there are two sets of annotations that only have . for strand info (GenSAS_5d82b316cd298-trnascan & RNAmmer-1.2). So, the decision needed to be made if we should convert these sets strands to an “artificial” value (e.g. set all of them to +), or eliminate them from the input GFF. I ended up converting GenSAS_5d82b316cd298-trnascan strand to + and eliminated RNAmmer-1.2 from the final input GFF.

Overall, it was a bit of an arduos process, but it’s all running now… Will update if I encounter any more hurdles.

from Sam’s Notebook https://ift.tt/d8t17hW
via IFTTT

#ifttt, #sams-notebook

Data Wrangling – P.generosa Genomic Feature FastA Creation

Steven wanted me to generate FastA files (GitHub Issue) for Panopea generosa (Pacific geoduck) coding sequences (CDS), genes, and mRNAs. One of the primary needs, though, was to have an ID that could be used for downstream table joining/mapping. I ended up using a combination of GFFutils and bedtools getfasta. I took advantage of being able to create a custom name column in BED files to generate the desired FastA description line having IDs that could identify, and map, CDS, genes, and mRNAs across FastAs and GFFs.

This was all documented in a Jupyter Notebook:

GitHub:

NB Viewer:


RESULTS

Output folder:


CDS FastA description lines look like this:

  • >PGEN_.00g000010.m01.CDS01|PGEN_.00g000010.m01|PGEN_.00g000010::Scaffold_01:2-125

Explanation for CDS:

  • PGEN_.00g000010.m01.CDS01: Unique sequence ID.
  • PGEN_.00g000010.m01: “Parent” ID. Corresponds to unique mRNA ID.
  • PGEN_.00g000010: “Parent” ID. Corresponds to unique gene ID.
  • Scaffold_01: Originating scaffold.
  • 2-125: Sequence coordinates from scaffold mentioned above.

mRNA FastA description looks like this:

  • PGEN_.00g000030.m01|PGEN_.00g000030::Scaffold_01:49248-52578

Explanation for mRNA:

  • PGEN_.00g000030.m01: Unique sequence ID.
  • PGEN_.00g000030: “Parent” ID. Corresponds to unique gene ID.
  • Scaffold_01: Originating scaffold.
  • 49248-52578: Sequence coordinates from scaffold mentioned above.

from Sam’s Notebook https://ift.tt/8IljR1e
via IFTTT

#ifttt, #sams-notebook

Fri. Feb. 18, 2022

NOPP-gigas-ploidy-temp analysis – Part 1 –
Project name: NOPP-gigas-ploidy-temp Funding source: National Oceanographic Partnership Program Species: Crassostrea gigas variable: ploidy, elevated seawater temperature, desiccation previous notebook entry next notebook entry Background We sent 72 samples for RNA sequencing (TagSeq) to the Genomic Sequencing and Analysis Facility at the University of Texas at Austin. Here is a…

from Matthew N. George, Ph.D. https://ift.tt/CqRAV3Z
via IFTTT

#ifttt, #matthew-n-george, #ph-d

Fri. Mar. 19, 2022

Honeywell UDA2182 controller setup –
Background Pictures of a working setup for the Honeywell UDA2182 dual analyzer that is controlling to a pH set point after a durafet probe and CO2/solenoids are installed. CO2 is injected into a header tank using a ozone injector. Pictures UDA2182 home screen Relay – connected to solenoid that is…

from Matthew N. George, Ph.D. https://ift.tt/rLdGEl5
via IFTTT

#ifttt, #matthew-n-george, #ph-d

Thu. Mar. 17, 2022

Oyster family transfer –
Background Picked up four C. gigas families (2n/3n within each family) from Pacific Hybreed in Manchester. Parents came from Hood Canal, Baywater. I dropped them off in heath trays at Jamestown S’Klallam’s hatchery at Pt. Whitney. Pictures are provided below. Pictures Health trays w/ labels Tray 1 – F5 (2n)…

from Matthew N. George, Ph.D. https://ift.tt/z8VD3wS
via IFTTT

#ifttt, #matthew-n-george, #ph-d

Sun. Mar. 13, 2022

Monday Meeting Notes –
9am – FISH 441 meeting JupyterHub Link to course site Publish course to get syllabus out there Update all dates to spring Outline activities by week 11am – Journal club 1. 2. 3. 2pm – OSU/WA meeting Submit WRAC pre-proposal. Here is the RFP. Due April 15th. Submit USDA-NIFA proposal….

from Matthew N. George, Ph.D. https://ift.tt/DozRtjc
via IFTTT

#ifttt, #matthew-n-george, #ph-d

Data Analysis – C.virginica BSseq Unmapped Reads Using MEGAN6

After performing DIAMOND BLASTx and DAA “meganization” on 20220302, the next step was to import the DAA files into MEGAN6 for analyzing the resulting taxonomic assignments of the Crassostrea virginica (Eastern oyster) unmapped BSseq reads that Steven generated.

“Meganized” DAA files were imported into MEGAN6 (v.6.21.5), via the “Import from BLAST” dialog menu. All paired read DAA files were imported together (but not using the “paired reads” option, since I didn’t provide the corresponding FastQ files for this – no real reason; just didn’t think it necessary for this particulary analysis) using the megan-map-Jan2021.db for Taxonomy. This importation generated a single corresponding RMA6 (i.e one RMA6 file for each pair of DAA files). In retrospect, I should have just converted the DAA files to RMA6 during the DIAMOND BLASTx and DAA “meganization” on 20220302. It would’ve saved a lot of memory-related issues when trying to import many DAA files in a single MEGAN6 session…

After conversion to RMA6, a new “Comparison” was created, which allows visualization of taxonomic assignments across all samples, simultaneously.


RESULTS

Output folder:

It looked like the bulk of reads in all samples (i.e. > 75%) get assigned to the genus Crassostrea (see images at bottom of post).


Wordcloud of taxonomic assignments of reads for each sample (Genus level):

Wordcloud of taxonomic assignments for all unmapped BSseq reads in each sample at the "Genus" level generated using MEGAN6 software, with the Genus "Crassostrea" as the most prominent of all the words present. In fact, all other words so small as to be illegible.


Phylogenetic tree of taxonomic assignments of reads for each sample (Genus level):

Phylogenetic tree of unmapped BSseq reads generated by MEGN6 software showing that most reads in all samples examined are assigned to the Genus "Crassostrea"


And, here’s the tree at the species level to help confirm that there’s not a bunch of C.gigas contamination and that it’s primarily Crassostrea virginica (Eastern oyster):

Phylogenetic tree of unmapped BSseq reads generated by MEGN6 software showing that most reads in all samples examined are assigned to the Species"Crassostrea virginica"


So, not sure why they’re not getting mapped originally.

I’m assuming these came from Bismark? If yes, the manual provides a bit of insight into capturing unmapped reads. Additionally, a comparison of unmapped reads and ambiguous reads might allow a better grasp of what’s happening:

--un

Write all reads that could not be aligned to the file _unmapped_reads.fq.gz in the output directory. Written reads will appear as they did in the input, without any translation of quality values that may have taken place within Bowtie or Bismark. Paired-end reads will be written to two parallel files with _1 and _2 inserted in their filenames, i.e. unmapped_reads_1.fq.gz and unmapped_reads_2.fq.gz. Reads with more than one valid alignment with the same number of lowest mismatches (ambiguous mapping) are also written to unmapped_reads.fq.gz unless --ambiguous is also specified.

--ambiguous

Write all reads which produce more than one valid alignment with the same number of lowest mismatches or other reads that fail to align uniquely to _ambiguous_reads.fq. Written reads will appear as they did in the input, without any of the translation of quality values that may have taken place within Bowtie or Bismark. Paired-end reads will be written to two parallel files with _1 and _2 inserted in their filenames, i.e. _ambiguous_reads_1.fq and _ambiguous_reads_2.fq. These reads are not written to the file specified with --un.

from Sam’s Notebook https://ift.tt/MBEHKsJ
via IFTTT

#ifttt, #sams-notebook

Data Wrangling – P.generosa Genome GFF Conversion to GTF Using gffread

Steven asked in this GitHub Issue to convert our Panopea generosa (Pacific geoduck) genomic GFF to a GTF for use in the 10x Genomics Cell Ranger software. This conversion was performed using GffRead in a Jupyter Notebook.

Jupyter Notebook:


RESULTS

Output folder:

from Sam’s Notebook https://ift.tt/ovbg2TP
via IFTTT

#ifttt, #sams-notebook