January Summary

Thursday, January 5 Happy New Year! Alright, last post I was ironing out some issues related to temperature vs. bitterness. Goals for this month (I’ll keep this short, as I’m writing this during lab meeting!) Modeling Visual Positives: For each final weighted model, create graph of each variable’s relationship to bitterness Continue writing up results and discussion Outline introduction Examining Short-Term Temperature Response: Get in touch with the journal – I’ve fixed all issues, but they haven’t gotten back to me, so need to communicate with them! Set up structure where I methodically lay out reviewer comments and my existing paper in the Google doc, start fixing up. Note: I’ve heard back from Reviewer 1 but not Reviewer 2. Start fixing up graphs to be paper-quality Overall Choose a day for thesis defense That finishes it up, let’s see how it goes!

from Aidan F. Coyle https://ift.tt/tPM5TBq
via IFTTT

December Summary

Monday, Dec. 5th Happy holidays – first post of December! Took the first few days of the month off to head back to Cleveland (congrats on the wedding, David!). To recap, I’ve got two papers that I’m working on at the moment. First, a paper examining the relationship of various environmental and physical variables to visual BCS positive C. bairdi by using GLMMs on a huge ADFG dataset. This is the focus of most of my work at this point – I’m in the writing up stages! The second is a paper that uses RNAseq to examine the response of infected C. bairdi to short-term temperature changes. I’ve submitted this one to Frontiers in Marine Science, but still have edits to make. My to-do lists for each task is below: Modeling Visual Positives: For each final weighted model, create graph of each variable’s relationship to bitterness Upload existing graphs to…

from Aidan F. Coyle https://ift.tt/BU7CKmv
via IFTTT

November Summary

Thanksgiving Update: 2022-11-23 Alright, it’s been quite a while since I’ve posted, so it’s definitely time for an update (apologies for the delay)! Big news: Got my paper submitted (the one examining gene expression in C. bairdi and Hematodinium)! Hopefully will hear back soon, but glad the bulk of it is over! Other updates: Have begun working on a paper on the modeling work I’ve been doing! Finished up the Methods section (well almost, just need to confirm my methods are sound), and have been moving on to the Results. Next goal is to produce some pretty figures for that! One big obstacle to getting there is saving the model results – I ran all four models once and the results sure look saved just fine, but to confirm I’m using the correct read-in function, I need to run the model again so I can have a before/after of reading…

from Aidan F. Coyle https://ift.tt/CL7Hh19
via IFTTT

September Summary

Alright, monthly goals are to do the following! Get feedback on models, use proper diagnostics Interpret modeling results Finish methods of modeling paper, start on results/intro We also have some secondary goals – these are more dependent on when I get the edits back, but Another round of edits on temperature analysis paper Thursday, September 1st Modeling: Crafted some nice loops to do some diagnostics. Found a persistent issue – when I use simulate() to check model accuracy and then compare the number of simulated BCS+ crab vs. the true number, it’s persistently underestimating (by a little for the All Crabs model, by a lot for the Females model), or overestimating (for the Males model) the number of BCS+ crab. I’m not sure exactly what is going on here. Could be a minor issue with how random effects are specified in simulate(), could be a more major issue with the…

from Aidan F. Coyle https://ift.tt/68ae7LW
via IFTTT

August Summary

I took the first week of July off – headed back to Ohio for Pop’s funeral – but now it’s back to daily notebook posts! Goal is to do the following: Incorporate edits on proposal (ASAP) Finish first draft of paper (ASAP) Clean survey data to prepare for modeling (by end of month) Alright, monthly goals are to do the following! Clean survey data to prepare for modeling (ASAP) Carry out modeling project (by end of month) We also have some secondary goals – these are more dependent on when I get the edits back, but Incorporate Steven’s edits in the proposal and submit Incorporate Steven’s edits in the paper and finish Draft 2 Monday, August 1st After a lot (a lot) of edits and revisions to the code, I managed to clean the (very messy) temperature files from all survey legs. This typically required re-doing the read-in process -…

from Aidan F. Coyle https://ift.tt/JLnftxC
via IFTTT

July Summary

I took the first week of July off – headed back to Ohio for Pop’s funeral – but now it’s back to daily notebook posts! Goal is to do the following: Incorporate edits on proposal (ASAP) Finish first draft of paper (ASAP) Clean survey data to prepare for modeling (by end of month) Monday, June 11th Spent the day examining Chelsea’s edits to my proposal! There’s a lot of varying degrees of difficulty. Focused on the easy ones for today. I tend to not have the easiest time incorporating edits (I get stressed out and avoid them), so this might take longer than it should. Still, trying to push myself to get em back quick! Tuesday, June 12th Another day of proposal editing! Got to a fairly good place – hopefully should have them all finished by tomorrow. Need to rewrite a paragraph or two, and then should be set…

from Aidan F. Coyle https://ift.tt/G1ve5Y2
via IFTTT

June Summary

Alright, I’ve been working for a while without a post! That’s because I’ve been nearly entirely focused on writing, which I haven’t considered post-worthy. But after a lab meeting, Chris suggested making a quick summary of each day’s work, so I can still track progress without writing up full posts! So that’s what I’m doing! Thurs, June 23rd Got a good amount done today – generally focused on tightening up my methods, correcting word choices, etc. Remade my two WGCNA heatmaps (the ones I’ll be using as figures in my paper) to make them more publication-worthy. Specific changes include: – Adding contig counts to eigengene labels – Removing title – Changing x-axis labels to be more publication-worthy – Rotating x-axis labels – Removing the “ME” prefix from all eigengene labels I also moved the structure of the scripts around a bit – I moved the heatmap to the end of…

from Aidan F. Coyle https://ift.tt/YRtyvzP
via IFTTT

Server Maintenance – Fix Server Certificate Authentication Issues

We had been encounterings issues when linking to images in GitHub (e.g. notebooks, Issues/Discussions) hosted on our servers (primarily Gannet). Images always showed up as broken links and, with some work, we could see an error message related to server authentication. More recently, I also noticed that Jupyter Notebooks hosted on our servers could not be viewed in NB Viewer. Attempting to view a Jupyter Notebook hosted on one of our servers results in a 404 error, with a note regarding server certificate problems. Finally, the most annoying issue was encountered when running the shell programs wget to retrieve files from our servers. This program always threw an error regarding our server certificates. The only way to run wget without this error was to add the option --no-check-certificate (which, thankfully, was a suggestion by wget error message).

The perplexing thing is that we have InCommon CA certificates provided by the University of Washington for each of our servers, so it was confusing that we kept encountering these certificate-related errors. I finally got fed up with the problem and decided to do some sleuthing. I had done so in the past, but never came across anything that could point me to what was wrong. Well, today, I stumbled across the problem:

We didn’t have an intermediate InCommon certificate!

I wasn’t previously aware this was a requirement, as this was not mentioned when I initially requested certificates from our department IT group. But, during my (remarkably) bried searching, I stumbled across this guide for installing a certificate on a Synology NAS (which is the brand for all three of our servers) and noticed that the process included an intermediate certificate. The phrasing of the instructions also implied that it was necessary, which is what prompted me to explore this further. Of note, not only did the department IT not mention intermediate certificates, but the Synology “wizard” for guiding the user through the certificate generation/upload process doesn’t require an intermediate certificate; suggesting it’s optional (which, I guess, technically, it is; but it’d be nice if they explained what the impacts might be if one does not procide an intermediate certificate).

Searching for UW intermediate server certificates led me to this page which actually provides the InCommon RSA Server intermediate certficates for UW!. Doing a bit more research, I found the UW page describing InCommon CA server certificates. On that page is this crucial piece of information:

Server admins must install the InCommon CA intermediate certificate.

Notably, that page doesn’t provide a link to find the InCommon CA intermediate certificate…

So, seeing those two pages made me realize that the solution was, indeed, installing the InCommon CA intermediate certificate from that UW page! To do so, I generated a new Certificate Signing Request (CSR) for each server via the Synology interface, sent them off to our department IT group, and then used Synology interface to import the updated server certificate and the intermediate certificate.

And with that, I have managed to resolve all of the issues described in the intial paragraph! It is a great day! Huzzah!

from Sam’s Notebook https://ift.tt/GOWKMbC
via IFTTT

#ifttt, #sams-notebook

Differential Gene Expression – P.generosa DGE Between Tissues Using Nextlow NF-Core RNAseq Pipeline on Mox

Steven asked that I obtain relative expression values for various geoduck tissues (GitHub Issue). So, I decided to use this as an opportunity to try to use a Nextflow pipeline. There’s an RNAseq pipeline, NF-Core RNAseq which I decided to use. The pipeline appears to be ridiculously thorough (e.g. trims, removes gDNA/rRNA contamination, allows for multiple aligners to be used, quantifies/visualizes feature assignments by reads, performs differential gene expression analysis and visualization), all in one package. Sounds great, but I did have some initial problems getting things up and running. Overall, getting things set up to actually run took longer than the actual pipeline run! Oh well, it’s a learning process, so that’s not totally unexpected.

For this pipeline run, I made some modifications to the genome GFF input file used. First, I attempted to create a “gene_biotype” description for the pipeline to use to get some visualizations of read assignments to different components of the genome. I did that in the following fashion:

# Copies header to new GFF
awk 'NR < 4 {print $0}' Panopea-generosa-v1.0.gff > Panopea-generosa-v1.0_biotype.gff
# Adds "gene_biotype" to end of line that matches feature field ($3)
awk 'NR > 3 {print $0";gene_biotype="$3}' Panopea-generosa-v1.0.gff >> Panopea-generosa-v1.0_biotype.gff

Then, modified it further to convert tRNA strand to + instead of . in order to avoid RSEM errors regarding strand info and removed RNAmmer features to also avoid RSEM strand errors.

# Converts strand field ($7) to `+` instead of `.`.
# Works just on tRNA entries
awk '$2 == "GenSAS_5d82b316cd298-trnascan" {$7="+"}1' Panopea-generosa-v1.0.a4_biotype.gff > Panopea-generosa-v1.0.a4_biotype-trna_strand_converted.gff
# Prints all lines which are not rRNA
awk '$2 != "RNAmmer-1.2"' Panopea-generosa-v1.0.a4_biotype-trna_strand_converted.gff > Panopea-generosa-v1.0.a4_biotype-trna_strand_converted-no_RNAmmer.gff

Then, this was all run on Mox.

SBATCH script (GitHub):

#!/bin/bash
## Job Name
#SBATCH --job-name=20220323-pgen-nextflow_rnaseq-tissues
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=17-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20220323-pgen-nextflow_rnaseq-tissues

# Script to run Nextflow NF Core RNAseq pipeline for RNAseq analysis of P.generosa, per this GitHub Issue:
# https://github.com/RobertsLab/resources/issues/1423

# See variable assignments below for input files used: genome, GFF, transcriptome
# List of input FastQs will be generated during run in: sample_sheet-"${SLURM_JOB_ID}".csv
# Custom config file for maximum memory and CPU thread setttings

# Outputs explanations are here: https://nf-co.re/rnaseq/3.6/output
# Input paramaeter explanations are here: https://nf-co.re/rnaseq/3.6/parameters

###################################################################################

# These variables need to be set by user

## Assign Variables

## PROGRAMS ##
# NF Core RNAseq workflow directory
nf_core_rnaseq="/gscratch/srlab/programs/nf-core-rnaseq-3.6/workflow"

# NF Core RNAseq custom config file
nf_core_rnaseq_config=/gscratch/srlab/programs/nf-core-rnaseq-3.6/configs/conf/base-srlab_500GB_node.config

## FILES AND DIRECTORIES ##
# Wordking directory
wd=$(pwd)

# RNAseq FastQs directory
reads_dir=/gscratch/srlab/sam/data/P_generosa/RNAseq

# Genome FastA
genome_fasta=/gscratch/srlab/sam/data/P_generosa/genomes/Panopea-generosa-v1.0.fa

# Genome GFF3
# This was manually modified by me to add gene_biotype to end of each entry.
# Improves NF-Core RNAseq pipeline analysis/visualiztion to have this info present.
genome_gff=/gscratch/srlab/sam/data/P_generosa/genomes/Panopea-generosa-v1.0.a4_biotype-trna_strand_converted-no_RNAmmer.gff

## INITIALIZE ARRAYS ##
# Leave empty!!
R1_array=()
R2_array=()
R1_uncompressed_array=()
R2_uncompressed_array=()

###################################################################################

# Exit script if a command fails
set -e

# Load Anaconda
# Uknown why this is needed, but Anaconda will not run if this line is not included.
. "/gscratch/srlab/programs/anaconda3/etc/profile.d/conda.sh"

# Activate NF-core conda environment
conda activate nf-core_env

# Load Singularity Mox module for NF Core/Nextflow
module load singularity


# NF Core RNAseq sample sheet header
sample_sheet_header="sample,fastq_1,fastq_2,strandedness"
printf "%s\n" "${sample_sheet_header}" >> sample_sheet-"${SLURM_JOB_ID}".csv

# Create array of original uncompressed fastq R1 files
# Set filename pattern
R1_uncompressed_array=("${reads_dir}"/*_1.fastq)

# Create array of original uncompressed fastq R2 files
# Set filename pattern
R2_uncompressed_array=("${reads_dir}"/*_2.fastq)

# Check array size to confirm it has all expected samples
# Exit if mismatch
if [[ "${#R1_uncompressed_array[@]}" != "${#R2_uncompressed_array[@]}" ]]
then
  echo ""
  echo "Uncompressed array sizes don't match."
  echo "Confirm all expected FastQs are present in ${reads_dir}"
  echo ""

  exit
fi

# Create list of original uncompressed fastq files
## Uses parameter substitution to strip leading path from filename
for fastq in "${!R1_uncompressed_array[@]}"
do
  # Strip leading path
    no_path=$(echo "${R1_uncompressed_array[${fastq}]##*/}")

  # Grab SRA name
  sra=$(echo "${no_path}" | awk -F "_" '{print $1}')

  # Only gzip matching FastQs
  # Only generate MD5 checksums for matching FastQs
  if [[ "${sra}" == "SRR12218868" ]] \
    || [[ "${sra}" == "SRR12218869" ]] \
    || [[ "${sra}" == "SRR12226692" ]] \
    || [[ "${sra}" == "SRR12218870" ]] \
    || [[ "${sra}" == "SRR12226693" ]] \
    || [[ "${sra}" == "SRR12207404" ]] \
    || [[ "${sra}" == "SRR12207405" ]] \
    || [[ "${sra}" == "SRR12227930" ]] \
    || [[ "${sra}" == "SRR12207406" ]] \
    || [[ "${sra}" == "SRR12207407" ]] \
    || [[ "${sra}" == "SRR12227931" ]] \
    || [[ "${sra}" == "SRR12212519" ]] \
    || [[ "${sra}" == "SRR12227929" ]] \
    || [[ "${sra}" == "SRR8788211" ]]
  then
    echo ""
    echo "Generating MD5 checksums for ${R1_uncompressed_array[${fastq}]} and ${R2_uncompressed_array[${fastq}]}."
    echo ""
    # Generate MD5 checksums of uncompressed FastQs
    {
      md5sum "${R1_uncompressed_array[${fastq}]}"
      md5sum "${R2_uncompressed_array[${fastq}]}"
    } >> uncompressed_fastqs-"${SLURM_JOB_ID}".md5

    # Gzip FastQs; NF Core RNAseq requires gzipped FastQs as inputs
    echo "Compressing FastQ files."
    if [ ! -f "${R1_uncompressed_array[${fastq}]}.gz" ]
    then 
      gzip --keep "${R1_uncompressed_array[${fastq}]}"
      gzip --keep "${R2_uncompressed_array[${fastq}]}"
    else 
      echo "${R1_uncompressed_array[${fastq}]}.gz already exists. Skipping."
    fi
    echo ""



  fi
done


# Create array of fastq R1 files
# Set filename pattern
R1_array=("${reads_dir}"/*_1.fastq.gz)

# Create array of fastq R2 files
# Set filename pattern
R2_array=("${reads_dir}"/*_2.fastq.gz)

# Check array sizes to confirm they are same size
# Exit if mismatch
if [[ "${#R1_array[@]}" != "${#R2_array[@]}" ]]
  then
    echo ""
    echo "Read1 and Read2 compressed FastQ array sizes don't match."
    echo "Confirm all expected compressed FastQs are present in ${reads_dir}"
    echo ""

    exit
fi

# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in "${!R1_array[@]}"
do
  echo ""
  echo "Generating MD5 checksums for ${R1_array[${fastq}]} and ${R2_array[${fastq}]}."
  echo ""
  # Generate MD5 checksums for compressed FastQs used in NF Core RNAseq analysis
  {
    md5sum "${R1_array[${fastq}]}"
    md5sum "${R2_array[${fastq}]}"
  } >> input_fastqs-"${SLURM_JOB_ID}".md5

  # Strip leading path
    no_path=$(echo "${R1_array[${fastq}]##*/}")

  # Grab SRA name
  sra=$(echo "${no_path}" | awk -F "_" '{print $1}')

  # Set tissue type
  if [[ "${sra}" == "SRR12218868" ]]
  then
    tissue="heart"

    # Add to NF Core RNAseq sample sheet
    printf "%s,%s,%s,%s\n" "${tissue}" "${R1_array[${fastq}]}" "${R2_array[${fastq}]}" "reverse" \
    >> sample_sheet-"${SLURM_JOB_ID}".csv

  elif [[ "${sra}" == "SRR12218869" ]] \
    || [[ "${sra}" == "SRR12226692" ]]

  then
    tissue="gonad"

    # Add to NF Core RNAseq sample sheet
    printf "%s,%s,%s,%s\n" "${tissue}" "${R1_array[${fastq}]}" "${R2_array[${fastq}]}" "reverse" \
    >> sample_sheet-"${SLURM_JOB_ID}".csv

  elif [[ "${sra}" == "SRR12218870" ]] \
    || [[ "${sra}" == "SRR12226693" ]]
  then
    tissue="ctenidia"

    # Add to NF Core RNAseq sample sheet
    printf "%s,%s,%s,%s\n" "${tissue}" "${R1_array[${fastq}]}" "${R2_array[${fastq}]}" "reverse" \
    >> sample_sheet-"${SLURM_JOB_ID}".csv

  elif [[ "${sra}" == "SRR12207404" ]] \
    || [[ "${sra}" == "SRR12207405" ]] \
    || [[ "${sra}" == "SRR12227930" ]] \
    || [[ "${sra}" == "SRR12207406" ]] \
    || [[ "${sra}" == "SRR12207407" ]] \
    || [[ "${sra}" == "SRR12227931" ]]
  then
    tissue="juvenile"

    # Add to NF Core RNAseq sample sheet
    printf "%s,%s,%s,%s\n" "${tissue}" "${R1_array[${fastq}]}" "${R2_array[${fastq}]}" "reverse" \
    >> sample_sheet-"${SLURM_JOB_ID}".csv

  elif [[ "${sra}" == "SRR12212519" ]] \
    || [[ "${sra}" == "SRR12227929" ]] \
    || [[ "${sra}" == "SRR8788211" ]]
  then
    tissue="larvae"

    # Add to NF Core RNAseq sample sheet
    printf "%s,%s,%s,%s\n" "${tissue}" "${R1_array[${fastq}]}" "${R2_array[${fastq}]}" "reverse" \
    >> sample_sheet-"${SLURM_JOB_ID}".csv
  fi

done

echo "Beginning NF-Core RNAseq pipeline..."
echo ""
# Run NF Core RNAseq workflow
nextflow run ${nf_core_rnaseq} \
-profile singularity \
-c ${nf_core_rnaseq_config} \
--input sample_sheet-"${SLURM_JOB_ID}".csv \
--outdir ${wd} \
--multiqc_title "20220321-pgen-nextflow_rnaseq-tissues-${SLURM_JOB_ID}" \
--fasta ${genome_fasta} \
--gff ${genome_gff} \
--save_reference \
--gtf_extra_attributes gene_name \
--gtf_group_features gene_id \
--featurecounts_group_type gene_biotype \
--featurecounts_feature_type exon \
--trim_nextseq 20 \
--save_trimmed \
--aligner star_salmon \
--pseudo_aligner salmon \
--min_mapped_reads 5 \
--save_align_intermeds \
--rseqc_modules bam_stat,\
inner_distance,\
infer_experiment,\
junction_annotation,\
junction_saturation,\
read_distribution,\
read_duplication

##############################################################
# Copy config file for later reference, if needed
cp "${nf_core_rnaseq_config}" .

# Document programs in PATH (primarily for program version ID)
{
  date
  echo ""
  echo "System PATH for $SLURM_JOB_ID"
  echo ""
  printf "%0.s-" {1..10}
  echo "${PATH}" | tr : \\n
} >> system_path.log

echo "Finished logging system PATH"

RESULTS

Runtime was surprisingly fast, at just a bit over 2.3 days…

Screencap of NF-Core RNAseq runtime on Mox showing runtime of 2 days, 9hrs and some change

There is a ton of data here to unpack, so I’ll just link to some of the most useful files.

Refer to the NF-Core/RNAseq “Output docs” for all the things generated, as well as a thorough explanation of the MultiQC Report:

Also, the NF-Core/RNAseq pipeline provides a nice progress report to show you which options are running/completed. This screenshot is from after the pipeline finished successfully:

Screencap of NF-Core/RNAseq pipeline upon completion. Shows percentages and checkboxes to indicate each process completion. Also provides a list of samples passing STAR mapping threshold and pipeline runtime.

Output folder:

from Sam’s Notebook https://ift.tt/7YAu6j0
via IFTTT

#ifttt, #sams-notebook

Nextflow – Trials and Tribulations of Installing and Using NF-Core RNAseq

INSTALLATION

For some reason, I struggled to get things installed correctly. Installing Nextflow was straightforward and didn’t have any issues. First bump in the road came from the [installation directions(https://ift.tt/YwEzc3j] for NF-Core RNAseq. The instructions say “Download the pipeline and test it..”. Well, that doesn’t indicate how/where to download. To add to the confusion (will be evident how in a bit), the previous step says to “only use Conda as a last resort”. Okay, so how do I get the NF-Core RNAseq? I downloaded the latest release from the GitHub Release page. After doing that, I couldn’t figure out how to just get a basic command (e.g. nf-core download) to run. I managed to discover that I likely needed to install nf-core Tools. However, doing that meant that I would have to use either Conda or Docker. Docker seemed overly complicated (didn’t want to have to deal with re-learning container/image usage and within the confines of a SLURM environment like Mox – this whole thing was supposed to be quick and easy!) and using Conda was strongly discouraged for the NF-Core RNAseq. For “simplicity” I went the Conda route. This is where everything went awry…

The pipeline is designed to have internet access to update and download newer config files automatically. Knowing that running SLURM job wouldn’t have internet access, I followed the instructions for downloading a pipeline for offlne use. Running the nf-core download command initiated an interactive mode to specify things, like which version to download, whether or not to provide a directory for $NXF_SINGULARITY_CACHEDIR, and whether or not to compress the Singularity container that would be downloaded. Setting $NXF_SINGULARITY_CACHEDIR seemed to cause issues. Additionally, the Singulairity download was surprisingly fast.

As it all turns out, I think the issue was that I had downloaded NF-Core RNAseq from the GitHub release and installed via Conda. Essentially, Conda creates the same directory structure in the same location as the NF-Core RNAseq install from the GitHub release. And, this fact, I think broke everything.

To fix the installation, I uninstalled all things related to NF-Core RNAseq, but I left the Nextflow installation. Then, I installed nf-core Tools (and, subsequently the RNAseq package) using only the Conda installation. Once I did that (and activated the nf-core/rnaseq conda environment) and ran the nf-core download, I noticed that the Singularity download took significantly longer than previous attempts, which led me to think that things were working, finally.

RUNNING

After finally getting things installed properly, I encountered a number of problems just trying to get the pipeline to run successfully. There were a number of small issues that lead to a lot of troubleshooting:

Error:

Transcript names not matching GFF.

Solution:

If supplying a transcriptome, the names in the FastA description lines have to match those in the GFF. Since we de novo assembled our transcriptome using Trinity, it has Trinity IDs which do not correspond to the genome assembly/annotation. So, I dropped the transcriptome option for the run.

Error:

Unable to parse config file: '/gscratch/srlab/programs/nf-core-rnaseq-3.6/workflow/nextflow.config'

  Cannot compare java.lang.Boolean with value 'true' and java.lang.Integer with value '0'

Solution:

--trim_nextseq option actually requires an integer. Documentation didn’t indicate such at the time, but may now though as I brought this to the attention of the developers via their Slack group and the discussion indicated they should add this info to the websit.

Error:

Unable to parse config file: '/gscratch/srlab/programs/nf-core-rnaseq-3.6/workflow/nextflow.config'

Solution:

When using a custom config file, one cannot use the check_max(). I had just been modifying the values for RAM, CPUs, run times that were in the base.config file. The base.confi file utilizes the check_max() function. After a bunch of sleuthing on the NF-Core RNAseq GitHub and Slack groups, I finally discovered this solution. Here’s what the custom, functional config looks like:

  • /gscratch/srlab/programs/nf-core-rnaseq-3.6/configs/conf/base-srlab_500GB_node.config
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    nf-core/rnaseq Nextflow base config file modified by Sam White on 20220321 for use
    on Roberts Lab 500GB Mox node.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    A 'blank slate' config file, appropriate for general use on most high performance
    compute environments. Assumes that all software is installed and available on
    the PATH. Runs in `local` mode - all jobs will be run on the logged in environment.
----------------------------------------------------------------------------------------
*/

process {

    cpus   = 28
    memory = 500.GB
    time   = 30.d

    errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
    maxRetries    = 1
    maxErrors     = '-1'

    // Process-specific resource requirements
    withLabel:process_low {
        cpus   = 28
        memory = 500.GB
        time   = 30.d
    }
    withLabel:process_medium {
        cpus   = 28
        memory = 500.GB
        time   = 30.d
    }
    withLabel:process_high {
        cpus   = 28
        memory = 500.GB
        time   = 30.d
    }
    withLabel:process_long {
        time   = 30.d
    }
    withLabel:process_high_memory {
        memory = 500.GB
    }
    withLabel:error_ignore {
        errorStrategy = 'ignore'
    }
    withLabel:error_retry {
        errorStrategy = 'retry'
        maxRetries    = 2
    }
    withName:CUSTOM_DUMPSOFTWAREVERSIONS {
        cache = false
    }
}

Error:

Error Message: Strand is neither '+' nor '-'!

ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:MAKE_TRANSCRIPTS_FASTA (rsem/Panopea-generosa-v1.0.fa)'
Caused by:
  Process `NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:MAKE_TRANSCRIPTS_FASTA (rsem/Panopea-generosa-v1.0.fa)` terminated with an error exit status (255)
Command executed:
  rsem-prepare-reference \
      --gtf Panopea-generosa-v1.0_genes.gtf \
      --num-threads 28 \
       \
      rsem/Panopea-generosa-v1.0.fa \
      rsem/genome
  
  cp rsem/genome.transcripts.fa .
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:MAKE_TRANSCRIPTS_FASTA":
      rsem: $(rsem-calculate-expression --version | sed -e "s/Current version: RSEM v//g")
      star: $(STAR --version | sed -e "s/STAR_//g")
  END_VERSIONS
Command exit status:
  255
Command output:
  rsem-extract-reference-transcripts rsem/genome 0 Panopea-generosa-v1.0_genes.gtf None 0 rsem/Panopea-generosa-v1.0.fa
  "rsem-extract-reference-transcripts rsem/genome 0 Panopea-generosa-v1.0_genes.gtf None 0 rsem/Panopea-generosa-v1.0.fa" failed! Plase check if you provide correct parameters/options for the pipeline!
Command error:
  The GTF file might be corrupted!
  Stop at line : Scaffold_02    GenSAS_5d82b316cd298-trnascan   transcript      27467   27541   38.1    .       .       transcript_id "21513.GS22252506.PGEN_.tRNA00000001"; gene_id "21513.GS22252506.PGEN_.tRNA00000001"; Name "Thr"; anti_codon "CGT"; gene_no "1376";

Solution:

As it turns out, this wasn’t an issue with the pipeline, it’s really an issue with how RSEM handles GTF strand parsing. Even though the GFF spec indicates that strand column can be one of +, -, ., or ?, RSEM only parses for + or -. And, as it turns out, our genome GFF has some . for strand info. Looking through our “merged” GenSAS GFF, it turns out there are two sets of annotations that only have . for strand info (GenSAS_5d82b316cd298-trnascan & RNAmmer-1.2). So, the decision needed to be made if we should convert these sets strands to an “artificial” value (e.g. set all of them to +), or eliminate them from the input GFF. I ended up converting GenSAS_5d82b316cd298-trnascan strand to + and eliminated RNAmmer-1.2 from the final input GFF.

Overall, it was a bit of an arduos process, but it’s all running now… Will update if I encounter any more hurdles.

from Sam’s Notebook https://ift.tt/d8t17hW
via IFTTT

#ifttt, #sams-notebook