Sam’s Notebook: Genome Comparison – Pgenerosa_v074 vs Pgenerosa_v070 with MUMmer Promer on Mox

In continuing to further improve our geoduck genome annotation, I’m attempting to figure out why Scaffold 1 of our assembly doesn’t have any annotations. As part of that I’ve decided to perform a series of genome comparisons and see how they match up, with an emphasis on Scaffold 1, using MUMmer 3.23 (specifically, promer for protein level comparisons). This software is specifically designed to do this type of comparison.

Basically, MUMmer promer will take a query genome assembly (Pgenerosa_v074 in this case), translate it into all six reading frames, align it to a reference genome, and determine contig similarities/differences. So, this should provide further insight on what is happening in Pgenerosa_v074 Scaffold 1, when compared to related (or different) species’ genomes.

I’ve previously run nucmer (for nucleotide comparisons) and thought it might be a good idea to compare these sequences at the protein level, since there is usually a greater degree of sequence conservation at the protein level than at the nucleotide level – thus an increased possibility of alignment across these different species.

INPUT FILES:

Query:

Reference genome (Panopea generosa – Pacific geoduck):

This was run using MUMmer v3.23 on Mox using the SBATCH script below.

NOTE: The previous nucleotide comparison using nucmer was performed using MUMmer4 nucmer– however, MUMmer4 promer throws an error every time I tried to run it (even after attempting to patch the promer.pl script).

SBATCH script (GitHub):

 #!/bin/bash ## Job Name #SBATCH --job-name=promer_pgen074_vs_pgen070 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=5-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20190813_pgen_mummer_promer_pgen-v074_pgen-v070 # Exit if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log ### Set variables # Filename prefix prefix="pgen-v074_pgen-v074" pga1_coords="PGA_scaffold1.coords.txt" # Program paths promer="/gscratch/srlab/programs/MUMmer3.23/promer" show_coords="/gscratch/srlab/programs/MUMmer3.23/show-coords" # P.generosa Pgenerosa_v070 FastA pgen_v070_fasta="/gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v070.fa" # P.generosa Pgenerosa_v074 FastA pgen_v074_fasta="/gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v074.fa" ### Run MUMmer (promer) # Compares pgen_v074 (query) to pgen_v070 (reference) "${promer}" \ -p "${prefix}" \ "${pgen_v070_fasta}" \ "${pgen_v074_fasta}" # Parse promer delta output into more userfriendly format # -b useful for syteny - merges overlapping alignments # -c show percent coverage info # -q option sorts by query "${show_coords}" \ -b \ -c \ -q \ "${prefix}".delta \ > "${prefix}".coords.txt # Parse out PGA_scaffold1__77_contigs__length_89643857 head -n 5 "${prefix}".coords.txt > "${pga1_coords}" grep "PGA_scaffold1__77_contigs__length_89643857" "${prefix}".coords.txt >> "${pga1_coords}"