Sam’s Notebook: Genome Annotation – Pgenerosa_v070 Hisat2 Transcript Isoform Index

This is the first step in getting transcript isoform annotations. The annotations and alignments that will be generated with Stringtie will be used to help us get a better grasp of what’s going on with our annotations of the different Panopea generosa genome assembly versions.

Essentially, the steps below (which is what was done here) are needed to prepare files for use with Stringtie:

  1. Create GTF file (basically a GFF specifically for use with transcripts – thus the “T” in GTF) from input GFF file. Done with GFF utilities software.
  2. Identify splice sites and exons in newly-created GTF. Done with Hisat2 software.
  3. Create a Hisat2 reference index that utilizes the GTF. Done with Hisat2 software.

This was run on Mox.

The SBATCH script has a bunch of leftover extraneous steps that aren’t relevant to this step of the annotation process; specifically the FastQ manipulation steps. This is due to a copy/paste from a previous Hisat2 run that I neglected to edit out and I didn’t want to edit the script after I actually ran it, so have left it in here.

SBATCH script (GitHub):

20190723_hisat2-build_pgen_v070.sh

 #!/bin/bash ## Job Name #SBATCH --job-name=oly_hisat2 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190723_hisat2-build_pgen_v070 # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log threads=28 genome_index_name="Pgenerosa_v070" # Paths to programs gffread="/gscratch/srlab/programs/gffread-0.11.4.Linux_x86_64/gffread" hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2_build="${hisat2_dir}/hisat2-build" hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py" hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py" # Input/output files fastq_dir="/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq" genome_dir="/gscratch/srlab/sam/data/P_generosa/genomes" genome_gff="${genome_dir}/Pgenerosa_v070_genome_snap02.all.renamed.putative_function.domain_added.gff" exons="hisat2_exons.tab" genome_fasta="${genome_dir}/Pgenerosa_v070.fa" splice_sites="hisat2_splice_sites.tab" transcripts_gtf="Pgenerosa_v070_genome_snap02.all.renamed.putative_function.domain_added.gtf" ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() # Create array of fastq R1 files for fastq in "${fastq_dir}"/*R1*.gz do fastq_array_R1+=("${fastq}") done # Create array of fastq R2 files for fastq in "${fastq_dir}"/*R2*.gz do fastq_array_R2+=("${fastq}") done # Create array of sample names ## Uses parameter substitution to strip leading path from filename ## Uses awk to parse out sample name from filename for R1_fastq in "${fastq_dir}"/*R1*.gz do names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"[_.]" '{print $1 "_" $5}')) done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in "${fastq_dir}"/*.gz do echo "${fastq#${fastq_dir}}" >> fastq.list.txt done # Create transcipts GTF from genome FastA "${gffread}" -T \ "${genome_gff}" \ -o "${transcripts_gtf}" # Create Hisat2 exons tab file "${hisat2_exons}" \ "${transcripts_gtf}" \ > "${exons}" # Create Hisate2 splice sites tab file "${hisat2_splice_sites}" \ "${transcripts_gtf}" \ > "${splice_sites}" # Build Hisat2 reference index using splice sites and exons "${hisat2_build}" \ "${genome_fasta}" \ "${genome_index_name}" \ --exon "${exons}" \ --ss "${splice_sites}" \ -p "${threads}" \ 2> hisat2_build.err # Copy Hisat2 index files to my data directory rsync -av "${genome_index_name}"*.ht2 "${genome_dir}"  

RESULTS

Took about an hour to run:

Screencap of Hisat2 runtime

Output folder:

The Hisat2 index files are: *.ht2. These will be used with Stringtie for transcript isoform annotation.

from Sam’s Notebook https://ift.tt/2JMm9jt
via IFTTT