Sam’s Notebook: Genome Annotation – Pgenerosa_v074 Hisat2 Transcript Isoform Index

Essentially, the steps below (which is what was done here) are needed to prepare files for use with Stringtie:

  1. Create GTF file (basically a GFF specifically for use with transcripts – thus the “T” in GTF) from input GFF file. Done with GFF utilities software.
  2. Identify splice sites and exons in newly-created GTF. Done with Hisat2 software.
  3. Create a Hisat2 reference index that utilizes the GTF. Done with Hisat2 software.

This was run on Mox.

The SBATCH script has a bunch of leftover extraneous steps that aren’t relevant to this step of the annotation process; specifically the FastQ manipulation steps. This is due to a copy/paste from a previous Hisat2 run that I neglected to edit out and I didn’t want to edit the script after I actually ran it, so have left it in here.

SBATCH script (GitHub):

 #!/bin/bash ## Job Name #SBATCH --job-name=oly_hisat2 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190723_hisat2-build_pgen_v074 # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log threads=28 genome_index_name="Pgenerosa_v074" # Paths to programs gffread="/gscratch/srlab/programs/gffread-0.11.4.Linux_x86_64/gffread" hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2_build="${hisat2_dir}/hisat2-build" hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py" hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py" # Input/output files fastq_dir="/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq" genome_dir="/gscratch/srlab/sam/data/P_generosa/genomes" genome_gff="${genome_dir}/Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff" exons="hisat2_exons.tab" genome_fasta="${genome_dir}/Pgenerosa_v074.fa" splice_sites="hisat2_splice_sites.tab" transcripts_gtf="Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gtf" ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() # Create array of fastq R1 files for fastq in "${fastq_dir}"/*R1*.gz do fastq_array_R1+=("${fastq}") done # Create array of fastq R2 files for fastq in "${fastq_dir}"/*R2*.gz do fastq_array_R2+=("${fastq}") done # Create array of sample names ## Uses parameter substitution to strip leading path from filename ## Uses awk to parse out sample name from filename for R1_fastq in "${fastq_dir}"/*R1*.gz do names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"[_.]" '{print $1 "_" $5}')) done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in "${fastq_dir}"/*.gz do echo "${fastq#${fastq_dir}}" >> fastq.list.txt done # Create transcipts GTF from genome FastA "${gffread}" -T \ "${genome_gff}" \ -o "${transcripts_gtf}" # Create Hisat2 exons tab file "${hisat2_exons}" \ "${transcripts_gtf}" \ > "${exons}" # Create Hisate2 splice sites tab file "${hisat2_splice_sites}" \ "${transcripts_gtf}" \ > "${splice_sites}" # Build Hisat2 reference index using splice sites and exons "${hisat2_build}" \ "${genome_fasta}" \ "${genome_index_name}" \ --exon "${exons}" \ --ss "${splice_sites}" \ -p "${threads}" \ 2> hisat2_build.err # Copy Hisat2 index files to my data directory rsync -av "${genome_index_name}"*.ht2 "${genome_dir}"  

Sam’s Notebook: Genome Annotation – Pgenerosa_v070 Hisat2 Transcript Isoform Index

This is the first step in getting transcript isoform annotations. The annotations and alignments that will be generated with Stringtie will be used to help us get a better grasp of what’s going on with our annotations of the different Panopea generosa genome assembly versions.

Essentially, the steps below (which is what was done here) are needed to prepare files for use with Stringtie:

  1. Create GTF file (basically a GFF specifically for use with transcripts – thus the “T” in GTF) from input GFF file. Done with GFF utilities software.
  2. Identify splice sites and exons in newly-created GTF. Done with Hisat2 software.
  3. Create a Hisat2 reference index that utilizes the GTF. Done with Hisat2 software.

This was run on Mox.

The SBATCH script has a bunch of leftover extraneous steps that aren’t relevant to this step of the annotation process; specifically the FastQ manipulation steps. This is due to a copy/paste from a previous Hisat2 run that I neglected to edit out and I didn’t want to edit the script after I actually ran it, so have left it in here.

SBATCH script (GitHub):

20190723_hisat2-build_pgen_v070.sh

 #!/bin/bash ## Job Name #SBATCH --job-name=oly_hisat2 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190723_hisat2-build_pgen_v070 # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log threads=28 genome_index_name="Pgenerosa_v070" # Paths to programs gffread="/gscratch/srlab/programs/gffread-0.11.4.Linux_x86_64/gffread" hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2_build="${hisat2_dir}/hisat2-build" hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py" hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py" # Input/output files fastq_dir="/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq" genome_dir="/gscratch/srlab/sam/data/P_generosa/genomes" genome_gff="${genome_dir}/Pgenerosa_v070_genome_snap02.all.renamed.putative_function.domain_added.gff" exons="hisat2_exons.tab" genome_fasta="${genome_dir}/Pgenerosa_v070.fa" splice_sites="hisat2_splice_sites.tab" transcripts_gtf="Pgenerosa_v070_genome_snap02.all.renamed.putative_function.domain_added.gtf" ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() # Create array of fastq R1 files for fastq in "${fastq_dir}"/*R1*.gz do fastq_array_R1+=("${fastq}") done # Create array of fastq R2 files for fastq in "${fastq_dir}"/*R2*.gz do fastq_array_R2+=("${fastq}") done # Create array of sample names ## Uses parameter substitution to strip leading path from filename ## Uses awk to parse out sample name from filename for R1_fastq in "${fastq_dir}"/*R1*.gz do names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"[_.]" '{print $1 "_" $5}')) done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in "${fastq_dir}"/*.gz do echo "${fastq#${fastq_dir}}" >> fastq.list.txt done # Create transcipts GTF from genome FastA "${gffread}" -T \ "${genome_gff}" \ -o "${transcripts_gtf}" # Create Hisat2 exons tab file "${hisat2_exons}" \ "${transcripts_gtf}" \ > "${exons}" # Create Hisate2 splice sites tab file "${hisat2_splice_sites}" \ "${transcripts_gtf}" \ > "${splice_sites}" # Build Hisat2 reference index using splice sites and exons "${hisat2_build}" \ "${genome_fasta}" \ "${genome_index_name}" \ --exon "${exons}" \ --ss "${splice_sites}" \ -p "${threads}" \ 2> hisat2_build.err # Copy Hisat2 index files to my data directory rsync -av "${genome_index_name}"*.ht2 "${genome_dir}"  

RESULTS

Took about an hour to run:

Screencap of Hisat2 runtime

Output folder:

The Hisat2 index files are: *.ht2. These will be used with Stringtie for transcript isoform annotation.

from Sam’s Notebook https://ift.tt/2JMm9jt
via IFTTT

Yaamini’s Notebook: Ecology of Infectious Marine Diseases TA Recap

Things I did at FHL that were not my research

For the past five weeks, I’ve been at Friday Harbor Laboratories as a Teaching Assistant for the Ecology of Infectious Marine Diseases course! I thought it would be good to recap what I did, since I was working but didn’t have time for my own research. In addition to finding lab equipment and helping out with fieldwork, I gave some lectures, helped students with a genomics project, and spearheaded the formal science communication section of the curriculum.

Teaching molecular methods

For my first lecture, I taught students about Github and basic Linux commands. I had students navigate to this Github repository and clone it to their machine. I then walked through the steps I laid out in this document. I had students learn to change their working directory and navigate through directories using the command line interface. I opened up a Terminal window and placed a Finder under it so students could see how the commands I typed in the Terminal changed directory structure and files. I also emphasized the use of tab-complete to make things easier and avoid typos in code. The next set of commands I walked through involved downloading FASTA files from web links. Looking back on it now, I probably should have taught them about checksums, but I think their brains may have exploded a bit. I had them create new directories, move FASTA files into directories, and remove extra files. Finally, I went through commands to explore files from the command line.

Later that day, I taught them how to blast from the command line! I wanted them to type the code themselves, but we were unable to download blast on the computers in the Computer Lab. I walked through the code in this document so they could try reviewing it themselves later. What was more useful for them was going to the Uniprot SwissProt database and teaching them about the database and how to get GOterms.

At the end of our genomics block, I gave a lecture on my own work! Since Colleen focused on transcriptomics, I used my work as case studies on the use of proteomics and epigenetics to study how abiotic stressors affect organismal physiology. Students were really interested in specific methods, so I added in a lot of detail. I think those extra details may have prevented some students from seeing the big picture. I’ll need to work on that if I give that lecture again.

A large component of the course was working on projects. For the project examining NIX prevalence in Kalaloch Beach razor clams, I assisted students with DNA extractions and qPCR. I spent most of my time assisting with the transcriptomics project looking at eelgrass wasting disease host-pathogen interactions. I created this Jupyter notebook to merge transcriptomic data with blast output and Uniprot Swiss-Prot annotations. I also streamlined isoforms into genes. In this R Markdown file, I formatted input files for gene enrichment with GO-MWU. I used the GO-MWU pipeline for eelgrass and pathogen transcriptomic data, but only found two enriched GOterms for eelgrass. I also created this R Markdown document to help students create heatmaps of differentially expressed genes. They used the code I created to create their own heatmaps for genes of interest. I think any molecular method is difficult to follow if you have little to no experience, and if you don’t have a great understanding of R or Linux. One thing that (I think) helped while I was teaching was to constantly remind students the purpose of each step.

Science communication

The other part of the course I helped with was science communication practice. Students were required to write one blog post for a public audience and create a short talk about a disease for the class. I worked with each student on their blog post and provided targetted feedback when editing their initial drafts. I also had students give practice presentations to me so I could help them if they needed. Most of my comments were directed at improving the organization of their pieces or talks. I took it upon myself to help them outline their final papers and reviewed their final presentations so they would not have similar issues.

Overall TAing at FHL was a lot of work but really rewarding! I learned a lot about what it means to be a good instructor and I cannot wait to flex those skills soon.

from the responsible grad student https://ift.tt/2y1cqir
via IFTTT

Shelly’s Notebook: Mon. Jul. 22, 2019 Salmon sea lice methylomes

run fastqc and trim

  • I ended up getting this to run on Ostrich via this jupyter notebook:
  • The multiqc instance I installed on Ostrich via didn’t seem to work properly (errors in jupyter notebook output) so I ran this on emu and it worked
      scp /Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722/*.zip srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc scp /Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722/*.html srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ multiqc . [INFO   ] multiqc : This is MultiQC v0.9 [INFO   ] multiqc : Template : default [INFO   ] multiqc : Searching '.' [INFO   ] fastqc : Found 92 reports [INFO   ] multiqc : Report : multiqc_report.html [INFO   ] multiqc : Data : multiqc_data [INFO   ] multiqc : MultiQC complete srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rsync --archive --progress --verbose multiqc_data strigg@ostrich.fish.washington.edu:/Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722 Password: building file list ... 6 files to consider multiqc_data/ multiqc_data/.multiqc.log 6,880 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=4/6) multiqc_data/multiqc_fastqc.txt 20,380 100% 19.44MB/s 0:00:00 (xfr#2, to-chk=3/6) multiqc_data/multiqc_general_stats.txt 6,880 100% 6.56MB/s 0:00:00 (xfr#3, to-chk=2/6) multiqc_data/multiqc_report.html 2,567,171 100% 18.14MB/s 0:00:00 (xfr#4, to-chk=1/6) multiqc_data/multiqc_sources.txt 12,078 100% 87.37kB/s 0:00:00 (xfr#5, to-chk=0/6) sent 2,614,141 bytes received 500 bytes 747,040.29 bytes/sec total size is 2,613,389 speedup is 1.00 srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm *.html srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm *.zip srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm -r multiqc_data/  
  • Raw sequence FASTQC output folder:
  • Raw sequence MultiQC report (HTML):
  • TrimGalore! output folder (adapter trimmed FastQ files):
  • Adapter trimming MultiQC report (HTML):
  • TrimGalore hard trim output folder (first 5bp trimmed from 5’ of each read):
  • Hard trim MultiQC report (HTML):

copy the salmon genome and sea lice genome to mox

  • Steven shared the C_rogercresseyi on this CaligusLIFE Slack channel thread.
    • I downloaded it locally and then copied it to Mox:
        Shellys-MacBook-Pro:coverage Shelly$ rsync --archive --progress --verbose ~/Downloads/Caligus_rogercresseyi_Genome.fa strigg@mox.hyak.uw.edu:/gscratch/srlab/strigg/data/Caligus/GENOMES Password: Enter passcode or select one of the following options: 1. Duo Push to Android (XXX-XXX-0029) 2. Phone call to Android (XXX-XXX-0029) Duo passcode or option [1-2]: 1 building file list ... 1 file to consider Caligus_rogercresseyi_Genome.fa 515420160 100% 19.75MB/s 0:00:24 (xfer#1, to-check=0/1) sent 515483225 bytes received 42 bytes 14122829.23 bytes/sec total size is 515420160 speedup is 1.00  
    • I also copied it to Gannett
        Shellys-MacBook-Pro:Caligus Shelly$ rsync --archive --progress --verbose ~/Downloads/Caligus_rogercresseyi_Genome.fa /Volumes/web/metacarcinus/Salmo_Calig/GENOMES/Caligus building file list ... 1 file to consider Caligus_rogercresseyi_Genome.fa 515420160 100% 12.72MB/s 0:00:38 (xfer#1, to-check=0/1) sent 515483225 bytes received 42 bytes 13050209.29 bytes/sec total size is 515420160 speedup is 1.00  
  • I downloaded the most recent RefSeq version of the S.salar genome on Gannett:
      ostrich:RefSeq strigg$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_genomic.fna.gz --2019-07-19 12:21:17-- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_genomic.fna.gz => ‘GCF_000233375.1_ICSASG_v2_genomic.fna.gz’ Resolving ftp.ncbi.nlm.nih.gov... 130.14.250.11, 2607:f220:41e:250::11 Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.11|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2 ... done. ==> SIZE GCF_000233375.1_ICSASG_v2_genomic.fna.gz ... 759073402 ==> PASV ... done. ==> RETR GCF_000233375.1_ICSASG_v2_genomic.fna.gz ... done. Length: 759073402 (724M) (unauthoritative) GCF_000233375.1_ICSASG_v2_genomic.fna.gz 100%[=================================================================================================================================================================>] 723.91M 631KB/s in 19m 53s 2019-07-19 12:41:12 (621 KB/s) - ‘GCF_000233375.1_ICSASG_v2_genomic.fna.gz’ saved [759073402]  
    • then copied the S. salar RefSeq genome to Mox:
        Shellys-MacBook-Pro:Caligus Shelly$ rsync --archive --progress --verbose /Volumes/web/metacarcinus/Salmo_Calig/GENOMES/v2/RefSeq/GCF_000233375.1_ICSASG_v2_genomic.fna.gz strigg@mox.hyak.uw.edu:/gscratch/srlab/strigg/data/Ssalar/GENOMES Password: Enter passcode or select one of the following options: 1. Duo Push to Android (XXX-XXX-0029) 2. Phone call to Android (XXX-XXX-0029) Duo passcode or option [1-2]: 1 building file list ... 1 file to consider GCF_000233375.1_ICSASG_v2_genomic.fna.gz 759073402 100% 7.68MB/s 0:01:34 (xfer#1, to-check=0/1) sent 759166220 bytes received 42 bytes 7195888.74 bytes/sec total size is 759073402 speedup is 1.00  

copy data from Gannett to owl

NEXT STEPS:

  1. Concatenate data from different lanes together (L001 + L002 for each sample)
  2. transfer concatenated trimmed reads from Gannett to Mox
  3. determine alignment parameters for:
    • subset of Calig. data
    • subset of Salmon data
  4. run bismark on all data

from shellytrigg https://ift.tt/2JNJh1f
via IFTTT

Shelly’s Notebook: Fri. Jul. 19, 2019 Salmon sea lice methylomes

Received salmon sea lice methylome data!!

copy data to Gannett metacarcinus folder

  • I received this url from the sequencing core
  • I mounted gannet via finder -> connect to server
  • I installed Globus Connect Personal on Ostrich and gave the Globus app writing permissions to my metacarcinus folder on Gannet
  • I navigated to my Globus account file manager section via the provided url and could see my sequencing data “UW_Trigg_190718.tar.gz” 148.22GB.
  • I selected the “Transfer of Sync to…” option, entered the computer name “Ostrich” which I entered when setting up Globus Connect Personal app, entered the path “/Volumes/web/metacarcinus/Salmo_Calig/FASTQS/”. Then clicked start. img
  • I navigated to the Activity section of Globus and could see the info being transferred (26GB in 10min) img
  • after a couple hours, I received an email notification and could see in the status section that the transfer had completed successfully img

copy data from Gannett to mox

ran this command:

 rsync --archive --progress --verbose strigg@ostrich.fish.washington.edu:/Volumes/web/metacarcinus/Salmo_Calig/FASTQS/UW_Trigg_190718.tar.gz /gscratch/srlab/strigg/data/Salmo_Calig/FASTQS/  

I decided to make two species folders rather than have them combined, and wanted to remove the data that was not Salmon or sea lice fastqs, so ran the following commands:

 tar -xvf UW_Trigg_190718.tar.gz cd UW_Trigg_190718_done/ rm -r Reports/ rm -r Stats/ rm -r Undetermined_S0_L00* cd ../../../ mkdir Caligus mv Salmo_Calig/ Ssalar cd Ssalar/ cd FASTQS/ cd .. cd Caligus/ mkdir FASTQS mv ../Ssalar/FASTQS/UW_Trigg_190718_done/Sealice_F* . mv *.gz FASTQS/ rm UW_Trigg_190718.tar.gz  

copy data from Gannett to owl

run fastqc and trim

  • prepared scripts (Salmon: /gscratch/srlab/strigg/jobs/20190719_FASTQC_ADPTRIM_Ssalar.sh and lice: /gscratch/srlab/strigg/jobs/20190719_FASTQC_ADPTRIM_Caligus.sh) to do this on Mox but didn’t work because multiqc and cutadapt programs could not be found as I had them. See github issue #712

from shellytrigg https://ift.tt/2M5nkMt
via IFTTT