Sam’s Notebook: Genome Annotation – Pgenerosa_v070 Hisat2 Transcript Isoform Index

This is the first step in getting transcript isoform annotations. The annotations and alignments that will be generated with Stringtie will be used to help us get a better grasp of what’s going on with our annotations of the different Panopea generosa genome assembly versions.

Essentially, the steps below (which is what was done here) are needed to prepare files for use with Stringtie:

  1. Create GTF file (basically a GFF specifically for use with transcripts – thus the “T” in GTF) from input GFF file. Done with GFF utilities software.
  2. Identify splice sites and exons in newly-created GTF. Done with Hisat2 software.
  3. Create a Hisat2 reference index that utilizes the GTF. Done with Hisat2 software.

This was run on Mox.

The SBATCH script has a bunch of leftover extraneous steps that aren’t relevant to this step of the annotation process; specifically the FastQ manipulation steps. This is due to a copy/paste from a previous Hisat2 run that I neglected to edit out and I didn’t want to edit the script after I actually ran it, so have left it in here.

SBATCH script (GitHub):

20190723_hisat2-build_pgen_v070.sh

 #!/bin/bash ## Job Name #SBATCH --job-name=oly_hisat2 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190723_hisat2-build_pgen_v070 # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log threads=28 genome_index_name="Pgenerosa_v070" # Paths to programs gffread="/gscratch/srlab/programs/gffread-0.11.4.Linux_x86_64/gffread" hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2_build="${hisat2_dir}/hisat2-build" hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py" hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py" # Input/output files fastq_dir="/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq" genome_dir="/gscratch/srlab/sam/data/P_generosa/genomes" genome_gff="${genome_dir}/Pgenerosa_v070_genome_snap02.all.renamed.putative_function.domain_added.gff" exons="hisat2_exons.tab" genome_fasta="${genome_dir}/Pgenerosa_v070.fa" splice_sites="hisat2_splice_sites.tab" transcripts_gtf="Pgenerosa_v070_genome_snap02.all.renamed.putative_function.domain_added.gtf" ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() # Create array of fastq R1 files for fastq in "${fastq_dir}"/*R1*.gz do fastq_array_R1+=("${fastq}") done # Create array of fastq R2 files for fastq in "${fastq_dir}"/*R2*.gz do fastq_array_R2+=("${fastq}") done # Create array of sample names ## Uses parameter substitution to strip leading path from filename ## Uses awk to parse out sample name from filename for R1_fastq in "${fastq_dir}"/*R1*.gz do names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"[_.]" '{print $1 "_" $5}')) done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in "${fastq_dir}"/*.gz do echo "${fastq#${fastq_dir}}" >> fastq.list.txt done # Create transcipts GTF from genome FastA "${gffread}" -T \ "${genome_gff}" \ -o "${transcripts_gtf}" # Create Hisat2 exons tab file "${hisat2_exons}" \ "${transcripts_gtf}" \ > "${exons}" # Create Hisate2 splice sites tab file "${hisat2_splice_sites}" \ "${transcripts_gtf}" \ > "${splice_sites}" # Build Hisat2 reference index using splice sites and exons "${hisat2_build}" \ "${genome_fasta}" \ "${genome_index_name}" \ --exon "${exons}" \ --ss "${splice_sites}" \ -p "${threads}" \ 2> hisat2_build.err # Copy Hisat2 index files to my data directory rsync -av "${genome_index_name}"*.ht2 "${genome_dir}"  

RESULTS

Took about an hour to run:

Screencap of Hisat2 runtime

Output folder:

The Hisat2 index files are: *.ht2. These will be used with Stringtie for transcript isoform annotation.

from Sam’s Notebook https://ift.tt/2JMm9jt
via IFTTT

Yaamini’s Notebook: Ecology of Infectious Marine Diseases TA Recap

Things I did at FHL that were not my research

For the past five weeks, I’ve been at Friday Harbor Laboratories as a Teaching Assistant for the Ecology of Infectious Marine Diseases course! I thought it would be good to recap what I did, since I was working but didn’t have time for my own research. In addition to finding lab equipment and helping out with fieldwork, I gave some lectures, helped students with a genomics project, and spearheaded the formal science communication section of the curriculum.

Teaching molecular methods

For my first lecture, I taught students about Github and basic Linux commands. I had students navigate to this Github repository and clone it to their machine. I then walked through the steps I laid out in this document. I had students learn to change their working directory and navigate through directories using the command line interface. I opened up a Terminal window and placed a Finder under it so students could see how the commands I typed in the Terminal changed directory structure and files. I also emphasized the use of tab-complete to make things easier and avoid typos in code. The next set of commands I walked through involved downloading FASTA files from web links. Looking back on it now, I probably should have taught them about checksums, but I think their brains may have exploded a bit. I had them create new directories, move FASTA files into directories, and remove extra files. Finally, I went through commands to explore files from the command line.

Later that day, I taught them how to blast from the command line! I wanted them to type the code themselves, but we were unable to download blast on the computers in the Computer Lab. I walked through the code in this document so they could try reviewing it themselves later. What was more useful for them was going to the Uniprot SwissProt database and teaching them about the database and how to get GOterms.

At the end of our genomics block, I gave a lecture on my own work! Since Colleen focused on transcriptomics, I used my work as case studies on the use of proteomics and epigenetics to study how abiotic stressors affect organismal physiology. Students were really interested in specific methods, so I added in a lot of detail. I think those extra details may have prevented some students from seeing the big picture. I’ll need to work on that if I give that lecture again.

A large component of the course was working on projects. For the project examining NIX prevalence in Kalaloch Beach razor clams, I assisted students with DNA extractions and qPCR. I spent most of my time assisting with the transcriptomics project looking at eelgrass wasting disease host-pathogen interactions. I created this Jupyter notebook to merge transcriptomic data with blast output and Uniprot Swiss-Prot annotations. I also streamlined isoforms into genes. In this R Markdown file, I formatted input files for gene enrichment with GO-MWU. I used the GO-MWU pipeline for eelgrass and pathogen transcriptomic data, but only found two enriched GOterms for eelgrass. I also created this R Markdown document to help students create heatmaps of differentially expressed genes. They used the code I created to create their own heatmaps for genes of interest. I think any molecular method is difficult to follow if you have little to no experience, and if you don’t have a great understanding of R or Linux. One thing that (I think) helped while I was teaching was to constantly remind students the purpose of each step.

Science communication

The other part of the course I helped with was science communication practice. Students were required to write one blog post for a public audience and create a short talk about a disease for the class. I worked with each student on their blog post and provided targetted feedback when editing their initial drafts. I also had students give practice presentations to me so I could help them if they needed. Most of my comments were directed at improving the organization of their pieces or talks. I took it upon myself to help them outline their final papers and reviewed their final presentations so they would not have similar issues.

Overall TAing at FHL was a lot of work but really rewarding! I learned a lot about what it means to be a good instructor and I cannot wait to flex those skills soon.

from the responsible grad student https://ift.tt/2y1cqir
via IFTTT

Shelly’s Notebook: Mon. Jul. 22, 2019 Salmon sea lice methylomes

run fastqc and trim

  • I ended up getting this to run on Ostrich via this jupyter notebook:
  • The multiqc instance I installed on Ostrich via didn’t seem to work properly (errors in jupyter notebook output) so I ran this on emu and it worked
      scp /Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722/*.zip srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc scp /Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722/*.html srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ multiqc . [INFO   ] multiqc : This is MultiQC v0.9 [INFO   ] multiqc : Template : default [INFO   ] multiqc : Searching '.' [INFO   ] fastqc : Found 92 reports [INFO   ] multiqc : Report : multiqc_report.html [INFO   ] multiqc : Data : multiqc_data [INFO   ] multiqc : MultiQC complete srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rsync --archive --progress --verbose multiqc_data strigg@ostrich.fish.washington.edu:/Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722 Password: building file list ... 6 files to consider multiqc_data/ multiqc_data/.multiqc.log 6,880 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=4/6) multiqc_data/multiqc_fastqc.txt 20,380 100% 19.44MB/s 0:00:00 (xfr#2, to-chk=3/6) multiqc_data/multiqc_general_stats.txt 6,880 100% 6.56MB/s 0:00:00 (xfr#3, to-chk=2/6) multiqc_data/multiqc_report.html 2,567,171 100% 18.14MB/s 0:00:00 (xfr#4, to-chk=1/6) multiqc_data/multiqc_sources.txt 12,078 100% 87.37kB/s 0:00:00 (xfr#5, to-chk=0/6) sent 2,614,141 bytes received 500 bytes 747,040.29 bytes/sec total size is 2,613,389 speedup is 1.00 srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm *.html srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm *.zip srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm -r multiqc_data/  
  • Raw sequence FASTQC output folder:
  • Raw sequence MultiQC report (HTML):
  • TrimGalore! output folder (adapter trimmed FastQ files):
  • Adapter trimming MultiQC report (HTML):
  • TrimGalore hard trim output folder (first 5bp trimmed from 5’ of each read):
  • Hard trim MultiQC report (HTML):

copy the salmon genome and sea lice genome to mox

  • Steven shared the C_rogercresseyi on this CaligusLIFE Slack channel thread.
    • I downloaded it locally and then copied it to Mox:
        Shellys-MacBook-Pro:coverage Shelly$ rsync --archive --progress --verbose ~/Downloads/Caligus_rogercresseyi_Genome.fa strigg@mox.hyak.uw.edu:/gscratch/srlab/strigg/data/Caligus/GENOMES Password: Enter passcode or select one of the following options: 1. Duo Push to Android (XXX-XXX-0029) 2. Phone call to Android (XXX-XXX-0029) Duo passcode or option [1-2]: 1 building file list ... 1 file to consider Caligus_rogercresseyi_Genome.fa 515420160 100% 19.75MB/s 0:00:24 (xfer#1, to-check=0/1) sent 515483225 bytes received 42 bytes 14122829.23 bytes/sec total size is 515420160 speedup is 1.00  
    • I also copied it to Gannett
        Shellys-MacBook-Pro:Caligus Shelly$ rsync --archive --progress --verbose ~/Downloads/Caligus_rogercresseyi_Genome.fa /Volumes/web/metacarcinus/Salmo_Calig/GENOMES/Caligus building file list ... 1 file to consider Caligus_rogercresseyi_Genome.fa 515420160 100% 12.72MB/s 0:00:38 (xfer#1, to-check=0/1) sent 515483225 bytes received 42 bytes 13050209.29 bytes/sec total size is 515420160 speedup is 1.00  
  • I downloaded the most recent RefSeq version of the S.salar genome on Gannett:
      ostrich:RefSeq strigg$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_genomic.fna.gz --2019-07-19 12:21:17-- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_genomic.fna.gz => ‘GCF_000233375.1_ICSASG_v2_genomic.fna.gz’ Resolving ftp.ncbi.nlm.nih.gov... 130.14.250.11, 2607:f220:41e:250::11 Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.11|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2 ... done. ==> SIZE GCF_000233375.1_ICSASG_v2_genomic.fna.gz ... 759073402 ==> PASV ... done. ==> RETR GCF_000233375.1_ICSASG_v2_genomic.fna.gz ... done. Length: 759073402 (724M) (unauthoritative) GCF_000233375.1_ICSASG_v2_genomic.fna.gz 100%[=================================================================================================================================================================>] 723.91M 631KB/s in 19m 53s 2019-07-19 12:41:12 (621 KB/s) - ‘GCF_000233375.1_ICSASG_v2_genomic.fna.gz’ saved [759073402]  
    • then copied the S. salar RefSeq genome to Mox:
        Shellys-MacBook-Pro:Caligus Shelly$ rsync --archive --progress --verbose /Volumes/web/metacarcinus/Salmo_Calig/GENOMES/v2/RefSeq/GCF_000233375.1_ICSASG_v2_genomic.fna.gz strigg@mox.hyak.uw.edu:/gscratch/srlab/strigg/data/Ssalar/GENOMES Password: Enter passcode or select one of the following options: 1. Duo Push to Android (XXX-XXX-0029) 2. Phone call to Android (XXX-XXX-0029) Duo passcode or option [1-2]: 1 building file list ... 1 file to consider GCF_000233375.1_ICSASG_v2_genomic.fna.gz 759073402 100% 7.68MB/s 0:01:34 (xfer#1, to-check=0/1) sent 759166220 bytes received 42 bytes 7195888.74 bytes/sec total size is 759073402 speedup is 1.00  

copy data from Gannett to owl

NEXT STEPS:

  1. Concatenate data from different lanes together (L001 + L002 for each sample)
  2. transfer concatenated trimmed reads from Gannett to Mox
  3. determine alignment parameters for:
    • subset of Calig. data
    • subset of Salmon data
  4. run bismark on all data

from shellytrigg https://ift.tt/2JNJh1f
via IFTTT

Shelly’s Notebook: Fri. Jul. 19, 2019 Salmon sea lice methylomes

Received salmon sea lice methylome data!!

copy data to Gannett metacarcinus folder

  • I received this url from the sequencing core
  • I mounted gannet via finder -> connect to server
  • I installed Globus Connect Personal on Ostrich and gave the Globus app writing permissions to my metacarcinus folder on Gannet
  • I navigated to my Globus account file manager section via the provided url and could see my sequencing data “UW_Trigg_190718.tar.gz” 148.22GB.
  • I selected the “Transfer of Sync to…” option, entered the computer name “Ostrich” which I entered when setting up Globus Connect Personal app, entered the path “/Volumes/web/metacarcinus/Salmo_Calig/FASTQS/”. Then clicked start. img
  • I navigated to the Activity section of Globus and could see the info being transferred (26GB in 10min) img
  • after a couple hours, I received an email notification and could see in the status section that the transfer had completed successfully img

copy data from Gannett to mox

ran this command:

 rsync --archive --progress --verbose strigg@ostrich.fish.washington.edu:/Volumes/web/metacarcinus/Salmo_Calig/FASTQS/UW_Trigg_190718.tar.gz /gscratch/srlab/strigg/data/Salmo_Calig/FASTQS/  

I decided to make two species folders rather than have them combined, and wanted to remove the data that was not Salmon or sea lice fastqs, so ran the following commands:

 tar -xvf UW_Trigg_190718.tar.gz cd UW_Trigg_190718_done/ rm -r Reports/ rm -r Stats/ rm -r Undetermined_S0_L00* cd ../../../ mkdir Caligus mv Salmo_Calig/ Ssalar cd Ssalar/ cd FASTQS/ cd .. cd Caligus/ mkdir FASTQS mv ../Ssalar/FASTQS/UW_Trigg_190718_done/Sealice_F* . mv *.gz FASTQS/ rm UW_Trigg_190718.tar.gz  

copy data from Gannett to owl

run fastqc and trim

  • prepared scripts (Salmon: /gscratch/srlab/strigg/jobs/20190719_FASTQC_ADPTRIM_Ssalar.sh and lice: /gscratch/srlab/strigg/jobs/20190719_FASTQC_ADPTRIM_Caligus.sh) to do this on Mox but didn’t work because multiqc and cutadapt programs could not be found as I had them. See github issue #712

from shellytrigg https://ift.tt/2M5nkMt
via IFTTT

Sam’s Notebook: Genome Annotation – O.lurida 20190709-v081 Transcript Isoform ID with Stringtie on Mox

Earlier today, I generated the necessary Hista2 index, which incorporated splice sites and exons, for use with Stringtie in order to identify transcript isoforms in our 20190709-Olurida_v081 annotation. This annotation utilized tissue-specific transcriptome assemblies provided by Katherine Silliman.

I used all the trimmed FastQ files from the 20180827 Trinity transcriptome assembly, as this utilized all of our existing RNAseq data.

Command to pull trimmed files (Trimmomatic) out of the Trinity output folder that is a gzipped tarball:

 tar -ztvf trinity_out_dir.tar.gz \ | grep -E "*P.qtrim.gz" \ && tar -zxvf trinity_out_dir.tar.gz \ -C /home/sam/Downloads/ \ --wildcards "*P.qtrim.gz"  

This was run locally on my computer (swoose) and then rsync‘d to Mox.

NOTE: The “P” in the *.P.qtrim.gz represents trimmed reads that are properly paired, as determined by Trimmomatic/Trinity. See the fastq.list.txt for the list of FastQ files used as input. For more info on input FastQ files, refer to the Nightingales Google Sheet.

Here’s the quick rundown of how transcript isoform annotation with Stringtie runs:

  1. Use Hisat2 reference index with identified splice sites and exons (this was done yesterday).
  2. Use Hisat2 to create alignments from each pair of trimmed FastQ files. Need to use the --downstream-transcriptome-assembly option!!!
  3. Use Stringtie to create a GTF for each alignment.
  4. Use Stringtie to create a singular, merged GTF from all of the individual GTFs.

SBATCH script (GitHub):

 #!/bin/bash ## Job Name #SBATCH --job-name=oly_stringtie ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=500G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190716_stringtie_20190709-olur-v081 # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log threads=27 genome_index_name="20190709-Olurida_v081" # Paths to programs hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2="${hisat2_dir}/hisat2" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" stringtie="/gscratch/srlab/programs/stringtie-1.3.6.Linux_x86_64/stringtie" # Input/output files genome_gff="/gscratch/srlab/sam/data/O_lurida/genomes/Olurida_v081/20190709-Olurida_v081_genome_snap02.all.renamed.putative_function.domain_added.gff" genome_index_dir="/gscratch/srlab/sam/data/O_lurida/genomes/Olurida_v081" fastq_dir="/gscratch/srlab/sam/data/O_lurida/RNAseq/" gtf_list="gtf_list.txt" ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() names_array=() # Copy Hisat2 genome index rsync -av "${genome_index_dir}"/${genome_index_name}*.ht2 . # Generate checksum of GFF file for backtracking to original # Original named: Olurida_v081_genome_snap02.all.renamed.putative_function.domain_added.gff # Created in 20190709 Olurida_v081 annotation - renamed to avoid filename clashes with previous annotations. md5sum "${genome_gff}" > genome_gff.md5 # Create array of fastq R1 files for fastq in "${fastq_dir}"*R1*.gz do fastq_array_R1+=("${fastq}") done # Create array of fastq R2 files for fastq in "${fastq_dir}"*R2*.gz do fastq_array_R2+=("${fastq}") done # Create array of sample names ## Uses parameter substitution to strip leading path from filename ## Uses awk to parse out sample name from filename for R1_fastq in "${fastq_dir}"*R1*.gz do names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"[_.]" '{print $1 "_" $5}')) done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in "${fastq_dir}"*.gz do echo "${fastq#${fastq_dir}}" >> fastq.list.txt done # Hisat2 alignments for index in "${!fastq_array_R1[@]}" do sample_name=$(echo "${names_array[index]}") "${hisat2}" \ -x "${genome_index_name}" \ --downstream-transcriptome-assembly \ -1 "${fastq_array_R1[index]}" \ -2 "${fastq_array_R2[index]}" \ -S "${sample_name}".sam \ 2> "${sample_name}"_hisat2.err # Sort SAM files, convert to BAM, and index "${samtools}" view \ -@ "${threads}" \ -Su "${sample_name}".sam \ | "${samtools}" sort - \ -@ "${threads}" \ -o "${sample_name}".sorted.bam "${samtools}" index "${sample_name}".sorted.bam # Run stringtie on alignments "${stringtie}" "${sample_name}".sorted.bam \ -p "${threads}" \ -o "${sample_name}".gtf \ -G "${genome_gff}" \ -C "${sample_name}.cov_refs.gtf" # Add GTFs to list file echo "${sample_name}.gtf" >> "${gtf_list}" done # Create singular transcript file, using GTF list file "${stringtie}" --merge \ "${gtf_list}" \ -p "${threads}" \ -G "${genome_gff}" \ -o "${genome_index_name}".stringtie.gtf # Delete unneccessary index files rm "${genome_index_name}"*.ht2 # Delete unneded SAM files rm ./*.sam  

Sam’s Notebook: Genome Annotation – Pgenerosa_v070 and v074 Top 18 Scaffolds Feature Count Comparisons

After annotating Pgenerosa_v074 on 20190701, we noticed a large discrepancy in the number of transcripts that MAKER identified, compared to Pgenerosa_v070. As a reminder, the Pgenerosa_v074 is a subset of Pgenerosa_v070 containing only the top 18 longest scaffolds. So, we decided to do a quick comparison of the annotations present in these 18 scaffolds Pgenerosa_v070 and Pgenerosa_v074.

Briefly, used grep to pull out features identified in the same 18 scaffolds in the Pgenerosa_v074 assembly from Pgenerosa_v070 annotated GFF from 20190228 and then counted the number of features identified in this newly subsetted GFF. It’s all documented in the Jupyter Notebook below.

Jupyter Notebook (GitHub):

[code]#!/bin/bash ## Job Name -...

#!/bin/bash
## Job Name - can be changed
#SBATCH --job-name=bs-geo
## Allocation Definition - confirm correctness
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes (often you will only use 1)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=100G
## email notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=sr320@uw.edu
## Specify the working directory for this job
#SBATCH --workdir= /gscratch/scrubbed/sr320/0719/
# Exit script if a command fails
# set -e

##########################
# This is a script written to assess bisulfite sequencing reads
# using Bismark. The user needs to supply the following:
# 1. A single directory location contaning BSseq reads.
# 2. BSseq reads need to be gzipped FastQ and end with .fq.gz
# 3. A bisulfite-converted genome, produced with Bowtie2.
# 4. Indicate if deduplication should be performed (whole genome sbator reduced genome sequencing)
#
# Set these values below



### USER NEEDS TO SET VARIABLES FOR THE FOLLOWING:
# Set --workdir= path in SBATCH header above.
#
# Full path to directory with sequencing reads
reads_dir="/gscratch/srlab/strigg/data/Pgenr/FASTQS"

# Full path to bisulftie-converted genome directory
genome_dir="/gscratch/srlab/sr320/data/geoduck/v074"

# Enter y (for yes) or n (for no) between the quotes.
# Yes - Whole genome bisulfite sequencing, MBD.
# No - Reduced genome bisulfite sequencing (e.g. RRBS)
deduplicate="No"

# Run Bismark on desired number of reads/pairs subset
# The default value is 0, which will run Bismark on all reads/pairs
subset="-u 0"

####################################################
# DO NOT EDIT BELOW THIS LINE
####################################################




# Evaluate user-edited variables to make sure they have been filled
[ -z ${deduplicate} ] \
&& { echo "The deduplicate variable is not defined. Please edit the SBATCH script and add y or n to deduplicate variable."; exit 1; }

[ -z ${genome_dir} ] \
&& { echo "The bisulfite genome directory path has not been set. Please edit the SBATCH script."; exit 1; }

[ -z ${reads_dir} ] \
&& { echo "The reads directory path has not been set. Please edit the SBATCH script."; exit 1; }



# Directories and programs
wd=$(pwd)
bismark_dir="/gscratch/srlab/programs/Bismark-0.21.0_dev"
bowtie2_dir="/gscratch/srlab/programs/bowtie2-2.3.4.1-linux-x86_64/"
samtools="/gscratch/srlab/programs/samtools-1.9/samtools"
threads="28"
reads_list="input_fastqs.txt"

## Concatenated FastQ Files
R1=""
R2=""

# Initialize arrays
R1_array=()
R2_array=()

# Create list of input FastQ files for easier confirmation.
for fastq in ${reads_dir}/*.fq.gz
do
  echo ${fastq##*/} >> ${reads_list}
done

# Check for paired-end
# Capture grep output
# >0 means single-end reads
# set +e/set -e prevents error >0 from exiting script
set +e
grep "_R2_" ${reads_list}
paired=$?
set -e

# Confirm even number of FastQ files
num_files=$(wc -l < ${reads_list})
fastq_even_odd=$(echo $(( ${num_files} % 2 )) )


## Save FastQ files to arrays
R1_array=(${reads_dir}/*_R1_*.fq.gz)
## Send comma-delimited list of R1 FastQ to variable
R1=$(echo ${R1_array[@]} | tr " " ",")

# Evaluate if paired-end FastQs
# Run Bismark as paired-end/single-end based on evaluation
if [[ ${paired} -eq 0 ]]; then
  # Evaluate if FastQs have corresponding partner (i.e. R1 and R2 files)
  # Evaluated on even/odd number of files.
  if [[ ${fastq_even_odd} -ne 0 ]]; then
    { echo "Missing at least one FastQ pair from paired-end FastQ set."; \
      echo "Please verify input FastQs all have an R1 and corresponding R2 file.";
      exit 1; \
    }
  fi
  ## Save FastQ files to arrays
  R2_array=(${reads_dir}/*_R2_*.fq.gz)
  ## Send comma-delimited list of R2 FastQ to variable
  R2=$(echo ${R2_array[@]} | tr " " ",")
  # Run bismark using bisulftie-converted genome
  # Generates a set of BAM files as outputs
  # Records stderr to a file for easy viewing of Bismark summary info
  ${bismark_dir}/bismark \
  --path_to_bowtie2 ${bowtie2_dir} \
  --genome ${genome_dir} \
  --samtools_path=${samtools} \
  --non_directional \
  --score_min L,0,-0.6 \
  ${subset} \
  -p ${threads} \
  -1 ${R1} \
  -2 ${R2} \
  2> bismark_summary.txt
else
  # Run Bismark single-end
  ${bismark_dir}/bismark \
  --path_to_bowtie2 ${bowtie2_dir} \
  --genome ${genome_dir} \
  --samtools_path=${samtools} \
  --non_directional \
  ${subset} \
  -p ${threads} \
  ${R1} \
  2> bismark_summary.txt
fi



# Determine if deduplication is necessary
# Then, determine if paired-end or single-end
if [ ${deduplicate} == "y"  ]; then
  # Sort Bismark BAM files by read names instead of chromosomes
  find *.bam \
  | xargs basename -s .bam \
  | xargs -I bam_basename \
  ${samtools} sort \
  --threads ${threads} \
  -n bam_basename.bam \
  -o bam_basename.sorted.bam
  if [ ${paired} -eq 0 ]; then
    # Deduplication
    find *sorted.bam \
    | xargs basename -s .bam \
    | xargs -I bam_basename \
    ${bismark_dir}/deduplicate_bismark \
    --paired \
    --samtools_path=${samtools} \
    bam_basename.bam
  else
    find *sorted.bam \
    | xargs basename -s .bam \
    | xargs -I bam_basename \
    ${bismark_dir}/deduplicate_bismark \
    --single \
    --samtools_path=${samtools} \
    bam_basename.bam
  fi
  # Methylation extraction
  # Extracts methylation info from deduplicated BAM files produced by Bismark
  # Options to created a bedgraph file, a cytosine coverage report, counts, remove spaces from names
  # and to use the "scaffolds" setting.
  ${bismark_dir}/bismark_methylation_extractor \
  --bedGraph \
  --cytosine_report \
  --genome_folder ${genome_dir} \
  --gzip
  --counts \
  --scaffolds \
  --remove_spaces \
  --multicore ${threads} \
  --buffer_size 75% \
  --samtools_path=${samtools} \
  *deduplicated.bam
  # Sort deduplicated BAM files
  find *deduplicated.bam \
  | xargs basename -s .bam \
  | xargs -I bam_basename \
  ${samtools} sort \
  --threads ${threads} \
  bam_basename.bam \
  -o bam_basename.sorted.bam
  # Index sorted files for IGV
  # The "-@ ${threads}" below specifies number of CPU threads to use.
  find *deduplicated.sorted.bam \
  | xargs -I sorted_bam \
  ${samtools} index \
  -@ ${threads} \
  sorted_bam
else
  # Methylation extraction
  # Extracts methylation info from BAM files produced by Bismark
  # Options to created a bedgraph file, a cytosine coverage report, counts, remove spaces from names
  # and to use the "scaffolds" setting.
  ${bismark_dir}/bismark_methylation_extractor \
  --bedGraph \
  --cytosine_report \
  --genome_folder ${genome_dir} \
  --gzip \
  --counts \
  --scaffolds \
  --remove_spaces \
  --multicore ${threads} \
  --buffer_size 75% \
  --samtools_path=${samtools} \
  *.bam

  # Sort BAM files
  find *.bam \
  | xargs basename -s .bam \
  | xargs -I bam_basename \
  ${samtools} sort \
  --threads ${threads} \
  -o bam_basename.sorted.bam
  # Index sorted files for IGV
  # The "-@ ${threads}" below specifies number of CPU threads to use.
  find *sorted.bam \
  | xargs -I sorted_bam \
  ${samtools} index \
  -@ ${threads} \
  sorted_bam
fi


# Bismark processing report
# Generates HTML reports from previously created files
${bismark_dir}/bismark2report

#Bismark summary report
# Generates HTML summary reports from previously created files
${bismark_dir}/bismark2summary

#bismark, #sbatch