Sam’s Notebook: Data Wrangling – FastA Splitting With faSplit

Steven posted an issue on GitHub regarding splitting a FastA file into multiple sequences. Specifically, he wanted a single, large FastA sequence (~89Mbp) split into smaller FastAs for BLASTing.

I downloaded the FastA he provided ( and split the sequence into 2000bp chunks using the faSplit program (

 faSplit \ size \ 20190731_faSplit_PGA-scaffold1_splits_2000bp/ \ 2000  

Sam’s Notebook: Data Summary – P.generosa Transcriptome Assemblies Stats

In our continuing quest to wrangle the geoduck transcriptome assemblies we have, I was tasked with compiling assembly stats for our various assemblies. The table below provides an overview of some stats for each of our assemblies. Links within the table go to the the notebook entries for the various methods from which the data was gathered. In general:

  • Genes/Isoforms stats come directly from the Trinity assembly stats output file.
  • transdecoder_pep is a count of headers in the Transdecoder FastA output file, transdecoder_pep.
  • CD-Hit is a count of headers in the CD-Hit-est FastA output file.
Assembly Genes Isoforms transdecoder_pep CD-Hit
ctenidia [216248(] 349773 72274 325783
gonad 151263 198748 31706 189378
Juvenile (EPI 115) 199765 320691 78149 297848
Juvenile (EPI 116) 268476 434877 99089 408498
Juvenile (EPI 123) 196131 303568 67398 284852
Juvenile (EPI 124) 255277 421670 93285 395527
Larvae (EPI 99) 249799 425165 77694 379210
MEANS 219566 350642 74228 325871

Sam’s Notebook: Transcriptome Compression – P.generosa Transcriptome Assemblies Using CD-Hit-est on Mox

In continued attempts to get a grasp on the geoduck transcriptome size, I decided to “compress” our various assemblies by clustering similar transcripts in each assembly in to a single “representative” transcript, using CD-Hit-est. Settings use to run it were taken from the Trinity FAQ regarding “too many transcripts”.

A bash script was used to rsync files to Mox and then execute the SBATCH script.

Bash script (GitHub):

 #!/usr/bin/bash # Script to retrieve geoduck Trinity assemblies # Assemblies will be used in SBATCH script called at end of this script. # Script needs to be run within same directory as SBATCH script. # Exit if any command fails set -e # Set rsync remote path gannet="gannet:/volume2/web/Atumefaciens" owl="owl:/volume1/web/Athaliana" # Create array of directories for storing Trinity assemblies assembly_dirs_array=( /gscratch/srlab/sam/data/P_generosa/transcriptomes/20180827_assembly /gscratch/srlab/sam/data/P_generosa/transcriptomes/ctenidia /gscratch/srlab/sam/data/P_generosa/transcriptomes/gonad /gscratch/srlab/sam/data/P_generosa/transcriptomes/heart /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI115 /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI116 /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI123 /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI124 /gscratch/srlab/sam/data/P_generosa/transcriptomes/larvae/EPI99) # Array of Trinity assemblies remote paths for rysnc-ing assemblies_array=( 20180827_trinity_geoduck_RNAseq/Trinity.fasta 20190409_trinity_pgen_ctenidia_RNAseq/trinity_out_dir/Trinity.fasta 20190409_trinity_pgen_gonad_RNAseq/trinity_out_dir/Trinity.fasta 20190215_trinity_geoduck_heart_RNAseq/trinity_out_dir/Trinity.fasta 20190409_trinity_pgen_EPI115_RNAseq/trinity_out_dir/Trinity.fasta 20190409_trinity_pgen_EPI116_RNAseq/trinity_out_dir/Trinity.fasta 20190409_trinity_pgen_EPI123_RNAseq/trinity_out_dir/Trinity.fasta 20190409_trinity_pgen_EPI124_RNAseq/trinity_out_dir/Trinity.fasta 20190409_trinity_pgen_EPI99_RNAseq/trinity_out_dir/Trinity.fasta) # Retrieve FastA files via rsync for index in "${!assemblies_array[@]}" do # Remove everything after first slash assembly=$(echo "${assemblies_array[index]%%/*}") echo "Preparing to download ${assembly}..." if [ "${assembly}" = "20180827_trinity_geoduck_RNAseq" ]; then echo "Now syncing ${assembly} to ${assembly_dirs_array[index]}" rsync \ --archive \ --progress \ "${owl}/${assemblies_array[index]}" \ "${assembly_dirs_array[index]}" else echo "Now syncing ${assembly} to ${assembly_dirs_array[index]}" rsync \ --archive \ --progress \ "${gannet}/${assemblies_array[index]}" \ "${assembly_dirs_array[index]}" fi done # Start SBATCH script to run CD-Hit on all transcriptome assemblies sbatch  

SBATCH script (GitHub):

 #!/bin/bash ## Job Name #SBATCH --job-name=cdhit_pgen ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=5-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190729_cdhit-est_pgen_transcriptomes # This script is called by # That script uses rsync to transfer files to Mox via the login node. # This is required because Mox execute nodes don't have internet access. # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log # Set CPU threads threads=27 # Program paths cd_hit_est="/gscratch/srlab/programs/cd-hit-v4.8.1-2019-0228/cd-hit-est" # Create assembly paths array assembly_dirs_array=( /gscratch/srlab/sam/data/P_generosa/transcriptomes/20180827_assembly /gscratch/srlab/sam/data/P_generosa/transcriptomes/ctenidia /gscratch/srlab/sam/data/P_generosa/transcriptomes/gonad /gscratch/srlab/sam/data/P_generosa/transcriptomes/heart /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI115 /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI116 /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI123 /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI124 /gscratch/srlab/sam/data/P_generosa/transcriptomes/larvae/EPI99) # Run cd-hit-est on each assembly for index in "${!assembly_dirs_array[@]}" do # Store individual sample name by removing # everything up to and including the last slash in path sample_name=$(echo "${assembly_dirs_array[index]##*/}") # Run cd-hit-est "${cd_hit_est}" \ -o "${sample_name}".cdhit \ -c 0.98 \ -i "${assembly_dirs_array[index]}"/Trinity.fasta \ -p 1 \ -d 0 \ -b 3 \ -T "${threads}" \ -M 0 done