Sam’s Notebook: Data Wrangling – FastA Subsetting of Pgenerosa_v070.fa Using samtools faidx

Steven asked to subset the Pgenerosa_v070.fa (2.1GB) in this GitHub Issue #705. In that issue, it was determined that a significant portion of the sequencing data that was assembled by Phase Genomics clustered in “scaffolds” 1 – 18. As such, Steven asked to subset just those 18 scaffolds.

This was done by using the samtools faidx program.

Process is documented in the following Jupyter Notebook (GitHub):