Sam’s Notebook: Data Wrangling – Create Canonical Olurida_v081 Genes FastA

I finally had some time to tackle this GitHub Issue and create a canonical genes FastA file using the MAKER IDs, instead of the original contig IDs from our Olympia oyster genome assembly – https://owl.fish.washington.edu/halfshell/genomic-databank/Olurida_v081.fa (FastA; 1.1GB).

Everything was documented in a Jupyter Notebook (see link below), but here’s the skinny on how I did it:

  1. Pull existing FastA-formatted sequences from the fully annotated GFF (GFF; 2.9GB; MAKER appended the FastAs to the end of the GFF).
  2. Use ‘bedTools fastaFromBed’ to create FastA for all genes using gene GFF coordinates and generate unique FastA headers for each sequence.
  3. Use sed to do a substitution using the MAKER IDs and the bedTools fastaFromBed IDs.

Jupyter Notebook (GitHub):