I finally had some time to tackle this GitHub Issue and create a canonical genes FastA file using the MAKER IDs, instead of the original contig IDs from our Olympia oyster genome assembly – https://owl.fish.washington.edu/halfshell/genomic-databank/Olurida_v081.fa (FastA; 1.1GB).
Everything was documented in a Jupyter Notebook (see link below), but here’s the skinny on how I did it:
- Pull existing FastA-formatted sequences from the fully annotated GFF (GFF; 2.9GB; MAKER appended the FastAs to the end of the GFF).
- Use ‘bedTools fastaFromBed’ to create FastA for all genes using gene GFF coordinates and generate unique FastA headers for each sequence.
- Use
sed
to do a substitution using the MAKER IDs and thebedTools fastaFromBed
IDs.
Jupyter Notebook (GitHub):