Shelly posted a GitHub Issue asking if I could create a file of S.salar genes with their UniProt annotations (e.g. gene name, UniProt accession, GO terms).
Here’s the approach I took:
- Use GFFutils to pull out only gene features, along with:
- chromosome name
- start position
- end position
- Dbxref attribute (which, in this case, is the NCBI gene ID)
- Submit the NCBI gene IDs to UniProt to map the NCBI gene IDs to UniProt accessions. Accomplished using the Perl batch submission script provided by UniProt.
- Parse out the stuff we were interested in.
- Join it all together!
All of this is documented in the Jupyter Notebook below:
Jupyter Notebook (GitHub):
Jupyter Notebook (NBviewer):