Data Wrangling – P.generosa Genomic Feature FastA Creation

Steven wanted me to generate FastA files (GitHub Issue) for Panopea generosa (Pacific geoduck) coding sequences (CDS), genes, and mRNAs. One of the primary needs, though, was to have an ID that could be used for downstream table joining/mapping. I ended up using a combination of GFFutils and bedtools getfasta. I took advantage of being able to create a custom name column in BED files to generate the desired FastA description line having IDs that could identify, and map, CDS, genes, and mRNAs across FastAs and GFFs.

This was all documented in a Jupyter Notebook:

GitHub:

NB Viewer:


RESULTS

Output folder:


CDS FastA description lines look like this:

  • >PGEN_.00g000010.m01.CDS01|PGEN_.00g000010.m01|PGEN_.00g000010::Scaffold_01:2-125

Explanation for CDS:

  • PGEN_.00g000010.m01.CDS01: Unique sequence ID.
  • PGEN_.00g000010.m01: “Parent” ID. Corresponds to unique mRNA ID.
  • PGEN_.00g000010: “Parent” ID. Corresponds to unique gene ID.
  • Scaffold_01: Originating scaffold.
  • 2-125: Sequence coordinates from scaffold mentioned above.

mRNA FastA description looks like this:

  • PGEN_.00g000030.m01|PGEN_.00g000030::Scaffold_01:49248-52578

Explanation for mRNA:

  • PGEN_.00g000030.m01: Unique sequence ID.
  • PGEN_.00g000030: “Parent” ID. Corresponds to unique gene ID.
  • Scaffold_01: Originating scaffold.
  • 49248-52578: Sequence coordinates from scaffold mentioned above.

from Sam’s Notebook https://ift.tt/8IljR1e
via IFTTT

#ifttt, #sams-notebook