Data Wrangling – Gene ID Extraction from P.generosa Genome GFF Using Methylation Machinery List

Per this GitHub Issue Steven asked that I take a list of gene names associated with DNA methylation and see if I could extract a list of Panopea generosa (Panopea generosa) gene IDs and corresponding BLAST e-values for each from our P.generosa genome annotation (see Genomic Resources wiki for more info).

Here’s the list of gene names provided:

dnmt1 dnmt3a dnmt3b dnmt3l mbd1 mbd2 mbd3 mbd4 mbd5 mbd6 mecp2 Baz2a Baz2b UHRF1 UHRF2 Kaiso zbtb4 zbtb38b zfp57 klf4 egr1 wt1 ctcf tet1 tet2 tet3 

The operations were run in a Jupyter Notebook. All results are available in the notebook, as well as in the RESULTS section below.

Briefly, here’s how the process was run:

  1. Use list of gene names to scan GenSAS Panopea-generosa-vv0.74.a4.gene.gff3
  2. Use list of matches to scan both GenSAS BLAST results files:
  • Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab
  • Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab
  1. Extract e-values for any matches.
  2. Print out tab-delimited table of P.generosa gene IDs, gene names, and both BLAST results e-values, if present.

Jupyter Notebook: