Sam’s Notebook: Data Wrangling – Renaming, Splitting, and Feature Counts of Updated Pgenerosa_v074 GenSAS Merged GFF

In the final GFF from our GenSAS Pgenerosa_v074.a4 annotation , we noticed that there were no repeat motifs/sequences identified on Scaffold 01. The remaining scaffolds all had repeat motifs present on them, so something seemed amiss (see this GitHub Issue for more info).

I ended up contacting GenSAS and it turned out there was a bug on their end that led to this issue:

Taein Lee Nov 26, 2019, 7:27 PM (8 days ago) to me, jhumann

Hi Sam,

Thank you so much for your report. There was a bug and it has been fixed. Your gff3 files has been re-generated.

-Taein From: gensas-admin on behalf of sam white Sent: Tuesday, November 26, 2019 3:45 PM To: gensas-admin; jhumann; taein_lee Subject: [Website feedback] Merged GFF missing repeats on only one chromosome

Sam (https://ift.tt/2LnmNEw) sent a message using the contact form at https://ift.tt/2qkEE7F.

Hi,

I generated a merged GFF after I “published” my annotation. I included RepeatModeler features in the merged GFF.

My genome has 18 chromosomes. All of them except one chromosome (name: PGA_scaffold1__77_contigs__length_89643857) has the expected repeats annotations present.

I looked at the individual RepeatMasker and RepeatModeler jobs, and both of those GFFs identified repeats on PGA_scaffold1__77_contigs__length_89643857.

Would you happen to have any ideas on why PGA_scaffold1__77_contigs__length_89643857 isn’t showing any repeat features in the merged GFF?>

This is for my project Pgenerosa_v074.

Thanks for any insight!

Sam

So, now that I have the updated, final GFF, I want to re-run the GFF splitting into separate feature files, as well as counts and sequence length stats for all features (including repeats).

Everything is documented in this Jupyter Notebook (GitHub):