Data Wrangling – C.virginica Gonad RNAseq Transcript Counts Per Gene Per Sample Using Ballgown

As we continue to work on the analysis of impacts of OA on Crassostrea virginica (Eastern oyster) gonads via DNA methylation and RNAseq (GitHub repo), we decided to compare the number of transcripts expressed per gene per sample (GitHub Issue). As it turns out, it was quite the challenge. Ultimately, I wasn’t able to solve it myself, and turned to StackOverflow for a solution. I should’ve just done this at the beginning, as I got a response (and solution) less than five minutes after posting! Regardless, the data wrangling progress (struggle?) was documented in the following GitHub Discussion:

  • [Help with unwiedldy table(https://ift.tt/3KP3yRp)

The final data wrangling was performed using R and documented in this R Markdown file:


RESULTS

Output file (CSV):

Ultimately, the solution came down to this tiny bit of code (see the R Markdown file linked above for actual info about it):

whole_tx_table %>%
select(starts_with(c("gene_name", "FPKM"))) %>%
group_by(gene_name) %>%
summarise((across(everything(), ~sum(. > 0))))

That’s it!

from Sam’s Notebook https://ift.tt/32Io1WS
via IFTTT

#ifttt, #sams-notebook

January 2022 Goals

image

Break? Over. Rest? Acquired. Projects? Ready to be tackled.

Me? Not ready to work yet but oh well here we go.

I have some lofty goals this month, but my plan is to try and split my time more effectively between my PhD and postdoc projects. I think I’ll work a couple days at home and dedicate those to PhD projects, and spend my office days working on postdoc projects. Hopefully the physical separation will help me compartmentalize the work so I don’t feel guilty for not working on postdoc projects while working on my PhD projects, and vice versa!

November Goals Recap

Gigas Gonad Methylation:

  • SO CLOSE to finishing reworking the discussion! But also not done. Whoops.

Hawaii Gigas Methylation:

  • I didn’t have time to work on the Hawaii paper, but now that I have Rajan’s feedback I can start to incorporate that as I restart my work

Virginica Gonad Methylation:

  • Talked through methods with Steven and Sam
  • Have yet to extract SNPs or perform any actual analysis myself

Coral Transcriptomics:

NSF PRFB:

  • Successfully wrote and submitted my NSF PRFB!

Other:

  • Finished identifying new papers for ocean acidification and reproduction review
  • Completed Molecular Ecology review
  • Identified a killifish RRBS and RNA-Seq dataset to work on with Neel
  • Discussed potential projects involving DNMT-3 knockout zebrafish or menhaden with Neel

January Goals

Gigas Gonad Methylation:

  • FINALLY finish the discussion
  • Revise the introduction and abstract
  • Send to Steven for edits
  • Submit to a journal and bioRXiv

Hawaii Gigas Methylation:

  • Address Rajan’s edits
  • Review DSS script and determine if I should go back to methylKit for better interpretation
  • Extract SNPs with EpiDiverse and create a relatedness matrix
  • Look at methylation islands and non-methylated regions
  • Examine overlaps between DML and other epigenomic datasets from Sascha

Virginica Gonad Methylation:

  • Update methods of draft paper
  • Extract SNPs with BSnper and EpiDiverse
  • Identify methods for linking WGBS and RNA-Seq data
  • Identify lncRNA and miRNAs in dataset

Coral Transcriptomics:

  • Continue testing extraction protocol to identify a successful methods
  • Use gel to check RNA integrity
  • Extract all A. cervicornis samples

Killifish Methylation and Expression:

  • Review BAT mapping and DMR identification protocol
  • Start mapping procedure for methylation data
  • Identify protocol for RNA-Seq data

Ocean Acidification and Reproduction Review:

  • Start integrating new papers into main text and supplemental material
  • Revise figures

Other:

  • Meet with Carolyn, Ann, and Neel to discuss projects and funding opportunities for the next six months
  • Develop any potential projects
  • Coordinate postdoctoral mentoring program with Maggi

Please enable JavaScript to view the comments powered by Disqus.

from the responsible postdoc https://ift.tt/3JMtLQ4
via IFTTT

December Goals

Alright, last time I found a bunch of cool stuff by re-running with signed modules. Before I get into my goals, let’s recap those: WGCNA Previously, I switched from an unsigned to a signed network. After doing so, I realized that it would be optimal to use hemat_level as a continuous, rather than categorical, variable. I’d set it up as categorical earlier, simply because samples either had a quite low SQ mean (sub-5000) or quite high (over 100,000). However, I figured hey, why not have the extra precision of treating it as continuous! So I re-ran WGCNA with that input changed, and definitely got some differences. Nothing major, just some changes in module names and contents. After discarding significant modules in which there was high correlation to a single crab, I examined significance with GO-MWU. No enrichment was detected in any GO module within the host or complete transcriptome. However,…

from Aidan F. Coyle https://ift.tt/3DrrANz
via IFTTT

Doing It All

Alright buckle up, it’s going to be a really really long one. But I promise, tons of cool stuff here. Alright, so last time, I was fixing my GO-MWU analyses by switching over from adjusted p-value to log2 fold change. Most of my time since then has been spent on three things: – Examining decreased-temp libraries with DESeq2 and GO-MWU – Re-running WGCNA with signed networks and binarized variables – Analyzing significant WGCNA modules with GO-MWU 1: Running DESeq2 and GO-MWU Comparisons on Decreased-Temp Libraries I more or less ignored those when I was solely examining infection through a cPCR lens. However, since then, I’ve learned some info that’s made me a lot more confident about qPCR and less about cPCR. If we trust qPCR (which I do at this point), then examining our decreased-temperature results becomes a lot more important. It also lets us examine an experimental group over…

from Aidan F. Coyle https://ift.tt/3HxJgtV
via IFTTT

Fixing GO-MWU

GO-MWU Repair Alright, so I’ll keep this relatively quick, since it’s quite late and I just spent a bunch of time fixing old mistakes. So back in February when I was a bright-eyed and bushy-tailed young graduate student, I ran GO-MWU on a long series of pairwise comparisons for crab libraries aligned to each of the three transcriptomes. I’ve recently been going through my old work to put it together into paper format. In doing so, I ended up poking through the GO-MWU README, and realized that GO-MWU made a really, really important update to their instructions. In June, they added a short section to their FAQ saying the following: NOTE: In read-based gene expression analysis (RNA-seq, TagSeq) p-values may be biased towards highly abundant genes, especially when the read depth is low. This may result in the corresponding GO bias. Use log2-fold changes to avoid this. I had previously…

from Aidan F. Coyle https://ift.tt/3BWeZAT
via IFTTT

Bairdi Immune Genes Lit Search

Bad news: in the time since my last notebook post, I caught the novel coronavirus known as COVID-19. Perhaps you’ve heard of it – it definitely isn’t a barrel of fun. Good news: I have since recovered from the novel coronavirus, and have been rockin’ it all week! This week was spent on examining immune genes in Chionoecetes. So earlier, I created a list of genes with the GO term associated with “immune response” (that’s TK GO TERM) for each of our three transcriptomes. Again, that’s unfiltered (cbai_v2.0), Chionoecetes-only (cbai_v4.0), and Hematodinium-only (hemat_v1.6). To better understand what’s going on with the immune system of these Tanner crab, I assigned myself two goals: 1: Better understand the pathways of the crustacean immune system more broadly 2: Examine the specific genes expressed in the crab (that’s the immune genes observed in cbai_v4.0), and search for the importance of those genes in similar…

from Aidan F. Coyle https://ift.tt/316RisZ
via IFTTT

Chatting with Pam Jensen

Background Longtime readers of Aidan’s lab notebook are no doubt familiar with Pam Jensen, the recently-retired Hematodinium expert. Formerly with NOAA, she worked with Grace to make the 2017 C. bairdi / Hematodinium transcriptome project (referred to later as the 2017 NPRB study). The vast majority of my research so far has been analyzing data collected in this project. Pam retired soon before I joined the lab, and upon her retirement, shipped around 30,000 genetic samples to the Roberts lab. These samples are partially described in our post from this August. Most come from Alaskan snow and Tanner crab from either the Eastern Bering Sea (EBS) or Southeast surveys. However, there’s a decent smattering of samples from other species and locations. These samples were nearly all collected from 2005-2019 To get a better idea of the importance and content of the samples, and to gain additional background on the host/parasite…

from Aidan F. Coyle https://ift.tt/2ZBUwnS
via IFTTT

WGBS Analysis Part 36

Miscellaneous enrichment investigations

I have a two outstanding things I want to look at before I close the book on my enrichment analysis results.

Unannotated GOterms and potential nesting

When I was creating my annotation lists, I noticed that not all GOterms from topGO were present in the genome annotation I generated. I want to see if these terms were nested inside of other GOterms that were enriched, or if they were parent terms of other terms present in my annotation.

My biological process GOterms without matching annotations were GO:0001539 (cilium or flagellum-dependent cell motility), GO:0003341 (cilium movement), GO:0060285 (cilium dependent cell motility), GO:0060294 (cilium movement involved in cell motility), GO:0097722 (sperm motility), and GO:0001700 (embryonic development via the syncytial blastoderm). The terms GO:0002165 (instar larval or pupal development), GO:0009791 (post-embryonic development), and GO:0007391 (dorsal closure) had matching annotations for some transcripts but not others. I looked for parent and child terms on the QuickGO.

Several motility terms were related to eachother, including ones with and without annotations:

Screen Shot 2021-10-25 at 2 51 58 PM

I encountered a similar situation looking at two GOterm lineages for developmental terms:

Screen Shot 2021-10-25 at 3 21 45 PM

Most of the cellular component GOterms were missing matching annotations, and were all related to eachother:

Unknown

I had one cellular component term with some annotations, and it was not related to the other terms:

Unknown-2

I’m not sure why some of the topGO GOterms show up in the annotation or not, but my guess is that topGO may consider the entire term “family tree” when performing enrichment in a way that’s different than the annotation.

Enrichment of hyper- vs. hypomethylated DML

I wanted to see if the genes with enriched GOterms contained more hyper- or hypomethylated DML. While we found a pretty even split of hyper- and hypomethylated DML, any differential enrichment of these DML may give us additional insight into the function of methylation. To do this, I took my dataframe with enriched GOterms and used it as a transcript index to filter my master DML list:

sigRes.allBPMethDiff <- unique(sigRes.allBPProduct %>% dplyr::select(., transcript) %>% left_join(., allDMLGOtermsFiltered, by = "transcript") %>% filter(., GOcat == "P") %>% dplyr::select(., geneID, chr, start, end, meth.diff)) #Isolate transcript column; join with DML list; remove all rows that aren't BP GOterms, select gene ID/chr/start/end/meth.diff columns 

Based on my filtering, I identified 16 unique DML in genes with enriched BP GOterms. Of these 16 DML, 13 were hypermethylated and 3 were hypomethylated. When I did the same thing with my CC GOterms, six of the seven unique DML were hypermethylated. There is definitely an enrichment bias for genes with hypermethylated DML.

Going forward

  1. Update methods
  2. Update results
  3. Revise discussion
  4. Revise introduction
  5. Identify journals for submission
  6. Format for submission
  7. Submit preprint to bioRXiv
  8. Submit paper for publication
  9. Report mc.cores issue to methylKit
  10. Perform randomization test
  11. Update mox handbook with R information
  12. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2ZjNfZE
via IFTTT

WGBS Analysis Part 36

Miscellaneous enrichment investigations

I have a two outstanding things I want to look at before I close the book on my enrichment analysis results.

Unannotated GOterms and potential nesting

When I was creating my annotation lists, I noticed that not all GOterms from topGO were present in the genome annotation I generated. I want to see if these terms were nested inside of other GOterms that were enriched, or if they were parent terms of other terms present in my annotation.

My biological process GOterms without matching annotations were GO:0001539 (cilium or flagellum-dependent cell motility), GO:0003341 (cilium movement), GO:0060285 (cilium dependent cell motility), GO:0060294 (cilium movement involved in cell motility), GO:0097722 (sperm motility), and GO:0001700 (embryonic development via the syncytial blastoderm). The terms GO:0002165 (instar larval or pupal development), GO:0009791 (post-embryonic development), and GO:0007391 (dorsal closure) had matching annotations for some transcripts but not others. I looked for parent and child terms on the QuickGO.

Several motility terms were related to eachother, including ones with and without annotations:

Screen Shot 2021-10-25 at 2 51 58 PM

I encountered a similar situation looking at two GOterm lineages for developmental terms:

Screen Shot 2021-10-25 at 3 21 45 PM

Most of the cellular component GOterms were missing matching annotations, and were all related to eachother:

Unknown

I had one cellular component term with some annotations, and it was not related to the other terms:

Unknown-2

I’m not sure why some of the topGO GOterms show up in the annotation or not, but my guess is that topGO may consider the entire term “family tree” when performing enrichment in a way that’s different than the annotation.

Enrichment of hyper- vs. hypomethylated DML

I wanted to see if the genes with enriched GOterms contained more hyper- or hypomethylated DML. While we found a pretty even split of hyper- and hypomethylated DML, any differential enrichment of these DML may give us additional insight into the function of methylation. To do this, I took my dataframe with enriched GOterms and used it as a transcript index to filter my master DML list:

sigRes.allBPMethDiff <- unique(sigRes.allBPProduct %>% dplyr::select(., transcript) %>% left_join(., allDMLGOtermsFiltered, by = "transcript") %>% filter(., GOcat == "P") %>% dplyr::select(., geneID, chr, start, end, meth.diff)) #Isolate transcript column; join with DML list; remove all rows that aren't BP GOterms, select gene ID/chr/start/end/meth.diff columns 

Based on my filtering, I identified 16 unique DML in genes with enriched BP GOterms. Of these 16 DML, 13 were hypermethylated and 3 were hypomethylated. When I did the same thing with my CC GOterms, six of the seven unique DML were hypermethylated. There is definitely an enrichment bias for genes with hypermethylated DML.

Going forward

  1. Update methods
  2. Update results
  3. Revise discussion
  4. Revise introduction
  5. Identify journals for submission
  6. Format for submission
  7. Submit preprint to bioRXiv
  8. Submit paper for publication
  9. Report mc.cores issue to methylKit
  10. Perform randomization test
  11. Update mox handbook with R information
  12. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3nz0tdw
via IFTTT

WGBS Analysis Part 35

Gene librarian for genes with DML associated with enriched GOterms

Last week I finally matched my list of enriched GOterms with gene IDs and methylation difference information. This week, I spent a lot of time playing gene librarian and understanding the functions of each of these genes in the gonad or during embryogenesis, probable responses to low pH stress, and overlaps between other studies.

Methods

To separate my manual annotations and searches from tables I produced in R, I created this Excel spreadsheet. I searched for each of the gene products in publicly available DML lists from the four other ocean acidification and methylation in oyster studies: Downey-Wall et al. 2020, Venkataraman et al. 2020, Chandra Rajan et al. 2021, and Lim et al. 2020. I also looked at the C. gigas-specific reproduction and gene expression papers: Dheilly et al. 2012 and Broquard et al. 2021. Important note with these papers: they use the oyster_v9 genome, while I used the Roslin genome. Finally, I searched for papers on Google Scholar examining these genes in marine invertebrates, but preferably bivalves. For these searches, I looked for low pH responses and/or egg-specific studies. I went through this process for biological process and cellular component lists separately, but there was some overlap between these gene lists.

Biological processes: General functions and interesting notes

Overall, genes are mainly involved in developmental progression and regulation.

  • Dynein heavy chain 5, axonemal: Associated with sperm (flagellar and cillar) motility). Ddynein heavy chain 1, 8, 12, cytoplasmic dynein 1 heavy chain 1-like, dynein heavy chain 1, and cytoplasmic dynein 2 heavy chain 1-like contained DML in C. virginica gonad (Venkataraman et al. 2020). Higher expression of dynein proteins over the course of spermatogeneisis in C. gigas (Dheilly et al. 2012). Dynein does not have strict functions relating to sperm motility. Differential abundance of dynein proteins was found in shotgun proteomic characterization of C. gigas ctenidia (Timmins-Schiffman et al. 2014), and can help move materials for calcification in response to low pH stress in mantle tissue (De Wit et al. 2018). Unfertilized sea urchin eggs contain several dynein proteins (Pratt 1980, Porter 1988). With new genome, it’s possible that this annotation may be a cytoplasmic homologue, as we did not sequence any male DNA. Cytoplasmic dynein would be involved in organelle transport.
  • Unconventional myosin-VI: Motor protein. Decreased expression during gametogenesis, with higher expression in immature gonad (Dheilly et al. 2012). The paramyosin homologue is highly expressed in females over males, with declining expression as gametes mature (Broquard et al. 2021). Unconventional myosin-XVIIIa-like contained a DML in C. virginica gonad (Venkataraman et al. 2020). Unconventional myosins were also found to be important for embryonic development in urchins, specifically to regulate the onset of blastulation and gastrulation stages (Sirotkin 2000). Changes in methylation of this gene could impact progression of gametogenesis or embryogenesis.
  • Serine/threonine-protein kinase 36: Involved in signaling pathways. Higher expression of serine/threonine-protein kinases in mature female gonads (Dheilly et al. 2012), especially during sex determination processes (Broquard et al. 2021). Examination of the C. gigas kinome found several serine/threonine-protein kinases in eggs and embryos, with some gene expression changes in response to abiotic stress (Epelboin et al. 2016). Also found to be related to oocyte maturation in the king scallop (Pauletto et al. 2017). Several serine/threonine-protein kinases contained DML in C. virginica (Venkataraman et al. 2020), and one was differentially methylated in C. hongkongensis larvae. Methyalation of this gene could promote homeostasis by regulating gonad development processes.
  • Helicase domino: Exchanges phosphorylated histone for acetylated form, with chromatin remodeling impacting gene expression. Potentially involved in oogenesis. Outlier SNP loci were found in this gene in O. lurida, potentailly related to immune or stress response (Silliman 2019). Potential methylation regulation of chromatin structure.
  • Protein neuralized (neuralized-like protein): Involved in cell fate decisions. Part of the ubiquitin ligase family and broadly involved in protein ubiquitination processes (Hu 2005, Jiang 2011). Changes in protein abundance in Strongylocentrotus purpuratus egg and embryo related to early cleavage and initiation of gastrulation. Increased expression of this gene in Artemia sinica can suppress cell division and macromolecule synthesis (Jiang 2011), so methylation may be regulating gonad development.
  • Cytoplasmic aconitate hydratase (aconitase): Catalyzes part of the TCA cycle. Found in C. virginica eggs and trocophores, with enzyme activity increasing as development progressed (Black 1962). Associated with energy metabolism during gametogenesis in Pecten maximus (Boonmee 2016), and similarly associated with energy production in C. gigas during spermatogenesis (Kingtong et al. 2013). Higher expression in S. purpuratus populations more exposed to low pH conditions after one day of low pH exposure (embryos), but lower expression after seven days of low pH exposure (larvae) (Evans et al. 2017). Methylation may “prime” embryos for low pH exposure and dictate energy metabolism.
  • Serine-protein kinase ATM: Cell cycle checkpoint kinase, regulates downstream proteins, involved in cellular response to DNA damage. Gene contained DML in C. virginica gonad (Venkataraman et al. 2020). Associated with female gamete generation and development in Sinonovacula constricta. Mobilized under embroygenesis and during environmental stress, regulating cell-cycle progression (Epelboin et al. 2016). Involved in cellular stress response when C. gigas was exposed to low pH and As (Moreira et al. 2018).
  • Cation-independent mannose-6-phosphate receptor: Transport lysosomal enzymes to lysosome. Potential involvement in D. polymorpha contaminant (Leprêtre et al. 2020) and Marsupenaus japonicus bacteria immune responses.

Cellular components

  • Lots of similar genes to BP! Dynein heavy chain 5, unconventional myosin, and helicase domino had enriched CC GOterms. Helicase domino was associated with histone and protein acetyltransferase complexes, and dynein and myosin were associated with microtubule complexes and the cytoskeleton.
  • Cytoplasmic dynein 2 light intermediate chain 1-like: Forms a motor protein complex to transport items along microtubules in cilia and flagella. Similar findings to dynein heavy chain 5 (see above)
  • Kinesin-like protein KIF23: Helps with organelle transport and required for cytokinesis. Increased expression over the course of spermatogenesis, and higher expression in mature male and female gonads (Dheilly et al. 2012). Differential expression between mature gonad and stripped oocytes (Dheilly et al. 2012). Kinesin-like protein KIF12 and KIF15 contained DML in C. virginica gonad (Venkataraman et al. 2020). Changes in kinesin-like protein gene expression have been seen in larval S. purpuratus exposed to low pH (Padilla-Gamiño et al. 2013) and Pocillopora acuta exposed to heat stress (Poquita-Du et al. 2020), and these changes could be associated with reduced calcification. Also associated with immune response in C. gigas (Lorgeril et al. 2011).

Going forward

  1. Look into unannotated GOterms and potential nesting
  2. Determine if there’s a bias towards hyper- or hypomethylated genic DML with enriched functions
  3. Report mc.cores issue to methylKit
  4. Perform randomization test
  5. Update mox handbook with R information
  6. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3CesOvL
via IFTTT