WordPress Management

I’ll keep this quick It’s my third lab notebook post of the day, so definitely want to be brief with all this! But basically, I’m taking over the Bitter Crab blog from Grace! I’ve got a bunch of vague ideas for where to go from here, with the goal of keeping it generally outreach-y rather than getting too tied up in what’s happening in the lab. After all, if people want that, they can just come here! I also already wrote my first blog post – on snow and Tanner crab hybridization! Anyways, this is largely to mark that a) I’m managing the blog now, and b) to just put down some ideas I had for future articles to write. They’re generally Tanner- or Hematodinium-centric, but branch out a bit into some other Alaskan crab fisheries. They are as follows: Tanner crab fisheries (Bering Sea, Kodiak, SE) Tanner crab mating…

from Aidan F. Coyle https://ift.tt/3zcqQe6

Manual Clustering 2

Manual Clustering, Continued Alright, some updates from last time: Rather than setting the number of modules independently for each individual crab/transcriptome, I specified a single cut height, which was used on the dendrogram for each crab to separate into modules. For crabs A-F (where three time points are present), this cut height was 1.8, and was chosen because it generally separated crabs into 5-8 modules, all of which appeared to contain single expression patterns. Crabs G, H, and I were part of the elevated-temperature treatment group, which only had 2 time points, and so that cut height just wasn’t adequate. As a result, for these crabs, I bumped up the cut height to 10. These changes can be seen in scripts 71-73. All are the same except for initial inputs and module naming but here’s one as an example I changed my name scheme for the manual clustering method. Previously,…

from Aidan F. Coyle https://ift.tt/2Sf4Grb

Using Linear Modeling on Hematodinium

Quarter Recap Before getting into the details, the GitHub repo for all this is available here! It also includes a more detailed write-up here So this quarter, I took Sarah Converse’s QERM 514 class on linear modeling. As the final project, we chose a data set to model. My choice was, naturally, some data on Hematodinium and Tanner crab – specifically, I chose the 2007-2012 Southeast Alaska Tanner crab survey data we obtained from Pam Jensen. For these surveys, each row equals one crab, with approximately 14,500 crab in our dataset total. A variety of physical and environmental factors are tracked – most notably, the following: Hematodinum infection status (infected or uninfected) Year Location code Day and time Pot depth Carapace width Sex Shell condition Black Mat disease infection status Missing legs code Of these, I treated year, carapace width, day, and pot depth as continuous, and all others as…

from Aidan F. Coyle https://ift.tt/3zcqNyW

Hawaii Gigas Methylation Analysis Part 19

Restarting functional annotation with blastx

TL;DR, I decided use standard blastx instead of DIAMOND blastx to annotate the new genome. More details on why I made that decision below.

A struggle recap

I initially started using DIAMOND blastx to annotate the new C. gigas genome, but ran into several issues. Since I wasn’t getting any useful output, Sam suggested I use the existing genome annotation in this discussion. I looked at the annotation release and FTP site and found the annotation provided by the NCBI. While the annotation has protein ID and product information, I have no Uniprot Accession codes, which means I can’t obtain GOterms.

At this point, Steven suggested I retrieve accession information through Uniprot mapping. Sam had a script I could modify to retrieve Uniprot Accession codes from the genomic GFF file. I modified my Jupyter notebook to go through these steps based on Sam’s code. I ran into a few perl module installation issues, which I detailed in this discussion. Once I troubleshooted the perl issues, I still wasn’t getting any matches! Sam spot-checked a few gene IDs on the Uniprot website and didn’t get matches either. Turns out the genome is new enough that the information is not on the Uniprot website!

I spot-checked the protein IDs from the protein annotation, and saw those were returning results. I then pivoted my strategy: use protein IDs to get Uniprot information, then find a different way to map protein IDs to gene IDs. To do this programmatically, I tried using the RefSeq annotation but encountered the same no-match result, which I detailed in a new discussion. At this point, Steven suggested I just use the web interface for Uniprot mapping, since my spot-check worked. When I tried uploading all of the protein IDs to the GUI, the interface was no longer available! Same when I tried just copying and pasting instead of uploading a file. Steven tried this as well, and ended up only getting a ~1000 of the ~60,000 entries mapping. Even then, they only mapped to themselves! So, blastx it is.

Moving on to standard blastx

I reworked this script for blastx instead of DIAMOND blastx. It wasn’t too difficult, as the syntax is mostly similar between the two programs. I started running this on mox, so hopefully it finishes in a day and I can use the output for enrichment analysis!

Exit script if any command fails

set -e

Program paths


Create database from Uniprot-SwissProt information

Speciy protein database

Save to srlab/yaamini/blastdbs

-in /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot_sprot.fasta
-dbtype prot
-out /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot-sprot.blastdb

Run blastx

Output format 6 produces a standard BLAST tab-delimited file

-db /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot-sprot.blastdb
-query /gscratch/scrubbed/yaaminiv/data/cgigas_uk_roslin_v1_genomic-mito.fa
-out 20210605-cgigas-roslin-mito-blastx.outfmt6
-outfmt 6
-evalue 1e-4
-max_target_seqs 1
-num_threads 28

Going forward

  1. Perform enrichment
  2. Update methods
  3. Update results
  4. Create figures
  5. Outline discussion
  6. Write discussion
  7. Write introduction
  8. Conduct randomization test with DSS
  9. Try EpiDiverse/snp for SNP extraction from WGBS data
  10. Run methylKit or randomization test on mox
  11. Transfer scripts used to a nextflow workflow

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3uXajqW

Data Wrangling – S.salar Gene Annotations from NCBI RefSeq GCF_000233375.1_ICSASG_v2_genomic.gff for Shelly

Shelly posted a GitHub Issue asking if I could create a file of S.salar genes with their UniProt annotations (e.g. gene name, UniProt accession, GO terms).

Here’s the approach I took:

  1. Use GFFutils to pull out only gene features, along with:
  • chromosome name
  • start position
  • end position
  • Dbxref attribute (which, in this case, is the NCBI gene ID)
  1. Submit the NCBI gene IDs to UniProt to map the NCBI gene IDs to UniProt accessions. Accomplished using the Perl batch submission script provided by UniProt.
  2. Parse out the stuff we were interested in.
  3. Join it all together!

All of this is documented in the Jupyter Notebook below:

Jupyter Notebook (GitHub):

Jupyter Notebook (NBviewer):

June 2021 Goals


May Goals Recap

Hawaii Gigas Methylation:

Gigas Gonad Methylation:


  • Establish a dissertation reading committee
  • Finish all work on the ocean acidification and reproduction review

June Goals

AKA rolling over May Goals that were rolled over from April…that were rolled over from March. One day I’ll learn how to set reasonable goals!

Hawaii Gigas Methylation:

  • Extract SNPs from bisulfite sequencing data with EpiDiverse
  • Identify significantly enriched GOterms associated with DML
  • Identify methylation islands and non-methylated regions
  • Finalize figures and statistical analyses
  • Have a complete initial manuscript!
  • Send completed manuscript to my dissertation reading committee no later than the end of the quarter!

Gigas Gonad Methylation:

  • Identify significantly enriched GOterms associated with DML
  • Identify methylation islands and non-methylated regions
  • Finalize figures and statistical analyses
  • Have a complete initial manuscript!
  • Send completed manuscript to my dissertation reading committee no later than the end of the quarter!
  • Decide if it’s worth extracting gonad RNA for integrated RNA-Seq and methylation analyses


  • Finish all work on the ocean acidification and reproduction review
  • Present at Chew Series seminar
  • Draft defense talk for feedback from the lab
  • Pass out for approximately six days

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3yUy3zb

Hawaii Gigas Methylation Analysis Part 17

Trying DSS for DML identification

Initially, I wanted to test-run DSS and ramwas. I looked into the manuals for both packages, and wasn’t convinced that ramwas was a good fit for my data. I could associate methylation information with genotype data, which seemed valuable. However, my genotype data came from BS-Snper analyses with my WGBS data, and looking for associations between WGBS and WGBS-derived genotype data didn’t make sense to me! Instead, I focused on using DSS to identify differentially methylated loci.

DSS background

DSS uses a Bayesian hierarchical model to estimate dispersions, then uses a Wald test for differential methylation, to calculate CpG methylation in a general experimental design. I was interested in using this package because I wanted to test pH and ploidy simultaneously, instead of using each variable as a covariate in my analysis! Coverage data (i.e. count data) are modeled using a beta-binomial distribution with an arcisne link function and log-normal priors, and the model is fit on the transformed data using the generalized least squares method. A Wald test is used at each CpG site for statistical testing. The test itself depends on sequencing depth and a dispersion parameter. This dispersion parameter is defined by a shrinkage estimator from the Bayesian hierarchical model. After conducting the Wald test, I can define DML based on p-value and FDR thresholds.

According to the package creators, this method is better than the Fisher’s test used by methylKit. With the Fisher’s test, read counts are added between samples in a treatment (think unite). By doing this, the test assumes that data in each sample are distributed in the same way, and obscures any variation between samples in a treatment. Based on the data I’ve seen from methylKit, the data do have similar distributions, but I have seen certain samples look like outliers in the PCA. Retaining the biological variation in each treatment with the dispersion parameters seems like a good step in the right direction.

Another thing DSS does while modeling is smoothing, which involves using information from nearby CpG sites to better inform methylation calculations at a specific CpG site. The manual made it sound like smoothing was always recommended to improve calculations, but the paper only suggested using smoothing when the data is “dense” and there are CpG sites closeby. This immediately made me think of differences between the vertebrate and invertebrate methylation model. It’s not likely that I’ll have many CpGs nearby to estimate methylation information, so I decided not to use any smoothing.

Oh, did I mention that DSS was designed to be less computationally intensive? Very important.

Installing packages on R Studio server

The first step of my analysis was installing packages on mox to use with the R Studio server! I needed to install DSS and its dependency bsseq. At first, I opened a build node and tried installing the packages that way, but I ran into continual errors. I then tried to install the packages from the R Studio server interface itself, and ran into similar issues. Turns out Aidan was having similar issues and posted this discussion. Sam mentioned that if we wanted to install packages to use on the server, we needed to create our own singularity container.

I followed Sam’s instructions in the handbook to create a singularity container. First I created a definition file:

Screen Shot 2021-05-25 at 2 35 52 PM

I then created the script r_packages_installs.R to install specific R packages, like methylKit, bsseq, and DSS:

Screen Shot 2021-05-25 at 2 36 14 PM

After logging into a build node, I loaded the singularity module and built the container rstudio-4.0.2.yrv-v1.0.sif. Finally, I modified my R Studio SLURM script, found here, to call my new singularity container! I think I ran into some errors later on because my singularity container was in my login node instead of somewhere wiht more storage, but that didn’t seem to be the sole cause. There might be something weird with my container or SLURM script, but it didn’t fully affect my ability to do the analysis.

Working through DSS

To work through this analysis, I relied heavily on the DSS manual! I pasted the code that ran on the R Studio server into this R Markdown script so it was easier to follow.

First, I modified my merged coverage files from bismark to fit DSS input requirements. Each sample needed a corresponding text file with chromosome, position, read coverage, and reads methylated, and required a header. For position, I used the start of the CpG dinucleotide.

#Create chr, pos, read cov, and reads meth columns for f in *cov do awk '{print $1"\t"$2"\t"$5+$6"\t"$5}' ${f} \ > ${f}.txt done #Add header for f in *txt do sed -i '1i chr\tpos\tN\tX' ${f} done #Rename files: zr3644_#.txt for f in *txt do [ -f ${f} ] || continue mv "${f}" "${f//_R1_val_1_val_1_val_1_bismark_bt2_pe..CpG_report.merged_CpG_evidence.cov/}" done 

To import multiple files into R Studio, I used a method with list2env I found two years ago:

coverageFiles <- list.files(pattern = "*txt") #Create a file list for all 24 files to import list2env(lapply(setNames(coverageFiles, make.names(gsub("", "", coverageFiles))), read.table, header = TRUE), envir = .GlobalEnv) #Import all coverage files into R global environment, using first line as header 

Once I had the data in R Studio, I created a BSseq object (this is why I needed the bsseq dependency). This object is just a large conglomeration of coverage information for each sample and sample ID:

BSobj <- makeBSseqData(dat = list(zr3644_1.txt, zr3644_2.txt, zr3644_3.txt, zr3644_4.txt, zr3644_5.txt, zr3644_6.txt, zr3644_7.txt, zr3644_8.txt, zr3644_9.txt, zr3644_10.txt, zr3644_11.txt, zr3644_12.txt, zr3644_13.txt, zr3644_14.txt, zr3644_15.txt, zr3644_16.txt, zr3644_17.txt, zr3644_18.txt, zr3644_19.txt, zr3644_20.txt, zr3644_21.txt, zr3644_22.txt, zr3644_23.txt, zr3644_24.txt), sampleNames = c("2H-1", "2H-2", "2H-3", "2H-4", "2H-5", "2H-6", "2L-1", "2L-2", "2L-3", "2L-4", "2L-5", "2L-6", "3H-1", "3H-2", "3H-3", "3H-4", "3H-5", "3H-6", "3L-1", "3L-2", "3L-3", "3L-4", "3L-5", "3L-6")) #Make BSseq object. dat = list of dataframes with coverage information. sampleNames = sample names. 

The BSseq object is what’s used when fitting the model! After creating the BSseq object, I defined the experimental design, making sure to match sample ID information in the row names with what I included in the BSseq object:

design <- data.frame("ploidy" = c(rep("D", times = 12), rep("T", times = 12)), "pH" = c(rep("H", times = 6), rep("L", times = 6), rep("H", times = 6), rep("L", times = 6))) #Create dataframe with experimental design information row.names(design) <- c("2H-1", "2H-2", "2H-3", "2H-4", "2H-5", "2H-6", "2L-1", "2L-2", "2L-3", "2L-4", "2L-5", "2L-6", "3H-1", "3H-2", "3H-3", "3H-4", "3H-5", "3H-6", "3L-1", "3L-2", "3L-3", "3L-4", "3L-5", "3L-6") #Set sampleID as rownames 

To fit the general linear model to the data, I used the DMLfit.multiFactor function, which allowed me to specify a model with ploidy, pH, and in the interaction between the two variables. After fitting the model, I checked the column names to 1) confirm that the model fit the variables I wanted and 2) see which condition was picked at the “treatment” condition (this usually occurs alphabetically, and luckily for me my treatments are L and T, which come after H and D):

DMLfit <- DMLfit.multiFactor(BSobj, design = design, formula = ~ploidy + pH + ploidy:pH) #Formula: intercept, additive effect of ploidy and pH, and interaction colnames(DMLfit$X) #Column names of the design matrix (ploidyT = ploidy triploid, pHL = pH low, ploidyT:pHL = interaction) 

For each coefficient, I extracted model fit information, then identified DML using a p-value < 0.05 and FDR < 0.01. This is roughly equivalent with a p-value < 0.05 and q-value < 0.01 I used with methylKit:

DMLtest.pH <- DMLtest.multiFactor(DMLfit, coef = "pHL") #Test the pH effect pH.DML <- subset(DMLtest.pH, subset = pvals < 0.05 & fdrs < 0.01) #Obtain loci with p-value < 0.05 and FDR < 0.01 

One important thing to note is that I couldn’t select a percent methylation difference with a generalized linear model. The model doesn’t produce methylation values to begin with, and using multiple regressions makes it more complicated to return these kinds of values (see the FAQ here).

Looking at DML in IGV

Next step: look at DML in IGV! I opened this IGV session and added in DSS DML BEDfile tracks I produced in this Jupyter notebook. When I looked at the DML, it was easier to understand why some DML were called over others. I definitely ended up in a confusing/existential place: if we don’t quite understand what methylation levels really mean, can I look at a DML in IGV and make a judgement call as to what looks like differential methylation? I decided to keep my DSS DML moving forward, even if I couldn’t understand why certain loci were identified as differentially methylated.

DML overlaps between methods

Based on the differences I saw with methylKit and DSS DML, I was interested in what overlaps existed between the different sets. In this Jupyter notebook, I used intersectBed to look at those overlaps.

Table 1. Number of DML common between contrasts and methods. 25 DML overlapped between all DSS lists.

DML List methylKit pH methylKit ploidy DSS pH DSS ploidy DSS ploidy:pH
methylKit pH 42 2 1 N/A N/A
methylKit ploidy 29 N/A 3 N/A
DSS pH 178 21 11
DSS ploidy 154 17
DSS ploidy:pH 53

I found it interesting that only 1 DML overlapped between the different pH lists, and there were 3 common between the ploidy lists. This is probably a testament to the different methods used to call differential methylation (Fisher’s exact test vs. Wald test).

DML overlaps with genomic features

For each DSS list, I used intersectBed to look at genomic locations of DML in this Jupyter notebook.

Table 2. DML distribution in genomic features for DSS lists

Genome Feature pH ploidy ploidy:pH
Total 178 154 53
Genes 123 145 48
Exon UTR 4 8 3
CDS 15 26 6
Introns 104 114 41
Upstream flanks 2 8 0
Downstream flanks 2 8 0
Intergenic regions 27 18 5
lncRNA 4 1 2
Transposable elements 86 66 14
Unique C/T SNPs 5 3 1

Even though the number of DML are different between the two methods, the genomic locations are basically the same: most DML are in introns, followed by transposable elements. One thing that is important to note is that the percent of DML that overlap with C/T SNPs is much less with DSS than methylKit (~2% vs. ~20%).

So, I guess I just have multiple DML lists now? I’ll probably move forward with each list separately for enrichment analysis, then make some sort of decision after that. I will also need to find a way to construct a randomization test with DSS! My guess is that I could get results from that randomization test sooner rather than later, since DSS was less computationally intensive than methylKit.

Going forward

  1. Perform enrichment
  2. Conduct randomization test with DSS
  3. Try EpiDiverse/snp for SNP extraction from WGBS data
  4. Run methylKit randomization test on mox
  5. Investigate comparison mechanisms for samples with different ploidy in oysters and other taxa
  6. Transfer scripts used to a nextflow workflow
  7. Update methods
  8. Update results
  9. Create figures

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2RLvHlE

TWIP 17 – What Happens when Steven does not show up?

Hawaii Gigas Methylation Analysis Part 16

Reworking pH-DML

When writing up my methods, I realized that I incorrectly re-assigned treatments in methylKit when I was shuffling between ploidy and pH treatment assignments! Now that I was using min.per.group with unite, I couldn’t just re-assign treatments because the min.per.group would always be based off of the ploidy treatment information, not the pH treatment information. This could mean that I would call DML with less than eight samples per pH treatment! So, back to the code I go.

Returning to methylKit

I opened the R Studio server and my R script to recode the treatment re-assignment. I just needed to re-assign treatments the step before unite:

methylationInformationFilteredCov5T <- methylKit::reorganize(processedFilteredFilesCov5, sample.id = sampleMetadata$sampleID, treatment = sampleMetadata$pHTreatment) %>% methylKit::unite(., destrand = FALSE, min.per.group = 8L) #Re-assign treatment information to filtered and normalized 5x CpG loci, then unite 5x loci with coverage in at least 8/12 samples/treatment 

With this re-assignment, I had 5,086,421 CpGs with data in at least eight samples per pH treatment. I then used this new object to identify DML! After finishing up on mox, I used rsync to move my revised DML lists and data to gannet.

Unique C/T SNPs

The next thing I wanted to do was determine how many unique C/T SNPs were in my data in this Jupyter notebook. First, I used cat to combine all SNP lists:

#Combine C/T SNPs into one file !cat *CT-SNPs.tab >> all-CT-SNPs.tab 

Then, I used sort and uniq to pull out all the unique SNPs:

!sort all-CT-SNPs.tab \ | uniq \ > unique-CT-SNPs.tab 

When I looked at the output, I realized there were still duplicate C/T SNPs! The eighth column had information from the VCF file that was unique from each sample, so that was likely preventing me from pulling out the unique SNPs properly. Instead, I used awk to pull the first five columns out (chr, bp, ?, C, T), and used that as input into sort and uniq.

!awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5}' all-CT-SNPs.tab \ | sort \ | uniq \ > unique-CT-SNPs.tab 

That gave me 289,826 unique C/T SNPs in my data.

DML characterization

In this Jupyter notebook, I used my new pH-DML and unique C/T SNP lists to look at DML locations in the genome.

Table 1 Overlaps between DML and various genome features.

Genome Feature pH-DML ploidy-DML Common DML
Total 42 29 2
Genes 36 25 2
Exon UTR 5 0 0
CDS 7 7 1
Introns 24 18 1
Upstream flanks 0 0 0
Downstream flanks 4 1 0
Intergenic regions 2 3 0
lncRNA 3 0 0
Transposable elements 16 8 0
Unique C/T SNPs 6 5 0

Going forward

  1. Test-run DSS and ramwas
  2. Try EpiDiverse/snp for SNP extraction from WGBS data
  3. Run methylKit randomization test on mox
  4. Investigate comparison mechanisms for samples with different ploidy in oysters and other taxa
  5. Transfer scripts used to a nextflow workflow
  6. Update methods
  7. Update results
  8. Create figures

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3oEWjAG

TWIP 15 – Oh look! They are twins!