Figuring Out WGCNA, Part 1

Recap A while back, we decided that it’d be optimal for me to analyze the individual libraries with WGCNA. Given the complexity of WGCNA, I decided to first follow the WGCNA tutorials, and then attempt to follow along with Yaamini’s WGCNA script with the data for a single crab over time, and then scale it up to blocks of crab. The plan is to use cbai_transcriptomev2.0 (AKA cbai_hemat_transcriptomev2.0), as we’ll group both hematodinium and bairdi genes together based on expression. That way, we might see some interesting interactions between parasite and host! Well, I did…most of that! After finishing the tutorials, I tried to run WGCNA on a single crab. I got most of the way through, but hit a wall when trying to relate modules to external traits – since I was tracking a single crab, I had no replication, and therefore couldn’t progress further. The good news is…

from Aidan F. Coyle https://ift.tt/3dXMCbL
via IFTTT

TWIP 10: The transition

WGBS Analysis Part 18

Testing base methylKit code

Alright, it’s time to run methylKit on mox! Steven and Sam suggested I run my methylKit code in this discussion, so I’ll finish testing my code, make some modifications, and create a script to run on mox.

Percent difference for DML

The first thing I needed to make a decision about was the percent difference I wanted to use as a threshold for DML identification. In this discussion, I pointed out that some papers use a 25% difference, while others (including my C. virginica paper) use a 50% difference. Initially, I set up my code to include 25%, 50%, 75%, and 99% differences in methylation between treatments. After loading in my saved .RData, I ran getMethylDiff using objects generated from calculateDiffMeth with no additional modifications and just the overdispersion correction.

Using a 25% or 50% methylation difference, I got > 1000 DML when not applying any covariate matrix or overdispersion correction! The number of DML identified were in the 100s when using a 75% or 99% difference. When I used the output from calculateDiffMeth with the overdispersion correcrtion, I identified 2119 DML that were 25% different, 564 that were 50% different, and 80 that were 75% different. Clearly making the model more complicated reduces the amount of DML identified. I think if I used output with a covariate matrix and overdispersion correction, a 25% or 50% methylation difference wouldn’t get me more than ~500 DML. I set up the DML identification code to use both a 25% and 50% methylation difference so I can compare the output from mox.

Testing randomization code

The main parts of my R Markdown script I needed to still test were the randomization trial and subsequent statistical testing. I started by running a modified version of the randomization trial that did not include any overdispersion or covariate information for calculateDiffMeth:

for (i in 1:1000) { #For each iteration pHRandTreat <- sample(sampleMetadata$pHTreatment, size = length(sampleMetadata$pHTreatment), replace = FALSE) #Randomize treatment information pHRandDML <- methylKit::reorganize(processedFiles, sample.ids = sampleMetadata$sampleID, treatment = pHRandTreat) %>% methylKit::filterByCoverage(., lo.count = 5, lo.perc = NULL, hi.count = NULL, hi.perc = 99.9) %>% methylKit::normalizeCoverage(.) %>% methylKit::unite(., destrand = FALSE) %>% methylKit::calculateDiffMeth(.) %>% methylKit::getMethylDiff(., difference = 50, qvalue = 0.01) %>% nrow() -> pHRandTest50[i] ##Reorganize, filter, normalize, and unite samples. Calculate diffMeth and obtain DML that are 50% and have a q-value of 0.01. Save the number of DML identified } 

I thought the code would finish running in a few hours, but since it was taking a while and I didn’t get any error messages, I interrupted the code chunk and commented it out. I then set up two randomization tests: one for a 25% methylation difference, and another for a 50% methylation difference:

pHRandTest25 <- NULL #Create an empty matrix to store randomization trial results for 25% difference for (i in 1:1000) { #For each iteration pHRandTreat <- sample(sampleMetadata$pHTreatment, size = length(sampleMetadata$pHTreatment), replace = FALSE) #Randomize treatment information pHRandDML <- methylKit::reorganize(processedFiles, sample.ids = sampleMetadata$sampleID, treatment = pHRandTreat) %>% methylKit::filterByCoverage(., lo.count = 5, lo.perc = NULL, hi.count = NULL, hi.perc = 99.9) %>% methylKit::normalizeCoverage(.) %>% methylKit::unite(., destrand = FALSE) %>% methylKit::calculateDiffMeth(., covariates = covariateMetadata, overdispersion = "MN", test = "Chisq") %>% methylKit::getMethylDiff(., difference = 25, qvalue = 0.01) %>% nrow() -> pHRandTest25[i] ##Reorganize, filter, normalize, and unite samples. Calculate diffMeth and obtain DML that are 50% and have a q-value of 0.01. Save the number of DML identified } 
pHRandTest50 <- NULL #Create an empty matrix to store randomization trial results for 50% difference for (i in 1:1000) { #For each iteration pHRandTreat <- sample(sampleMetadata$pHTreatment, size = length(sampleMetadata$pHTreatment), replace = FALSE) #Randomize treatment information pHRandDML <- methylKit::reorganize(processedFiles, sample.ids = sampleMetadata$sampleID, treatment = pHRandTreat) %>% methylKit::filterByCoverage(., lo.count = 5, lo.perc = NULL, hi.count = NULL, hi.perc = 99.9) %>% methylKit::normalizeCoverage(.) %>% methylKit::unite(., destrand = FALSE) %>% methylKit::calculateDiffMeth(., covariates = covariateMetadata, overdispersion = "MN", test = "Chisq") %>% methylKit::getMethylDiff(., difference = 50, qvalue = 0.01) %>% nrow() -> pHRandTest50[i] ##Reorganize, filter, normalize, and unite samples. Calculate diffMeth and obtain DML that are 50% and have a q-value of 0.01. Save the number of DML identified } 

The output of the randomization test is a vector with the number of DML identified in each trial. I needed a way to determine if the number of DML I identified with getMethylDiff was significantly greater than what may be identified by chance. I dug deep into my stats brain (you know, that part of my brain that barely turns on) and decided to run a one-tailed t-test:

pHRandTestResults50 <- t.test(pHRandTest50, alternative = "less", mu = nrow(diffMethStatsTreatment50)) #Conduct a one-sample t-test to determine the probability of identifying more DML by chance. mu is the number of DML identified with a 50% methylation difference pHRandTestResults50$statistic #t pHRandTestResults50$parameter #df pHRandTestResults50$p.value #P-value 

I also wrote code to create a histogram with the random distribution of DML identified, and include a vertical line for the number of DML obtained with getMethylDiff:

jpeg(filename = "../output/06-methylKit/rand-test/Random-Distribution-diff50.jpeg", height = 1000, width = 1000) #Save file with designated name hist(pHRandTest50, plot = TRUE, main = "", xlab = "Number of DML (50% difference)") #Plot a histogram with randomization trial results dev.off() 

Now that I’ve tested my base code, I will work on separating female and indeterminate samples and creating the actual mox script!

Going forward

  1. Separate female and indeterminate samples
  2. Write a mox script
  3. Run R script on mox
  4. Update methods
  5. Obtain relatedness matrix and SNPs with EpiDiverse/snp
  6. Revise methylKit study design: separate tests for indeterminate and female oysters
  7. Write results
  8. Identify genomic location of DML
  9. Determine if RNA should be extracted
  10. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3dRSwv3
via IFTTT

Using C. bairdi transcriptome v4.0

Recap Last time, I described aligning all libraries to the Hematodinium-specific transcriptome hemat_transcriptomev1.6 and running them through my pipeline (kallisto -> DESeq2 -> GO-MWU). I made two comparisons – Ambient Day 2 vs. Elevated Day 2 and Elevated Day 0 vs. Elevated Day 2 (reminder: Day 0 samples were taken when all crabs were still at ambient temp). Since then, I repeated the same process, but with a C. bairdi – specific transcriptome, cbai_transcriptomev4.0! I made the same two comparisons as before. Scripts are noted with a prefix in the 40s. Prior to aligning libraries, I first built an index – that process is shown within the scripts. Scripts available as follows: Creating BLASTx index for cbai_transcriptomev4.0 Running kallisto Running DESeq2 Obtaining GO terms Preparing one of the inputs for GO-MWU Preparing the other input for GO-MWU Running GO-MWU Again, two comparisons were made – Elevated Day 0 vs. Elevated…

from Aidan F. Coyle https://ift.tt/3fVpg9m
via IFTTT

WGBS Analysis Part 17

Identifying SNPs with BS-Snper

Now that I know how to use BS-Snper, I’m going to identify merged and individual SNPs in the Manchester data! I set up this Jupyter notebook to start the process.

The first thing I needed to do was merge BAM files with samtools merge:

%%bash samtools merge \ /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/merged-sorted-deduplicated.bam \ /Volumes/web/spartina/project-gigas-oa-meth/output/bismark-roslin/zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-gigas-oa-meth/output/bismark-roslin/zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-gigas-oa-meth/output/bismark-roslin/zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-gigas-oa-meth/output/bismark-roslin/zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-gigas-oa-meth/output/bismark-roslin/zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-gigas-oa-meth/output/bismark-roslin/zr3616_6_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-gigas-oa-meth/output/bismark-roslin/zr3616_7_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-gigas-oa-meth/output/bismark-roslin/zr3616_8_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam 

Once I had the merged BAM, I looked at the header with samtools view:

Screen Shot 2021-04-07 at 9 18 28 AM

I used default parameters, except for --mincov 5, with the merged BAM to identify SNP variants in my dataset. Again, I saved all of my output to gannet since it was large!

#Defaults: --minhetfreq 0.1 --minhomfreq 0.85 --minquali 15 --maxcover 1000 --minread2 2 --errorate 0.02 --mapvalue 20 #Saving output to gannet !perl /Users/Shared/Apps/BS-Snper-master/BS-Snper.pl \ --fa /Volumes/web/spartina/project-oyster-oa/data/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa \ --input /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/merged-sorted-deduplicated.bam \ --output /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/SNP-candidates.txt \ --methcg /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/CpG-meth-info.tab \ --methchg /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/CHG-meth-info.tab \ --methchh /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/CHH-meth-info.tab \ --mincover 5 \ > /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/SNP-results.vcf \ 2> /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/merged.ERR.log 

I used a loop to run BS-Snper on individual samples, making sure to set any variables necessary for the loop within the same cell. Before running the loop, I moved to this directory where the BAM files were located. After the loop finished running, I moved all the files this directory for BS-Snper output.

%%bash FILES=$(ls *deduplicated.sorted.bam) echo ${FILES} for file in ${FILES} do NAME=$(echo ${file} | awk -F "." '{print $1}') echo ${NAME} perl /Users/Shared/Apps/BS-Snper-master/BS-Snper.pl \ --fa /Volumes/web/spartina/project-oyster-oa/data/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa \ --input ${NAME}.deduplicated.sorted.bam \ --output ${NAME}.SNP-candidates.txt \ --methcg ${NAME}.CpG-meth-info.tab \ --methchg ${NAME}.CHG-meth-info.tab \ --methchh ${NAME}.CHH-meth-info.tab \ --mincover 5 \ > ${NAME}.SNP-results.vcf 2> ${NAME}.ERR.log done mv *SNP-candidates.txt *meth-info.tab *SNP-results.vcf *ERR.log ../BS-Snper/ 

Going forward

  1. Write methods
  2. Obtain relatedness matrix and SNPs with EpiDiverse/snp
  3. Revise methylKit study design: separate tests for indeterminate and female oysters
  4. Run R script on mox
  5. Write results
  6. Identify genomic location of DML
  7. Determine if RNA should be extracted
  8. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2Q1mmF2
via IFTTT

Hawaii Gigas Methylation Analysis Part 12

Identifying SNPs

Now that samtools is installed and I merged BAM files, I’m ready to identify SNPs with BS-Snper! I’m going to identify SNPs from the merged BAM file and from individual BAM files. Steven suggested that once I have SNP vcf files, I could use bedtools intersect to figure out where the SNPs overlap with the CG track and DML I’ll identify later, so I won’t need to filter SNPs once I identify them.

Merged SNPs

First, I decided to identify SNPs from the merged BAM file I created with samtools merge. Based on Mac’s suggestion, I ran BS-Snper with the default parameters. However, I set --mincov 5 instead of using the default coverage of 10 to be consistent with bismark and methylKit parameters.

#Defaults: --minhetfreq 0.1 --minhomfreq 0.85 --minquali 15 --maxcover 1000 --minread2 2 --errorate 0.02 --mapvalue 20 #Saving output to gannet !perl /Users/Shared/Apps/BS-Snper-master/BS-Snper.pl \ --fa /Volumes/web/spartina/project-oyster-oa/data/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa \ --input /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/merged-sorted-deduplicated.bam \ --output /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/SNP-candidates.txt \ --methcg /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/CpG-meth-info.tab \ --methchg /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/CHG-meth-info.tab \ --methchh /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/CHH-meth-info.tab \ --mincover 5 \ > /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/SNP-results.vcf 2> merged.ERR.log 

I wrote the output to this gannet folder instead of trying to deal with large files my Github repository.

Individual SNPs

Since I ran the script successfully, I moved onto SNP identification from individual files! I modified a loop from Mac to identify SNPs in my files. First, I changed my working directory to the one with BAM files. Within the same code chunk, I needed to create variables for the file names, then use the modified file names within the BS-Snper loop:

%%bash FILES=$(ls *deduplicated.sorted.bam) echo ${FILES} for file in ${FILES} do NAME=$(echo ${file} | awk -F "." '{print $1}') echo ${NAME} perl /Users/Shared/Apps/BS-Snper-master/BS-Snper.pl \ --fa /Volumes/web/spartina/project-oyster-oa/data/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa \ --input ${NAME}.deduplicated.sorted.bam \ --output ${NAME}.SNP-candidates.txt \ --methcg ${NAME}.CpG-meth-info.tab \ --methchg ${NAME}.CHG-meth-info.tab \ --methchh ${NAME}.CHH-meth-info.tab \ --mincover 5 \ > ${NAME}.SNP-results.vcf 2> ${NAME}.ERR.log done 

Initially I specified --fa with a FASTA variable, but I kept running into an error! I then remembered that within Jupyter notebooks I had issues specifying variables within loops, so I used the absolute file path instead. Once I had all the files, I moved them to the BS-Snper directory on gannet.

# Move files to BS-Snper directory !mv *SNP-candidates.txt *meth-info.tab *SNP-results.vcf *ERR.log ../BS-Snper/ 

Katherine suggested I merge with vcf-merge, but I don’t think I need to do that if I could run a loop with bedtools intersect to determine if any individual SNPs overlap with DML. Now that I identified SNPs for this dataset, I can do something similar with the Manchester data! I’m also going to try EpiDiverse/snp since it would give me a relatedness matrix in addition to SNP variants.

Going forward

  1. Try EpiDiverse/snp for SNP extraction from WGBS data
  2. Run methylKit script on mox
  3. Investigate comparison mechanisms for samples with different ploidy in oysters and other taxa
  4. Test-run DSS and ramwas
  5. Transfer scripts used to a nextflow workflow
  6. Update methods
  7. Update results

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2PCHWjo
via IFTTT

March 2021 Goals

sorry-cake

I had such grand plans for Pubathon 2021 last month that obviously did not work out the way I thought. I made some good progress on the Hawaii and Manchester manuscripts! Talking to Steven, we also decided that those manuscripts will be the last components of my formal dissertation, so any additional labwork or projects are now firmly on the backburner until June. Hopefully this makes my plans for April more manageable (but knowing how I have not set reasonable goals since September 2016, jury’s still out on how this month will go).

March Goals Recap

I had some real LOFTY goals for last month.

Hawaii Gigas Methylation:

Gigas Gonad Methylation:

Other:

  • Reformatted and submitted the coral manuscript to MER and bioRxiv!
  • Continued working on the ocean acidification and reproduction review, but as a last priority
  • Submitted a peer review for Molecular Ecology
  • Had a great vacation 🙂

April Goals

AKA a bunch of stuff rolled over from March.

Hawaii Gigas Methylation:

  • Extract SNPs from bisulfite sequencing data with BS-SNPer and EpiDiverse
  • Compare BS-SNPer and EpiDiverse output and determine which approach is suitable
  • Complete preliminary assessment of DML with methylKit
  • Try identifying DML with DSS and ramwas
  • Compare ramwas and DSS DML output and determine which approach is suitable
  • Determine genomic location of DML
  • Identify significantly enriched GOterms associated with DML
  • Identify methylation islands and non-methylated regions
  • Have a complete initial manuscript!
  • Obtain WGS resequencing quote from Genewiz

Gigas Gonad Methylation:

  • Extract SNPs from bisulfite sequencing data with the method determined from Hawaii samples
  • Identify DML using methylKit
  • Determine genomic location of DML
  • Identify significantly enriched GOterms associated with DML
  • Identify methylation islands and non-methylated regions
  • Have a complete initial manuscript!
  • Decide if it’s worth extracting gonad RNA for integrated RNA-Seq and methylation analyses

Other:

  • Finish all work on the ocean acidification and reproduction review

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/31KQW8u
via IFTTT

Hawaii Gigas Methylation Analysis Part 10

Merging BAM files

The plan

Based on Pubathon conversations, my plan is to call SNPs two different ways. First, I’ll call SNPs for each individual sample. If any sample has a SNP, that will certainly affect DML calls. Katherine also suggested I merge samples and call SNPs. Loci that are SNPs across samples but not present in an individual sample can still change the meaning of a DML (that DML could be more associated with genetic differences than treatment). Both Mac and Katherine implied that default parameters would be more than enough to identify C/T SNPs for my goals, so I won’t mess around with them.

Headaches with samtools installation

Which returns me to the original problem I had trying to merge BAM files: library versions. Sam suggested I update the htslib library based on these instructions, and I was able to do that successfully! But after updating this library, I still had the same samtools installation error!

config.status: creating config.mk dyld: Library not loaded: /usr/local/opt/mpfr/lib/libmpfr.4.dylib Referenced from: /usr/local/bin/gawk Reason: image not found ./config.status: line 879: 83237 Done(141) eval sed \"\$ac_sed_extra\" "$ac_file_inputs" 83238 Abort trap: 6 | $AWK -f "$ac_tmp/subs.awk" > $ac_tmp/out config.status: error: could not create config.mk 

Since the installation was referencing a library from gawk, Sam suggested I update gawk. I ran brew update gawk and got an error that suggested I don’t have gawk installed on this machine:

Screen Shot 2021-03-30 at 11 47 14 AM

I decided to reinstall Homebrew just to be safe, and then ran brew install gawk. Once I had everything installed, I tried once more with my samtools installation and STILL got the same error!

Screen Shot 2021-03-30 at 1 32 40 PM

At this point, I started digging around for specific solutions for Homebrew potentially interfering with samtools installations. I found this issue that suggested running brew install xz prior to configuration. I still encountered the same error! I realized installing xz was specific to the error message the user got after trying to configure their program after reading this Stack Overflow question linked in the issue. Based on one of the suggestions, I ran brew search libmpfr (the library that wasn’t loaded in any error above): Error: No formulae or casks found for "libmpfr".

Then, I ran brew search mpfr:

Warning: mpfr 4.1.0 is already installed and up-to-date. To reinstall 4.1.0, run: brew reinstall mpfr 

I reinstalled mpfr with brew reinstall mpfr and see if that resolves the issue (and obviously it did not). Since mpfr is part of Homebrew, I realized I should just try installing samtools with brew install samtools! AND THAT FINALLY WORKED!

Screen Shot 2021-03-30 at 2 00 09 PM

If anything, graduate school has taught me how to conduct very specific Google searches until you find someone else who solves your problem for you. Now that I had samtools installed, I could get to work.

Merging BAM files

Before I could identify SNPs across all samples, I needed to merge my BAM files! Based on the samtools merge documentation, I set up the following code in my Jupyter notebook:

%%bash samtools merge \ /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/merged-sorted-deduplicated.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_1_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_2_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_3_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_4_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_5_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_6_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_7_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_8_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_9_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_10_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_11_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_12_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_13_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_14_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_15_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_16_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_17_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_18_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_19_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_20_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_21_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_22_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_23_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ /Volumes/web/spartina/project-oyster-oa/Haws/bismark-2/zr3644_24_R1_val_1_val_1_val_1_bismark_bt2_pe.deduplicated.sorted.bam 

I decided to save the output file (merged-sorted-deduplicated.bam) to gannet because there was no way that file would be small enough for Github. I contemplated adding the -r to add an @RG tag for each alignment, but I thought I’d run the code with no options to see what happens first.

After a few hours (that’s what I get for not including a -threads specification), I had a merged file (found here). I looked at the output and saw there was no @RG tag (which makes sense because I didn’t specify where it should pull that information from). My next step is to run that file through BS-Snper to see if 1) it works and 2) the output is usable.

Going forward

  1. Try BS-Snper and EpiDiverse/snp for SNP extraction from WGBS data
  2. Run methylKit script on mox
  3. Investigate comparison mechanisms for samples with different ploidy in oysters and other taxa
  4. Test-run DSS and ramwas
  5. Transfer scripts used to a nextflow workflow
  6. Update methods
  7. Update results

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2R7SHdY
via IFTTT

WGBS Analysis Part 15

Reviewing revised alignments

I got my Roslin genome-aligned data back! Last week the paper for this genome was published as well, so it’ll be nice to have that resource while I work with this data. Steven noticed that they have their own version of the mitochondrial genome, which is different than the sequence I used. This shouldn’t be a problem since mitochondrial DNA isn’t methylated.

multiqc

I transferred all mox bismark output to this gannet folder. I didn’t run multiqc on mox, so I did that in the same folder. I then copied the HTML reports and multiqc output to this repo folder! Once I was done moving files around, I looked at the multiqc report.

Similar to what I saw with the Hawaii samples, alignment to the Roslin genome was slightly lower, ranging between 61.5% to 62.0%. Percent duplicate reads was also lower when aligning to the Roslin genome! Percent duplication is much higher for this dataset than the Hawaii one (22.6%-26.4%), but that’s probably because the histology DNA wasn’t high quality. I noticed that the samples with more M C’s/M Aligned had higher percent duplication, but overall percent aligned wasn’t higher for these samples.

Screen Shot 2021-03-30 at 11 02 30 AM

Figure 1. General alignment statistics for oyster_v9 genome (left) and Roslin genome (right)

IGV

Since my alignment statistics generally looked good, I got to move onto a more important step: checking the alignments in IGV to make sure there isn’t any methylation in the mitochondrial sequence! I opened a new IGV session and entered in the genome URL, but it wouldn’t load since I didn’t have an index file. I posted this discussion to determine how to create a .fai file. After posting the discussion, I did a quick search and found samtools faidx. I decided to try samtools to create the index file. I returned to this discussion post where I had some issues running samtools, and successfully installed the latest version of htslib using these instructions. However, I ran into an error installing the latest samtools version! I downloaded the latest version on genefish and ran ./configure --prefix=/Users/Shared/Apps/samtools-1.12. Got the following error:

config.status: creating config.mk dyld: Library not loaded: /usr/local/opt/mpfr/lib/libmpfr.4.dylib Referenced from: /usr/local/bin/gawk Reason: image not found ./config.status: line 879: 83237 Done(141) eval sed \"\$ac_sed_extra\" "$ac_file_inputs" 83238 Abort trap: 6 | $AWK -f "$ac_tmp/subs.awk" > $ac_tmp/out config.status: error: could not create config.mk 

Sam suggested I upgrade gawk using brew upgrade gawk. When I tried that, I got another error:

Screen Shot 2021-03-30 at 11 47 14 AM

Sam then suggested I install gawk before running samtools. While I was going down that rabbit hole, Steven commented on my original discussion post and said I should make an index file with IGV. IGV didn’t give me an option to create an index file when I added the genome URL, but I found this page in their documentation that has instructions to create a genome. Under Genomes > Create .genome File, I added in the FASTA URL and named the genome “Roslin-gigas-mito.genome.” It worked! I saved the genome and index file in this directory with my combined FASTA sequence. At last, I was able to load in sample bedgraphs and look at the mitochondrial sequence.

Screen Shot 2021-03-30 at 1 20 52 PM

Figure 2. Mitochondrial sequence in IGV

The mitochondrial sequence was predominantly unmethylated, but there were some methylated CpGs in individual samples. I posted this discussion to determine what happened. Steven suggested I look only at 5x coverage. When he mentioned this, I realized I already created 5x bedgraphs, and should have been using those from the beginning! I loaded the 5x bedgraphs into IGV and looked at the mitochondrial genome.

Screen Shot 2021-04-04 at 1 14 16 PM

Screen Shot 2021-04-04 at 1 14 32 PM

Screen Shot 2021-04-04 at 1 14 51 PM

Figures 3-5. Mitochondrial sequence in IGV using 5x begraphs

Looking at the full sequencence with 5x begraphs is definitely an improvement, but there are two areas with very slight methylation. I don’t see a lot of consistency between samples either, so I don’t think it’s likely any of these areas will be called as DML. In any case, these methylation traces may be a sign of something else going on. I could remove these locations from my dataset, bump up coverage to 10x to see if that reduces methylation, or try mapping to the Roslin mitochondrial sequence. I saved my IGV session and commented on the discussion to get input on how to proceed.

Going forward

  1. Write methods
  2. Identify SNPs in WGBS data mapped to Roslin genome
  3. Revise methylKit study design: separate tests for indeterminate and female oysters
  4. Run R script on mox
  5. Write results
  6. Identify genomic location of DML
  7. Determine if RNA should be extracted
  8. Determine if larval DNA/RNA should be extracted

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/3cPiPCB
via IFTTT