Yaamini’s Notebook: DML Analysis Part 19

Overlaps with methylKit gene background

The next big step in the C. virginica project is to conduct a gene enrichment (if necessary…we’ll get to this later). I started this notebook with that intention. Before I could get into gene enrichment or description, I needed to characterize overlaps between the gene background used in methylKit and various genome feature files. I started this small analysis in December, but didn’t follow through because I was focusing on my MEPS resubmission.

The gene background from methylKit is formed using unite. I took this background and saved it as a BEDfile in this R Markdown file. I used intersectBed to find the overlaps between the gene background, mRNA coding regions, and transposable elements. Finally, I calculated the overlap proportions. My next step is to use these overlap proportions, and overlap proportions from DML and DMR to conduct a proportion test.

Going forward

  1. Conduct a proportion test with gene background and DML/DMR overlaps with genome feature files
  2. See how min_cov, alignment stringency, or SNPs affect clustering
  3. Determine if a formal gene enrichment is necessary
  4. If necessary, select the most appropriate gene enrichment method
  5. Describe functions of most interesting genes with DML and DMR

// Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student http://bit.ly/2QoBIhd
via IFTTT

Sam’s Notebook: Gene Prediction – HiSeqX Metagenomics from Geoduck Water Using MetaGeneMark on Mox

After assembline the metagenomic data yesterday, I needed to predict some genes. I did this using MetaGeneMark (v.3.38) and ran it on Mox.

Input FastA(2.2GB):

SBATCH script (text):

  #!/bin/bash ## Job Name #SBATCH --job-name=busco ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=4-00:00:00 ## Memory per node #SBATCH --mem=500G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190103_metagenomics_geo_metagenemark # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # Variables gmhmmp=/gscratch/srlab/programs/MetaGeneMark_linux_64_3.38/mgm/gmhmmp mgm_mod=/gscratch/srlab/programs/MetaGeneMark_linux_64_3.38/mgm/MetaGeneMark_v1.mod assembly_fasta=/gscratch/scrubbed/samwhite/outputs/20190102_metagenomics_geo_megahit/megahit_out/final.contigs.fa nuc_out=20190103-mgm-nucleotides.fa gff_out=20190103-mgm.gff3 prot_out=20190103-mgm-proteins.fa # Run MetaGeneMark ## Specifying the following: ### -a : output predicted proteins ### -A : write predicted proteins to designated file ### -d : output predicted nucleotides ### -D : write predicted protein to designated file ### -f 3 : Output format in GFF3 ### -m : Model file (supplied with software) ### -o : write GFF3 to designated file ${gmhmmp} \ -a \ -A ${prot_out} \ -d \ -D ${nuc_out} \ -f 3 \ -m ${mgm_mod} \ ${assembly_fasta} \ -o ${gff_out}  

This will output predicted genes, both nucleotides and proteins, as FastA files, and a GFF3 file.