Yaamini’s Notebook: MethCompare Part 2

More genome feature tracks

After I looked over code and added analyses, approaches, and figures we could explore in the paper, I set out to refine genome feature tracks for M. capitata and P. acuta. Once we confirm we trust our analysis files, I can use these feature tracks to understand where methylation occurs in these species.

M. capitata CDS track

The first thing I did was examine the M. capitata CDS track further. I went back to this Jupyter notebook where I generated all the genome feature tracks. The first thing I realized was that I didn’t make the gene track correctly! When I used grep "gene", it pulled lines with CDS and intron information and saved that to the gene track. I used grep AUGUSTUS gene" instead, and used similar code for the rest of the M. capitata tracks.

Now to the task at hand: understanding the CDS track. Looking at the gene track using head, I could see that CDS were split up by introns. In that way, the CDS track is similar to an exon track.

Screen Shot 2020-04-07 at 10 44 36 AM

But looking at all the tracks in this IGV session, I don’t think the CDS track includes UTR.

Screen Shot 2020-04-06 at 2 48 24 PM

I posted this issue to get more clarity about the CDS track and see if we can derive UTR or exon information from the gene and CDS tracks. If not, then I don’t think I’ll be able to do comparisons between M. capitata and P. acuta.

Visualizing P. acuta tracks

The P. acuta tracks were in better shape: I have gene, transcript, exon, intron, and CDS information. Within exons, I have initial, internal, and terminal exon tracks that can help us answer a lot of questions about exon-specific methylation. I wanted to create an IGV session for all the P. acuta tracks. I downloaded the genome from the Google Drive and tried visualizing the intron track I generated in this notebook. In IGV, this track looks blank (even though I know from my files that there are introns on this scaffold):

Screen Shot 2020-04-06 at 9 58 09 PM

Screen Shot 2020-04-06 at 9 58 27 PM

I posted this issue to get some help. I thought maybe the file wasn’t sorted correctly, but since I pulled it directly from the genome I don’t think that would be the issue. We’ll see.

Going forward

  1. Create promoter, UTR, and intergenic region tracks for species depending on what information is available and what is possible
  2. Intersect all genome feature tracks with CG motif information
  3. Rerun the CpG characterization pipeline with full samples and incorporate new genome features
  4. Create concatenation files and figure out methylation island analysis

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2XitXk0

Sam’s Notebook: Transcriptome Assessment – BUSCO Metazoa on C.bairdi MEGAN Transcriptome

I previously created a C.bairdi de novo transcriptome assembly with Trinity from the MEGAN6 taxonomic-specific reads for Arthropoda on 20200330 and decided to assess its “completeness” using BUSCO and the metazoa_odb9 database.

BUSCO was run with the --mode transcriptome option on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=cbai_busco_megan_transcriptome ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=1-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200407_cbai_busco_megan # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log # Establish variables for more readable code timestamp=$(date +%Y%m%d) species="cbai" prefix="${timestamp}.${species}" ## Input files and settings base_name="${prefix}.megan" busco_db=/gscratch/srlab/sam/data/databases/BUSCO/metazoa_odb9 transcriptome_fasta=/gscratch/srlab/sam/data/C_bairdi/transcriptomes/20200406.C_bairdi.megan.Trinity.fasta augustus_species=fly threads=28 ## Save working directory wd=$(pwd) ## Set program paths augustus_bin=/gscratch/srlab/programs/Augustus-3.3.2/bin augustus_scripts=/gscratch/srlab/programs/Augustus-3.3.2/scripts blast_dir=/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/ busco=/gscratch/srlab/programs/busco-v3/scripts/run_BUSCO.py hmm_dir=/gscratch/srlab/programs/hmmer-3.2.1/src/ ## Augustus configs augustus_dir=${wd}/augustus augustus_config_dir=${augustus_dir}/config augustus_orig_config_dir=/gscratch/srlab/programs/Augustus-3.3.2/config ## BUSCO configs busco_config_default=/gscratch/srlab/programs/busco-v3/config/config.ini.default busco_config_ini=${wd}/config.ini # Export BUSCO config file location export BUSCO_CONFIG_FILE="${busco_config_ini}" # Export Augustus variable export PATH="${augustus_bin}:$PATH" export PATH="${augustus_scripts}:$PATH" export AUGUSTUS_CONFIG_PATH="${augustus_config_dir}" # Copy BUSCO config file cp ${busco_config_default} "${busco_config_ini}" # Make Augustus directory if it doesn't exist if [ ! -d "${augustus_dir}" ]; then mkdir --parents "${augustus_dir}" fi # Copy Augustus config directory cp --preserve -r ${augustus_orig_config_dir} "${augustus_dir}" # Edit BUSCO config file ## Set paths to various programs ### The use of the % symbol sets the delimiter sed uses for arguments. ### Normally, the delimiter that most examples use is a slash "/". ### But, we need to expand the variables into a full path with slashes, which screws up sed. ### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%"). sed -i "/^;cpu/ s/1/${threads}/" "${busco_config_ini}" sed -i "/^tblastn_path/ s%tblastn_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^makeblastdb_path/ s%makeblastdb_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^augustus_path/ s%augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^etraining_path/ s%etraining_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^gff2gbSmallDNA_path/ s%gff2gbSmallDNA_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^new_species_path/ s%new_species_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^optimize_augustus_path/ s%optimize_augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^hmmsearch_path/ s%hmmsearch_path = /home/osboxes/BUSCOVM/hmmer/hmmer-3.1b2-linux-intel-ia32/binaries/%path = ${hmm_dir}%" "${busco_config_ini}" # Run BUSCO/Augustus training ${busco} \ --in ${transcriptome_fasta} \ --out ${base_name} \ --lineage_path ${busco_db} \ --mode transcriptome \ --cpu ${threads} \ --long \ --species ${augustus_species} \ --tarzip \ --augustus_parameters='--progress=true'