Safs Orientation

Continuing to Analyze Hematodinium BLAST results

Again, this all refers to my project to locally blast hematodinium transcriptome data against the Swiss-prot database

Spent most of the morning at SAFS orientation! Chatted with the new grad students, learned some info about the program and schedule – typical orientation stuff

Given the limited time, made a surprising amount of progress today! Hit a substantial roadblock late last night – couldn’t figure out a way to scrub the two files (queries and matches) of whitespace and properly run a comparison function. Realized that it’d be optimal for both the query file and match file to be the same filetype, so made new .txt files. That completely fixed the problems and let me use diff to compare the two!

Diff eliminated all lines that were identical in both files, but still left our data a bit messy. Cleaned up the existing data, and got our results – 3486 of our 6348 queries had no match!

Next task: rerun the transcriptome through DIAMOND BLASTx! Got started on it, but having some problems getting it to run. Ah well, good project for tomorrow

from Oceanic Observations https://ift.tt/32YyHha
via IFTTT

Assembly Assessment – BUSCO C.bairdi Genome v1.01 on Mox

After creating a subset of the cbai_genome_v1.0 of contigs >100bp yesterday (subset named cbai_genome_v1.01), I wanted to generate BUSCO scores for cbai_genome_v1.01. This is primarily just to keep info consistent on our Genomic Resources wiki, as I don’t expect these scores to differ at all from the cbai_genome_v1.0 BUSCO scores.

Analysis was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=cbai_genome_v1.01_busco ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=3-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200924_cbai_genome_v1.01_busco ################################################################################### # Script to generate BUSCO score(s) for genome(s). # These variables need to be set by user ## Save working directory wd=$(pwd) # Genomes directory genomes_dir=/gscratch/srlab/sam/data/C_bairdi/genomes # Genomes array genomes_array=( "${genomes_dir}"/cbai_genome_v1.01.fasta \ ) ## Input files and settings busco_db=/gscratch/srlab/sam/data/databases/BUSCO/metazoa_odb9 augustus_species=fly threads=28 # Programs associative array declare -A programs_array programs_array=( [busco]="/gscratch/srlab/programs/busco-v3/scripts/run_BUSCO.py" ) ## Set program paths augustus_bin=/gscratch/srlab/programs/Augustus-3.3.2/bin augustus_orig_config_dir=/gscratch/srlab/programs/Augustus-3.3.2/config augustus_scripts=/gscratch/srlab/programs/Augustus-3.3.2/scripts blast_dir=/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin/ hmm_dir=/gscratch/srlab/programs/hmmer-3.2.1/src/ # Export Augustus variable export PATH="${augustus_bin}:$PATH" export PATH="${augustus_scripts}:$PATH" ## BUSCO configs busco_config_default=/gscratch/srlab/programs/busco-v3/config/config.ini.default busco_config_ini=${wd}/config.ini # Export BUSCO config file location export BUSCO_CONFIG_FILE="${busco_config_ini}" # Copy BUSCO config file cp ${busco_config_default} "${busco_config_ini}" # Edit BUSCO config file ## Set paths to various programs ### The use of the % symbol sets the delimiter sed uses for arguments. ### Normally, the delimiter that most examples use is a slash "/". ### But, we need to expand the variables into a full path with slashes, which screws up sed. ### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%"). sed -i "/^;cpu/ s/1/${threads}/" "${busco_config_ini}" sed -i "/^tblastn_path/ s%tblastn_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^makeblastdb_path/ s%makeblastdb_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}" sed -i "/^augustus_path/ s%augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^etraining_path/ s%etraining_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}" sed -i "/^gff2gbSmallDNA_path/ s%gff2gbSmallDNA_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^new_species_path/ s%new_species_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^optimize_augustus_path/ s%optimize_augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}" sed -i "/^hmmsearch_path/ s%hmmsearch_path = /home/osboxes/BUSCOVM/hmmer/hmmer-3.1b2-linux-intel-ia32/binaries/%path = ${hmm_dir}%" "${busco_config_ini}" ################################################################################### # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 for genome in "${!genomes_array[@]}" do # Remove path from genome using parameter substitution genome_name="${genomes_array[$genome]##*/}" ## Augustus config directories augustus_dir=${wd}/${genome_name}_augustus augustus_config_dir=${augustus_dir}/config export AUGUSTUS_CONFIG_PATH="${augustus_config_dir}" # Make Augustus directory if it doesn't exist if [ ! -d "${augustus_dir}" ]; then mkdir --parents "${augustus_dir}" fi # Copy Augustus config directory cp --preserve -r ${augustus_orig_config_dir} "${augustus_dir}" # Run BUSCO/Augustus training ${programs_array[busco]} \ --in ${genomes_array[$genome]} \ --out ${genome_name} \ --lineage_path ${busco_db} \ --mode genome \ --cpu ${threads} \ --long \ --species ${augustus_species} \ --tarzip \ --augustus_parameters='--progress=true' # Capture FastA checksums for verification echo "" echo "Generating checksum for ${genome_name}" md5sum "${genomes_array[$genome]}" > "${genome_name}".checksum.md5 echo "Finished generating checksum for ${genome_name}" echo "" done # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" ${programs_array[$program]} --help echo "" echo "" echo "

Getting A Grep On Things

Analyzing Hematodinium BLAST results

Again, this all refers to my project to locally blast hematodinium transcriptome data against the Swiss-prot database

Spent the morning at the SAFS Coffee Hour! Met most of the new grad students, learned about all the best bagel places, and got some info on how the semester would go. Overall, a great morning.

After that, I started work on analyzing my blast results. Here are the goals: – Compare no_max_hsps to max_hsps blasts – Count lines of output files – Determine which queries didn’t produce alignments – Re-run using DIAMOND BLASTx – Compare results of DIAMOND BLASTx and NCBI BLASTx – Speed – Number of matches – SPID matches – Anything else It was fairly straightforward to count the output file lines, as I could just use wc -l. However, determining which queries didn’t produce alignments has proved more difficult. After quite a bit of trial and error and more error, I figured out how to use grep to select just the query name. All queries begin with “TRINITY_”, so I used the following command: grep -o ‘TRINITY[^ ]*’ filename > query_list

If I understand it correctly, the [^ ] indicates that it searched through the line until locating the space following TRINITY, and the * meant that all characters were accepted until a space was found.

So at this point, I have two files – a list of queries and a list of subjects. Now I just need to scrub them both of whitespace (apparently – the results are different when I run grep with -w) and cross-reference the two to isolate differences! Much easier said than done, but feeling fairly good. However, unlikely to make too much progress tomorrow, as I have orientation for much of the day

from Oceanic Observations https://ift.tt/3co24Na
via IFTTT

Data Wrangling – Subsetting cbai_genome_v1.0 Assembly with faidx

Previously assembled cbai_genome_v1.0.fasta with our NanoPore Q7 reads on 20200917 and noticed that there were numerous sequences that were well shorter than the expected 500bp threshold that the assembler (Flye) was supposed to spit out. I created an Issue on the Flye GitHub page to find out why. The developer responded and determined it was an issue with the assembly polisher and that sequences <500bp could be safely ignored.

So, I’ve decided to subset the cbai_genome_v1.0.fasta to exclude all sequences <1000bp, as that seems like a more reasonable minimum length for potential genes. I did not run this in a Jupyter Notebook, due to the brevity of the commands. Here are the commands, using faidx:

>1kbp subsetting

faidx --size-range 1000,1000000000 cbai_genome_v1.0.fasta > cbai_genome_v1.01.fasta 

Index new FastA

faidx Pgenerosa_v071.fasta 
samb@mephisto:~/data/C_bairdi/genomes$ sort -nk2,2 cbai_genome_v1.01.fasta.fai | head contig_4272 1000 15642836 60 61 contig_4503 1000 16422183 60 61 contig_4429 1001 16145927 60 61 contig_1038 1002 230201 60 61 contig_1691 1005 1716551 60 61 contig_2992 1005 7322005 60 61 contig_3284 1006 9674445 60 61 contig_1810 1008 2050977 60 61 contig_408 1008 15069716 60 61 contig_1616 1009 1549839 60 61 

Subsetting looks like it worked.

Looking at sequence counts in FastAs:

samb@mephisto:~/data/C_bairdi/genomes$ for file in *.fasta; do grep --with-filename -c ">" $file; done cbai_genome_v1.01.fasta:2431 cbai_genome_v1.0.fasta:3294 

Probing Jupyter

Final Results (I Think) For Hematodinium BLAST

Again, this all refers to my project to locally blast hematodinium transcriptome data against the Swiss-prot database

I got some help in the morning lab meeting, and learned how to get Jupyter Notebooks to accept Unix commands! Just gotta start each cell with ! or %%bash. Once I learned that, things got a lot simpler. I spent a while screwing things up in Jupyter (kept accidentally running cells by pressing shift+enter), but eventually got all the code for the hematodinium blast written out. Then I ran the blast, linked my new github repo, and boom – final results produced!

As before, I ran two separate blasts – one with a value of 1 for max_hsps, the other with no specified max_hsps value.

Now that I know how to use it, wow, Jupyter is pretty fantastic. Love how easy it is to go back and edit the code! Very impressed, excited to learn more.

from Oceanic Observations https://ift.tt/3mMckni
via IFTTT

He-Mat and the Masters of the Database

Success!

For some context, this post is a continuation of my entries from the past few days – I’ve been working to take hematodinium transcriptome data and locally blast against the Swiss-prot protein database

Well, I figured out a way to bring my files from WSL to Windows and link em in my notebook! My initial results can be seen here

Important caveat: The results above were produced following the same outline as the tutorial. Matches were made using max_target_seqs (set to 1). However, while investigating how to use blastx, I found info indicating that max_target_seqs just grabs the first sequence that meets the evalue standards (set to 1E-20 here). To me, that seemed sub-optimal, and so I reran blastx with a line added setting max_hsps to 1. Those results are linked here

I’m posting this mid-day just to get the data in. I’ll be spending the rest of the afternoon writing a biography for the Roberts Lab website!

from Oceanic Observations https://ift.tt/35X3X26
via IFTTT

The FFAR geoduck seed was…

The FFAR geoduck seed was planted yesterday (8-20-2020) at Thorndyke Bay. Two plots of 200 tubes each were planted with 2 geoducks each for the Control and Experimental groups, respectively. The intertidal height of both plots is about a-1.0’ below MLLW. Planting was done following low tide by divers physically placing two seed per 4″ mesh tube. Tubes are spaced about 12″ apart. Matt, the seed looks fantastic – very healthy!

Data on live weight of remaining seed following the plant for Control (N=23) and Experimental (N=30) is attached in an Excel spreadsheet. The individual mean live weight of the Control seed and Treatment seed was 68 mg and 76 mg, respectively.

https://d.pr/f/E25ZTt

CG/CV transcriptomics

Helping Colleen with a comparative transcriptomics project for OSHV-infected and uninfected Pacific and eastern oysters to build skills and pipelines I’ll need for my RNASeq analysis later.

Revised WGCNA with GO-MWU analysis

https://github.com/eimd-2019/project-EWD-transcriptomics/blob/master/analyses/WGCNA/WGCNA.md

Yaamini’s Notebook: MethCompare Part 24

Another (and hopefully final) iteration of union bedgraph analyses

Last night, Shelly pointed out that there were different versions of union begraphs in this issue. Steven originally created union bedgraphs, but in one of our meetings we realized that there were duplicate entries in each file. Shelly removed those duplicates and saved another version of files. I knew about these versions, but only used the older version with duplicates for my analysis! Even though my bedtools code was written such that overlaps with genome features are only listed once, the presence of duplicates would affect the proportions of highly methylated, moderately methylated, and lowly methylated loci. I changed the file path I used to source the union bedfiles in this Jupyter notebook and reran the script. Then, I took that output and ran it through this R Markdown script to generate my figures and count information. As expected, no drastic changes to the methylation status stacked barplots and no change at all to the genomic location barplots. I updated the figures in the manuscript and started working on the results. Now that we have the correct input and all the code is (hopefully) refined, perhaps I won’t have to run this again in the future…

Going forward

  1. Look into program for mCpG identification

Please enable JavaScript to view the comments powered by Disqus.

from the responsible grad student https://ift.tt/2ZTcgay
via IFTTT