Kaitlyn’s Notebook: Uniprot Annotations and SQL Join

After Blasting the oyster data against a Uniprot database, I joined it with Uniprot annotations from sr320 on SQL Share.
left join
sr.column1 = kr."Protein ID"


Now I’ve condensed some of the information to have a table that is easier to quickly read:


There are a lot of possible directions to go from here. The goal is to identify proteins that are highly expressed or vary between treatments (23C vs 29C). Proteins that are highly expressed and/or do not vary between treatments could indicate essential functions for pacific oysters which I believe is not well understood. There are a lot of questions that could be answered from here so right now I am just trying to form a more specific question to investigate using the data I now have.

BWA Aligner on Hyak w/ Cod data.

Got a new test for hyak, running BWA on some cod data. I’m doing this all in a terminal window, so I’ll copy output here for posterity, as well as saving it to a text file.

First, copy a bunch of files over with wget. An example would be like below. These are fairly large files.

After I moved everything, I unpacked and built the reference genome from the supplied .fa file with bwa index. I did this on an interactive execute node, because it wouldn’t be very time consuming.

[seanb80@n2049 cod]$ bwa index -p atl_cod_4_2017 -a bwtsw Gadus_morhua.gadMor1.dna.toplevel.fa > bwa_index.txt
[bwa_index] Pack FASTA... 7.67 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=1664229176, availableWord=129101124
[bwa_index] 531.38 seconds elapse.
[bwa_index] Update BWT... 3.67 sec
[bwa_index] Pack forward-only FASTA... 4.85 sec
[bwa_index] Construct SA from BWT and Occ... 188.46 sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa index -p atl_cod_4_2017 -a bwtsw Gadus_morhua.gadMor1.dna.toplevel.fa
[main] Real time: 736.824 sec; CPU: 736.036 sec

Then I used picard to create a sequence dictionary via

[seanb80@n2049 cod]$ java -jar picard.jar CreateSequenceDictionary REFERENCE=Gadus_morhua.gadMor1.dna.toplevel.fa OUTPUT=Gadus_morhua.gadMor1.dna.toplevel.dict
[Fri May 05 16:17:45 UTC 2017] picard.sam.CreateSequenceDictionary REFERENCE=Gadus_morhua.gadMor1.dna.toplevel.fa OUTPUT=Gadus_morhua.gadMor1.dna.toplevel.dict    TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri May 05 16:17:45 UTC 2017] Executing as seanb80@n2049.hyak.local on Linux 3.10.0-327.36.3.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_111-b15; Picard version: 2.9.1-SNAPSHOT
[Fri May 05 16:17:55 UTC 2017] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.16 minutes.

It looks like the next step will be alignment, so that will have to be done via sbatch.