Kaitlyn’s Notebook: Uniprot Annotations and SQL Join

After Blasting the oyster data against a Uniprot database, I joined it with Uniprot annotations from sr320 on SQL Share.
left join
sr.column1 = kr."Protein ID"


Now I’ve condensed some of the information to have a table that is easier to quickly read:


There are a lot of possible directions to go from here. The goal is to identify proteins that are highly expressed or vary between treatments (23C vs 29C). Proteins that are highly expressed and/or do not vary between treatments could indicate essential functions for pacific oysters which I believe is not well understood. There are a lot of questions that could be answered from here so right now I am just trying to form a more specific question to investigate using the data I now have.

BWA Aligner on Hyak w/ Cod data.

Got a new test for hyak, running BWA on some cod data. I’m doing this all in a terminal window, so I’ll copy output here for posterity, as well as saving it to a text file.

First, copy a bunch of files over with wget. An example would be like below. These are fairly large files.

file 1

file 2

file 3

file 4

reference genome

wget http://de.cyverse.org/dl/d/EC35A828-1A13-4B61-9CE7-67939C4E648B/GGTCAGTT_6.1_trimmed.fastq
--2017-05-05 08:27:05--  http://de.cyverse.org/dl/d/EC35A828-1A13-4B61-9CE7-67939C4E648B/GGTCAGTT_6.1_trimmed.fastq
Resolving de.cyverse.org (de.cyverse.org)...
Connecting to de.cyverse.org (de.cyverse.org)||:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://de.cyverse.org/dl/d/EC35A828-1A13-4B61-9CE7-67939C4E648B/GGTCAGTT_6.1_trimmed.fastq [following]
--2017-05-05 08:27:05--  https://de.cyverse.org/dl/d/EC35A828-1A13-4B61-9CE7-67939C4E648B/GGTCAGTT_6.1_trimmed.fastq
Connecting to de.cyverse.org (de.cyverse.org)||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘GGTCAGTT_6.1_trimmed.fastq’

    [            <=>                     ] 14,786,883,414 10.0MB/s 

After I moved everything, I unpacked and built the reference genome from the supplied .fa file with bwa index. I did this on an interactive execute node, because it wouldn’t be very time consuming.

[seanb80@n2049 cod]$ bwa index -p atl_cod_4_2017 -a bwtsw Gadus_morhua.gadMor1.dna.toplevel.fa > bwa_index.txt
[bwa_index] Pack FASTA... 7.67 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=1664229176, availableWord=129101124
[BWTIncConstructFromPacked] 10 iterations done. 99999992 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999992 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 299999992 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 399999992 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 499999992 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 599999992 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 699999992 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 799999992 characters processed.
[BWTIncConstructFromPacked] 90 iterations done. 899999992 characters processed.
[BWTIncConstructFromPacked] 100 iterations done. 999911096 characters processed.
[BWTIncConstructFromPacked] 110 iterations done. 1092851208 characters processed.
[BWTIncConstructFromPacked] 120 iterations done. 1175452616 characters processed.
[BWTIncConstructFromPacked] 130 iterations done. 1248864968 characters processed.
[BWTIncConstructFromPacked] 140 iterations done. 1314110040 characters processed.
[BWTIncConstructFromPacked] 150 iterations done. 1372096008 characters processed.
[BWTIncConstructFromPacked] 160 iterations done. 1423630072 characters processed.
[BWTIncConstructFromPacked] 170 iterations done. 1469429672 characters processed.
[BWTIncConstructFromPacked] 180 iterations done. 1510132424 characters processed.
[BWTIncConstructFromPacked] 190 iterations done. 1546305128 characters processed.
[BWTIncConstructFromPacked] 200 iterations done. 1578451496 characters processed.
[BWTIncConstructFromPacked] 210 iterations done. 1607019224 characters processed.
[BWTIncConstructFromPacked] 220 iterations done. 1632406280 characters processed.
[BWTIncConstructFromPacked] 230 iterations done. 1654966296 characters processed.
[bwa_index] 531.38 seconds elapse.
[bwa_index] Update BWT... 3.67 sec
[bwa_index] Pack forward-only FASTA... 4.85 sec
[bwa_index] Construct SA from BWT and Occ... 188.46 sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa index -p atl_cod_4_2017 -a bwtsw Gadus_morhua.gadMor1.dna.toplevel.fa
[main] Real time: 736.824 sec; CPU: 736.036 sec

Then I used picard to create a sequence dictionary via

[seanb80@n2049 cod]$ java -jar picard.jar CreateSequenceDictionary REFERENCE=Gadus_morhua.gadMor1.dna.toplevel.fa OUTPUT=Gadus_morhua.gadMor1.dna.toplevel.dict
[Fri May 05 16:17:45 UTC 2017] picard.sam.CreateSequenceDictionary REFERENCE=Gadus_morhua.gadMor1.dna.toplevel.fa OUTPUT=Gadus_morhua.gadMor1.dna.toplevel.dict    TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri May 05 16:17:45 UTC 2017] Executing as seanb80@n2049.hyak.local on Linux 3.10.0-327.36.3.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_111-b15; Picard version: 2.9.1-SNAPSHOT
[Fri May 05 16:17:55 UTC 2017] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.16 minutes.

It looks like the next step will be alignment, so that will have to be done via sbatch.