Sam’s Notebook: Data Wrangling – FastA Splitting With faSplit

Steven posted an issue on GitHub regarding splitting a FastA file into multiple sequences. Specifically, he wanted a single, large FastA sequence (~89Mbp) split into smaller FastAs for BLASTing.

I downloaded the FastA he provided (https://d.pr/f/UlzHLR) and split the sequence into 2000bp chunks using the faSplit program (https://ift.tt/2Znl0nP

 faSplit \ size \ 20190731_faSplit_PGA-scaffold1_splits_2000bp/ \ 2000  

Sam’s Notebook: Data Summary – P.generosa Transcriptome Assemblies Stats

In our continuing quest to wrangle the geoduck transcriptome assemblies we have, I was tasked with compiling assembly stats for our various assemblies. The table below provides an overview of some stats for each of our assemblies. Links within the table go to the the notebook entries for the various methods from which the data was gathered. In general:

  • Genes/Isoforms stats come directly from the Trinity assembly stats output file.
  • transdecoder_pep is a count of headers in the Transdecoder FastA output file, transdecoder_pep.
  • CD-Hit is a count of headers in the CD-Hit-est FastA output file.
Assembly Genes Isoforms transdecoder_pep CD-Hit
ctenidia [216248(https://ift.tt/2ZlNCh7] 349773 72274 325783
gonad 151263 198748 31706 189378
Juvenile (EPI 115) 199765 320691 78149 297848
Juvenile (EPI 116) 268476 434877 99089 408498
Juvenile (EPI 123) 196131 303568 67398 284852
Juvenile (EPI 124) 255277 421670 93285 395527
Larvae (EPI 99) 249799 425165 77694 379210
MEANS 219566 350642 74228 325871

Sam’s Notebook: Transcriptome Compression – P.generosa Transcriptome Assemblies Using CD-Hit-est on Mox

In continued attempts to get a grasp on the geoduck transcriptome size, I decided to “compress” our various assemblies by clustering similar transcripts in each assembly in to a single “representative” transcript, using CD-Hit-est. Settings use to run it were taken from the Trinity FAQ regarding “too many transcripts”.

A bash script was used to rsync files to Mox and then execute the SBATCH script.

Bash script (GitHub):

 #!/usr/bin/bash # Script to retrieve geoduck Trinity assemblies # Assemblies will be used in SBATCH script called at end of this script. # Script needs to be run within same directory as SBATCH script. # Exit if any command fails set -e # Set rsync remote path gannet="gannet:/volume2/web/Atumefaciens" owl="owl:/volume1/web/Athaliana" # Create array of directories for storing Trinity assemblies assembly_dirs_array=( /gscratch/srlab/sam/data/P_generosa/transcriptomes/20180827_assembly /gscratch/srlab/sam/data/P_generosa/transcriptomes/ctenidia /gscratch/srlab/sam/data/P_generosa/transcriptomes/gonad /gscratch/srlab/sam/data/P_generosa/transcriptomes/heart /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI115 /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI116 /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI123 /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI124 /gscratch/srlab/sam/data/P_generosa/transcriptomes/larvae/EPI99) # Array of Trinity assemblies remote paths for rysnc-ing assemblies_array=( 20180827_trinity_geoduck_RNAseq/Trinity.fasta 20190409_trinity_pgen_ctenidia_RNAseq/trinity_out_dir/Trinity.fasta 20190409_trinity_pgen_gonad_RNAseq/trinity_out_dir/Trinity.fasta 20190215_trinity_geoduck_heart_RNAseq/trinity_out_dir/Trinity.fasta 20190409_trinity_pgen_EPI115_RNAseq/trinity_out_dir/Trinity.fasta 20190409_trinity_pgen_EPI116_RNAseq/trinity_out_dir/Trinity.fasta 20190409_trinity_pgen_EPI123_RNAseq/trinity_out_dir/Trinity.fasta 20190409_trinity_pgen_EPI124_RNAseq/trinity_out_dir/Trinity.fasta 20190409_trinity_pgen_EPI99_RNAseq/trinity_out_dir/Trinity.fasta) # Retrieve FastA files via rsync for index in "${!assemblies_array[@]}" do # Remove everything after first slash assembly=$(echo "${assemblies_array[index]%%/*}") echo "Preparing to download ${assembly}..." if [ "${assembly}" = "20180827_trinity_geoduck_RNAseq" ]; then echo "Now syncing ${assembly} to ${assembly_dirs_array[index]}" rsync \ --archive \ --progress \ "${owl}/${assemblies_array[index]}" \ "${assembly_dirs_array[index]}" else echo "Now syncing ${assembly} to ${assembly_dirs_array[index]}" rsync \ --archive \ --progress \ "${gannet}/${assemblies_array[index]}" \ "${assembly_dirs_array[index]}" fi done # Start SBATCH script to run CD-Hit on all transcriptome assemblies sbatch 20190729_cdhit-est_pgen_transcriptomes.sh  

SBATCH script (GitHub):

 #!/bin/bash ## Job Name #SBATCH --job-name=cdhit_pgen ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=5-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190729_cdhit-est_pgen_transcriptomes # This script is called by 20190729_cdhit_pgen_trinity_assemblies.sh. # That script uses rsync to transfer files to Mox via the login node. # This is required because Mox execute nodes don't have internet access. # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log # Set CPU threads threads=27 # Program paths cd_hit_est="/gscratch/srlab/programs/cd-hit-v4.8.1-2019-0228/cd-hit-est" # Create assembly paths array assembly_dirs_array=( /gscratch/srlab/sam/data/P_generosa/transcriptomes/20180827_assembly /gscratch/srlab/sam/data/P_generosa/transcriptomes/ctenidia /gscratch/srlab/sam/data/P_generosa/transcriptomes/gonad /gscratch/srlab/sam/data/P_generosa/transcriptomes/heart /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI115 /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI116 /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI123 /gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI124 /gscratch/srlab/sam/data/P_generosa/transcriptomes/larvae/EPI99) # Run cd-hit-est on each assembly for index in "${!assembly_dirs_array[@]}" do # Store individual sample name by removing # everything up to and including the last slash in path sample_name=$(echo "${assembly_dirs_array[index]##*/}") # Run cd-hit-est "${cd_hit_est}" \ -o "${sample_name}".cdhit \ -c 0.98 \ -i "${assembly_dirs_array[index]}"/Trinity.fasta \ -p 1 \ -d 0 \ -b 3 \ -T "${threads}" \ -M 0 done  

Shelly’s Notebook: Wed. Jul. 24, 2019 Pt. Whitney Juv. Geoduck resistance to stress plans

experimental plans

uc?export=view&id=1Bt66XE4MpND6zPW8NF8JkxBy3fPSF5ac

biological response measurement ideas

options for low pH treatment:

  1. piggy-back on Sam’s treatment: constant low pH 7.0 total scale
  2. variable diel low pH to simulate lagoon conditions

options for high temperature treatment:

  1. constant high temperature of 20C
  2. constant high temperature of 29C (this will likely kill them before 1 month)
  3. variable diel high temperature of 29C to simulate lagoon conditions

Space:

  • There is space avaiable across from our current heath stacks where the totes to put another heath stack in.
  • Matt should have space for 4 trays in some other heath stacks to house any left over animals if needed

NEXT STEPS:

  • begin heath stack set up directly across from heath stacks
    • test out heat control, flow, etc.

Equipment update

  • Thermometer for discrete temp measurements errored out. See slack post. It may just be the probe, but need to test this.
  • Apex status:
    • Controller not lighting up at all when plugged into everything. Need to call support and get help troubleshooting this
    • Power strip lights up fine
    • Aqua buses:
      • how many can be linked to 1 Apex controller?
    • Probes:
      • ideally need 12 pH + temp sets
    • brought Controller back to UW to troubleshoot
  • Have 20 total heath trays plumbed for downwelling
  • Can use 1 micron screen mesh tray inserts with large grain sand (Matt has extra).
    • NEED to check:
      • # of good quality inserts we have (ideally 10)
      • # of inserts that need dividers added
        • can cut dividers from black plexiglass upstairs in the hatchery
  • Controlled heating source:
    • NEED 2 heat rods that can be plugged into the apex
      • have these at UW, need to test
  • Flow:
    • if doing constant pH, NEED 4 pumps with adjustable flow if using same header conicals as Sam
    • if doing diel variable pH, use different conical and don’t need adjustable flow
      • extra pumps at hatchery (imagitarium secondary pumps used for circulation in broodstock experiment)
      • would need additional probe set
  • More equipment will free up when Sam is done by the end of August

Checked on animals

uc?export=view&id=1Pe-WvIdQ4T7YklqcTAnnKUcnp6suxDSZ

Size difference between H2T3 animals and H2T1 animals from cascade feeding uc?export=view&id=1D_90h2WL1ItCFDRABYHSHBRVeJdi6lrL

In-flow hose was not going directly into the heath tray insert in H2T1: uc?export=view&id=1igdszlxJwQZ6NFHv9yTogQA8WAY3lwNt

Tried to break up algae and clear mesh to improve flow of food and shuffled trays around to new order:

  1. H2T1
  2. H2T5
  3. H2T7
  4. H2T3
  5. H2T6 (extras)
  6. H2T2 (low density)

*only swapped H2T3 with H2T1 because these trays showed the biggest size differences.

*Left H2T6 and H2T2 in same positions because they had a lot of die off and aren’t going to be part of the experiment.

Size data (photos) for all animals are here. Size ranges for all are about 2-6mm. I will do individual measurements with image J

Survival and size data for larval rearing through now I think is too convoluted to draw conclusions from parental exposure.

Prior to pH x temp experiment, I can size select to start with animals of similar sizes across the board.

from shellytrigg https://ift.tt/2ynMaPM
via IFTTT

Laura’s Notebook: Oly RNA isolation – larvae

July 23, 2019

Homogenized and frozen larvae to prep for RNA isolation, and aliquoted larve in ethanol for imaging/measuring at a later date. These larvae were collected during spring 2017, captured on screens upon release from mother, concentrated into microcentrifuge tubes, and placed immediately into a -80 freezer to preserve. Larvae in the same tube are likely to be full siblings or half siblings; the number of larvae collected on that day / from that collection bucket is a good indication of whether more than one female released brood that day (we expect ~200k larvae/female).

First step is homogenization, which is particularly necessary due to larvae having shells that need to be broken up. As with the ctenidia samples, I used mortar+pestle and liquid nitrogen. I pre-portioned 1 mL of RNAzol into microcentrifuge tubes, and transferred up to 100 mg homogenized larvae into chilled RNAzol. NOTE: I tried to get as much tissue as possible up to 100 mg, since the larval shells likely comprise a significant percentage of the mass.

In total I have 83 larval samples. I isolated RNA from 14 of these samples in spring 2018 for the test QuantSeq round. A few resulted in too little RNA, 2 of which had sufficient larvae remaining for another attempt to isolate RNA. So, I homogenized a total of 70 samples into RNAzol in 10 batches (~4 days of work), with an additional 2 controls (just liquid nitrogen ground into mortar+pestle).

July 24, 2019

Did a first round of RNA isolation (n=12, batch #1) to test protocol. Followed RNAzol protocol like with the ctenidia samples, with two exceptions: 1) after addition of DEPC-treatetd water (0.2 mL), centrifuged at 16,000 rcf (not 12,000); 2) added 100 uL DEPC-treated water in last step to dissolve RNA. Samples in this 1st batch: ABCDEFG

Also of note: after precipitating the DNA/proteins, the supernatant retained a black/gray color. Then, after adding isopropanol, the RNA pellet was predominantly black in 10 of the 12 samples (the 2 samples with small, white pellets were X and Y). Then, upon dissolving RNA in the DEPC-treated water and vortexing for 5 minutes, there was a black substance that settled into the bottom of the tubes. A black substance occurred before, in 2018 when I did a preliminary round of RNA isolation from the same study’s Oly larvae (see notebook post); this didn’t seem to interfere with the successful QuantSeq run.

I quantified RNA concentration in this first batch, using 1 uL of the RNA solution, pulled from the surface of the solution to avoid the black substance. Measured RNA in all samples, at a concentration of between 30-90 ng/uL.

Image of a few tubes showing that RNA pellete included black substance: IMG_8758

Image of RNA dissolved in water, showing a black substance settled at the bottom of the tubes. IMG_8757

July 26, 2019

Big day of RNA isolation – did 4 batches of 12, but processed 2 batches simultaneously, offset by about 10 minutes. This was feasible because there is down time during reactions/centrifuging.

Batches #2 and #3 (run simultaneously offset), followed the same protocol as with batch #1 (above), with one exception: added 75 uL of DEPC-treated water in final step to dissolve RNA.

Batch #2 samples: 402, 412, 421, 431, 432, 442, 452, 461, 462, 482, 532, 542

Batch #3 samples: 403, 404, 472, 473, 483, 484, 491, 522, 533, 552, 562, 571

Batches #4 and #5 (run simultaneously, offset), followed same protocol as batches 2 & 3, with one exception: did not vortex homogenate prior to transferring 500 uL to new tubes with the 200 uL DEPC-treated water. The resulting RNA dissolved in water had significantly less black substance, likely due to this adjustment.

I quantified RNA isolated from all samples, shown in the below table, and the spreadsheet is saved in the repo. A handful of samples resulted in relatively poor yield (452, 461, 462, 472, 533, 552), and several samples had a substantial amount of black substance mixed into the RNA solution (404, 431, 432, 442, 443(ish), 461, 462, 472, 486(ish), 521(ish), 522(ish), 552). Some samples in these these two categories overlapped – 461, 462, 472, 552. I will process some of the remaining homogenate (250 uL) to try to get a cleaner batch of RNA.

Date larvae collected Cohort Treatment TISSUE SAMPLE # HOMOG. TUBE # VOL RNAzol (mL) MASS TISSUE (mg) DATE HOMOG. HOMOG. BATCH RNA ISOLATION DATE RNA ISOLATION BATCH Total RNA volume remaining, uL (RNA + H2O) [RNA] ng/uL Amount of RNA (ng) Volume needed for 500 ng RNA Notes
5/24/17 Dabob Bay 10 Ambient 14-A 401 1 100 19-Jul 1 24-Jul 1 99 52.0 5,148 9.62
5/31/17 Dabob Bay 10 Ambient 31-A 402 1 10 20-Jul 3 26-Jul 2 74 140.0 10,360 3.57
6/19/17 Dabob Bay 10 Ambient 75-A 403 1 40 22-Jul 5 26-Jul 3 74 148.0 10,952 3.38
6/29/17 Dabob Bay 10 Ambient 80-A 404 1 110 22-Jul 6 26-Jul 3 74 95.2 7,045 5.25
5/26/17 Dabob Bay 10 Low 23-A 411 1 10 19-Jul 1 24-Jul 1 99 57.2 5,663 8.74
5/27/17 Dabob Bay 10 Low 27-A 412 1 10 20-Jul 3 26-Jul 2 74 60.8 4,499 8.22
6/10/17 Dabob Bay 10 Low 58-A 413 1 80 22-Jul 7 26-Jul 4 74 91.4 6,764 5.47
6/12/17 Dabob Bay 10 Low 60-A 414 1 20 23-Jul 8 26-Jul 4 74 136.0 10,064 3.68
6/12/17 Dabob Bay 6 Ambient 59-A 421 1 10 20-Jul 2 26-Jul 2 74 43.0 3,182 11.63
6/7/17 Dabob Bay 6 Low 51-A 431 1 20 20-Jul 2 26-Jul 2 74 43.4 3,212 11.52
6/17/17 Dabob Bay 6 Low 72-A 432 1 50 22-Jul 4 26-Jul 2 74 47.6 3,522 10.50
6/17/17 Dabob Bay 6 Low 73-A 433 na na na na na na na na na na
6/19/17 Dabob Bay 6 Low 74-A 434 1 60 23-Jul 9 5 74 67.8 5,017 7.37
5/25/17 Fidalgo Bay 10 Ambient 20-A 441 1 70 19-Jul 1 24-Jul 1 99 46.0 4,554 10.87
6/3/17 Fidalgo Bay 10 Ambient 38-A 442 1 80 20-Jul 3 26-Jul 2 74 LOW
6/7/17 Fidalgo Bay 10 Ambient 53-A 443 1 60 22-Jul 6 26-Jul 4 74 58.4 4,322 8.56
6/14/17 Fidalgo Bay 10 Ambient 63-A 444 1 80 22-Jul 7 26-Jul 4 74 91.2 6,749 5.48
6/15/17 Fidalgo Bay 10 Ambient 65-A 445 1 40 23-Jul 8 26-Jul 5 74 132.0 9,768 3.79
5/24/17 Fidalgo Bay 10 Low 16-A 451 1 70 19-Jul 1 24-Jul 1 99 68.4 6,772 7.31
5/24/17 Fidalgo Bay 10 Low 18-A 452 1 80 20-Jul 3 26-Jul 2 74 37.8 2,797 13.23 poor yield b/c black substance w/ RNA?
6/3/17 Fidalgo Bay 10 Low 36-A 453 1 80 23-Jul 9 27-Jul 6 74 85.2 6,305 5.87
5/26/17 Fidalgo Bay 6 Ambient 22-A 461 1 100 20-Jul 2 26-Jul 2 74 31.0 2,294 16.13 poor yield b/c black substance w/ RNA?
5/29/17 Fidalgo Bay 6 Ambient 29-A 462 1 60 22-Jul 4 26-Jul 2 74 29.8 2,205 16.78 poor yield b/c black substance w/ RNA?
5/25/17 Fidalgo Bay 6 Low 19-A 471 1 100 20-Jul 2 24-Jul 1 99 33.8 3,346 14.79
5/26/17 Fidalgo Bay 6 Low 21-A 472 1 70 22-Jul 4 26-Jul 3 74 8.3 613 60.39 poor yield b/c black substance w/ RNA?
6/5/17 Fidalgo Bay 6 Low 46-A 473 1 50 22-Jul 5 26-Jul 3 74 120.0 8,880 4.17
6/5/17 Fidalgo Bay 6 Low 47-A 474 1 50 22-Jul 6 26-Jul 4 74 81.0 5,994 6.17
6/6/17 Fidalgo Bay 6 Low 50-A 475 1 80 22-Jul 7 26-Jul 4 74 110.0 8,140 4.55
6/10/17 Fidalgo Bay 6 Low 54-A 476 1 80 23-Jul 8 26-Jul 5 74 43.4 3,212 11.52
6/19/17 Fidalgo Bay 6 Low 76-A 477 1 40 23-Jul 9 27-Jul 6 74 HIGH
5/20/17 Oyster Bay C1 10 Ambient 02-A, 02-B 481 1 40 19-Jul 1 24-Jul 1 99 64.4 6,376 7.76
5/20/17 Oyster Bay C1 10 Ambient 04-A, 04-B 482 1 60 20-Jul 3 26-Jul 2 74 67.2 4,973 7.44
5/21/17 Oyster Bay C1 10 Ambient 03-A, 03-B 483 1 110 22-Jul 4 26-Jul 3 74 95.8 7,089 5.22
5/23/17 Oyster Bay C1 10 Ambient 09-A 484 1 40 22-Jul 5 26-Jul 3 74 66.2 4,899 7.55
6/1/17 Oyster Bay C1 10 Ambient 34-A 485 1 20 22-Jul 6 26-Jul 4 74 156.0 11,544 3.21
6/3/17 Oyster Bay C1 10 Ambient 39-A 486 1 90 22-Jul 7 26-Jul 4 74 41.0 3,034 12.20
6/3/17 Oyster Bay C1 10 Ambient 40-A 487 1 30 23-Jul 8 26-Jul 5 74 112.0 8,288 4.46
6/4/17 Oyster Bay C1 10 Ambient 44-A 488 1 70 23-Jul 9 26-Jul 5 74 57.2 4,233 8.74
6/6/17 Oyster Bay C1 10 Ambient 49-A 489 1 10 23-Jul 9 27-Jul 6 74 57.2 4,233 8.74
6/14/17 Oyster Bay C1 10 Ambient 64-A 490 1 70 22-Jul 7 27-Jul 6 74 58.6 4,336 8.53
6/15/17 Oyster Bay C1 10 Ambient 66-A 491 1 20 22-Jul 5 26-Jul 3 74 126.0 9,324 3.97
7/6/17 Oyster Bay C1 10 Ambient 81-A 492 1 70 22-Jul 6 26-Jul 5 74 122.0 9,028 4.10
5/21/17 Oyster Bay C1 10 Low 06-A, 06-B 501 1 <10 19-Jul 1 na na na na na na
5/23/17 Oyster Bay C1 10 Low 08-A 502 na na na na na na na na na na
5/26/17 Oyster Bay C1 10 Low 24-A 503 na na na na na na na na na na
5/27/17 Oyster Bay C1 10 Low 26-A 504 na na na na na na na na na na
5/31/17 Oyster Bay C1 10 Low 32-A 505 na na na na na na na na na na
6/14/17 Oyster Bay C1 10 Low 62-A 506 1 80 20-Jul 2 24-Jul 1 99 63.8 6,316 7.84
6/15/17 Oyster Bay C1 10 Low 67-A 507 na na na na na na na na na na
6/24/17 Oyster Bay C1 10 Low 79-A 508 na na na na na na na na na na
5/23/17 Oyster Bay C1 6 Ambient 10-A 511 na na na na na na na na na na
6/3/17 Oyster Bay C1 6 Ambient 37-A 512 na na na na na na na na na na
6/5/17 Oyster Bay C1 6 Ambient 45-A 513 30 23-Jul 10 26-Jul 5 74 156.0 11,544 3.21
6/6/17 Oyster Bay C1 6 Ambient 48-A 514 na na na na na na na na na na
6/15/17 Oyster Bay C1 6 Ambient 69-A 515 na na na na na na na na na na
6/17/17 Oyster Bay C1 6 Ambient 71-A 516 na na na na na na na na na na
6/19/17 Oyster Bay C1 6 Ambient 77-A 517 na na na na na na na na na na
5/21/17 Oyster Bay C1 6 Low 01-A, 01-B- 01-C 521 1 70 20-Jul 2 24-Jul 1 99 54.4 5,386 9.19
5/22/17 Oyster Bay C1 6 Low 07-A 522 1 20 22-Jul 5 26-Jul 3 74 60.8 4,499 8.22
5/27/17 Oyster Bay C1 6 Low 25-A 523 1 30 22-Jul 6 26-Jul 4 74 60.6 4,484 8.25
5/27/17 Oyster Bay C1 6 Low 28-A 524 1 80 22-Jul 7 26-Jul 4 74 80.8 5,979 6.19
5/29/17 Oyster Bay C1 6 Low 30-A 525 1 30 23-Jul 8 26-Jul 5 74 128.0 9,472 3.91
5/31/17 Oyster Bay C1 6 Low 33-A 526 1 30 23-Jul 9 27-Jul 6 74 65.2 4,825 7.67
6/14/17 Oyster Bay C1 6 Low 61-A 527 1 90 22-Jul 6 27-Jul 6 74 81.4 6,024 6.14
6/15/17 Oyster Bay C1 6 Low 68-A 528 1 30 23-Jul 8 26-Jul 5 74 162.0 11,988 3.09
6/17/17 Oyster Bay C1 6 Low 70-A 529 1 70 23-Jul 9 26-Jul 5 74 73.4 5,432 6.81
5/24/17 Oyster Bay C2 10 Ambient 17-A 531 1 60 19-Jul 1 24-Jul 1 99 88.2 8,732 5.67
6/3/17 Oyster Bay C2 10 Ambient 42-A 532 1 40 20-Jul 3 26-Jul 2 74 158.0 11,692 3.16
6/10/17 Oyster Bay C2 10 Ambient 56-A 533 1 <10 22-Jul 4 26-Jul 3 74 7.4 548 67.57 poor yield b/c very little tissue (likely)
5/23/17 Oyster Bay C2 10 Low 12-A 541 1 40 10-Jul 1 24-Jul 1 99 45.6 4,514 10.96
5/24/17 Oyster Bay C2 10 Low 13-A 542 1 30 20-Jul 3 26-Jul 2 74 82.0 6,068 6.10
6/4/17 Oyster Bay C2 10 Low 43-A 543 1 80 22-Jul 7 27-Jul 6 74 61.4 4,544 8.14
6/1/17 Oyster Bay C2 6 Ambient 35-A 551 1 30 20-Jul 2 24-Jul 1 99 86.0 8,514 5.81
6/3/17 Oyster Bay C2 6 Ambient 41-A 552 1 80 22-Jul 5 26-Jul 3 74 17.5 1,295 28.57 poor yield b/c black substance w/ RNA?
6/10/17 Oyster Bay C2 6 Ambient 55-A 553 1 30 23-Jul 8 26-Jul 4 74 200.0 14,800 2.50
6/20/17 Oyster Bay C2 6 Ambient 78-A 554 1 30 23-Jul 9 26-Jul 5 74 156.0 11,544 3.21
5/21/17 Oyster Bay C2 6 Low 05-A 561 1 40 20-Jul 2 24-Jul 1 99 43.4 4,297 11.52
5/23/17 Oyster Bay C2 6 Low 11-A 562 1 90 22-Jul 5 26-Jul 3 74 106.0 7,844 4.72
5/24/17 Oyster Bay C2 6 Low 15-A 563 1 50 22-Jul 6 26-Jul 4 74 84.6 6,260 5.91
6/7/17 Oyster Bay C2 6 Low 52-A 564 1 not recorded 22-Jul 7 27-Jul 6 74 HIGH
6/10/17 Oyster Bay C2 6 Low 57-A 565 1 10 23-Jul 8 27-Jul 6 74 31.8 2,353 15.72
NA RNA Control RNA Control 571 1 10 22-Jul 4 26-Jul 3 74 LOW NA NA
NA RNA Control RNA Control 572 1 10 23-Jul 10 26-Jul 5 74 LOW NA NA
NA RNA Control RNA Control 574 NA NA NA NA 27-Jul 6 74 LOW NA NA

from The Shell Game https://ift.tt/2GCINc3
via IFTTT

Shelly’s Notebook: Tues. Jul. 23, 2019 Salmon + sea lice methylomes and Oyster Proteomics

Oyster Proteomics

Salmon + sea lice methylomes

  • Still running TrimGalore! (probably will be around 30 hours to complete)
  • Prepared Bismark genomes on Mox:
    • Salmon genome prep
      • script location: /gscratch/srlab/strigg/jobs/BuildSalmo_BmrkGenome.sh
      • bismark genome location: /gscratch/srlab/strigg/data/Ssalar/GENOMES
    • Sea lice genome prep
      • script location: /gscratch/srlab/strigg/jobs/BuildCalig_BmrkGenome.sh
      • bismark genome location: /gscratch/srlab/strigg/data/Caligus/GENOMES
  • Determine bismark alignment settings to use

from shellytrigg https://ift.tt/2y8ReXS
via IFTTT

Sam’s Notebook: Genome Annotation – Pgenerosa_v074 Hisat2 Transcript Isoform Index

Essentially, the steps below (which is what was done here) are needed to prepare files for use with Stringtie:

  1. Create GTF file (basically a GFF specifically for use with transcripts – thus the “T” in GTF) from input GFF file. Done with GFF utilities software.
  2. Identify splice sites and exons in newly-created GTF. Done with Hisat2 software.
  3. Create a Hisat2 reference index that utilizes the GTF. Done with Hisat2 software.

This was run on Mox.

The SBATCH script has a bunch of leftover extraneous steps that aren’t relevant to this step of the annotation process; specifically the FastQ manipulation steps. This is due to a copy/paste from a previous Hisat2 run that I neglected to edit out and I didn’t want to edit the script after I actually ran it, so have left it in here.

SBATCH script (GitHub):

 #!/bin/bash ## Job Name #SBATCH --job-name=oly_hisat2 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190723_hisat2-build_pgen_v074 # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log threads=28 genome_index_name="Pgenerosa_v074" # Paths to programs gffread="/gscratch/srlab/programs/gffread-0.11.4.Linux_x86_64/gffread" hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2_build="${hisat2_dir}/hisat2-build" hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py" hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py" # Input/output files fastq_dir="/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq" genome_dir="/gscratch/srlab/sam/data/P_generosa/genomes" genome_gff="${genome_dir}/Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff" exons="hisat2_exons.tab" genome_fasta="${genome_dir}/Pgenerosa_v074.fa" splice_sites="hisat2_splice_sites.tab" transcripts_gtf="Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gtf" ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() # Create array of fastq R1 files for fastq in "${fastq_dir}"/*R1*.gz do fastq_array_R1+=("${fastq}") done # Create array of fastq R2 files for fastq in "${fastq_dir}"/*R2*.gz do fastq_array_R2+=("${fastq}") done # Create array of sample names ## Uses parameter substitution to strip leading path from filename ## Uses awk to parse out sample name from filename for R1_fastq in "${fastq_dir}"/*R1*.gz do names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"[_.]" '{print $1 "_" $5}')) done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in "${fastq_dir}"/*.gz do echo "${fastq#${fastq_dir}}" >> fastq.list.txt done # Create transcipts GTF from genome FastA "${gffread}" -T \ "${genome_gff}" \ -o "${transcripts_gtf}" # Create Hisat2 exons tab file "${hisat2_exons}" \ "${transcripts_gtf}" \ > "${exons}" # Create Hisate2 splice sites tab file "${hisat2_splice_sites}" \ "${transcripts_gtf}" \ > "${splice_sites}" # Build Hisat2 reference index using splice sites and exons "${hisat2_build}" \ "${genome_fasta}" \ "${genome_index_name}" \ --exon "${exons}" \ --ss "${splice_sites}" \ -p "${threads}" \ 2> hisat2_build.err # Copy Hisat2 index files to my data directory rsync -av "${genome_index_name}"*.ht2 "${genome_dir}"  

Sam’s Notebook: Genome Annotation – Pgenerosa_v070 Hisat2 Transcript Isoform Index

This is the first step in getting transcript isoform annotations. The annotations and alignments that will be generated with Stringtie will be used to help us get a better grasp of what’s going on with our annotations of the different Panopea generosa genome assembly versions.

Essentially, the steps below (which is what was done here) are needed to prepare files for use with Stringtie:

  1. Create GTF file (basically a GFF specifically for use with transcripts – thus the “T” in GTF) from input GFF file. Done with GFF utilities software.
  2. Identify splice sites and exons in newly-created GTF. Done with Hisat2 software.
  3. Create a Hisat2 reference index that utilizes the GTF. Done with Hisat2 software.

This was run on Mox.

The SBATCH script has a bunch of leftover extraneous steps that aren’t relevant to this step of the annotation process; specifically the FastQ manipulation steps. This is due to a copy/paste from a previous Hisat2 run that I neglected to edit out and I didn’t want to edit the script after I actually ran it, so have left it in here.

SBATCH script (GitHub):

20190723_hisat2-build_pgen_v070.sh

 #!/bin/bash ## Job Name #SBATCH --job-name=oly_hisat2 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190723_hisat2-build_pgen_v070 # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log threads=28 genome_index_name="Pgenerosa_v070" # Paths to programs gffread="/gscratch/srlab/programs/gffread-0.11.4.Linux_x86_64/gffread" hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2_build="${hisat2_dir}/hisat2-build" hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py" hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py" # Input/output files fastq_dir="/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq" genome_dir="/gscratch/srlab/sam/data/P_generosa/genomes" genome_gff="${genome_dir}/Pgenerosa_v070_genome_snap02.all.renamed.putative_function.domain_added.gff" exons="hisat2_exons.tab" genome_fasta="${genome_dir}/Pgenerosa_v070.fa" splice_sites="hisat2_splice_sites.tab" transcripts_gtf="Pgenerosa_v070_genome_snap02.all.renamed.putative_function.domain_added.gtf" ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() # Create array of fastq R1 files for fastq in "${fastq_dir}"/*R1*.gz do fastq_array_R1+=("${fastq}") done # Create array of fastq R2 files for fastq in "${fastq_dir}"/*R2*.gz do fastq_array_R2+=("${fastq}") done # Create array of sample names ## Uses parameter substitution to strip leading path from filename ## Uses awk to parse out sample name from filename for R1_fastq in "${fastq_dir}"/*R1*.gz do names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"[_.]" '{print $1 "_" $5}')) done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in "${fastq_dir}"/*.gz do echo "${fastq#${fastq_dir}}" >> fastq.list.txt done # Create transcipts GTF from genome FastA "${gffread}" -T \ "${genome_gff}" \ -o "${transcripts_gtf}" # Create Hisat2 exons tab file "${hisat2_exons}" \ "${transcripts_gtf}" \ > "${exons}" # Create Hisate2 splice sites tab file "${hisat2_splice_sites}" \ "${transcripts_gtf}" \ > "${splice_sites}" # Build Hisat2 reference index using splice sites and exons "${hisat2_build}" \ "${genome_fasta}" \ "${genome_index_name}" \ --exon "${exons}" \ --ss "${splice_sites}" \ -p "${threads}" \ 2> hisat2_build.err # Copy Hisat2 index files to my data directory rsync -av "${genome_index_name}"*.ht2 "${genome_dir}"  

RESULTS

Took about an hour to run:

Screencap of Hisat2 runtime

Output folder:

The Hisat2 index files are: *.ht2. These will be used with Stringtie for transcript isoform annotation.

from Sam’s Notebook https://ift.tt/2JMm9jt
via IFTTT

Yaamini’s Notebook: Ecology of Infectious Marine Diseases TA Recap

Things I did at FHL that were not my research

For the past five weeks, I’ve been at Friday Harbor Laboratories as a Teaching Assistant for the Ecology of Infectious Marine Diseases course! I thought it would be good to recap what I did, since I was working but didn’t have time for my own research. In addition to finding lab equipment and helping out with fieldwork, I gave some lectures, helped students with a genomics project, and spearheaded the formal science communication section of the curriculum.

Teaching molecular methods

For my first lecture, I taught students about Github and basic Linux commands. I had students navigate to this Github repository and clone it to their machine. I then walked through the steps I laid out in this document. I had students learn to change their working directory and navigate through directories using the command line interface. I opened up a Terminal window and placed a Finder under it so students could see how the commands I typed in the Terminal changed directory structure and files. I also emphasized the use of tab-complete to make things easier and avoid typos in code. The next set of commands I walked through involved downloading FASTA files from web links. Looking back on it now, I probably should have taught them about checksums, but I think their brains may have exploded a bit. I had them create new directories, move FASTA files into directories, and remove extra files. Finally, I went through commands to explore files from the command line.

Later that day, I taught them how to blast from the command line! I wanted them to type the code themselves, but we were unable to download blast on the computers in the Computer Lab. I walked through the code in this document so they could try reviewing it themselves later. What was more useful for them was going to the Uniprot SwissProt database and teaching them about the database and how to get GOterms.

At the end of our genomics block, I gave a lecture on my own work! Since Colleen focused on transcriptomics, I used my work as case studies on the use of proteomics and epigenetics to study how abiotic stressors affect organismal physiology. Students were really interested in specific methods, so I added in a lot of detail. I think those extra details may have prevented some students from seeing the big picture. I’ll need to work on that if I give that lecture again.

A large component of the course was working on projects. For the project examining NIX prevalence in Kalaloch Beach razor clams, I assisted students with DNA extractions and qPCR. I spent most of my time assisting with the transcriptomics project looking at eelgrass wasting disease host-pathogen interactions. I created this Jupyter notebook to merge transcriptomic data with blast output and Uniprot Swiss-Prot annotations. I also streamlined isoforms into genes. In this R Markdown file, I formatted input files for gene enrichment with GO-MWU. I used the GO-MWU pipeline for eelgrass and pathogen transcriptomic data, but only found two enriched GOterms for eelgrass. I also created this R Markdown document to help students create heatmaps of differentially expressed genes. They used the code I created to create their own heatmaps for genes of interest. I think any molecular method is difficult to follow if you have little to no experience, and if you don’t have a great understanding of R or Linux. One thing that (I think) helped while I was teaching was to constantly remind students the purpose of each step.

Science communication

The other part of the course I helped with was science communication practice. Students were required to write one blog post for a public audience and create a short talk about a disease for the class. I worked with each student on their blog post and provided targetted feedback when editing their initial drafts. I also had students give practice presentations to me so I could help them if they needed. Most of my comments were directed at improving the organization of their pieces or talks. I took it upon myself to help them outline their final papers and reviewed their final presentations so they would not have similar issues.

Overall TAing at FHL was a lot of work but really rewarding! I learned a lot about what it means to be a good instructor and I cannot wait to flex those skills soon.

from the responsible grad student https://ift.tt/2y1cqir
via IFTTT

Shelly’s Notebook: Mon. Jul. 22, 2019 Salmon sea lice methylomes

run fastqc and trim

  • I ended up getting this to run on Ostrich via this jupyter notebook:
  • The multiqc instance I installed on Ostrich via didn’t seem to work properly (errors in jupyter notebook output) so I ran this on emu and it worked
      scp /Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722/*.zip srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc scp /Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722/*.html srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ multiqc . [INFO   ] multiqc : This is MultiQC v0.9 [INFO   ] multiqc : Template : default [INFO   ] multiqc : Searching '.' [INFO   ] fastqc : Found 92 reports [INFO   ] multiqc : Report : multiqc_report.html [INFO   ] multiqc : Data : multiqc_data [INFO   ] multiqc : MultiQC complete srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rsync --archive --progress --verbose multiqc_data strigg@ostrich.fish.washington.edu:/Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722 Password: building file list ... 6 files to consider multiqc_data/ multiqc_data/.multiqc.log 6,880 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=4/6) multiqc_data/multiqc_fastqc.txt 20,380 100% 19.44MB/s 0:00:00 (xfr#2, to-chk=3/6) multiqc_data/multiqc_general_stats.txt 6,880 100% 6.56MB/s 0:00:00 (xfr#3, to-chk=2/6) multiqc_data/multiqc_report.html 2,567,171 100% 18.14MB/s 0:00:00 (xfr#4, to-chk=1/6) multiqc_data/multiqc_sources.txt 12,078 100% 87.37kB/s 0:00:00 (xfr#5, to-chk=0/6) sent 2,614,141 bytes received 500 bytes 747,040.29 bytes/sec total size is 2,613,389 speedup is 1.00 srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm *.html srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm *.zip srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm -r multiqc_data/  
  • Raw sequence FASTQC output folder:
  • Raw sequence MultiQC report (HTML):
  • TrimGalore! output folder (adapter trimmed FastQ files):
  • Adapter trimming MultiQC report (HTML):
  • TrimGalore hard trim output folder (first 5bp trimmed from 5’ of each read):
  • Hard trim MultiQC report (HTML):

copy the salmon genome and sea lice genome to mox

  • Steven shared the C_rogercresseyi on this CaligusLIFE Slack channel thread.
    • I downloaded it locally and then copied it to Mox:
        Shellys-MacBook-Pro:coverage Shelly$ rsync --archive --progress --verbose ~/Downloads/Caligus_rogercresseyi_Genome.fa strigg@mox.hyak.uw.edu:/gscratch/srlab/strigg/data/Caligus/GENOMES Password: Enter passcode or select one of the following options: 1. Duo Push to Android (XXX-XXX-0029) 2. Phone call to Android (XXX-XXX-0029) Duo passcode or option [1-2]: 1 building file list ... 1 file to consider Caligus_rogercresseyi_Genome.fa 515420160 100% 19.75MB/s 0:00:24 (xfer#1, to-check=0/1) sent 515483225 bytes received 42 bytes 14122829.23 bytes/sec total size is 515420160 speedup is 1.00  
    • I also copied it to Gannett
        Shellys-MacBook-Pro:Caligus Shelly$ rsync --archive --progress --verbose ~/Downloads/Caligus_rogercresseyi_Genome.fa /Volumes/web/metacarcinus/Salmo_Calig/GENOMES/Caligus building file list ... 1 file to consider Caligus_rogercresseyi_Genome.fa 515420160 100% 12.72MB/s 0:00:38 (xfer#1, to-check=0/1) sent 515483225 bytes received 42 bytes 13050209.29 bytes/sec total size is 515420160 speedup is 1.00  
  • I downloaded the most recent RefSeq version of the S.salar genome on Gannett:
      ostrich:RefSeq strigg$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_genomic.fna.gz --2019-07-19 12:21:17-- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_genomic.fna.gz => ‘GCF_000233375.1_ICSASG_v2_genomic.fna.gz’ Resolving ftp.ncbi.nlm.nih.gov... 130.14.250.11, 2607:f220:41e:250::11 Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.11|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2 ... done. ==> SIZE GCF_000233375.1_ICSASG_v2_genomic.fna.gz ... 759073402 ==> PASV ... done. ==> RETR GCF_000233375.1_ICSASG_v2_genomic.fna.gz ... done. Length: 759073402 (724M) (unauthoritative) GCF_000233375.1_ICSASG_v2_genomic.fna.gz 100%[=================================================================================================================================================================>] 723.91M 631KB/s in 19m 53s 2019-07-19 12:41:12 (621 KB/s) - ‘GCF_000233375.1_ICSASG_v2_genomic.fna.gz’ saved [759073402]  
    • then copied the S. salar RefSeq genome to Mox:
        Shellys-MacBook-Pro:Caligus Shelly$ rsync --archive --progress --verbose /Volumes/web/metacarcinus/Salmo_Calig/GENOMES/v2/RefSeq/GCF_000233375.1_ICSASG_v2_genomic.fna.gz strigg@mox.hyak.uw.edu:/gscratch/srlab/strigg/data/Ssalar/GENOMES Password: Enter passcode or select one of the following options: 1. Duo Push to Android (XXX-XXX-0029) 2. Phone call to Android (XXX-XXX-0029) Duo passcode or option [1-2]: 1 building file list ... 1 file to consider GCF_000233375.1_ICSASG_v2_genomic.fna.gz 759073402 100% 7.68MB/s 0:01:34 (xfer#1, to-check=0/1) sent 759166220 bytes received 42 bytes 7195888.74 bytes/sec total size is 759073402 speedup is 1.00  

copy data from Gannett to owl

NEXT STEPS:

  1. Concatenate data from different lanes together (L001 + L002 for each sample)
  2. transfer concatenated trimmed reads from Gannett to Mox
  3. determine alignment parameters for:
    • subset of Calig. data
    • subset of Salmon data
  4. run bismark on all data

from shellytrigg https://ift.tt/2JNJh1f
via IFTTT