Adding Individual Libraries

DESeq2 First, I finished up my first quarter! Congratulations to me! Alright, I’ve made a lot of progress since last time. For the moment, building an index for transcriptome 3.0 is on the backburner – the top priority is just getting a list of differentially-expressed genes. I was able to…

from Aidan F. Coyle https://ift.tt/34n40me
via IFTTT

Samples Received – Cockle Clam Gonad H and E Slides

Today we received the H & E-stained slides from the cockle clam gonad tissue blocks/cassettes we submitted on 20201201. Slides were added to Slide Case #5 – Rows 13 – 37 (Google Sheet).

All info has been added to:

from Sam’s Notebook https://ift.tt/3oZHVlA
via IFTTT

FastQC-MultiQC – M.magister MBD-BSseq Pool Test MiSeq Run on Mox

Earlier today we received the M.magister (C.magister; Dungeness crab) MiSeq data from Mac.

I ran FastQC and MultiQC on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201211_mmag_fastqc_multiqc_mbd-bsseq_miseq ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201211_mmag_fastqc_multiqc_mbd-bsseq_miseq ### FastQC assessment of raw MiSeq sequencing test run for ### MBD-BSseq pool of M.magister samples from 20201202. ################################################################################### # These variables need to be set by user # FastQC output directory output_dir=$(pwd) # Set number of CPUs to use threads=28 # Input/output files checksums=fastq_checksums.md5 fastq_list=fastq_list.txt raw_reads_dir=/gscratch/srlab/sam/data/C_magister/MBD-BSseq/ # Paths to programs fastqc=/gscratch/srlab/programs/fastqc_v0.11.9/fastqc multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc # Programs associative array declare -A programs_array programs_array=( [fastqc]="${fastqc}" \ [multiqc]="${multiqc}" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Sync raw FastQ files to working directory rsync --archive --verbose \ "${raw_reads_dir}"CH*.fastq.gz . # Populate array with FastQ files fastq_array=(CH*.fastq.gz) # Pass array contents to new variable fastqc_list=$(echo "${fastq_array[*]}") # Run FastQC # NOTE: Do NOT quote ${fastqc_list} ${programs_array[fastqc]} \ --threads ${threads} \ --outdir ${output_dir} \ ${fastqc_list} # Create list of fastq files used in analysis echo "${fastqc_list}" | tr " " "\n" >> ${fastq_list} # Generate checksums for reference while read -r line do # Generate MD5 checksums for each input FastQ file echo "Generating MD5 checksum for ${line}." md5sum "${line}" >> "${checksums}" echo "Completed: MD5 checksum for ${line}." echo "" # Remove fastq files from working directory echo "Removing ${line} from directory" rm "${line}" echo "Removed ${line} from directory" echo "" done < ${fastq_list} # Run MultiQC ${programs_array[multiqc]} . # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} fi ${programs_array[$program]} -h echo "" echo "" echo "

Data Received – M.magister MBD-BSseq Pool Test MiSeq Run

After creating _M.magister (C.magister; Dungeness crab) MBD-BSseq libraries (on 20201124), I gave the pooled set of samples to Mac for a test sequencing run on the MiSeq on 20201202.

MiSeq data consisted of 76bp paired-end (PE) sequencing.

All files were downloaded to the C_magister folder on Owl(Synology server).

Have added files to our high-throughput sequencing database (Google Sheet):

Next up:

from Sam’s Notebook https://ift.tt/3gMdQD6
via IFTTT

Starting Kallisto

TRINITY Pipeline Well, there’s good news and bad news. The good news is that I don’t actually need to run the TRINITY pipeline to create a reference transcriptome – we already have some reference transcriptomes assembled! The bad news is…dang, that’s a lot of work down the drain. As a…

from Aidan F. Coyle https://ift.tt/3n8o0k4
via IFTTT

Alignment – C.gigas RNAseq to GCF_000297895.1_oyster_v9 Genome Using STAR on Mox

Mac was getting some weird results when mapping some single cell RNAseq data to the C.gigas mitochondrial (mt) genome that she had, so she asked for some help mapping other C.gigas RNAseq data (GitHub Issue) to the C.gigas mt genome to see if someone else would get similar results.

Per Mac’s suggestion, I used STAR to perform an RNAseq alignment.

I used a genome FastA and transcriptome GTF file that she had previously provided in this GitHub Issue, so I don’t know much about their origination/history.

For RNAseq data, I used the only Roberts Lab C.gigas data I could find (see Nightingales (Google Sheet) for more info), which was surprisingly limited. I didn’t realize that we’ve performed so few RNAseq experiments with C.gigas.

I used the following files for the alignment:

RNAseq (FastQ):

Genome FastA (540MB):

Transcriptome GTF (380MB):

This was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201208_cgig_STAR_RNAseq-to-NCBI-GCF_000297895.1_oyster_v9 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201208_cgig_STAR_RNAseq-to-NCBI-GCF_000297895.1_oyster_v9 ### C.gigas RNAseq alignment to NCBI genome FastA file from Mac GCF_000297895.1_oyster_v9_genomic.fasta. ### Mackenzie Gavery asked for help to evaluate RNAseq read mappings to mt genome. ################################################################################### # These variables need to be set by user # Working directory wd=$(pwd) # Set number of CPUs to use threads=28 # Initialize arrays fastq_array=() # Input/output files fastq_checksums=fastq_checksums.md5 genome_fasta_checksum=genome_fasta_checksum.md5 gtf_checksum=gtf_checksum.md5 rnaseq_reads_dir=/gscratch/srlab/sam/data/C_gigas/RNAseq gtf=/gscratch/srlab/sam/data/C_gigas/transcriptomes/GCF_000297895.1_oyster_v9_genomic.gtf.wl_keep_mito_v7.sorted.gtf genome_dir=${wd}/genome_dir genome_fasta=/gscratch/srlab/sam/data/C_gigas/genomes/GCF_000297895.1_oyster_v9_genomic.fasta # Paths to programs multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc samtools="/gscratch/srlab/programs/samtools-1.10/samtools" star=/gscratch/srlab/programs/STAR-2.7.6a/bin/Linux_x86_64_static/STAR # Programs associative array declare -A programs_array programs_array=( [multiqc]="${multiqc}" \ [samtools_index]="${samtools} index" \ [samtools_sort]="${samtools} sort" \ [samtools_view]="${samtools} view" \ [star]="${star}" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Load GCC OMP compiler. Might/not be needed for STAR module load gcc_8.2.1-ompi_4.0.2 # Make STAR genome directory mkdir --parents ${genome_dir} # Populate RNAseq array fastq_array=(${rnaseq_reads_dir}/*.fastq) # Comma separated list required for STAR mapping # Uses tr to change spaces between elements to commas fastq_list=$(tr ' ' ',' <<< "${fastq_array[@]}") # Create STAR genome indexes # Overhang value is set to "generic" 100bp - # this value is unknown and is the suggested default in # STAR documentation. ${programs_array[star]} \ --runThreadN ${threads} \ --runMode genomeGenerate \ --genomeDir ${genome_dir} \ --genomeFastaFiles ${genome_fasta} \ --sjdbGTFfile ${gtf} \ --sjdbOverhang 100 \ --genomeSAindexNbases 13 # Run STAR mapping # Sets output to sorted BAM file ${programs_array[star]} \ --runThreadN ${threads} \ --genomeDir ${genome_dir} \ --outSAMtype BAM SortedByCoordinate \ --readFilesIn ${fastq_list} # Index BAM output file ${programs_array[samtools_index]} \ Aligned.sortedByCoord.out.bam # Extract mt alignments # -h: includes header ${programs_array[samtools_view]} \ --threads ${threads} \ --write-index \ -h \ Aligned.sortedByCoord.out.bam NC_001276.1 \ -o Aligned.sortedByCoord.out.NC_001276.1.bam # Generate checksums for reference # Uses bash string substitution to replace commas with spaces # NOTE: do NOT quote string substitution command for fastq in ${fastq_list//,/ } do # Generate MD5 checksums for each input FastQ file echo "Generating MD5 checksum for ${fastq}." md5sum "${fastq}" >> "${fastq_checksums}" echo "Completed: MD5 checksum for ${fastq}." echo "" done # Run MultiQC ${programs_array[multiqc]} . # Generate checksums for genome FastA and GTF echo "Generating MD5 checksum for ${genome_fasta}." md5sum "${genome_fasta}" > "${genome_fasta_checksum}" echo "Completed: MD5 checksum for ${genome_fasta}." echo "" echo "Generating MD5 hecksum for ${gtf}." md5sum "${gtf}" > "${gtf_checksum}" echo "Completed: MD5 checksum for ${gtf}." echo "" # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} fi ${programs_array[$program]} -h echo "" echo "" echo "

SRA Submission – Haws Lab C.gigas Ploidy pH WGBS

I submitted the 24 C.gigas diploid/triploid pH-treated WGBS sequence data we received 20201205 to the NCBI Sequence Read Archive (SRA).

The samples were registered as part of the following BioProject:

This accession number is what should be referenced for any publications with these samples. This has been added to Nightingales (Google Sheet), the Roberts Lab NGS database.

Here is a table which contains individual sample accession numbers, in case they’re needed:

Sample_ID BioSample BioProject
2N_HI_05 SAMN17011249 PRJNA682817
2N_HI_08 SAMN17011255 PRJNA682817
2N_HI_09 SAMN17011256 PRJNA682817
2N_HI_10 SAMN17011257 PRJNA682817
2N_HI_11 SAMN17011258 PRJNA682817
2N_HI_12 SAMN17011259 PRJNA682817
2N_LOW_01 SAMN17011260 PRJNA682817
2N_LOW_02 SAMN17011261 PRJNA682817
2N_LOW_03 SAMN17011262 PRJNA682817
2N_LOW_04 SAMN17011239 PRJNA682817
2N_LOW_05 SAMN17011240 PRJNA682817
2N_LOW_06 SAMN17011241 PRJNA682817
3N_HI_02 SAMN17011242 PRJNA682817
3N_HI_03 SAMN17011243 PRJNA682817
3N_HI_05 SAMN17011244 PRJNA682817
3N_HI_08 SAMN17011245 PRJNA682817
3N_HI_10 SAMN17011246 PRJNA682817
3N_HI_11 SAMN17011247 PRJNA682817
3N_LOW_06 SAMN17011248 PRJNA682817
3N_LOW_07 SAMN17011250 PRJNA682817
3N_LOW_08 SAMN17011251 PRJNA682817
3N_LOW_10 SAMN17011252 PRJNA682817
3N_LOW_11 SAMN17011253 PRJNA682817
3N_LOW_12 SAMN17011254 PRJNA682817

from Sam’s Notebook https://ift.tt/3qD26rc
via IFTTT

Trimming – Haws Lab C.gigas Ploidy pH WGBS 10bp 5 and 3 Prime Ends Using fastp and MultiQC on Mox

Making the assumption that the 24 C.gigas ploidy pH WGBS data we receved 20201205 will be analyzed using Bismark, I decided to go ahead and trim the files according to Bismark guidelines for libraries made with the ZymoResearch Pico MethylSeq Kit.

I trimmed the files using fastp.

The trimming trims adapters and 10bp from both the 5’ and 3’ ends of each read. The Bismark guidelines suggest that the user “probably should” trim in this fashion (as opposed to just trimming 10bp from the 5’ end).

The job was run on Mox.

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201206_cgig_fastp-10bp-5-3-prime_ploidy-pH-wgbs ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201206_cgig_fastp-10bp-5-3-prime_ploidy-pH-wgbs ### Fastp trimming of Haw's Lab ploidy pH WGBS. ### Trims adapters, 10bp from 5' and 3' ends of reads ### Trimming is performed according to recommendation for use with Bismark ### for libraries created using ZymoResearch Pico MethylSeq Kit: ### https://github.com/FelixKrueger/Bismark/blob/master/Docs/README.md#ix-notes-about-different-library-types-and-commercial-kits ### Expects input filenames to be in format: zr3644_3_R1.fq.gz ################################################################################### # These variables need to be set by user ## Assign Variables # Set number of CPUs to use threads=27 # Input/output files trimmed_checksums=trimmed_fastq_checksums.md5 raw_reads_dir=/gscratch/srlab/sam/data/C_gigas/wgbs/ fastq_checksums=raw_fastq_checksums.md5 # Paths to programs fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() R1_names_array=() R2_names_array=() # Programs associative array declare -A programs_array programs_array=( [fastp]="${fastp}" \ [multiqc]="${multiqc}" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Capture date timestamp=$(date +%Y%m%d) # Sync raw FastQ files to working directory rsync --archive --verbose \ "${raw_reads_dir}"zr3644*.fq.gz . # Create arrays of fastq R1 files and sample names for fastq in *R1.fq.gz do fastq_array_R1+=("${fastq}") R1_names_array+=("$(echo "${fastq}" | awk 'BEGIN {FS = "[_.]"; OFS = "_"} {print $1, $2, $3}')") done # Create array of fastq R2 files for fastq in *R2.fq.gz do fastq_array_R2+=("${fastq}") R2_names_array+=("$(echo "${fastq}" | awk 'BEGIN {FS = "[_.]"; OFS = "_"} {print $1, $2, $3}')") done # Run fastp on files # Trim 10bp from 5' from each read # Adds JSON report output for downstream usage by MultiQC for index in "${!fastq_array_R1[@]}" do R1_sample_name=$(echo "${R1_names_array[index]}") R2_sample_name=$(echo "${R2_names_array[index]}") ${fastp} \ --in1 ${fastq_array_R1[index]} \ --in2 ${fastq_array_R2[index]} \ --detect_adapter_for_pe \ --detect_adapter_for_pe \ --trim_front1 10 \ --trim_front2 10 \ --trim_tail1 10 \ --trim_tail2 10 \ --thread ${threads} \ --html "${R1_sample_name}".fastp-trim."${timestamp}".report.html \ --json "${R1_sample_name}".fastp-trim."${timestamp}".report.json \ --out1 "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz \ --out2 "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz # Generate md5 checksums for newly trimmed files { md5sum "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz md5sum "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz } >> "${trimmed_checksums}" # Create list of fastq files used in analysis # Create MD5 checksum for reference echo "${fastq_array_R1[index]}" >> input.fastq.list.txt echo "${fastq_array_R2[index]}" >> input.fastq.list.txt md5sum "${fastq_array_R1[index]}" >> ${fastq_checksums} md5sum "${fastq_array_R2[index]}" >> ${fastq_checksums} # Remove original FastQ files rm "${fastq_array_R1[index]}" "${fastq_array_R2[index]}" done # Run MultiQC ${multiqc} . # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} fi ${programs_array[$program]} -h echo "" echo "" echo "

FastQC-MultiQc – C.gigas Ploidy pH WGBS Raw Sequence Data from Haws Lab on Mox

Yesterday (20201205), we received the whole genome bisulfite sequencing (WGBS) data back from ZymoResearch from the 24 C.gigas diploid/triploid subjected to two different pH treatments (received from the Haws’ Lab on 20200820 that we submitted to ZymoResearch on 20200824. As part of our standard sequencing data receipt pipeline, I needed to generate FastQC files for each sample.

FastQC was run on Mox.

Links to FastQC reports will be added to our NGS database spreadsheet, Nightingales (Google Sheet).

SBATCH script (GitHub):

#!/bin/bash ## Job Name #SBATCH --job-name=20201206_cgig_fastqc_multiqc_ploidy-pH-wgbs ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201206_cgig_fastqc_multiqc_ploidy-pH-wgbs ### FastQC assessment of raw sequencing from Haw's Lab ploidy pH WGBS. ################################################################################### # These variables need to be set by user # FastQC output directory output_dir=$(pwd) # Set number of CPUs to use threads=28 # Input/output files checksums=fastq_checksums.md5 fastq_list=fastq_list.txt raw_reads_dir=/gscratch/srlab/sam/data/C_gigas/wgbs/ # Paths to programs fastqc=/gscratch/srlab/programs/fastqc_v0.11.9/fastqc multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc # Programs associative array declare -A programs_array programs_array=( [fastqc]="${fastqc}" \ [multiqc]="${multiqc}" ) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Sync raw FastQ files to working directory rsync --archive --verbose \ "${raw_reads_dir}"zr3644*.fq.gz . # Populate array with FastQ files fastq_array=(*.fq.gz) # Pass array contents to new variable fastqc_list=$(echo "${fastq_array[*]}") # Run FastQC # NOTE: Do NOT quote ${fastqc_list} ${programs_array[fastqc]} \ --threads ${threads} \ --outdir ${output_dir} \ ${fastqc_list} # Create list of fastq files used in analysis echo "${fastqc_list}" | tr " " "\n" >> ${fastq_list} # Generate checksums for reference while read -r line do # Generate MD5 checksums for each input FastQ file echo "Generating MD5 checksum for ${line}." md5sum "${line}" >> "${checksums}" echo "Completed: MD5 checksum for ${line}." echo "" # Remove fastq files from working directory echo "Removing ${line} from directory" rm "${line}" echo "Removed ${line} from directory" echo "" done < ${fastq_list} # Run MultiQC ${programs_array[multiqc]} . # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} fi ${programs_array[$program]} -h echo "" echo "" echo "

Data Received – C.gigas Diploid-Triploid pH Treatments Ctenidia WGBS from ZymoResearch

Today we received the whole genome bisulfite sequencing (WGBS) from the [24 C.gigas diploid-triploid samples subjected to different pH that were submitted 20200824](https://ift.tt/2CYJaiG. The lengthy turnaround time was due to a bad lot of reagents, which forced them Zymo to find a different manufacturer in order to generate libraries.

Sequencing consisted of WGBS 150bp paired end (PE) reads for each library. All files were downloaded to the C_gigas folder on Owl(Synology server). MD5 checksums were confirmed:

screencap of md5 checksum verification

Principal spreadsheet for this project was updated (Google Sheet):

Have added files to our high-throughput sequencing database (Google Sheet):

Next up:

  • FastQC
  • Submit to NCBI sequence read archive (SRA).
Zymo_ID Sample_ID Ploidy pH_treatment
zr3644_1 2N_HI_5 diploid high
zr3644_2 2N_HI_8 diploid high
zr3644_3 2N_HI_9 diploid high
zr3644_4 2N_HI_10 diploid high
zr3644_5 2N_HI_11 diploid high
zr3644_6 2N_HI_12 diploid high
zr3644_7 2N_LOW_1 diploid low
zr3644_8 2N_LOW_2 diploid low
zr3644_9 2N_LOW_3 diploid low
zr3644_10 2N_LOW_4 diploid low
zr3644_11 2N_LOW_5 diploid low
zr3644_12 2N_LOW_6 diploid low
zr3644_13 3N_HI_2 triploid high
zr3644_14 3N_HI_3 triploid high
zr3644_15 3N_HI_5 triploid high
zr3644_16 3N_HI_8 triploid high
zr3644_17 3N_HI_10 triploid high
zr3644_18 3N_HI_11 triploid high
zr3644_19 3N_LOW_6 triploid low
zr3644_20 3N_LOW_7 triploid low
zr3644_21 3N_LOW_8 triploid low
zr3644_22 3N_LOW_10 triploid low
zr3644_23 3N_LOW_11 triploid low
zr3644_24 3N_LOW_12 triploid low

from Sam’s Notebook https://ift.tt/2VF0hvw
via IFTTT