Sam’s Notebook: Data Wrangling – Rename Pgenerosa_v074 Files and Scaffolds

Continuing to organizing files for a manuscript dealing with the geoduck genome assembly/annotation we’ve done, we decided to rename the files as well as rename the scaffolds, to make the naming consistent and a bit easier to read (both for humans and computers).

Currently, most of the GFF and BED files are named something like:

  • Panopea-generosa-vv0.74.a4.rRNA.gff3

A couple of other files (like the assembly FastA) have names like this:

  • Pgenerosa_v074.fa

The scaffolds within each of the files are named like so:

  • PGA_scaffold18__69_contigs__length_27737463

We want the filenames to look like this:

  • Panopea-generosa-v1.0

We want the scaffold names to look like this:

  • Scaffold_01

I processed all of the necessary files and documented in the following Jupyter Notebook (GitHub):

[code] #!/bin/bash ## Job Name...

#!/bin/bash
## Job Name
#SBATCH --job-name=c2c_l2
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=00-12:00:00
## Memory per node
#SBATCH --mem=100G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=sr320@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/sr320/1104c/



# Directories and programs
bismark_dir="/gscratch/srlab/programs/Bismark-0.21.0"
#bowtie2_dir="/gscratch/srlab/programs/bowtie2-2.3.4.1-linux-x86_64/"
#samtools="/gscratch/srlab/programs/samtools-1.9/samtools"




source /gscratch/srlab/programs/scripts/paths.sh



find /gscratch/srlab/sr320/data/geoduck/cov_files/*_R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz \
| xargs basename -s _R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz | xargs -I{} ${bismark_dir}/coverage2cytosine \
--genome_folder /gscratch/srlab/sr320/data/geoduck/v074 \
-o {}_ \
--merge_CpG \
/gscratch/srlab/sr320/data/geoduck/cov_files/{}_R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz

#bowtie2_dir, #samtools, #sbatch