Sam’s Notebook: Data Wrangling – CpG OE Calculations on C.virginica Genes

Steven tasked me with processing ~90 FastA files containing gene sequences from C.virginica in this GitHub Issue. He needed to determine the Observed/Expected (O/E) ratio of CpGs in each FastA. He provided this example code and this link to all the files. Additionally, today, he tasked Kaitlyn with merging all of the output CpG O/E values for each sample in to a single file, but I decided to tackle it anyway.

The CpG O/E determination was done in a Jupyter Notebook:

Interestingly, the processing (which relied on awk) required the use of gawk, due to the high number of output fields. The default implementation of awk on the version of Ubuntu I was using was not gawk.

The creation of a single file with all of the CpG O/E info is detailed in this bash script:

  #!/bin/bash ## Script to append sample-specific headers to each ID_CpG ## file and join all ID_CpG files. ## Run file from within this directory. # Temp file placeholder tmp=$(mktemp) # Create array of subdirectories. array=(*/) # Create column headers for ID_CpG files using sample name from directory name. for file in ${array[@]} do gene=$(echo ${file} | awk -F\[._] '{print $6"_"$7}') sed "1iID\t${gene}" ${file}ID_CpG > ${file}ID_CpG_labelled done # Create initial file for joining cp ${array[0]}ID_CpG_labelled ID_CpG_labelled_all # Loop through array and performs joins. for file in ${array[@]:1} do join \ --nocheck-order \ ID_CpG_labelled_all ${file}ID_CpG_labelled \ | column -t \ > ${tmp} \ && mv ${tmp} ID_CpG_labelled_all done