Sam’s Notebook: VCF Splitting with bcftools

0000-0002-2747-368X

Steven asked for some help trying to split a VCF in to individual VCF files.

VCF file (15GB): SNP.TRSdp5g95FnDNAmaf05.vcf.gz

Skip to the Results section if you don’t want to read through the tials and tribulations of getting this to work.

Here’s an overview of how I managed to get this to work and what didn’t work.

Figured out the VCF file needed to be sorted, bgzipped (part of htslib), and indexed with tabix, due to the following error when initially trying to process with VCF file using bcftools:

[W::vcf_parse] contig '' is not defined in the header. (Quick workaround: index the file with tabix.)
Undefined tags in the header, cannot proceed in the sample subset mode.

So, I did that:

  • Sort and bgzip:
 cat SNP.TRSdp5g95FnDNAmaf05.vcf | \ awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' | \ bgzip --threads 20 > SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz 
  • Index with tabix:
 tabix --preset vcf SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz 

This produced a separate file:

  • SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz.tbi.

It seems as though this file must exist in the same directory as the source VCF for it to be utilized, although no commands work directly with this index file.

Then, tried biostars solution, but produces an error

  #!/bin/bash for file in *.vcf.gz; do for sample in `bcftools query -l $file`; do bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file done done  

Resulting error:

 [E::bcf_calc_ac] Incorrect AN/AC counts at NC_035780.1:26174 

And empty split VCF files…

Tried tabix on unsorted bgzipped file yields this error:

 [E::hts_idx_push] chromosome blocks not continuous 

Tried modified sort:

  cat SNP.TRSdp5g95FnDNAmaf05.vcf | \ awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1V -k2,2n"}' | \ bgzip --threads 20 > SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz  

Produces this error:

 [E::bcf_calc_ac] Incorrect AN/AC counts at NC_035780.1:26174 

And empty split VCF files…

Changed to new version of “view” – trying “call” instead (it seems that bcftools view is deprecated?):

  #!/bin/bash for file in *.vcf.gz; do for sample in `bcftools query -l $file`; do bcftools call \ --consensus-caller \ --output-type z \ --threads 18 \ --samples $sample --output-file ${file/.vcf.gz/.$sample.vcf.gz} \ $file done done  

Still results in empty output files.

Based off of the repeated error about AN/AC counts, tried to fill AN/AC values…

  bcftools plugin fill-AN-AC SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz \ --output-type z \ --threads 18 \ --output SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz  

And, ran this code:

  #!/bin/bash for file in SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz; do for sample in `bcftools query -l $file`; do bcftools call \ --consensus-caller \ --output-type z \ --threads 18 \ --samples $sample --output-file ${file/.vcf.gz/.$sample.vcf.gz} \ $file done done  

Still results in empty files…

Try original code again (expanded shortened arguments to improve readability):

  #!/bin/bash for file in SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz; do for sample in `bcftools query -l $file`; do bcftools view \ --min-ac 1 \ --output-type z \ --samples $sample \ --output-file ${file/.vcf*/.$sample.vcf.gz} \ --threads 18 \ $file done done  

P.S. I realize the outermost for loop is not necessary, but it was faster/easier to just quickly modify the code from that Biostars solution.