call command can be used to compare variants identified from Sanger sequencing with those present within a VCF. The
call command uses BLAST to perform variant calling. This is a not a sophisticated method of variant calling, and is merely meant to facilitate in comparing Sanger sequencing with variants called in a VCF. Only homozygous calls are supported. Sanger sequences can be provided as AB1/ABI, FASTQ, or FASTA files. Sequences must be annotated with the sample name if you want to compare the genotype calls.
VCF-kit first tries to identify the sample from sequences using their filename. Because
ABI/AB1 files can only contain one sequence, you must specify their sample using the filename. To identify the sample, VCF-kit searches for the sample name after splitting the string using the
\W regular expression. For example:
The following Sanger sequence files (in AB1, fa, and fastq format) are all acceptable ways of specifying that the sample from the VCF is
DL238.AB1 DL238.fa DL238.fastq.gz DL238.20161205.fq.gz DL238_20161205.fq.gz DL238-20161205.fq.gz
In the first line,
DL238.AB1 is split into
AB1. Among the split pieces, the name of a sample must match exactly.
The examples below do not work:
You may have multiple samples within a single fastq or fasta file. In this case, you can use the description line to annotate each record:
> DL238 CGTCAAGGGGCACCGGATCG...
> @FCC1GJUACXX:6:1101:19686:24985#AAGGTCTA/2 |DL238| CGTCAAGGGGCACCGGATCG...
And multiple samples may be present within the same file:
> DL238 CGTCAAGGGGCACCGGATCG... > N2 CGTCAAGGCGCACCGGATCG... > MY25 CGTCAAGGCGCACCGGATCG...
vk call <seq> --ref=<reference> [--all-sites --vcf-targets <vcf>]
Perform variant calling using blast. Useful for validating variants using sanger sequencing.
- <seq> - Fasta (fa), Fastq (fq, fastq), or ab1 format, determined by file extension containing sanger reads.
- --ref - The reference genome used to perform calling.
- --all-sites - Outputs comparison of every site that aligns within the alignment.
- --vcf-sites - Only outputs sites present within the VCF.
The following columns are output when using the
- CHROM - chromosome
- POS - position
- reference - The base at the genomic position within the reference genome. Should always be identical to
- REF - The reference genotype within the VCF. Only listed when a variant is present.
- ALT - The alternative genotype within the VCF. Only listed when a variant is present.
- seq_gt - The Sanger genotype for the sample listed.
- vcf_gt - The VCF genotype for the sample listed. Homozygous calls are listed as a single base.
- sample - The sample being compared. Must be matched to the Sanger sequence and present within the VCF.
- variant_type - The type of variant observed as compared to the Sanger sequence: SNV, Insertion, or Deletion.
- classification - True/False Positive/Negative, assuming the Sanger sequence is correct.
- index - The index of the Sanger sequence, as aligned.
- alignment_start - The start of the blast alignment.
- alignment_end - The end of the blast alignment.
- strand - The strand aligned to (
- gaps - Total number of gaps in the alignment.
- mismatch - Total number of mistmatches in the alignment.
- bitscore - The bitscore of the alignment.
Note: Some columns removed from examples for conciseness.
Shows all calls from Sanger and VCF. Will output every position that was aligned to reference from Sanger sequence and corresponding VCF calls. Notice the absence of REF/ALT for sites not called in the VCF below.
vk call DL238_sanger.AB1 --ref=WBcel235 --all-sites illumina_sequencing.vcf.gz
Only show variants present in the VCF.
vk call DL238_sanger.AB1 --ref=WBcel235 --vcf-sites illumina_resequencing.vcf.gz