vk phylo fasta <vcf> [<region>]
vk phylo tree (nj|upgma) [--plot] <vcf> [<region>]

The phylo command can be used to generate dendrograms, tree files, or a fasta file of variants concatenated together (equivelent to a multiple sequence alignment) from a VCF. Tree files are generated in Newick format) with MUSCLE using UPGMA or neighbor-joining. VCF-kit can use the output tree file to generate a plot of the tree/phylogeny.

phylo can read a VCF directly or from stdin by using -.

Generate fasta-alignment from variant calls

The phylo fasta command generates a fasta file by concatenating all single-nucleotide variants from a VCF for each sample. The first allele of each genotype is used (for example, for diploid organisms with an A/T genotype, A would always be used). Heterozygous calls should be avoided when using the phylo command. Missing values are replaced with a N. The fasta sequence is constructed in memory - so it can be somewhat resource intensive for larger files.

vk phylo fasta <vcf>


>QG536 <-- Sample 1
>GXW1 <-- Sample 2
>DL200 <-- Sample 3

Generate fasta-alignment from stdin

You may be interested in filtering variants prior to generating a fasta file. The command below reads from stdin.

bcftools filter --set-GTs . --exclude 'FMT/DP < 20'  data/test.vcf.gz | vk phylo fasta -

Generating a tree/phylogeny

vk phylo tree can be used to generate a tree/phylogeny from a vcf file. This command uses a fasta file (identical to what is produced using vk phylo fasta), and uses [MUSCLE](https://en.wikipedia.org/wiki/MUSCLE_(alignment_software) to produce a tree file.

Generate a UPGMA tree

An unweighted pair group method with arithmetic-mean (UPGMA) tree can be constructed using the following command. Output is in newick format.

vk phylo tree upgma <vcf>

Operating on regions

The phylo command can be used on specific regions or chromosomes.

Operate on specific chromosome

vk phylo tree upgma <vcf> I

Operate on specific region

vk phylo tree upgma <vcf> I:1-10000

Generate a neighbor-joining tree

An neighbor-joining tree can be constructed using the following command. Output is in newick format.

vk phylo tree nj <vcf>

Generate fasta sequences from variant data. This is useful for generating phylogenetic trees from VCF files.

Output format

Output - Newick format

The phylo tree command sends output to stdout is newick format. Newick format can be used


Plot a phylogeny using R

The following script can be used to plot your phylogeny using R. You will need to install tidyverse, ape, ggmap, and phyloseq to use it. You can install them using :

biocLite(c('ape','phyloseq','ggmap'), suppressUpdates=TRUE)

tree <- ape::read.tree(paste0("treefile.newick"))

# Optionally set an outgroup.
# tree <- root(tree,outgroup = "outgroup", resolve.root = T)

treeSegs <- phyloseq::tree_layout(
                                ladderize = T

treeSegs$edgeDT <- treeSegs$edgeDT  %>% 
                   dplyr::mutate(edge.length = 
                                    ifelse(edge.length < 0, 0, edge.length)
                                 , xright = xleft + edge.length
edgeMap = aes(x = xleft, xend = xright, y = y, yend = y)
vertMap = aes(x = x, xend = x, y = vmin, yend = vmax)
labelMap <- aes(x = xright+0.0001, y = y, label = OTU)

ggplot(data = treeSegs$edgeDT) + geom_segment(edgeMap) + 
  geom_segment(vertMap, data = treeSegs$vertDT) +
  geom_text(labelMap, data = dplyr::filter(treeSegs$edgeDT, !is.na(OTU)), na.rm = TRUE, hjust = -0.05) +
  ggmap::theme_nothing() + 
  scale_x_continuous(limits = c(
  expand = c(0,0))

The above script will output something that looks like this:

phylogeny example in R

Plot a phylogeny in your web browser

phylo tree can be used to generate a plot of a phylogeny in your web browser by adding the --plot flag.

vk phylo tree nj --plot <vcf>

phylogeny example