I ran into an issue while doing some analysis for my dissertation. I’ve been working on comparing genetic distances between populations using a variety of molecular markers (mtDNA sequences, Y-STRs, and autosomal STRs). I wanted to generate several neighbor-joining trees to display the results, but I also wanted a way to test the statistical significance of the tree, or how accurate a representation of the underlying genetic distance data the tree actually was.
One way to do this is with bootstrapping, where thousands of random data sets are generated from the original data (by dropping data and recalculating the tree). In the end you have a tree with internal branch values, showing how many times each node turned up in the analysis. It’s a standard technique, and is the method I used with my autosomal STR data. But the software I used to handle sequence data in particular (MEGA, Phylip), starts with the raw sequences and generates bootstrapped trees from that data. The trees created show each sequence on its own branch, rather than each population. With over 8,000 sequences in my data set, this type of analysis really wasn’t useful.
But last week I found TreeFit, a little Windows program that generates an overall R2 value by comparing the genetic distance matrix with the distances calculated based on the neighbor-joining algorithm. Basically, it appears comparable to the STRESS value used for multidimensional scaling (MDS), measuring how well the representation of the data (the NJ tree) matches the variation present in the original distance matrix. A perfect fit would generate an R2 value of 1.0, while anything above 0.90 is considered a good fit (or an accurate representation of the underlying data). Values less than 0.90 suggest that another graphical display method (MDS) might be a better choice, as not all data fit the hierarchical model on which the NJ algorithm is based.
Using TreeFit, I got some reassurance that my NJ trees were accurate, and the statistical significance I needed to convince my committee that my data is not “merely descriptive.”
Technical specs:
- OS: Windows (runs fine on my XP virtural machine)
- Requires MS .Net framework
- edfa
- if this is not installed on your system (as it wasn’t on mine), it can be downloaded from the Windows update site
- Input file: any lower left genetic distance matrix, meaning that this program works with ANY type of genetic data.
- Output: observed and fitted genetic distances, these can be plotted for a nice visual, plus overall R2
- Reference: Kalinowski, ST (2009) How well do evolutionary trees describe genetic relationships between populations? Heredity (28 Jan 2009) doi: 10.1038/hdy.2008.136. (PDF available from the author’s publication page).


