I ran into an issue while doing some analysis for my dissertation. I’ve been working on comparing genetic distances between populations using a variety of molecular markers (mtDNA sequences, Y-STRs, and autosomal STRs). I wanted to generate several neighbor-joining trees to display the results, but I also wanted a way to test the statistical significance of the tree, or how accurate a representation of the underlying genetic distance data the tree actually was.
One way to do this is with bootstrapping, where thousands of random data sets are generated from the original data (by dropping data and recalculating the tree). In the end you have a tree with internal branch values, showing how many times each node turned up in the analysis. It’s a standard technique, and is the method I used with my autosomal STR data. But the software I used to handle sequence data in particular (MEGA, Phylip), starts with the raw sequences and generates bootstrapped trees from that data. The trees created show each sequence on its own branch, rather than each population. With over 8,000 sequences in my data set, this type of analysis really wasn’t useful.
But last week I found TreeFit, a little Windows program that generates an overall R2 value by comparing the genetic distance matrix with the distances calculated based on the neighbor-joining algorithm. Basically, it appears comparable to the STRESS value used for multidimensional scaling (MDS), measuring how well the representation of the data (the NJ tree) matches the variation present in the original distance matrix. A perfect fit would generate an R2 value of 1.0, while anything above 0.90 is considered a good fit (or an accurate representation of the underlying data). Values less than 0.90 suggest that another graphical display method (MDS) might be a better choice, as not all data fit the hierarchical model on which the NJ algorithm is based.
Using TreeFit, I got some reassurance that my NJ trees were accurate, and the statistical significance I needed to convince my committee that my data is not “merely descriptive.”
Technical specs: