Classification methods used in machine learning (e.g., artificial neural networks, decision trees, and k-nearest neighbor clustering) are rarely used with population genetic data. We compare different nonparametric machine learning techniques with parametric likelihood estimations commonly employed in population genetics for purposes of assigning individuals to their population of origin (“assignment tests”). Classifier accuracy was compared across simulated data sets representing different levels of population differentiation (low and high FST), number of loci surveyed (5 and 10), and allelic diversity (average of three or eight alleles per locus). Empirical data for the lake trout (Salvelinus namaycush) exhibiting levels of population differentiation comparable to those used in simulations were examined to further evaluate and compare classification methods. Classification error rates associated with artificial neural networks and likelihood estimators were lower for simulated data sets compared to k-nearest neighbor and decision tree classifiers over the entire range of parameters considered. Artificial neural networks only marginally outperformed the likelihood method for simulated data (0–2.8% lower error rates). The relative performance of each machine learning classifier improved relative likelihood estimators for empirical data sets, suggesting an ability to “learn” and utilize properties of empirical genotypic arrays intrinsic to each population. Likelihood-based estimation methods provide a more accessible option for reliable assignment of individuals to the population of origin due to the intricacies in development and evaluation of artificial neural networks.
In recent years, characterization of highly polymorphic molecular markers such as mini- and microsatellites and development of novel methods of analysis have enabled researchers to extend investigations of ecological and evolutionary processes below the population level to the level of individuals (e.g., Bowcock et al. 1994; Estoup and Angers 1998; Jarne and Lagoda 1996). Analyses of individual-based genotypic information could substantially improve our understanding of evolutionary phenomena and contribute to effective management of natural populations (review inBernatchez and Duchesne 2000). The use of individual-based methods remained largely unexplored in animal populations until recently due to a lack of highly polymorphic markers (Bernatchez and Duchesne 2000;Smouse and Chevillon 1998). Traditional analytical methods in population genetics rely almost exclusively on descriptors of genetic characterizations of populations (Bernatchez and Duchesne 2000) and not on individual genotypes.
“Assignment tests” are designed to determine population membership for individuals. One particular application based on a likelihood estimate (LE) was introduced by Paetkau et al. (1995; see also Vásquez-Domínguez et al. 2001) to assign an individual to the population of origin on the basis of multilocus genotype and expectations of observing this genotype in each potential source population. The LE approach can be implemented statistically in a Bayesian framework as a convenient way to evaluate hypotheses of plausible genealogical relationships (e.g., that an individual possesses an ancestor in another population) (Dawson and Belkhir 2001;Pritchard et al. 2000; Rannala and Mountain 1997). Other studies have evaluated the confidence of the assignment (Almudevar 2000) and characteristics of genotypic data (e.g., degree of population divergence, number of loci, number of individuals, number of alleles) that lead to greater population assignment (Bernatchez and Duchesne 2000; Cornuet et al. 1999; Haig et al. 1997; Shriver et al. 1997; Smouse and Chevillon 1998). Main statistical and conceptual differences between methods leading to the use of an assignment test are given in, for example,Cornuet et al. (1999) and Rosenberg et al. (2001). However, the relative power of those tests has certainly not been fully appreciated and empirical comparisons are scarce (Eldridge et al. 2001). Assignment tests can also be considered as surrogates at the individual level (sensu Hansen et al. 2001a) for other statistical tools developed earlier, such as mixed-stock analysis (e.g., Pella and Masuda 2001; Pella and Milner 1987). Detailed theoretical comparison of the interests and limitations of both methods are still lacking, but empirical studies have revealed correlations between outputs of methods (Knutsen et al. 2001; Potvin and Bernatchez 2001).
Assignment tests have been widely used in different applications, including determination of degree of population differentiation or to establish the relationship among individuals within and among various taxonomic groupings (e.g., Bogdanowicz et al. 1997; Koskinen et al. 2001;Marshall et al. 2000; Müller 2000; Neraas and Spruell 2001; Nielsen et al. 2001b; Polzhien et al. 2000; Primmer et al. 1999; Roeder et al. 2001;Roques et al. 1999; Schulte-Hostedde et al. 2001; Sefc et al. 2000; Spidle et al. 2001; Vásquez-Domínguez et al. 2001), including hybrids (e.g.,Beaumont et al. 2001; Congiu et al. 2001; Randi et al. 2001), introgressed individuals (e.g., Martinez et al. 2001; Randi and Lucchini 2002), and ecotypes (e.g., Taylor et al. 2000). Applications of assignment tests also include [human] forensics (e.g., Evett and Weir 1998; Primmer et al. 2000), identification and/or source of dispersers (e.g., Davies et al. 1999;Eldridge et al. 2001; Galbusera et al. 2000; Petersson et al. 2001; Tsutsui et al. 2001; Vasemägi et al. 2001), phylogeographical analyses (e.g., King et al. 2001; Zeisset and Beebee 2001), and the evaluation of the contribution of stocked individuals to natural populations (e.g., Fritzner et al. 2001; Hansen et al. 2000, 2001b) and of supportive breeding programs (Nielsen et al. 2001a; Olsen et al. 2000). Fish are among the organisms that have received considerable attention using such tools (see Hansen et al. [2001a] for a review). Moreover, these techniques are now used for profiles of traits outside the limited scope of population genetics (Thorrold et al. 2001).
Methods of classification vary widely based on several criteria (e.g., Jain et al. 2000) (Figure 1). Two basic classification processes are traditionally recognized in machine learning: supervised classifiers and unsupervisedclassifiers (Figure 1; e.g., Duda et al. 2000; Jain et al. 2000). Supervised classifiers represent a group of methods whereby individual assignment is made to predefined classes (i.e., populations of origin). Unsupervised classification classes are unknown and are defined a posteriori on the basis of the degree of difference or similarity in attributes characterized from sampled individuals. Clustering methods (e.g., multidimensional scaling, principal component analysis) are examples of unsupervised classification.
Applications of assignment testing in population genetics first used supervised parametric likelihood-based approaches (Figure 1). Other machine learning classification methods are widely used in the physical and social sciences and in other biological disciplines (e.g. Boddy et al. 2000; Leung and Tran 2000; Manel et al. 1999; Raymer et al. 1997). Artificial neural networks (ANNs) are a popular technique used in machine learning (e.g., Boddy and Morris 1999; Duda et al. 2000; Lek and Guégan 2000; Ripley 1996). However, while recognized (Hansen et al. 2001a), ANN methods rarely have been employed for population genetics applications (Aurelle 1999; Aurelle et al. 1999; Cornuet et al. 1996; Curtis et al. 2001;Giraudel et al. 2000; Grigull et al. 2001; Taylor et al. 1994; Whitler et al. 1994). Other popular classification methods in machine learning, such as decision trees (e.g., Bell 1996, 1999; Duda et al. 2000; Mitchell 1997) andk-nearest neighbor analysis (k-NN; e.g., Dasarathy 1991; Duda et al. 2000) have yet to be applied in population genetics (Figure 1). Moreover, there has not been a directed effort to compare machine learning methodologies with the likelihood-based procedures widely used in population genetics. Cornuet et al. (1996) compared the relative merits of ANNs to discriminant analysis in an empirical study involving different populations and subspecies of honeybee (Apis mellifera). However, they did not compare LE and ANN supervised classifiers. Aurelle (1999) used the approach of Rannala and Mountain (1997) (Figure 1) and ANN analysis using brown trout (Salmo trutta) microsatellite data; however, he did not provide a direct comparison of classification results or accuracies. Hansen et al. (2001a) briefly presented ANNs, but rejected their use without really testing their ability to classify individuals.
The objective of this article is to describe several of the more widely used machine learning classifiers that may have utility when used with empirical population genetics data. We compare likelihood-based “assignment tests” (Paetkau et al. 1995) with supervised machine learning classifiers including ANN, decision tree, and a k-NN clustering. Simulations were conducted which estimated and compared the assignment accuracy associated with different classifiers using ranges of parameter values (number of loci, allelic diversity, and interpopulation variance in allele frequency) typically encountered in natural populations. Comparative analyses were extended to empirical examples using lake trout (Salvelinus namaycush; Salmonidae).
Additional publication details
|Publication Subtype||Journal Article|
|Title||Comparisons of likelihood and machine learning methods of individual classification|
|Series title||Journal of Heredity|
|Contributing office(s)||Great Lakes Science Center|
|Online Only (Y/N)||N|
|Additional Online Files (Y/N)||N|