Comparisons of likelihood and machine learning methods of individual classification

Journal of Heredity
By: , and 

Links

Abstract

Classification methods used in machine learning (e.g., artificial neural networks, decision trees, and k-nearest neighbor clustering) are rarely used with population genetic data. We compare different nonparametric machine learning techniques with parametric likelihood estimations commonly employed in population genetics for purposes of assigning individuals to their population of origin (“assignment tests”). Classifier accuracy was compared across simulated data sets representing different levels of population differentiation (low and high FST), number of loci surveyed (5 and 10), and allelic diversity (average of three or eight alleles per locus). Empirical data for the lake trout (Salvelinus namaycush) exhibiting levels of population differentiation comparable to those used in simulations were examined to further evaluate and compare classification methods. Classification error rates associated with artificial neural networks and likelihood estimators were lower for simulated data sets compared to k-nearest neighbor and decision tree classifiers over the entire range of parameters considered. Artificial neural networks only marginally outperformed the likelihood method for simulated data (0–2.8% lower error rates). The relative performance of each machine learning classifier improved relative likelihood estimators for empirical data sets, suggesting an ability to “learn” and utilize properties of empirical genotypic arrays intrinsic to each population. Likelihood-based estimation methods provide a more accessible option for reliable assignment of individuals to the population of origin due to the intricacies in development and evaluation of artificial neural networks.

In recent years, characterization of highly polymorphic molecular markers such as mini- and microsatellites and development of novel methods of analysis have enabled researchers to extend investigations of ecological and evolutionary processes below the population level to the level of individuals (e.g., Bowcock et al. 1994Estoup and Angers 1998Jarne and Lagoda 1996). Analyses of individual-based genotypic information could substantially improve our understanding of evolutionary phenomena and contribute to effective management of natural populations (review inBernatchez and Duchesne 2000). The use of individual-based methods remained largely unexplored in animal populations until recently due to a lack of highly polymorphic markers (Bernatchez and Duchesne 2000;Smouse and Chevillon 1998). Traditional analytical methods in population genetics rely almost exclusively on descriptors of genetic characterizations of populations (Bernatchez and Duchesne 2000) and not on individual genotypes.

“Assignment tests” are designed to determine population membership for individuals. One particular application based on a likelihood estimate (LE) was introduced by Paetkau et al. (1995; see also Vásquez-Domínguez et al. 2001) to assign an individual to the population of origin on the basis of multilocus genotype and expectations of observing this genotype in each potential source population. The LE approach can be implemented statistically in a Bayesian framework as a convenient way to evaluate hypotheses of plausible genealogical relationships (e.g., that an individual possesses an ancestor in another population) (Dawson and Belkhir 2001;Pritchard et al. 2000Rannala and Mountain 1997). Other studies have evaluated the confidence of the assignment (Almudevar 2000) and characteristics of genotypic data (e.g., degree of population divergence, number of loci, number of individuals, number of alleles) that lead to greater population assignment (Bernatchez and Duchesne 2000Cornuet et al. 1999Haig et al. 1997; Shriver et al. 1997; Smouse and Chevillon 1998). Main statistical and conceptual differences between methods leading to the use of an assignment test are given in, for example,Cornuet et al. (1999) and Rosenberg et al. (2001). However, the relative power of those tests has certainly not been fully appreciated and empirical comparisons are scarce (Eldridge et al. 2001). Assignment tests can also be considered as surrogates at the individual level (sensu Hansen et al. 2001a) for other statistical tools developed earlier, such as mixed-stock analysis (e.g., Pella and Masuda 2001Pella and Milner 1987). Detailed theoretical comparison of the interests and limitations of both methods are still lacking, but empirical studies have revealed correlations between outputs of methods (Knutsen et al. 2001Potvin and Bernatchez 2001).

Assignment tests have been widely used in different applications, including determination of degree of population differentiation or to establish the relationship among individuals within and among various taxonomic groupings (e.g., Bogdanowicz et al. 1997Koskinen et al. 2001;Marshall et al. 2000Müller 2000Neraas and Spruell 2001Nielsen et al. 2001bPolzhien et al. 2000Primmer et al. 1999Roeder et al. 2001;Roques et al. 1999Schulte-Hostedde et al. 2001Sefc et al. 2000Spidle et al. 2001Vásquez-Domínguez et al. 2001), including hybrids (e.g.,Beaumont et al. 2001Congiu et al. 2001Randi et al. 2001), introgressed individuals (e.g., Martinez et al. 2001Randi and Lucchini 2002), and ecotypes (e.g., Taylor et al. 2000). Applications of assignment tests also include [human] forensics (e.g., Evett and Weir 1998Primmer et al. 2000), identification and/or source of dispersers (e.g., Davies et al. 1999;Eldridge et al. 2001Galbusera et al. 2000Petersson et al. 2001Tsutsui et al. 2001; Vasemägi et al. 2001), phylogeographical analyses (e.g., King et al. 2001Zeisset and Beebee 2001), and the evaluation of the contribution of stocked individuals to natural populations (e.g., Fritzner et al. 2001Hansen et al. 20002001b) and of supportive breeding programs (Nielsen et al. 2001aOlsen et al. 2000). Fish are among the organisms that have received considerable attention using such tools (see Hansen et al. [2001a] for a review). Moreover, these techniques are now used for profiles of traits outside the limited scope of population genetics (Thorrold et al. 2001).

Methods of classification vary widely based on several criteria (e.g., Jain et al. 2000) (Figure 1). Two basic classification processes are traditionally recognized in machine learning: supervised classifiers and unsupervisedclassifiers (Figure 1; e.g., Duda et al. 2000Jain et al. 2000). Supervised classifiers represent a group of methods whereby individual assignment is made to predefined classes (i.e., populations of origin). Unsupervised classification classes are unknown and are defined a posteriori on the basis of the degree of difference or similarity in attributes characterized from sampled individuals. Clustering methods (e.g., multidimensional scaling, principal component analysis) are examples of unsupervised classification.

Applications of assignment testing in population genetics first used supervised parametric likelihood-based approaches (Figure 1). Other machine learning classification methods are widely used in the physical and social sciences and in other biological disciplines (e.g. Boddy et al. 2000Leung and Tran 2000Manel et al. 1999Raymer et al. 1997). Artificial neural networks (ANNs) are a popular technique used in machine learning (e.g., Boddy and Morris 1999Duda et al. 2000Lek and Guégan 2000Ripley 1996). However, while recognized (Hansen et al. 2001a), ANN methods rarely have been employed for population genetics applications (Aurelle 1999Aurelle et al. 1999Cornuet et al. 1996Curtis et al. 2001;Giraudel et al. 2000Grigull et al. 2001Taylor et al. 1994Whitler et al. 1994). Other popular classification methods in machine learning, such as decision trees (e.g., Bell 19961999Duda et al. 2000Mitchell 1997) andk-nearest neighbor analysis (k-NN; e.g., Dasarathy 1991Duda et al. 2000) have yet to be applied in population genetics (Figure 1). Moreover, there has not been a directed effort to compare machine learning methodologies with the likelihood-based procedures widely used in population genetics. Cornuet et al. (1996) compared the relative merits of ANNs to discriminant analysis in an empirical study involving different populations and subspecies of honeybee (Apis mellifera). However, they did not compare LE and ANN supervised classifiers. Aurelle (1999) used the approach of Rannala and Mountain (1997) (Figure 1) and ANN analysis using brown trout (Salmo trutta) microsatellite data; however, he did not provide a direct comparison of classification results or accuracies. Hansen et al. (2001a) briefly presented ANNs, but rejected their use without really testing their ability to classify individuals.

The objective of this article is to describe several of the more widely used machine learning classifiers that may have utility when used with empirical population genetics data. We compare likelihood-based “assignment tests” (Paetkau et al. 1995) with supervised machine learning classifiers including ANN, decision tree, and a k-NN clustering. Simulations were conducted which estimated and compared the assignment accuracy associated with different classifiers using ranges of parameter values (number of loci, allelic diversity, and interpopulation variance in allele frequency) typically encountered in natural populations. Comparative analyses were extended to empirical examples using lake trout (Salvelinus namaycush; Salmonidae).

 

Publication type Article
Publication Subtype Journal Article
Title Comparisons of likelihood and machine learning methods of individual classification
Series title Journal of Heredity
DOI 10.1093/jhered/93.4.260
Volume 93
Issue 4
Year Published 2002
Language English
Publisher Oxford Journals
Contributing office(s) Great Lakes Science Center
Description 10 p.
First page 260
Last page 269
Online Only (Y/N) N
Additional Online Files (Y/N) N
Google Analytic Metrics Metrics page
Additional publication details