15 avril 2020

How to know if an individual specimen may be a member of a given reference population?

Let’s say you have one given specimen of unknown origin or species. Although you do not know precisely to which group this individual belongs, you have several hypotheses; e.g., you think that this individual might belong to populations A or B, but is certainly not a member of population C. In this post, we will illustrate some possible methods to assess the credibility of those hypotheses, depending on the data you were able to collect (or you are willing to use).

If you have only one continuous variable for both your unknown individual and your reference population sample, a simple yet efficient univariate method may be the so-called “probabilistic distance”.
If you have several continuous variables, one can think of various approaches. A basic one (frequently encountered within the literature in biological anthropology) makes use of principal component analyses and data ellipses. An alternative would be to use the concept of typicality probability.

Let’s start with the simplest case.

1 The univariate case

1.1 The “probabilistic distance”

1.1.1 Theory

The “probabilistic distance” (sometimes abbreviated dpx), or distance probabiliste in French, is basically a two sample t-test when one of the samples under comparison has only one individual. Within this framework, the null hypothesis \(\mathcal{H}_0\) might be rephrased as: “the unknown individual is a member of the reference population”. The use of this method was popularized by Bruno Maureille and coworkers; full theoretical details may be found in two of their publications [4,2].

The probabilistic distance assumes that the reference population sample is normally distributed. In this case, the final result is simply a p-value for the null hypothesis. It may be interpreted as the probability to find in the underlying reference population an individual as extreme as, or even more extreme than, the target individual under study. Consequently, a very low probabilistic distance (let’s say \(p = 0.0001\)) constitutes a strong evidence against the hypothesis that the unknown individual belongs to the reference population. Conversely, a probabilistic distance close to 1 means that the unknown individual is close to the average individual from the reference population, thus providing no evidence to discard the null hypothesis¹.

1.1.2 Practice with R

It is very easy to implement the formula of the probabilistic distance in a spreadsheet or in R. The detailed formulae can be found in [4]. However, the R package anthrostat offers an R function, dpx(), to facilitate the computations out of the box. Here is an example of its use:

library(anthrostat)
set.seed(2020) # to ensure replicability
## Generate (at random) a reference population sample of 15 indiv.:
ref <- rnorm(15, mean = 25, sd = 3)
## Distance probabiliste for an unknown individual x:
dpx(x = 20, ref)


[1] 0.1838366

The probabilistic distance is approximately equal to 0.18. Thus, for an unknown individual with a value \(x = 20\), there is no strong evidence that this individual might not be a member of the reference population sample (composed of 15 individuals distributed according to \(\mathcal{N}(25, 9)\)).

1.2 Alternatives

If the reference population sample is large enough, it might also be possible to apply any (robust) method of outlier detection to the whole data, i.e. the \(n+1\) values composed of the reference sample and the unknown individual. If the unknown individual is detected as an outlier within this sample, this may constitute an evidence against the hypothesis that the unknown individual is a member of the reference sample.

2 The multivariate case with one single reference sample

Let’s suppose now that we have several variables \(X_1, X_2, \cdots, X_p\) involved in the comparison of the unknown individual and the reference population sample.

2.1 PCA and data ellipses

A first possible approach would be to perform a PCA on the whole dataset². For this example, we will use an example from the Goldman Data Set, freely available online [1]. Our reference population sample will be composed of 29 male individuals from the Indian Knoll population sample (US-KT, 5500–3700 BP), and our target individual will be a male individual from Tigara population sample (US-AK, 750–300 BP). Thus, we hope to demonstrate that the target individual is not a member of the reference sample.

Six variables of interest will be considered here (various femoral, humeral and radial bone measurements). Their corresponding short codes in the Goldman Data Set are LFML, LHML, RTML, LHMLD, LFAPD, LFMLD respectively.

library(dplyr)
## Load dataset:
gds <- read.csv("http://web.utk.edu/~auerbach/Goldman.csv",
                dec = ".", na.strings = "",
                stringsAsFactors = TRUE,
                fileEncoding = "macintosh")
## Select the variables of interest and keep only male individuals:
gds <- subset(gds, Sex == "0")
gds <- select(gds, NOTE, LFML, LHML, RTML, LHMLD, LFAPD, LFMLD)
## Select a reference sample and a target individual:
ref <- subset(gds, NOTE == "Indian Knoll")
target <- gds[1, ]
## Merge them in one dataframe:
dat <- rbind(target, ref)
## Drop unused factor levels:
dat <- droplevels(dat)
## Summarize data:
summary(dat)


                      NOTE         LFML            LHML            RTML      
 Indian Knoll           :29   Min.   :404.0   Min.   :292.0   Min.   :337.5  
 Tigara - Point Hope, AK: 1   1st Qu.:424.8   1st Qu.:305.5   1st Qu.:355.5  
                              Median :441.0   Median :312.5   Median :367.0  
                              Mean   :439.9   Mean   :318.0   Mean   :365.8  
                              3rd Qu.:457.8   3rd Qu.:327.5   3rd Qu.:373.0  
                              Max.   :474.0   Max.   :354.0   Max.   :401.0  
                                              NA's   :1       NA's   :3      
     LHMLD           LFAPD           LFMLD      
 Min.   :16.09   Min.   :23.92   Min.   :20.66  
 1st Qu.:18.47   1st Qu.:25.66   1st Qu.:22.55  
 Median :19.10   Median :26.68   Median :23.58  
 Mean   :19.16   Mean   :26.81   Mean   :23.91  
 3rd Qu.:20.29   3rd Qu.:28.09   3rd Qu.:24.59  
 Max.   :24.30   Max.   :31.39   Max.   :29.77

There are only very few missing values here, so that we can simply ignore them and use the default method, i.e. the replacement by the column means³, and perform a PCA:

library(FactoMineR)
## Compute (silently) the PCA:
res_pca <- PCA(dat, quali.sup = 1, graph = FALSE)
## Display a PCA plot:
plot(res_pca, habillage = 1, graph.type = "classic")

Figure 1: First two principal axes of the PCA.

As it can be seen on figure 1, the null hypothesis that the target individual (in red) belongs to the reference population sample (in black) is rather unlikely (well, in this toy example, we know that it’s really false!). Adding a 95% data ellipse around the reference population sample is a reasonable way to better visualize this incompatibility:

library(car)
## Display a PCA plot:
plot(res_pca, habillage = 1, graph.type = "classic")
## Extract the principal coordinates of the reference sample
## (we remove the first line, since it corresponds to the target individual):
coor <- res_pca$ind$coor[-1, ]
## Add a 95% data ellipse:
dataEllipse(x = coor[, 1], # PC1
            y = coor[, 2], # PC2
            levels = 0.95, # 95% ellipse
            col = "black", add = TRUE)

Figure 2: 95% data ellipse for the reference population sample of Indian Knoll, computed and superimposed on the first two principal axes.

The figure 2 is eloquent, and brings an additional proof against the hypothesis that the target individual might be a member of Indian Knoll population sample: it is clearly not within the region of the 95% “most credible” individuals for this population sample. Some remarks, however:

This discussion must take into account the percentage of variance captured by the first two principal components (here, it is large enough: it is not mandatory to inspect the following principal components).
You can try with other confidence levels in order to get even an stronger evidence!
It is not sufficient to say that an individual may not be part of a reference sample: it is important to precise why (i.e., which variables or combination of variables observed on this individual are really unusual). Inspecting the correlation circle of the PCA should be instructive; applying univariate methods (as exposed in the previous section) on well-chosen variables should also help.

2.1.1 Note: a fast alternative with the R package `factoextra`

Producing manually the output in Figure 2 is quite cumbersome. It may be easier to use the R package factoextra. Using its function fviz_pca_ind(), one can directly display a pretty PCA plot with data ellipses, directly from the results returned by the PCA() function.

library(factoextra)
fviz_pca_ind(res_pca, habillage = 1, addEllipses = TRUE)

Figure 3: A similar output using the R package `factoextra`.

2.2 Alternatives

As for the univariate case, it might be possible to apply any multivariate outlier detection method to the whole dataset, including both the reference sample and the target individual.

3 The multivariate case with several one or several reference samples : typicality probability

3.1 Philosophy

The concept of typicality probability allows to get quantitative information about the hypothesis that an individual belongs to a reference population [3,5], in a way that differs from other methods. This method is particularly useful if there are several population samples as candidates: does the unknown individual belong to population A or population B ?

If the question can really be formulated like this, a common way to solve the problem is to apply a linear discriminant analysis. This will provide posterior probabilities \(p_A, p_B\) for the alternative “the individual comes from population A / population B”. Here, those two hypotheses are mutually exclusive and one of them is necessarily true, so that \(p_A + p_B = 1\).

This framework may be inconvenient in many situations: it makes sense only when there are no other possible reference population for the unknown individual: it is strictly impossible that the individual belong to a third population C. In concrete applications, this cannot always be completely excluded. The goal would be to get a quantitative assessment of the hypotheses “the individual may belong to pop A” / “the individual may belong to pop B”, taking into account that:

the individual may be incompatible with both populations A and B;
the measurements of the individual may be perfectly compatible with both A and B (so that none of the two hypotheses can really be discarded).

In other words, the respective credibility of both hypotheses (\(p_A, p_B\)) will not necessarily sum up to 1.

The concept of typicality probability offers an answer to this problem. It might be seen as a multivariate version of the probabilistic distance.

3.2 Mathematical details

How to compute the typicality probability for a given individual \(x\) and a reference sample? Full details can be found in the seminal article from Wilson [5]. Here is a short summary for the case where an individual is to be compared to one single reference sample:

It is assumed that the data within each group follow a multivariate normal distribution.
Compute the (squared) Mahalanobis distance between \(x\) and the centroid of the reference sample \(m\): \[ D^2 = (x - m)' V^{-1} (x - m)\] where \(V\) is the covariance matrix of the reference sample.
It can be proved that, if \(p\) is the number of variables and \(n\) the number of individuals in the reference sample: \[f_\text{obs} = \frac{n(n-p)}{p(n-1)(n+1)} D^2 \sim \mathcal{F}(p, n-p) \]
Thus, the typicality probability is defined as \(1 - F(f_\text{obs})\), where \(F\) is the distribution function of a Fisher distribution with \(p\) and \(n-p\) degrees of freedom.

From its definition, we can see that the typicality probability is the probability to find within the reference population an individual as extreme as, or even more extreme than, the target individual.

3.3 An example with R

Let’s retrieve the example from section 2.1, but we’ll add a second reference population sample: the foragers from Ipituaq (US-AK, 1500–1100 BP). One could expect the target individual to be closer to Ipituaq population than Indian Knoll population, but the target individual might also be far from the two populations.

First, we build a new dataset:

## Add a second population sample from the Goldman Data Set:
dat <- rbind(dat, subset(gds, NOTE == "Ipituaq - Point Hope, AK"))
## Drop unused factor levels, and remove all missing values:
dat <- droplevels(dat)
dat <- na.omit(dat)
## Summarize the data:
summary(dat)


                       NOTE         LFML            LHML            RTML      
 Indian Knoll            :25   Min.   :385.0   Min.   :275.0   Min.   :305.0  
 Tigara - Point Hope, AK : 1   1st Qu.:407.5   1st Qu.:301.0   1st Qu.:337.5  
 Ipituaq - Point Hope, AK:15   Median :429.0   Median :308.0   Median :360.0  
                               Mean   :429.3   Mean   :310.4   Mean   :354.2  
                               3rd Qu.:451.0   3rd Qu.:321.5   3rd Qu.:371.5  
                               Max.   :513.5   Max.   :352.5   Max.   :401.0  
     LHMLD           LFAPD           LFMLD      
 Min.   :16.09   Min.   :23.92   Min.   :20.66  
 1st Qu.:18.52   1st Qu.:26.06   1st Qu.:23.03  
 Median :19.53   Median :27.21   Median :24.64  
 Mean   :20.10   Mean   :27.45   Mean   :25.16  
 3rd Qu.:21.36   3rd Qu.:28.85   3rd Qu.:27.10  
 Max.   :25.91   Max.   :31.71   Max.   :29.96

The R function typprobClass() from the R package Morpho can do what we want: calculate the typicality probability of a given individual x for several groups.

library(Morpho)
## Isolate the target individual, which is in the first row:
indiv <- as.vector(dat[1, -1]) # do not consider the first column here
## Isolate the numeric data that constitute the reference sample,
## i.e. all the dataframe except the first row (target indiv)
## and the first column (grouping factor):
ref <- dat[-1, -1]
## For a greater ease of use below, keep apart the grouping factor:
pop <- dat[-1, "NOTE"]
## Compute typicality probabilities:
typprobClass(x = indiv, data = ref, groups = pop,
             method = "wilson")$probs


  Indian Knoll Ipituaq - Point Hope, AK
1  0.001216739                0.1980937

This suggests that, based on those six bone measurements, the target individual is certainly not a member of Indian Knoll (with a typicality probability as low as 0.001). However, his bone measurements might be compatible with those usually observed within the population of Ipituaq, since the typicality probability associated to this population sample is approximately 0.20.

References

[1]	Benjamin M. Auerbach and Christopher B. Ruff. Human body mass estimation: A comparison of morphometric and mechanical methods. American Journal of Physical Anthropology, 125(4):331--342, 2004. [ DOI ]
[2]	Bruno Maureille, Hélène Rougier, Francis Houët, and Bernard Vandermeersch. Les dents inférieures du néandertalien Regourdou 1 (site de Regourdou, commune de Montignac, Dordogne) : analyses métriques et comparatives. PALEO. Revue d'archéologie préhistorique, (13):183--200, 2001. http://journals.openedition.org/paleo/1043.
[3]	Yuji Mizoguchi. Typicality probabilities of Late Pleistocene human fossils from East Asia, Southeast Asia, and Australia: Implications for the Jomon population in Japan. Anthropological Science, 119(2):99--111, 2011. [ DOI ]
[4]	H. Scolan, F. Santos, A. M. Tillier, B. Maureille, and A. Quintard. Des nouveaux vestiges néanderthaliens à Las Pélénos (Monsempron-Libos, Lot-et-Garonne, France). Bulletins et mémoires de la Société d'anthropologie de Paris, 24(1-2):69--95, 2012. [ DOI ]
[5]	S. R. Wilson. On comparing fossil specimens with population samples. Journal of Human Evolution, 10(3):207--214, 1981. [ DOI ]

Footnotes:

In this case, one must not conclude that the unknown individual is a member of the reference population! It is simply an hypothesis compatible with the observed data, among other credible hypotheses.

This supposes that there is a moderate amount of missing values in the reference sample.

In real life applications, applying missing values imputation methods would be preferable all the same.