ADMIXTURE bar plot. David H. Alexander et al., Admixture 1.3. Software Manual (2020), https://dalexander.github.io/admixture/admixture-manual.pdf.
Two years after STRUCTURE was released, a group of researchers around population geneticist Noah Rosenberg—among them Jonathan Pritchard, one of the coauthors of STRUCTURE—used the program to analyze global genome-wide data from the Human Genome Diversity Panel.19 The resulting paper provides an example of how STRUCTURE derives population structure from individual ancestry coefficients. In the paper, the authors aimed to show that individuals from different present-day cultural and geographic groups—from ethnoreligious and ethnic groups (such as “Druze,” “Han,” or “Uyghur”) and nationalities (such as “Russian” or “Italian”) to regional subgroups of these nations (such as “Sardinian” or “Tuscan”)—show different patterns of allele frequencies. Thereby, they wanted to prove that these “predefined populations” correspond to human populations in a genetic sense.20
The analysis with STRUCTURE generally supported this assumption. However, what Rosenberg and his coauthors particularly emphasized was that the diagram produced through STRUCTURE identified these populations as composites of “six genetic main clusters, five of which correspond to major geographic regions.”21
The legend for the diagram explains that “each individual is represented by a thin vertical line, which is partitioned into K colored segments,” with K representing the number of clusters. It further explains that these colored segments “represent the individual’s estimated membership fractions in K clusters.” The black lines separate individuals of different populations, and these populations are labeled below the figure, with their regional affiliations above it.
Figure 2.3
Representation of the first genome-wide analysis of global human genetic diversity with STRUCTURE. From Noah A. Rosenberg et al., “Genetic Structure of Human Population,” Science 298, no. 5602 (2002): 2382. Reprinted with permission from AAAS.
This last point is significant, since the developers of STRUCTURE stressed that their program allows the clustering of samples without any “population information,” that is, without relying on any information about the sampled individuals’ affiliations to cultural and geographic groups.22 Nevertheless, after clustering the individual samples independent of population information, STRUCTURE is set to reintroduce this information in order to visually arrange and label the individual data according to their prior assignment to cultural and geographic groups. In the diagram, the vertical lines representing the ancestry of the individual samples become hardly distinguishable. Instead, those predefined cultural groups become the salient entity of analysis and emerge as horizontal bars with a characteristic pattern of colors divided by black lines. Since this kind of representation is used as the de facto standard for visualizing STRUCTURE results for any kind of inquiry into the structure of human genetic variation, this rearrangement and labeling seems to subvert one of the crucial innovations of STRUCTURE: the purely inductive method of analyzing human genetic diversity without the use of population information. Rather, this kind of visualization reinforces the idea that human genetic variation is essentially structured by genetically distinct populations that largely match notions of ethnicity, nation and religion.
Looking at a single STRUCTURE bar plot, there appears to be no hierarchical or temporal order. Rather, the rainbow-colored diagram represents an ahistorical snapshot of human population genetic diversity, in which a patchwork of genetically reified cultural groups of varying scale come to stand together on the same plane. But this is not how STRUCTURE analyses are usually conducted. Typically, such analyses proceed according to the principle of trial and error. First, one assumes that there are two clusters and lets the program compute how much of the ancestry of every individual “comes” from each cluster. Then one proceeds with three, four, five clusters and so on—until the optimal number of clusters is identified. Determining this “optimal number” of clusters consists in assessing how well the results of the different analyses fit the data. The inventors of ADMIXTURE, for instance, suggested that the choice of K—the number of clusters—should be guided by knowledge about the history of the predefined groups that are being analyzed.23 This strategy might be adequate in cases where ADMIXTURE is used for its originally stated purpose in medical genetics, that is, to mine for genetic variants associated with disease. Historical interpretations of ADMIXTURE analyses, however, are ultimately turned into circular arguments by the strategy of determining K based on historical knowledge.
The way Rosenberg and colleagues interpreted their bar charts exemplifies the role that the choice of the number of clusters may have in terms of conceiving of those clusters as “ancestral populations” that actually existed at some point in the past. One interpretation in the paper was derived from a single bar plot of five clusters.24 The authors suggested that this was basically congruent with a continental pattern, since the colors of the clusters align with the geographical designations at the top of the diagram: Africa is mainly orange, Europe mainly blue, East Asia pink, Oceania green, and America purple, whereas the Middle East, Central and South Asia emerge as mixes of their geographical neighbors. This interpretation blurs statistical clusters with the idea of distinct and relatively homogenous groups of a clear continental origin. It basically suggests that clusters can be understood in the sense of continental groups that align with common notions of “race.”
The second interpretation arose from all bar plots in combination. The authors observed that “at K = 2 the clusters were anchored by Africa and America, regions separated by a relatively large genetic distance.” They went on to state that “each increase in K split one of the clusters obtained with the previous value. At K = 5, clusters corresponded largely to major geographic regions.”25 Thus, by looking at several bar plots in succession, they arrived at a narrative of subsequent splitting along the lines of a human evolutionary history that is essentially shaped by the divergence of continental groups. These are the “ancestral populations” that—in line with the dominant model of admixture exemplified by the demographic history of African Americans—are supposed to have undergone a long period of divergence, followed by a relatively short period of convergence that created the current genetic mosaics.
Such simple interpretations of STRUCTURE bar plots seem to be common enough to have induced a group of data scientists and geneticists to publish a paper entitled “A Tutorial on How Not to Over-Interpret STRUCTURE and ADMIXTURE Bar Plots” in 2018.26 The authors simulated populations with specific patterns of allele frequencies, as they would theoretically form based on three historical scenarios, and then subjected these simulated populations to analysis with ADMIXTURE.27
The top line shows three different demographic scenarios involving four simulated populations represented by different colors (dark blue, black, dark green, and dark magenta). The lower part of figure 2.4 consists of the admixture analyses for each of the respective simulated populations. The clusters to which the individuals of the simulated populations were thereby assigned to are represented by lighter shades of the colors that are used to distinguish the populations in the upper part of the illustration (blue, light green, magenta).
Figure 2.4
“Three scenarios that give indistinguishable ADMIXTURE results.” Daniel J. Lawson, Lucy van Dorp, and Daniel Falush, “A Tutorial on How Not to Over-Interpret STRUCTURE and ADMIXTURE Bar Plots,” Nature Communications 9, no. 3258 (2018): 3.
This color scheme visually anticipates and emphasizes the controversial assumption that the simulated populations are mixtures of different pure ancestral populations.28 However, the illustration is actually intended to demonstrate that admixture constructs nearly indistinguishable bar plots from the different scenarios. The first scenario represents a history qualitatively similar to the one that is assumed for African Americans, but with three instead of two “ancestral populations.” In the second scenario, we see admixture between a known population (P1) and an unknown population giving rise to a mixed population (P2) as well as two populations (P3 and P4) not involved in any kind of mixing. The third scenario does not involve processes of admixture at all.
For the first scenario, an interpretation that assumes the African American admixture model, and therefore interprets the patterns as representing the proportions to which each genome was inherited from different “ancestral populations,” seems adequate. However, this interpretation is misleading for the second and third scenario. The simulations show that the patterns of an admixture analysis might reflect the length of time that each population has evolved independently from the others, rather than the proportion of ancestry resulting from admixture. As the authors of the paper point out, these distortions occur because the algorithm attempts to find the combination of clusters and admixture proportions in the data that best supports a simple admixture model—regardless of whether that model is historically accurate.29
As we will see, such difficulties are exacerbated with the use of aDNA in ADMIXTURE analyses. Suffice it to say here that groups that contain fewer samples are likely to be represented as mixes of populations with a greater sample size, rather than being assigned to their own “ancestral population.” Even if the aDNA sample is older than the separation date of the modern populations with which it is compared, the ancient sample is typically represented as an admixture of the modern populations.30
Enter Ancient DNA: Mosaic and Trees
While, as we have seen, there is a certain tendency to correlate genetic clusters with “the five continents,” which gives the impression that human genetic variation neatly corresponds to these regional divisions, ADMIXTURE analyses also have a subversive potential. When one experiments with ADMIXTURE, it seems that the more fine-grained the analyses become, the obscurer the population structures get. In other words: the bigger the number of clusters, the more “previously pure populations appear as admixed.” This tendency of “bursting populations” is reversed in the tree structure. An example of this can be found on an online blog which features an ADMIXTURE analysis that moves from K = 1 up to K = 15.31
In figure 2.5, the clusters—in this case fifteen—are not visualized in the way typical for such colored bar plots produced by STRUCTURE or ADMIXTURE (as shown in our figure 2.3), but we can see the 139 smaller populations that were studied and to which the 2,230 individuals whose DNA was analyzed belong. It is indeed a beautiful mosaic that evokes a cheerful picture of humankind in all colors of the rainbow. Nevertheless, the “ancestral populations” from which the current populations and individuals form mixtures are still inferred (and indicated by the fifteen different colors in the numbered key on the right side of the figure). This becomes apparent when the same data are used to generate a phylogenetic tree for the fifteen “original” or “ancestral” components or populations identified by ADMIXTURE. In this process, the admixture disappears, and we return to a diagram that creates a hierarchical order from the “Sub-Saharan Africans” to the “Siberians.” Instead of a human mosaic, in figure 2.6 we again see the diaspora in which the fifteen populations seem to have differentiated without converging. This is represented in the typical form with a first branch that separates “Africa” from “the rest,” and therefore not only ascribes originality to the continent and its people, but also marks them as the first “other.”32
Figure 2.5
Cluster diagram with K = 15. Dienekes Pontikos, “Human Genetic Variation: The First ? Components,” Diekenes’ Anthropology Blog, December 15, 2010, http://dienekes.blogspot.com/2010/12/human-genetic-variation-first.html.
Figure 2.6
Dendrogram of hierarchical clustering of the fifteen components. Dienekes Pontikos, “Human Genetic Variation: The First ? Components,” Diekenes’ Anthropology Blog, December 15, 2010, http://dienekes.blogspot.com/2010/12/human-genetic-variation-first.html.
It seems, therefore, that while admixture has become the center of attention, underneath it still lurks the conception of originally pure populations (each marked by a single color in figure 2.5) hierarchically ordered in a tree. The ancestral component called “Palaeoafrican” in the tree actually refers to the “Pygmies and San” (sic), which raises the question of how the entry of actual ancient DNA into these models affected diagrams of human diversity.33
With technical developments, such as next-generation sequencing and the increasing accessibility of ancient DNA in both quantity and time depth, the potential for aDNA research is said to have exploded from limited individual ancient genomes to the scale of population genomics.34 In this process, aDNA data are integrated into the models discussed in the previous section. However, as different numbers of individual DNA samples available across populations can misrepresent the relative frequencies of the ancestry components, clustering methods are susceptible to sampling biases, This is even more problematic when recent and ancient DNA samples are compared, because the former are generally overrepresented in related to the latter. In fact, aDNA samples from different times and places may be pooled to increase the number of samples.35 Of greater concern to us is the issue that such clustering analyses do not rely on explicit population models and assume a defined number of ancestral populations. They therefore might not fit complex admixture histories, which become increasingly probable with the integration of aDNA data.
Figure 2.7, taken from a paper on “the genetics of the African continent,” shows such clustering analyses of modern DNA, on the one hand, and aDNA, on the other. The cluster diagrams of modern and ancient “African genomes”—among other visualizations—stand for complicated population histories within the continent and genetic exchange with Eurasia. At the same time, data so far seem to suggest an isolation-by-distance pattern for deep population history (hunter-gatherers [HG] before farming) and present-day hunter-gatherer groups.37 In a way then, today’s hunter-gatherers are seen as conserving the deep-past genetic order in the form of a tree spread across a continent, creating a simultaneity of the nonsimultaneous in their coexistence with more strongly admixed populations. To conceptualize this, we might think of an image such as an ancient tree covered by entangled mistletoe. However, the authors of the paper remind us:
Figure 2.7
Clustering analyses of modern DNA (left) and aDNA (right).36 Mário Vicente and Carina M. Schlebusch, “African Population History: An Ancient DNA Perspective,” Current Opinion in Genetics & Development 62 (2020): 10.
From European modern-day and subsequent aDNA studies, we have learnt that isolation-by-distance patterns can mask several large-scale movements and replacements in the distant past, especially if no un-admixed present day groups are remaining. Thus the deep history of Africa remains to be clarified, but aDNA studies already started to contribute inferences.38
This suggests that while the existence of unadmixed populations tends to be pushed further back in time, they remain the vanishing point: there once was a tree of unadmixed populations. On the basis of genomic data, the paper posits that the Khoe-San population can be considered the first to branch off from “the rest of us.” But the authors, Mário Vicente and Carina Schlebusch, again caution: “While hierarchical bifurcating tree models provide a good estimation of general relatedness between groups, it is certainly a simplified representation of human history, and more complex models incorporating gene-flow, migration and deep population structure need to be considered in future research.”39
Indeed, aDNA studies seem to be undermining the tree more and more, and at ever deeper segments. It has been suggested that some of the most groundbreaking and surprising findings of the last seven years have to do with ancient genetic introgression events, from extinct lineages to those lines that lead to modern human populations, as well as with evidence of ancient admixture between different archaic lineages.40 Neanderthals are estimated to have contributed 1.5–2.1 percent to present-day non-African genomes, and Denisovans seem to have contributed to the genomes of modern Papuans, Melanesians, Aboriginal Australians, and other Southeast Asian islanders, as well as, to a lesser degree, those found in mainland East and South Asia.41 Further research has suggested possible deep-rooting admixture by an ancient “ghost population”—genetic traces of unknown ancestors—indicating “ghost-modern admixture” in West Africa as well as large-scale Neanderthal admixture in African populations.42 So it turns out that even humans as a species cannot be thought of as a “pure entity.”
With new technologies, increasing data, and the so-called aDNA revolution, the once predominant model of human evolution as a tree on a map—a distant common origin with subsequent population splits and evolution of populations in relative isolation—seems to have tumbled. Intricate reticulate models now appear more likely, with complex migration and admixture histories not only for modern humans, but also for archaic humans—a history that involves interaction between as well as among populations of modern and archaic humans.
However, it should be noted that the idea that archaic and modern humans interbred whenever they met, which is driven by increasing reports of admixture, is contested.43 Furthermore, and more central to our concern, the diagrams that express this new knowledge of interaction between modern and archaic populations still tend to closely resemble trees. They suggest that admixture and gene flow are mostly imagined as having taken place in one event of short duration—an event that now creates a link between two branches and that is often referred to with terms such as “introgression event,” “admixture pulse,” and “episodic migrations.” Interbreeding thus appears as the great exception to the normal course of evolution-in-isolation. This is clearly evoked by a diagram from a 2014 paper that visualizes these short events, pulses, or episodes as darts between the branches of an overall tree structure:
However, we again find a discrepancy between text and image when the authors of the paper conclude: