The Firelight Niche: From Sign to Speech

If protolanguages began as largely gestural systems, why and how did vocalization become so important?
By: Ronald J. Planer and Kim Sterelny

In their book “From Signal to Symbol,” Ronald Planer and Kim Sterelny propose a novel theory of language: that modern language is the product of a long series of increasingly rich protolanguages evolving over the last two million years. Arguing that language and cognition coevolved, they give a central role to archaeological evidence and attempt to infer cognitive capacities on the basis of that evidence, which they link in turn to communicative capacities.

If protolanguages began as largely gestural systems, Planer and Sterelny ask in the excerpt from the book featured below, why and how did vocalization become so important? They meet that challenge through the idea of a “firelight niche” — a term adapted from a phrase used by anthropologist Polly Wiessner in a 2014 article analyzing the fireside conversations of the Ju/’hoan (!Kung) Bushmen of South Africa — and the changed social and physical environments that came with the control of fire. In their view, selection for something like wordless singing and laughter led to improved vocal control. These behaviors helped to ease tensions and strengthen affiliative bonds as hominin social life became more complex and intense. With more vocal control available, the vocal channel offered various efficiencies, which were particularly salient at the fireside, in the firelight niche.
–The Editors

It is tempting to think that prior to the evolution of speech our mouths were more or less unoccupied, free to be recruited into communicative service. But reflection on the amount of time other great apes spend chewing per day tells a very different story.

Chimpanzees and orangutans, it is estimated, spend around 7 hours per day feeding, while gorillas spend some 8.8 hours per day on this activity. These numbers are all the more impressive given that great apes rise with the sun and go to bed at sunset, implying an effective day length of about 12 hours in low-latitude regions. In contrast, according to the most recent report from the Bureau of Labor Statistics, modern humans spend a little over an hour a day engaged in eating and drinking.

This article is adapted from Ronald Plander and Kim Sterelny’s book “From Signal to Symbol: The Evolution of Language

Anthropologist Richard Wrangham cites these contrasts between humans and great apes in feeding time in developing his hypothesis that humans are now, and have long been, dependent on processed foods. Indeed, in his view, humans are obligate consumers of cooked foods. (This is very plausible. As Wrangham nicely explains, “raw foodists” are in general quite unhealthy. One striking illustration of this is that many female raw foodists stop menstruating.) But he extends this back to our ancestor Homo erectus, arguing that the erectines regularly consumed cooked foods. As evidence, he cites erectines’ more gracile or slender craniofacial anatomy, less powerful jaws, and smaller teeth, together with other anatomical changes, for example a shorter gut and ribcage.

In 2012, Karina Fonseca-Azevedo and Suzana Herculano-Houzel tested this idea. Based on brain and body size, they estimated the amount of time various hominin species would have had to spend feeding, given a diet of only raw food. Their findings: Erectines would have had to spend 9 hours per day feeding, and later hominins 10. As the authors explain, it is a struggle for other apes to feed for 8 hours per day and still meet the other demands imposed by their lifeways (e.g., grooming, rest). For example, in periods of heavy rainfall, gorillas will halt eating temporarily, but if the rain continues for more than 2 hours, they resume feeding and make up for lost time by resting less on those days. This suggests that it would have been all but impossible for erectines to chew for 9 hours a day, given the much greater complexity of their social and economic worlds. These time costs might be reduced by eating more calorie-dense foods — specifically, meat. However, it is very unlikely that meat alone could reduce feeding times to feasible levels, given the unreliability of this resource and the fact that it takes a large time and energy investment to obtain.

Perhaps unexpectedly, this excursion into the prehistory of cooking is relevant to the evolution of speech. First, a morphology ideal for chewing and swallowing tough food is not ideal for controlled articulate speech, and vice versa. Making food softer and easier to fragment eases constraints on the mouth, jaw, and vocal tract imposed by the need to chew through tough food. Once food is processed, there is less of a tradeoff between a system optimized for chewing and swallowing and one optimized for speaking, though there is still some tradeoff.

If cooking dramatically reduces feeding time, it frees the mouth for other activities, including laughter and song.

In particular, the low position of our larynx is notorious for the increased risk of choking it poses, a cost that should not be underestimated: As of 2017, choking was the fourth leading cause of unintentional death by injury in the United States. This remains true despite widespread cultural awareness of choking risks and most people’s familiarity with the Heimlich maneuver. The risks to children are especially acute. It seems quite likely that cooked food poses a considerably lower risk of choking than does raw food, as it more easily fragments into smaller pieces. Perhaps that is even true of raw but pounded food. In any case, the longer the feeding time, the longer the exposure to the risk of choking, and cooking certainly reduces feeding time very significantly. Second, and in our view most importantly, cooking implies some control of fire, and we’ll argue in a moment that fire changes the social environment in ways that favor greater reliance on talking. Third and most obviously, if cooking dramatically reduces feeding time, it frees the mouth and its associated equipment for other activities. These might well include laughter, song, and even the use of the teeth to grip and hold soft materials as well as speech or its prequels.

The question remains, how much time was gained, and how critical was that time? That is hard to estimate: It depends both on the kinds of food processed and the type of processing — cooking versus pounding and/or grinding. Let’s begin with the supposition that early hominins mostly ate plants, as do many ethnographically known foragers.

If one recent study holds for those hominins — note that the study was performed on mice — pounding and grinding food before eating would cut feeding time roughly in half (raw, processed plant food is twice as energy-efficient as raw, unprocessed plant food). Cooking would cut the time by three-fourths. So processing plants without cooking suggests 4.5 hours of feeding time per day for erectines and 5 for later Homo. (Recall the estimates of Fonseca-Azevedo and Herculano-Houzel for eating times given a diet of raw, unprocessed foods: 9 hours per day spent feeding for erectines; 10 hours per day for later, bigger-brained hominins.) That would still be a lot of time spent chewing. However, if these hominins routinely cooked their food, then erectines and more recent hominins would feed for less than 3 hours per day. If meat were a substantial proportion of their diet, these estimates fall again, toward a figure that resembles modern feeding times. The bottom line: Any systematic processing saves significant time (given the huge mechanical advantage of tools over teeth). Cooking saves even more time, especially for those hominins eating largely plant-based diets.

How significant for the evolution of speech was the availability of this extra time? Again, that is difficult to estimate. If hominin communication was brief and staccato, long feeding times would not inhibit the use of the vocal channel. If communication involved extended bouts, then the time freed looks more important. This issue becomes central in the next section, where we focus on the role of laughter, song, and talk in social cohesion. These are not brief and staccato uses of the mouth, and we suggest they have ancient roots, with the erectines. If and to the extent that such communication was central to social bonding and defusing conflict, the relaxation of the feeding-time constraint mattered.

Laughter and Song

Just as we argue in an early chapter of our book that the cognitive capacities that made elaborated gesture possible evolved to control skilled action, we believe the capacities that made speech possible — precise top-down control of vocalization, and the capacities to attend, recognize, and match others’ vocalizations — primarily evolved for song, or song’s evolutionary antecedents. They were then co-opted in the evolutionary transformation of multimodal communication, as the vocal component became more prominent and its role changed.

It is indeed possible that the breath control on which laughter now depends evolved for singing (or speech). But we find the suggestion that laughter was an early and important mechanism of social bonding plausible. For as the anthropologist Robin Dunbar, whose work we lean on for our own purposes here, points out, we share laughter with the other great apes. However, our laughter differs from theirs in an important way, for our laughter is associated exclusively with exhaling. In contrast, great apes repeatedly inhale and exhale while laughing, not unlike panting. This difference is due to differences in breath control. Dunbar suggests that the release of endorphins associated with human laughter is likely a response to the bodily stress we experience during laughter (specifically, stress to the diaphragm and the chest muscles, coupled with temporary oxygen deprivation). In triggering the release of endorphins, the neuroendocrine effects of laughter resemble those of grooming, and so in many contexts laughter would have the same effect as grooming. This is important, as laughter is more efficient. Only the individual who is being groomed tends to produce endorphins, whereas laughter produces endorphins in all who laugh.

If laughter bonds more efficiently than grooming, singing is more efficient still. The idea that song was a precursor to human speech has a long history — Darwin himself proposed such a view. But it has typically been developed in the context of sexual selection, with males singing to impress females. Dunbar’s account is distinctive in suggesting that singing functioned primarily to promote group cohesion. (To the best of our knowledge, this theory dates back to 1993.) As Dunbar explains, singing also triggers the release of endorphins. But in addition, when performed in a group context, it tends to produce feelings of belongingness. Moreover, as the ethnographic record shows, people frequently sing in groups much larger than three — indeed, an entire band might sing together. Our view differs from both Dunbar and Darwin in that we think that while singing is a precursor to speech, it is not a precursor to language. Rather, singing helps explain the transition from gesture dominance to speech dominance.

The idea that (wordless) singing is a precursor to speech has important explanatory advantages. First, as pointed out by more or less everybody, singing and speech depend on many of the same features of our vocal tract and breath control. Hence, if there is reason to suspect singing evolved earlier in the hominin lineage, its evolution explains, or partially explains, the evolution of those features in our line. Second, in sharp contrast to laughing, singing is a prototypically voluntary activity. When we sing, we in general intend to do so, and we are always exercising some degree of top-down control over our voices. Thus, the demands of singing can help to explain the elaboration of the neocortical pathway involved in human speech. Third, to produce the feelings of belongingness to a group, it is important that individuals coordinate their singing. It must be experienced as a joint activity. While this might take the form of different parties singing different parts (imagine a kind of call-and-response theme), the simplest format would be for people to simply sing in unison. But that requires the ability to match others’ vocalizations — in other words, vocal imitation, and, more generally, intentional listening to others.

Hominins who can sing, albeit wordlessly, and who can use a reasonably rich, largely gestural protolanguage, though with some vocal elements, are poised to make a transition to a primarily vocal system, if selection favors such a transition. To the extent that earlier systems are multimodal, they are already primed with the realization that sound can carry or modulate meaning. They have voluntary control over sound production that suffices for reasonably precise and extended sequences. They have the cognitive tools to recognize, respond to, and learn to reproduce others’ sound sequences. It is true that such a shift surrenders the advantages of iconicity. But gestures in regular use become stylized and conventionalized; even chimp gestures do. Once gestural protolanguages are well established, their stability does not depend on iconicity, though presumably it remained and remains an advantage in coining new signs.

The Firelight Niche

At some stage in our evolutionary history there was a transition to a primarily vocal mode. After all, talking offers some obvious advantages over signing. It allows us to freely use our hands for other tasks while still communicating. That is particularly important when we are coordinating effort in some collective manual task, like shifting a heavy or awkward object through a cluttered environment. It enables us to communicate over longer distances and in the dark. It is immediately attention-grabbing. It is less physically demanding, and it allows us to fully visually attend to and act on our environment while listening and talking. It is thus no surprise that humans everywhere primarily use speech rather than gesture to linguistically communicate unless forced to do otherwise. (In hunting, for example, silence is often important.) However, it is not enough to merely list those advantages. We need to detail a context in which those advantages were salient enough to drive a transition. We have already hinted at our pick for that context: the control of fire.

The control of fire provided a source of light after the sun went down. Light is the ordinary means by which animals set their daily routines. In great apes, the sleep-wakefulness cycle is controlled by the secretion of melatonin by the pituitary gland, which induces sleep. The visual perception of light inhibits the production of melatonin in the pituitary gland, and hence promotes wakefulness. Based on humans’ current sleep needs, the control of fire would have on average extended the effective day length for hominins by as much as four hours in the tropics and even longer in winter in more northerly latitudes. But firelight is both dim and spatially confined: a patch of relative brightness in a dark world. That limits the productive uses of time. After food was cooked, and a few tools were made or repaired, there would not have been much left to do other than socialize. Fire reduces the opportunity costs of social intercourse, while making it less optional. Without the extremely expensive option of everyone building their own fire, enjoying light and warmth enforces proximity. Cooking fires can be a small patch of hot embers, but light and warmth require more substantial fires.

Almost certainly, our ancestors came to control fire only gradually. Fire would have been a regular and naturally occurring part of some hominin habitats. Fongoli chimpanzees inhabit a more open and arid environment than is typical for their species. These chimps are intrigued by fire, show an understanding of its dynamics on the landscape, and target burned areas for foraging after fire passes. Thus, early hominins likely acted similarly. This suggests that the first stage of hominin fire use was increased opportunistic use of wildfire. They might have actively followed wildfires, exploiting the presence of naturally cooked plant and animal foods, as researchers have suggested. They may have spread existing fire, as some raptors do. Animals fleeing from fire might have been hunted. Food might have been dropped into small, lingering pockets of fire or hot ash to cook or further cook them. Natural cooking very likely scaffolded more intentional cooking.

Increased opportunistic use of fire would have led to an enhanced understanding of fire. Harvesting wildfire and keeping it going would have been a natural next step. Fire maintenance is no trivial feat, however, and it is likely that it took hominins a long time to master this skill. Indeed, as Terrence Twomey has persuasively argued, fire maintenance imposes a range of cooperative, cognitive, and motivational challenges. To keep a fire going, we must constantly provision it. A working knowledge of what can be used as fuel, and where it can be sourced, is thus required. Hominins would have had to detach from the present, putting on hold current desires, to tend to this need. Gathering fuel would have been difficult and risky during the night. So it would often be important to collect fuel well before it was needed. Moreover, fire must be protected from rain and wind. Fire would have been particularly vulnerable when being transported from one location to another. The demands of fire on hominins would have been even more intense prior to the discovery of ignition technology. For one could not simply make fire on an as-needed basis, or recover it when it had accidentally gone out.

However, as Haim Offek has noted, fire also creates a natural opportunity for the division of labor. The less mobile or skilled can still locate firewood, bring it to camp, tend an existing fire. Once a fire is in place and established, prior to the control of ignition it provides a focal point in the landscape and a reason (and perhaps a beacon) to return before nightfall. If Wrangham and others are right to argue that cooking is an ancient, erectine innovation, then a partial control of fire was one of the factors, together with reproductive cooperation, that rewarded central-place foraging. Moreover, fire-keeping is necessarily fallible, so if hominins really did have stable access to fire during this period but without ignition capacity, that suggests a network of firekeepers willing to exchange fire with one another. If residential groups were mutually suspicious in the way chimp bands are, this form of cooperation would be difficult to establish. Importantly though, once established it is intrinsically stable. Reciprocation is direct; help is cheap to give but very valuable to receive; and the flow of benefit is symmetrical. More ambitious uses of fire, as a tool for landscape management, are more cognitively demanding still.

The prehistory of ignition technology is probably quite complex, with this technology being discovered at different times in different places. It is also possible that it was lost only to be rediscovered at a later date. For example, in his recent review of the current state of the cooking hypothesis, Richard Wrangham acknowledges that there are surprisingly recent Neanderthal sites (within the last 200,000 years) with no evidence of fire. As always, we must be cautious: Sites are never entirely preserved. But especially when residential groups were thinly scattered without regular connection, technological loss would be no surprise. Suffice it to say that our ancestors likely controlled fire for an extended period of time before ignition technology was widely possessed, though perhaps imperfectly and with losses. Once the advance to ignition technology was made, fire use would presumably have begun to resemble its role in modern forager society.

In our view, the control of fire was a major — if not the major — driver of the gesture-speech transition. We are not the first ones to link the control of fire with language evolution. Dunbar and Gowlett, for example, argue that human language was created by the fireside, using the increased budget of social time the control of fire made possible. Dunbar sees language as the final development in hominins’ strategy for meeting the demands of social life. It allowed us to bond over the telling of stories and jokes. According to him, language evolved out of nighttime singing “by the very short additional step of mapping meaning onto sound.” Short step indeed! We do not agree.

As we argue in the book, quite sophisticated communicative capacities were needed to build erectine and heidelbergensian social worlds (referring to Homo heidelbergensis, the last common ancestor of humans and Neanderthals, arriving on the scene around 750,000 years ago). That includes the firelight world. As with other aspects of the middle Pleistocene economy, the control of fire is demanding. It poses a range of coordination and cooperation problems. Solving these problems would not have required modern language. But a communication system more powerful than anything we see in other apes was probably necessary. Roles needed to be delegated (who will watch the fire? who will retrieve the fuel? who will carry the embers with us?), plans made (shall we gather fuel now or later? what should we burn?), real-time instructions given (add this there! no, don’t add that yet!). These hominins probably needed cultural learning supported by communication to transmit the requisite natural history knowledge of fire and fuel. We doubt that our ancestors could have controlled fire, or managed other aspects of their economic life, without some significant progress toward language.

In the firelight niche, the effective communicative universe would have been initially much constrained. The visual world is smaller, less varied, less well illuminated, and fewer joint activities are in train. Objects that were present to hand (e.g., food items, tools) would have been available as obvious referents of communicative acts. Objects or persons located elsewhere and elsewhen (e.g., the stranger one saw today) would not. For those already fluent in displaced reference, objects, kinds, and places outside the shared visual field were feasible targets. But displaced reference was presumably not one of the first linguistic features to evolve, and it is difficult to imagine it evolving de novo around the fire.

However, while the fireside niche is an unpromising environment for inventing language, for those already competently using a multimodal but primarily gestural protolanguage, that niche would reward a shift toward greater reliance on talking. For example, if voice added redundancy to gesture in a mixed-modality system, the importance of voice could now increase, selecting for more readily distinguished vocal accompaniments to specific gestures. Likewise, many competencies that evolved in the use of gesture would extend smoothly to a more vocal system. Turn-taking, for example, need not evolve anew. To the extent that enhanced working and semantic memory, enhanced theory of mind, and the capacity to understand displaced reference had already evolved to support gesture, these could also support a more vocal system. The control of fire then becomes an explanation for the gesture-speech transition rather than for the evolution of language itself.

As Dunbar notes in passing, “gesture is difficult to make out across the half-light of the fireside, but spoken language carries far beyond from one hearth to the next.” Vocalization is very clearly superior in the dim light of the fireside. This is especially true if these fires were small and in the open air. Fire maintenance in the absence of ignition technology may well have been associated with smaller fires, as this would have conserved fuel, an important consideration if they have to be permanently alight. Dunbar is right that under such conditions “gesture is difficult to make out.” Difficult enough that one might expect vocal elements to start to play an increasingly significant role in communication: perhaps to attract attention, perhaps to make displays easier to recognize and individuate. Those sitting right next to the fire would be able to converse using gesture, but it would be very hard for an individual at the periphery to be understood, especially if the visual world was not just smaller but more crowded, with more distant individuals partially occluded by closer ones. They might use vocalization more. It is also true, as Dunbar notes, that vocalization would allow communication between parties at different hearths. We suspect that the keeping of multiple fires might be a relatively recent habit, made possible by the control of ignition, given that without the control of ignition, fires must be maintained longer, increasing demands on fuel, but perhaps we are wrong. In any case, vocalization might still be used to communicate with an individual who has temporarily wandered away from the fire, to retrieve more fuel or to relieve themselves; here vocalization might serve as a kind of contact call. The more general advantages of speech over sign are valid in the fireside context: Hominins could talk or listen while they were using their hands for other activities.

All this said, while gesture is more difficult to make out in firelight, it is not impossible. A gesturing agent can increase the amplitude of her signs; repeat them; slow them down. That is important, because it shows that the transition to a more vocal mode could be gradual and undisruptive. An established, mostly gestural protolanguage can still be used by firelight, just less efficiently. Firelight conditions are not optimal for gestural communication, but they certainly do not render it useless. We still gesture around the campfire at night. That allows gestural and vocal elements to work together in a transition. In order for arbitrary sounds to inherit meaning from co-produced gestures, those gestures must remain visible. Once an increased tendency to vocalize developed around the fire, perhaps at first paired with a sign that it helped disambiguate, increased vocalization could carry over into daytime hours, further reinforcing any gesture-vocal pairing and serving as a bridge to a more purely vocal system.

In sum, we think Dunbar and Gowlett are certainly right to draw attention to the firelight niche and its importance. It is difficult to overstate the consequences of the control of fire on hominin bodies, brains, and social lives. But unlike Dunbar and Gowlett, we call upon the control of fire to explain the gesture-speech transition, rather than the initial appearance of language.

The Archaeological Evidence

We would be in trouble if the archaeological and genetic evidence of the control of fire, and the paleoanthropological dates for the capacities for speech, were wildly out of sync. Fortunately, these two fit together quite nicely. That said, dating the control of fire is not easy. In open-air environments, it can be very hard to distinguish naturally occurring fire from anthropomorphic fire (fire created or maintained by hominins). To attempt to distinguish these, we must look to the broader archaeological context.

Artifacts or cut-marked bones in the same strata as burned material are suggestive of anthropomorphic fire. And if any of these items are burned, we can infer their co-presence with fire. However, these items might have been discarded by hominins at an earlier time and burned by wildfire only later. Hence, it is necessary to look at the distribution and ratio of burned to nonburned items on the landscape. If, for example, only a small portion of on-site flakes or bones show signs of burning, then that suggests a small, contained fire, which in turn suggests anthropomorphic fire. Micromorphological clues, such as the magnetic properties of burned sediments, can also provide information about the temperature and duration of fire. This is important, as anthropomorphic fires tend to burn hotter for longer in a single spot. The interpretation of such contextual clues is complex, and there is room for disagreement. Hence, many proposed anthropomorphic fire sites remain contentious.

Other clues allow us to infer anthropomorphic fire with greater certainty. Flints whose functionality depends on hafting provide strong evidence of anthropomorphic fire, for hafting usually requires adhesives, which are made through the controlled application of heat. Direct evidence of such adhesives is stronger evidence still. However, these technologies doubtlessly took time to develop even after the control of fire, hence would not be present at the earliest sites containing domestic fires. So early claims of anthropomorphic fire are apt to remain the most contentious. Moreover, direct traces of burning are few and fragile, creating a significant preservation bias. Stone tools persist for far longer than charcoal in most environments. Anthropomorphic fire would have also occurred at lower frequencies than toolmaking, creating an additional sampling bias. With early control of fire, the absence of evidence is certainly not evidence of absence. Bearing this in mind, what does the fire record look like?

Cross section of burned Olea europaea subsp. oleaster (wild olive) specimen. The presence of burned seeds, wood, and flint at the Acheulian site of Gesher Benot Ya`aqov in Israel is suggestive of the control of fire by humans nearly 790,000 years ago.

A standard interpretation of the record runs as follows. To the extent that the earliest fire sites (older than 1 million years old) reflect hominin use of fire, this use was opportunistic. These hominins might have exploited naturally occurring fires, and even done so regularly, but they probably did not have the capacity to harvest, move, and keep naturally occurring fires going for indefinite stretches of time. The change we see in the record between 1 and half a million years ago may well signal the arrival of control. The earliest hominins during this period were probably more novice than expert when it came to navigating the challenges of the control of fire. Thus, we would expect a gradual intensification of the role of fire in the lifeways of mid-Pleistocene Homo. The evidence of fire use at Gesher Benot Ya’aqov suggests hominins who were at the very least proficient keepers of fire. Finally, the trends we see in the fire record from 400,000 years ago onward presumably reflect some combination of ignition technology and the more regular survival of these more recent traces (though it is possible that enhanced fire maintenance contributed as well).

Given this timeline, our account of the gesture-speech transition predicts that mid-Pleistocene Homo experienced selection for enhanced vocal communication. In particular, we would expect to see changes in Homo heidelbergensis. How does this prediction fare? In assessing it, we have two main sources of evidence to go on: fossil and genetic. There is mixed opinion regarding the potential for fossil evidence to tell us about the evolution of speech. Historically, proposed indicators have included cranial shape and dimensions, the size of the hypoglossal canal, the angle of the base of the skull, the morphology of the hyoid bone, and the width of the thoracic vertebral canal. Virtually all contemporary discussion revolves around the last two, the others having been dismissed as too unreliable.

Among other things, the hyoid bone supports the larynx. It is widely accepted that the Neanderthal hyoid is essentially humanlike. This suggests that the hyoid had already evolved into its modern form in heidelbergensians, dating from 600,000 to 200,000 years ago. Fossil remains from the Sima de los Huesos site in Spain are consistent with this. Two heidelbergensian hyoid bones have been recovered at this site and both show an essentially modern morphology. Some doubt whether the hyoid bone provides much information about speech abilities; in particular, it has been criticized as an indicator of the location of the larynx in the vocal tract. Skeptics point out that there is no change in hyoid morphology over the course of human development, despite its pronounced descent. However, it is uncontentious that the structure of all Neanderthal and heidelbergensian hyoids reveals the loss of the air sacs early hominins shared with Pan, a clear step in the direction of modern speech anatomy.

An upgrade in speech capacities predicts corresponding changes in auditory apparatus; there is little point in producing a range of new vocalizations unless these sounds can be discriminated by hearers. In this area evidence is limited, but the evidence we do have paints a clear picture. Fossil reconstructions of the external and middle ear in Homo heidelbergensis, again from the Sima de los Huesos site in Spain, suggest essentially modern auditory capacities. In particular, the reconstructed anatomy of the hearing canal implies increased sensitivity around 400 Hz compared to chimpanzees and other great apes. This inference is further substantiated by evidence from the Middle Paleolithic allowing us to compare Neanderthal and modern human ear ossicles (middle ear bones). This evidence shows that the morphological properties of the former fell within the modern human range. Thus, it would appear that auditory capacities capable of registering modern speech were in place before the split between Neanderthals and humans, and inherited by both lineages from an older ancestor.

It would appear that auditory capacities capable of registering modern speech were in place before the split between Neanderthals and humans, and inherited by both lineages from an older ancestor.

What about the genetic evidence? DNA comparisons between Neanderthals and modern humans have revealed a striking degree of similarity. One important general point has been emphasized by researchers: Given the size of the human genome, there are rather few differences between sapiens, Neanderthals, Denisovans, and by inference their common ancestor, heidelbergensis. We are genetically very similar to our Neanderthal and Denisovan siblings, and by inference to our heidelbergensis ancestor.

Much attention has been paid to the gene designated FOXP2 in particular. It was once thought that FOXP2 was critical to human syntactic capacities, and hence it was dubbed the “language gene.” The more recent understanding is that it makes possible precise vocal control. The coding region of FOXP2 is identical in humans and Neanderthals. However, there is a difference in one of the upstream binding sites. It might be tempting to think that this difference is relevant to speech. But as many as 10 percent of modern Africans show the archaic form of the gene at this position, and so, clearly, this mutation is not required for modern speech.

In sum, while the genomes of modern humans and Neanderthals are strikingly similar, a number of changes have occurred in our line since our split with the Neanderthals some half a million years ago. Of these changes, a small number may be implicated in the development of human speech capacities. But on the whole, there is little to suggest a major architectural difference between human and Neanderthal vocal tracts or neuroanatomical structures relevant to speech. Rather, these differences appear more like “finishing touches” to an already sophisticated system supporting speech production and perception capacities, a conclusion further supported by the fossil evidence mentioned above. Moreover, differences in sapiens speech compared to that of Neanderthals need not mean that Neanderthal capacities were limited or impaired compared to those of sapiens. All things considered, the evidence suggests that the capacities for speech were present in the mid-Pleistocene ancestor of humans and Neanderthals. The capacity for speech was probably available to these hominins, the heidelbergensians, as they settled into the fireside niche.

Thus, though the dates are far from iron-clad, and the evidence remains fragmentary and open to interpretation, both the archaeological and genetic evidence fits with our account of the gesture-speech transition. Are we to conclude from this that Neanderthals, and before them heidelbergensians, possessed an essentially modern form of spoken language? Not quite. Rather, it shows that they very likely possessed the requisite vocal and auditory machinery to support such a communication system. They may well have had the requisite cognitive capacities too. They were probably language-ready and speech-ready. But in our view, the full suite of features that mark out modern language as a distinct type of communication system relative to earlier protolanguages probably did not emerge until sometime in the last 200,000 years. Modern language, after all, depends on features of the social environment, not just on individual cognitive or structural machinery.

Ronald Planer carried out postdoctoral research at the Australian National University from 2015 to 2020 in the School of Philosophy and is currently a Research Affiliate of the Australian Research Council’s Centre of Excellence for the Dynamics of Language and Postdoctoral Fellow in the Evolution of Language in the School of Languages and Linguistics at Melbourne University.

Kim Sterelny is Professor of Philosophy at the Australian National University. He is the coauthor of “Language and Reality: An Introduction to the Philosophy of Language” and the author of, among other books, “The Evolved Apprentice: How Evolution Made Humans Unique.”

Planer and Sterelny are the authors of “From Signal to Symbol,” from which this article is adapted.

Posted on
The MIT Press is a mission-driven, not-for-profit scholarly publisher. Your support helps make it possible for us to create open publishing models and produce books of superior design quality.