One standout feature of human language is our ability to reference external objects and events with socially learned symbols, or words. Exploring the phylogenetic origins of this capacity is therefore key to a comprehensive understanding of the evolution of language. While non-human primates can produce vocalizations that refer to external objects in the environment, it is generally accepted that their acoustic structure is fixed and a product of arousal states . Indeed, it has been argued that the apparent lack of flexible control over the structure of referential vocalizations represents a key discontinuity with language . Here, we demonstrate vocal learning in the acoustic structure of referential food grunts in captive chimpanzees. We found that, following the integration of two groups of adult chimpanzees, the acoustic structure of referential food grunts produced for a specific food converged over 3 years. Acoustic convergence arose independently of preference for the food, and social network analyses indicated this only occurred after strong affiliative relationships were established between the original subgroups. We argue that these data represent the first evidence of non-human animals actively modifying and socially learning the structure of a meaningful referential vocalization from conspecifics. Our findings indicate that primate referential call structure is not simply determined by arousal and that the socially learned nature of referential words in humans likely has ancient evolutionary origins.