In 2022, SIGTYP is hosting a shared task on Prediction of Cognate Reflexes. In historical-comparative linguistics, scholars typically assemble words from related languages into cognate sets. In contrast to the notion of cognates in didactics and synchronic NLP applications, cognate words -- the members of a cognate set -- are commonly assumed to share a common origin regardless of their meaning. In addition, cognate sets should not contain borrowed words. Cognate words typically show so-called regular sound correspondences. This means that one can define a mapping across the individual phoneme systems of the individual languages. Thus, English t typically corresponds to a German ts (compare ten vs. zehn), and English d corresponds to German t (compare dove vs. Taube). The mappings often depend on certain contextual conditions and may differ, depending on the position in which they occur in a words. Due to regular sound correspondences, linguists can often predict fairly well how the cognate counterpart of a word in one language might sound in another language. However, prediction by linguists rarely takes only one language pair into account. The more reflexes (counterparts) a cognate set has in different languages, the easier it is to predict reflexes in individual languages.
Our data is taken from the Lexibank repository which offers wordlists from 100 standardized datasets (List et al. 2021. In the repository, a larger collection of datasets come with cognate sets provided by experts and with phonetic transcriptions which were standardized by the Lexibank team. Our development data, which users should use to test and design their models, consists of 10 CLDF datasets of varying size, language families, and time depths.
The expected prediction result for a given reflex is a list of phonetic transcription symbols (we segment all words in our CLDF datasets into sound units). This prediction can be directly compared against the attested form, which was removed from the data when training the model.
Participants will be invited to describe their system in a paper for the SIGTYP workshop proceedings. The task organizers will write an overview paper that describes the task and summarizes the different approaches taken, and analyzes their results.
Training data Release: ↣ 21 February 2022
Test data Release: ↣ 11 April 2022
Submissions Due: ↣ 25 April 2022
System descriptions are Due: ↣ 13 May 2022
Camera-ready Due: ↣ 20 May 2022
Please contact https://github.com/LinguList if you have any questions