SIGTYP 2021 Shared Task
In 2021, SIGTYP is hosting a shared task on predicting language IDs from speech. While language ID is a fundamental speech and language processing task, it remains a challenging task, especially when going beyond the small set of languages past evaluation has focused on. Further, for many low-resource and endangered languages, only single-speaker recordings may be available, demanding a need for domain and speaker-invariant language ID systems.
We selected 16 languages from across the world, some of which share phonological features, and others where these have been lost or gained due to language contact, to perform what we call robust language ID: systems will be trained on largely single-speaker speech from one domain, but evaluated on data in other domains recorded from speakers under different recording circumstances, mimicking more realistic low-resource scenarios.
For training models, we provide participants with speech data from the CMU Wilderness Dataset, which contains read speech from the Bible in 699 languages, but usually recorded from a single speaker. This training data will be released in the form of derived MFCCs---please contact the organizers if you want to use other features.
The evaluation will be conducted on data from different sources, in particular data from the Common Voice project, several OpenSLR corpora (SLR24, SLR35, SLR35, SLR36, SLR64, SLR66, SLR79), and the Paradisec collection, testing systems’ capacity to generalize to new domains, new speakers, and new recording settings. We will also use these data sources to give participants validation data in all 16 languages to test their systems.
Please see the README in our data release below for more details about the specific languages and exact data size.
Participants will be invited to describe their system in a paper for the SIGTYP workshop proceedings. The task organizers will write an overview paper that describes the task and summarizes the different approaches taken, and analyzes their results.
Important Links
↣ Download data! Google Drive or OneDrive
Important Dates
Training data Release: ↣ 1 February 2021
Test data Release: ↣ 15 March 2021
Submissions Due: ↣ 31 March 2021 (AoE)
Notification: ↣ 15 April 2021
Camera-ready Due: ↣ 26 April 2021
Workshop: ↣ 10 June 2021
Subtasks
The shared task will consist of two settings (subtasks):
1) Constrained: only provided training data can be employed.
2) Unconstrained: training data can be extended with any publicly available source of information (e.g. additional speech, typological features, etc.)
Use of Common Voice or corpora hosted on OpenSLR for training is disallowed for all submissions.
Languages
ISO 639-3 code | Wilderness code | Language name | Genus | Family | Macroarea | # Training Utts |
---|---|---|---|---|---|---|
kab | KABCEB | Kabyle | Berber | Afro-Asiatic | Africa | 4000 |
ind | INZTSI | Indonesian | Malayo-Sumbawan | Austronesian | Papunesia | 4000 |
sun | SUNIBS | Sundanese | Malayo-Sumbawan | Austronesian | Papunesia | 4000 |
jav | JAVNRF | Javanese | Javanese | Austronesian | Papunesia | 4000 |
eus | EUSEAB | Euskara | Basque | Basque | Eurasia | 4000 |
tam | TCVWTC | Tamil | Southern Dravidian | Dravidian | Eurasia | 4000 |
kan | ERVWTC | Kannada | Southern Dravidian | Dravidian | Eurasia | 4000 |
tel | TCWWTC | Telugu | South-Central Dravidian | Dravidian | Eurasia | 4000 |
hin | HNDSKV | Hindi | Indic | Indo-European | Eurasia | 4000 |
por | PORARA | Portuguese | Romance | Indo-European | Eurasia | 4000 |
rus | RUSS76 | Russian | Slavic | Indo-European | Eurasia | 4000 |
eng | EN1NIV | English | Germanic | Indo-European | Eurasia | 4000 |
mar | MARWTC | Marathi | Indic | Indo-European | Eurasia | 4000 |
tha | THATSV | Thai | Kam-Tai | Tai-Kadai | Eurasia | 4000 |
iba | IBATIV | Iban | Malayo-Sumbawan | Austronesian | Papunesia | 4000 |
cnh | CNHBSM | Chin, Hakha | Kuki-Chin | Sino-Tibetan | Eurasia | 4000 |
* Note: the family and genus for Hakha Chin was initially incorrectly listed. Please follow this table.
Submission
Submissions should be emailed to the organizers by end of the day 31 March, anywhere in the world.
Submissions should follow the format of the training and validation label files, with tab-separated file ids and ISO 639-3 codes.
Files should be named as {team name}_{unconstrained/constrained} to indicate the subtask.
Description Papers
Papers describing shared task submissions should consist of 4 to 8 pages of content plus additional pages of references, formatted according to the NAACL 2021 format guidelines. For shared task paper submission, it is not necessary to blind the team name and authors. Accepted papers will be published online in the NAACL 2021 proceedings and will be virtually presented at the SIGTYP workshop at NAACL 2021. Writeups should be submitted through softconf (link to come), and are due by 31 March 2021 11.59 pm [UTC-12h].
Organizers
Elizabeth Salesky | Ekaterina Vylomova | Sabrina Mielke | Gabriella Lapesa | Edoardo Ponti |
---|---|---|---|---|
Elena Klyachko | Oleg Serikov | Ritesh Kumar | Ryan Cotterell | Badr Abdullah |
Contact
Please contact sigtyp AT gmail DOT com if you have any questions