SIGTYP 2021 Shared Task


In 2021, SIGTYP is hosting a shared task on predicting language IDs from speech. While language ID is a fundamental speech and language processing task, it remains a challenging task, especially when going beyond the small set of languages past evaluation has focused on. Further, for many low-resource and endangered languages, only single-speaker recordings may be available, demanding a need for domain and speaker-invariant language ID systems.

We selected 16 languages from across the world, some of which share phonological features, and others where these have been lost or gained due to language contact, to perform what we call robust language ID: systems will be trained on largely single-speaker speech from one domain, but evaluated on data in other domains recorded from speakers under different recording circumstances, mimicking more realistic low-resource scenarios.

For training models, we provide participants with speech data from the CMU Wilderness Dataset, which contains read speech from the Bible in 699 languages, but usually recorded from a single speaker. This training data will be released in the form of derived MFCCs---please contact the organizers if you want to use other features.

The evaluation will be conducted on data from different sources, in particular data from the Common Voice project, several OpenSLR corpora (SLR24, SLR35, SLR35, SLR36, SLR64, SLR66, SLR79), and the Paradisec collection, testing systems’ capacity to generalize to new domains, new speakers, and new recording settings. We will also use these data sources to give participants validation data in all 16 languages to test their systems.

Please see the README in our data release below for more details about the specific languages and exact data size.

Participants will be invited to describe their system in a paper for the SIGTYP workshop proceedings. The task organizers will write an overview paper that describes the task and summarizes the different approaches taken, and analyzes their results.

Important Links

  ↣  Download data!    Google Drive or OneDrive

  ↣  Register for the Task!  

Important Dates

  Training data Release: ↣  1 February 2021
  Test data Release: ↣  15 March 2021
  Submissions Due: ↣  31 March 2021 (AoE)
  Notification: ↣  15 April 2021
  Camera-ready Due: ↣  26 April 2021
  Workshop: ↣  10 June 2021

Subtasks

The shared task will consist of two settings (subtasks):
1) Constrained: only provided training data can be employed.
2) Unconstrained: training data can be extended with any publicly available source of information (e.g. additional speech, typological features, etc.)
Use of Common Voice or corpora hosted on OpenSLR for training is disallowed for all submissions.

Languages

ISO 639-3 code Wilderness code Language name Genus Family Macroarea # Training Utts
kab KABCEB Kabyle Berber Afro-Asiatic Africa 4000
ind INZTSI Indonesian Malayo-Sumbawan Austronesian Papunesia 4000
sun SUNIBS Sundanese Malayo-Sumbawan Austronesian Papunesia 4000
jav JAVNRF Javanese Javanese Austronesian Papunesia 4000
eus EUSEAB Euskara Basque Basque Eurasia 4000
tam TCVWTC Tamil Southern Dravidian Dravidian Eurasia 4000
kan ERVWTC Kannada Southern Dravidian Dravidian Eurasia 4000
tel TCWWTC Telugu South-Central Dravidian Dravidian Eurasia 4000
hin HNDSKV Hindi Indic Indo-European Eurasia 4000
por PORARA Portuguese Romance Indo-European Eurasia 4000
rus RUSS76 Russian Slavic Indo-European Eurasia 4000
eng EN1NIV English Germanic Indo-European Eurasia 4000
mar MARWTC Marathi Indic Indo-European Eurasia 4000
tha THATSV Thai Kam-Tai Tai-Kadai Eurasia 4000
iba IBATIV Iban Malayo-Sumbawan Austronesian Papunesia 4000
cnh CNHBSM Chin, Hakha Kuki-Chin Sino-Tibetan Eurasia 4000

* Note: the family and genus for Hakha Chin was initially incorrectly listed. Please follow this table.

Submission

Submissions should be emailed to the organizers by end of the day 31 March, anywhere in the world.
Submissions should follow the format of the training and validation label files, with tab-separated file ids and ISO 639-3 codes.
Files should be named as {team name}_{unconstrained/constrained} to indicate the subtask.

Description Papers

Papers describing shared task submissions should consist of 4 to 8 pages of content plus additional pages of references, formatted according to the NAACL 2021 format guidelines. For shared task paper submission, it is not necessary to blind the team name and authors. Accepted papers will be published online in the NAACL 2021 proceedings and will be virtually presented at the SIGTYP workshop at NAACL 2021. Writeups should be submitted through softconf (link to come), and are due by 31 March 2021 11.59 pm [UTC-12h].

Organizers

Elizabeth Salesky Ekaterina Vylomova Sabrina Mielke Gabriella Lapesa Edoardo Ponti
Elena Klyachko Oleg Serikov Ritesh Kumar Ryan Cotterell Badr Abdullah

Contact

    Please contact sigtyp AT gmail DOT com if you have any questions