SIGTYP -- 2021 Shared Task

SIGTYP 2021 Shared Task

In 2021, SIGTYP is hosting a shared task on predicting language IDs from speech. While language ID is a fundamental speech and language processing task, it remains a challenging task, especially when going beyond the small set of languages past evaluation has focused on. Further, for many low-resource and endangered languages, only single-speaker recordings may be available, demanding a need for domain and speaker-invariant language ID systems.

We selected 16 languages from across the world, some of which share phonological features, and others where these have been lost or gained due to language contact, to perform what we call robust language ID: systems will be trained on largely single-speaker speech from one domain, but evaluated on data in other domains recorded from speakers under different recording circumstances, mimicking more realistic low-resource scenarios.

For training models, we provide participants with speech data from the CMU Wilderness Dataset, which contains read speech from the Bible in 699 languages, but usually recorded from a single speaker. This training data will be released in the form of derived MFCCs---please contact the organizers if you want to use other features.

The evaluation will be conducted on data from different sources, in particular data from the Common Voice project, several OpenSLR corpora (SLR24, SLR35, SLR35, SLR36, SLR64, SLR66, SLR79), and the Paradisec collection, testing systems’ capacity to generalize to new domains, new speakers, and new recording settings. We will also use these data sources to give participants validation data in all 16 languages to test their systems.

Please see the README in our data release below for more details about the specific languages and exact data size.

Participants will be invited to describe their system in a paper for the SIGTYP workshop proceedings. The task organizers will write an overview paper that describes the task and summarizes the different approaches taken, and analyzes their results.

Important Links

↣ Download data! Google Drive or OneDrive

↣ Register for the Task!

Important Dates

  Training data Release: ↣ 1 February 2021
  Test data Release: ↣ 15 March 2021
  Submissions Due: ↣ 31 March 2021 (AoE)
  Notification: ↣ 15 April 2021
  Camera-ready Due: ↣ 26 April 2021
  Workshop: ↣ 10 June 2021

Subtasks

The shared task will consist of two settings (subtasks):
1) Constrained: only provided training data can be employed.
2) Unconstrained: training data can be extended with any publicly available source of information (e.g. additional speech, typological features, etc.)
Use of Common Voice or corpora hosted on OpenSLR for training is disallowed for all submissions.

Languages

ISO 639-3 code	Wilderness code	Language name	Genus	Family	Macroarea	# Training Utts
kab	KABCEB	Kabyle	Berber	Afro-Asiatic	Africa	4000
ind	INZTSI	Indonesian	Malayo-Sumbawan	Austronesian	Papunesia	4000
sun	SUNIBS	Sundanese	Malayo-Sumbawan	Austronesian	Papunesia	4000
jav	JAVNRF	Javanese	Javanese	Austronesian	Papunesia	4000
eus	EUSEAB	Euskara	Basque	Basque	Eurasia	4000
tam	TCVWTC	Tamil	Southern Dravidian	Dravidian	Eurasia	4000
kan	ERVWTC	Kannada	Southern Dravidian	Dravidian	Eurasia	4000
tel	TCWWTC	Telugu	South-Central Dravidian	Dravidian	Eurasia	4000
hin	HNDSKV	Hindi	Indic	Indo-European	Eurasia	4000
por	PORARA	Portuguese	Romance	Indo-European	Eurasia	4000
rus	RUSS76	Russian	Slavic	Indo-European	Eurasia	4000
eng	EN1NIV	English	Germanic	Indo-European	Eurasia	4000
mar	MARWTC	Marathi	Indic	Indo-European	Eurasia	4000
tha	THATSV	Thai	Kam-Tai	Tai-Kadai	Eurasia	4000
iba	IBATIV	Iban	Malayo-Sumbawan	Austronesian	Papunesia	4000
cnh	CNHBSM	Chin, Hakha	Kuki-Chin	Sino-Tibetan	Eurasia	4000

* Note: the family and genus for Hakha Chin was initially incorrectly listed. Please follow this table.

Submission

Submissions should be emailed to the organizers by end of the day 31 March, anywhere in the world.
Submissions should follow the format of the training and validation label files, with tab-separated file ids and ISO 639-3 codes.
Files should be named as {team name}_{unconstrained/constrained} to indicate the subtask.

Description Papers

Papers describing shared task submissions should consist of 4 to 8 pages of content plus additional pages of references, formatted according to the NAACL 2021 format guidelines. For shared task paper submission, it is not necessary to blind the team name and authors. Accepted papers will be published online in the NAACL 2021 proceedings and will be virtually presented at the SIGTYP workshop at NAACL 2021. Writeups should be submitted through softconf (link to come), and are due by 31 March 2021 11.59 pm [UTC-12h].

Organizers

Elizabeth Salesky	Ekaterina Vylomova	Sabrina Mielke	Gabriella Lapesa	Edoardo Ponti
Elena Klyachko	Oleg Serikov	Ritesh Kumar	Ryan Cotterell	Badr Abdullah

Contact

Please contact sigtyp AT gmail DOT com if you have any questions