SIGTYP2021 — JUNE,10th — ONLINE

The schedule for different time zones is also available here

The workshop is (largely) single-track this year. There is going to be one Zoom call that lasts for 24 hours. The idea is that each module is repeated. The scheduling was done to ensure that everyone could attend the modules they are interested in (where interest was expressed on our Google doc) at a reasonable time. The schedule shows the time of the modules in many popular time zones. Note that the number of the session does not indicate different content: Morphology 1 and Morphology 2 have the same content but different people. The authors of the work will likely only be at one of the modules, namely the one that is in their time zone. During each session, we will play each pre-recorded talk and then run live Q&A (time slots for all talks are provided below). If you cannot attend a session you're presenting in, we will collect questions and put them in the corresponding RocketChat channel.
Please visit RocketChat for Zoom and Gather.town links!


Choose your time zone:


  Starts: 06:00 PM     Ends: 06:15 PM

By Edoardo Ponti

Opening remarks: general comments about SIGTYP development, SIGTYP2021 submissions, shared task, etc.


 Morphology Module 


  Session 1   Starts: 06:15 PM     Ends: 08:30 PM
    (Moderators: Ryan Cotterell, Eleanor Chodroff)

  Session 2   Starts: 10:30 PM     Ends: 01:00 AM
    (Moderators: Khuyagbaatar Batsuren, Ekaterina Vylomova, Ryan Cotterell)

  Session 3   Starts: 11:00 AM     Ends: 13:30 PM
    (Moderators: Gabriella Lapesa, Edoardo Ponti, Ryan Cotterell, Josef Valvoda)

   Registration: https://forms.gle/FfwgGsKzYaYtRobR6


Keynote Talk by David Yarowsky: Bible-based Morphology, Typology and NLP in 1000+ Languages (1 hour)

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

David Yarowsky David Yarowsky is Professor of computer science and a member of the Center for Language and Speech Processing at Johns Hopkins University. David’s research focuses on word sense disambiguation, minimally supervised induction algorithms in NLP, and multilingual natural language processing. He earned his bachelor’s (’87) in computer science at Harvard University and his master’s (’93) and Ph.D. (’96) in computer and information science at the University of Pennsylvania.
In this talk, David will discuss an approach to universal morphology, the UniMorph schema and a corresponding corpus as well as his work on massively multilingual NLP.

Slides  BiliBili  RocketChat  David's Website

Inferring Morphological Complexity from Syntactic Dependency Networks: A Test

By Guglielmo Inglese and Luca Brigada Villa

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

Research in linguistic typology has shown that languages do not fall into the neat morphological types (synthetic vs. analytic) postulated in the 19th century. Instead, analytic and synthetic must be viewed as two poles of a continuum and languages may show a mix analytic and synthetic strategies to different degrees. Unfortunately, empirical studies that offer a more fine-grained morphological classification of languages based on these parameters remain few. In this paper, we build upon previous research by Liu & Xu (2011) and investigate the possibility of inferring information on morphological complexity from syntactic dependency networks.

BiliBili  RocketChat  Paper

Information-Theoretic Characterization of Morphological Fusion

By Neil Rathi, Michael Hahn and Richard Futrell

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

Traditionally, morphological typology divides synthetic languages into two broad groups (e.g. von Schlegel,1808; von Humboldt, 1843). Agglutinative languages, such as Turkish, segment morphemes into independent features which can be easily split. On the other hand, fusional languages, such as Latin, “fuse” morphemes together phonologically (Bickel and Nichols, 2013). At the same time, there has long been recognition that the categories “agglutinative” and “fusional” are best thought of as a matter of degree, with Greenberg (1954) developing an “index of agglutination” metric for languages. Here, we propose an information-theoretic definition of the fusion of any given form in a language, which naturally delivers a graded measure of the degree of fusion. We use a sequence-to-sequence model to empirically verify that our measure captures typical linguistic classifications.

BiliBili  RocketChat  Paper

Morph Call: Probing Morphosyntactic Content of Multilingual Transformers

By Vladislav Mikhailov, Oleg Serikov and Ekaterina Artemova

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

The outstanding performance of transformer-based language models on a great variety of NLP and NLU tasks has stimulated interest in exploration of their inner workings. Recent research has been primarily focused on higher-level and complex linguistic phenomena such as syntax, semantics, world knowledge and common-sense. The majority of the studies is anglocentric, and little remains known regarding other languages, specifically their morphosyntactic properties. To this end, our work presents Morph Call, a suite of 46 probing tasks for four Indo-European languages of different morphology: Russian, French, English and German. We propose a new type of probing tasks based on detection of guided sentence perturbations. We use a combination of neuron-, layer- and representation-level introspection techniques to analyze the morphosyntactic content of four multilingual transformers, including their understudied distilled versions. Besides, we examine how fine-tuning on POS-tagging task affects the probing performance.

BiliBili  RocketChat  Paper

Measuring Prefixation and Suffixation in the Languages of the World

By Harald Hammarström

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

It has long been recognized that suffixing is more common than prefixing in the languages of the world. More detailed statistics on this tendency are needed to sharpen proposed explanations for this tendency. The classic approach to gathering data on the prefix/suffix preference is for a human to read grammatical descriptions (948 languages), which is time-consuming and involves discretization judgments. In this paper we explore two machine-driven approaches for prefix and suffix statistics which are crude approximations, but have advantages in terms of time and replicability. The first simply searches a large collection of grammatical descriptions for occurrences of the terms ‘prefix’ and ‘suffix’ (4 287 languages). The second counts substrings from raw text data in a way indirectly reflecting prefixation and suffixation (1 030 languages, using New Testament translations). The three approaches largely agree in their measurements but there are important theoretical and practical differences. In all measurements, there is an overall preference for suffixation, albeit only slightly, at ratios ranging between 0.51 and 0.68.

BiliBili  RocketChat  Paper

Predicting and Explaining French Grammatical Gender

By Saumya Sahai and Dravyansh Sharma

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

Grammatical gender may be determined by semantics, orthography, phonology, or could even be arbitrary. Identifying patterns in the factors that govern noun genders can be useful for language learners, and for understanding innate linguistic sources of gender bias. Traditional manual rule-based approaches may be substituted by more accurate and scalable but harder-to-interpret computational approaches for predicting gender from typological information. In this work, we propose interpretable gender classification models for French, which obtain the best of both worlds. We present high accuracy neural approaches which are augmented by a novel global surrogate based approach for explaining predictions. We introduce ‘auxiliary attributes’ to provide tunable explanation complexity.

BiliBili  RocketChat  Paper

Quantitative Detection of Cognacy in the Predictive Structure of Inflection Classes: Romance Verbal Conjugations Against the Broader Typological Variation

By Borja Herce and Balthasar Bickel

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

In recent years, Information Theory (with its core notion of entropy) has provided the theoretical background for a lot of empirical research on inflectional systems, and has inspired various metrics to capture (different aspects of) their complexity. So far, however, entropy-based metrics have chiefly been used to assess synchronic states. Here we explore their potential for capturing patterns in language change and phylogenetic relatedness. Specifically, we probe different aspects of an inflectional system for their stability within one language family, Romance, and for the degree to which they distinguish this family from unrelated and less closely related languages. Based on most metrics, Romance appears to be different from the control sample in the mean, variance, or both. The difference in variance is particularly interesting because it might suggest differences in relative diachronic stability and as phylogenetic signals of relatedness.

BiliBili  RocketChat  Paper


 Low-Resource Languages Module 


  Session 1   Starts: 01:00 AM     Ends: 03:30 AM
    (Moderators: Khuyagbaatar Batsuren, Ekaterina Vylomova, Ryan Cotterell)

  Session 2   Starts: 05:30 AM     Ends: 08:00 AM
    (Moderators: Ryan Cotterell)

  Session 3   Starts: 1:30 PM     Ends: 04:00 PM
    (Moderators: Edoardo Ponti, Irene Nikkarinen, Ryan Cotterell, Josef Valvoda)

   Registration: https://forms.gle/XJUJdvfgsxtHEkTn7


Keynote Talk by Miryam de Lhoneux: Low-resource NLP: Lessons from Dependency Parsing

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

Miryam de Lhoneux Miryam de Lhoneux is a Postdoctoral Researcher at Uppsala University, KU Leuven and the University of Copenhagen. Miryam's research interests are centered around three main themes in the domain of AI and language: syntactic parsing, typology and interpretability. She finds syntactic parsing an exciting area because it allows exploring interesting linguistic phenomena while working on a system that is central to NLP and useful to many applications.
This talk is about low-resource NLP, with a primary focus on dependency parsing. I first argue that the Universal Dependencies (UD) dataset is ahead of the game when it comes to typologically diverse NLP and that we can use it to investigate important questions in multilingual NLP. We can ask whether techniques such as transfer learning or the use of typological information can mitigate the gap in language technology between low and high-resource languages. I review studies using these two techniques in dependency parsing and conclude that transfer learning works surprisingly well for related languages but that our current methods do not work well for low-resource languages which do not have a related high-resource language. I suggest that the use of typological information is underexploited and is a promising research line. I finally discuss work on low-resource NLP beyond dependency parsing, namely, our participation in the machine translation shared task for indigenous languages of the Americas. This task turns out to be even harder than expected. I conclude the talk by suggesting that there is still a lot of work to do for typologically diverse NLP and by highlighting recent community efforts which are building new datasets and are slowly making it possible to put multilinguality at the core of NLP.

Slides  BiliBili  RocketChat  Miryam's Website

Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation

By Zhong Zhou and Alexander Waibel

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

We translate a closed text that is known in advance into a severely low resource language by leveraging massive source parallelism. In other words, given a text in 124 source languages, we translate it into a severely low resource language using only ∼1,000 lines of low resource data without any external help. Firstly, we propose a systematic method to rank and choose source languages that are close to the low resource language. We call the linguistic definition of language family Family of Origin (FAMO), and we call the empirical definition of higher-ranked languages using our metrics Family of Choice (FAMC). Secondly, we build an Iteratively Pretrained Multilingual Order-preserving Lexiconized Transformer (IPML) to train on ∼1,000 lines (∼3.5%) of low resource data. In order to translate named entities well, we build a massive lexicon table for 2,939 Bible named entities in 124 source languages, and include many that occur once and covers more than 66 severely low resource languages. Moreover, we also build a novel method of combining translations from different source languages into one. Using English as a hypothetical low resource language, we get a +23.9 BLEU increase over a multilingual baseline, and a +10.3 BLEU increase over our asymmetric baseline in the Bible dataset. We get a 42.8 BLEU score for Portuguese-English translation on the medical EMEA dataset. We also have good results for a real severely low resource Mayan language, Eastern Pokomchi.

BiliBili  RocketChat  Paper

Towards Figurative Language Generation in Afrikaans

By Imke van Heerden and Anil Bas (Extended Abstract)

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

This paper presents an LSTM-based approach to figurative language generation, which is an important step towards creative text generation in Afrikaans. Due to the scarcity of resources (in comparison to resource-rich languages), we train the proposed network on a single literary novel. This follows the same approach as Van Heerden and Bas (2021), however, we explicitly focus and expand on fully automatic text generation, centring on figurative language in particular. The proposed model generates phrases that contain compellingly novel figures of speech such as metaphor, simile and personification.

BiliBili  RocketChat  Paper

Graph Convolutional Network for Swahili News Classification

By Alexandros Kastanos and Tyler Martin

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

In this work, we demonstrate the ability of Text Graph Convolutional Network (Text GCN) to surpass the performance of traditional natural language processing benchmarks on the task of semi-supervised Swahili news categorisation. Our experiments highlight the more severely label-restricted context oftenfacing low-resourced African languages. We build on this finding by presenting a memory-efficient variant of Text GCN which replaces the naive one-hot node representation with a bag of words representation.

BiliBili  RocketChat  Paper

Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language

By Hala Mulki and Bilal Ghanem

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

Misogyny is one type of hate speech that disparages a person or a group having the female gender identity; it is typically defined as hatred of or contempt for women. Online misogyny has become an increasing worry for Arab women who experience gender-based online abuse on a daily basis. Such online abuse can be expressed through several misogynistic behaviors which reinforce and justify underestimation of women, male superiority, sexual abuse, mistreatment, and violence against women. Misogyny automatic detection systems can assist in the prohibition of anti-women Arabic toxic content. Developing these systems is hindered by the lack of the Arabic misogyny benchmark datasets. In this work, we introduce an Arabic Levantine Twitter dataset for Misogynistic language (LeT-Mi) to be the first benchmark dataset for Arabic misogyny. The proposed dataset consists of 6,550 tweets annotated either as neutral (misogynistic-free) or as one of seven misogyny categories: discredit, dominance, cursing/damning, sexual harassment, stereotyping and objectification, derailing, and the threat of violence. We further provide a detailed review of the dataset creation and annotation phases. The consistency of the annotations for the proposed dataset was emphasized through inter-rater agreement evaluation measures. Moreover, Let-Mi was used as an evaluation dataset through binary, multi-class, and target classification tasks which were conducted by several state-of-the-art machine learning systems along with Multi-Task Learning (MTL) configuration. The obtained results indicated that the performances achieved by the used systems are consistent with state-of-the-art results for languages other than Arabic, while employing MTL improved the performance of the misogyny/target classification tasks.
Our dataset is available at https://github.com/bilalghanem/let-mi

BiliBili  RocketChat  Paper

Improving Cross-Lingual Sentiment Analysis via Conditional Language Adversarial Adaptation

By Hemanth Kandula and Bonan Min

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

Sentiment analysis has come a long way for high-resource languages due to the availability of large annotated corpora. However, it still suffers from lack of training data for low-resource languages. To tackle this problem, we propose Conditional Language Adversarial Network (CLAN), an end-to-end neural architecture for cross-lingual sentiment analysis without cross-lingual supervision. CLAN differs from prior work in that it allows the adversarial training to be conditioned on both learned features and the sentiment prediction, to increase discriminativity for learned representation in the cross-lingual setting. Experimental results demonstrate that CLAN outperforms previous methods on the multilingual multi-domain Amazon review dataset.
Our source code is released at https://github.com/hemanthkandula/clan

BiliBili  RocketChat  Paper

Multilingual Slot and Intent Detection (xSID) with Cross-lingual Auxiliary Tasks

By Rob van der Goot, Ibrahim Sharaf, Aizhan Imankulova, Ahmet Üstün, Marija Stepanović, Alan Ramponi, Siti Oryza Khairunnisa, Mamoru Komachi and Barbara Plank

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

Digital assistants are becoming an integral part of everyday life. However, commercial digital assistants are only available for a limited set of languages (as of March 2020, between 8 to around 20 languages). Because of this, a vast amount of people can not use these devices in their native tongue. In this work, we focus on two core tasks within the digital assistant pipeline: intent classification and slot detection.Intent classification recovers the goal of the utterance, whereas slot detection identifies important properties regarding this goal. Besides introducing a novel cross-lingual dataset for these tasks, consisting of 13 languages, we evaluate a variety of models: 1) multilingually pre-trained transformer-based models,2) we supplement these models with auxiliary tasks to evaluate whether multi-task learning can be beneficial, and 3) annotation transfer with neural machine translation.

Bilibili  Rocketchat  Paper

Unsupervised Self-Training for Unsupervised Cross-Lingual Transfer

By Akshat Gupta, Sai Krishna Rallabandi and Alan W Black

  S1:   ----   ┊┊  S2:   ----   ┊┊  S3:   ----

Labelled data is scarce, especially for low-resource languages. This beckons the need to come up with unsupervised methods for natural language processing tasks. In this paper, we introduce a general framework called Unsupervised Self-Training, capable of unsupervised cross-lingual transfer. We apply our proposed framework to a two-class sentiment analysis problem of code-switched data. We use the power of pre-trained BERT models for initialization and fine-tune them in an unsupervised manner, only using pseudo labels produced by zero-shot predictions. We test our algorithm on multiple code-switched languages. Our unsupervised models compete well with their supervised counterparts, with their performance reaching within 1-7% (weighted F1 scores) when compared to supervised models trained for a two-class problem.

BiliBili  RocketChat  Paper


 Typological Knowledge in NLP Module 


  Session 1   Starts: 03:30 AM     Ends: 05:30 AM
   (Moderators: Alexey Sorokin, Ekaterina Vylomova)

  Session 2   Starts: 4:30 PM     Ends: 06:00 PM
    (Moderators: Gabriella Lapesa, Elizabeth Salesky)

   Registration: https://forms.gle/maPPcL88aAaYJp3A6


Keynote Talk by Johannes Bjerva: Typological Feature Prediction and Blinding for Cross-Lingual NLP

  S1:   ----   ┊┊  S2:   ----

Johannes Bjerva Johannes Bjerva is a tenure-track Assistant Professor affiliated with the Database and Web-technologies (DW) research group at the Department of Computer Science, Aalborg University. His research generally deals with under-resourced languages, which he approaches by combining linguistic typology with parameter sharing via multilingual and multitask learning. For the past few years, he has investigated computational typology and answering typological research questions for this purpose, typically using deep learning techniques.
I will discuss the usefulness of typological features in NLP, with a focus on low-resource settings. On the one hand, typological features from databases such as the World Atlas of Language Structures (WALS) seem promising for cross-lingual NLP, as annotations for useful aspects of language exist even for very low-resource languages. Furthermore, missing features in WALS can be predicted with relatively high success, and has been the focus of much recent work (e.g. in the SIGTYP 2020 shared task). When it comes to application of these features, however, previous work has only found minor benefits from using typological information in actual NLP modelling. In recent work (EACL 2021), we hypothesised that these minor gains might stem from that a model trained in a cross-lingual setting picks up on typological cues from the input data, thus overshadowing the utility of explicitly using such features. We verify this hypothesis by blinding a model to typological information, and investigate how cross-lingual sharing and performance is impacted. While this sheds some light on the matter, the question of how to use typological information in NLP in the best way, seems to remain an open question.

Slides  BiliBili  RocketChat  Johannes' Website

Improving the Performance of UDify with Linguistic Typology Knowledge

By Chinmay Choudhary

  S1:   ----   ┊┊  S2:   ----

UDify is the state-of-the-art language-agnostic dependency parser which is trained on a polyglot corpus of 75 languages. This multilingual modeling enables the model to generalize over unknown/lesser-known languages, thus leading to improved performance on low-resource languages. In this work we used linguistic typology knowledge available in URIEL database, to improve the cross-lingual transferring ability of UDify even further.

BiliBili  RocketChat  Paper

FrameNet and Typology

By Michael Ellsworth, Collin Baker and Miriam R. L. Petruck

  S1:   ----   ┊┊  S2:   ----

FrameNet and the Multilingual FrameNet project have produced multilingual semantic annotations of parallel texts that yield extremely fine-grained typological insights. Moreover, frame semantic annotation of a wide cross-section of languages would provide information on the limits of Frame Semantics (Fillmore 1982, Fillmore 1985). Multilingual semantic annotation offers critical input for research on linguistic diversity and recurrent patterns in computational typology. Drawing on results from FrameNet annotation of parallel texts, this paper proposes frame semantic annotation as a new component to complement the state of the art in computational semantic typology.

BiliBili  RocketChat  Paper

Exploring Linguistic Typology Features in Multilingual Machine Translation

By Oscar Moreno and Arturo Oncevay

  S1:   ----   ┊┊  S2:   ----

We explore whether linguistic typology features can impact multilingual machine translation performance (many-to-English) by using initial pseudo-tokens and factored language-level embeddings. With 20 languages from different families or groups, we observed that the features of “Order of Subject (S),Object (O) and Verb (V)”, “Position of Negative Word with respect to S-O-V” and “Prefixing vs. Suf-fixing in Inflectional Morphology” provided slight improvements in low-resource language-pairs despite not overcoming the average performance for all languages.

BiliBili  RocketChat  Paper

A Universal Dependencies Corpora Maintenance Methodology Using Downstream Application

By Ran Iwamoto, Hiroshi Kanayama, Alexandre Rademaker and Takuya Ohko

  S1:   ----   ┊┊  S2:   ----

This paper investigates updates of Universal Dependencies (UD) treebanks in 23 languages and their impact on a downstream application. Numerous people are involved in updating UD’s annotation guidelines and treebanks in various languages. However, it is not easy to verify whether the updated resources maintain universality with other language resources. Thus, validity and consistency of multilingual corpora should be tested through application tasks involving syntactic structures with PoS tags, dependency labels, and universal features. We apply the syntactic parsers trained on UD treebanks from multiple versions (2.0 to 2.7) to a clause-level sentiment extractor. We then analyze the relationships between attachment scores of dependency parsers and performance in application tasks. For future UD developments, we show examples of outputs that differ depending on version.

BiliBili  RocketChat  Paper

A Look to Languages through the Glass of BPE Compression

By Ximena Gutierrez-Vasques, Tanja Samardzic and Christian Bentz

  S1:   ----   ┊┊  S2:   ----

One of the predominant methods for subword tokenization is Byte-pair encoding (BPE). Originally, this is a data compression technique based on replacing the most common pair of consecutive bytes with a new symbol. When applied to text, each iteration merges two adjacent symbols; this can be seen as a process of going from characters to subwords through iterations.
Regardless of the language, the first merge operations tend to have a stronger impact on the compression of texts, i.e., they capture very frequent patterns that lead to a reduction of redundancy and to an increment of the text entropy. However, the natural language properties that allow this compression are rarely analyzed, i.e., do all languages get compressed in the same way through BPE merge operations? We hypothesize that the type of recurrent patterns captured in each merge depends on the typology and even orthography and other corpus-related phenomena. For instance, for some languages, this compression might be related to frequent affixes or regular inflectional morphs, while for some others, it might be related to more idiosyncratic, irregular patterns or even related to orthographic redundancies.
We propose a novel way to quantify this, inspired by the notion of morphological productivity.

BiliBili  RocketChat  Paper

Improving Access to Untranscribed Speech by Leveraging Spoken Term Detection and Self-supervised Learning of Speech Representations

By Nay San, Martijn Bartelds and Dan Jurafsky

  S1:   ----   ┊┊  S2:   ----

We summarise findings from our recent work showing that a large self-supervised model trained only on English speech provides a noise-robust and speaker-invariant feature extraction method that can be used for a speech information retrieval task with unrelated low resource target languages. A qualitative error analysis also revealed that the majority of the retrieval errors could be attributed to the differences in phonological inventories between English and the evaluation languages. With a longer-term aim of leveraging typological information to better adapt such models for the target languages, we also report on work in progress which examines the phonetic information encoded in these representations.

BiliBili  RocketChat  Paper


 Linguistic Typology Module 


  Session 1   Starts: 08:30 PM     Ends: 10:30 PM
    (Moderators: Ryan Cotterell, Eleanor Chodroff)

  Session 2   Starts: 08:00 AM     Ends: 10:00 AM
    (Moderators: Gabriella Lapesa)

    Registration: https://forms.gle/bAL1KXPDWHjMx7D66


Keynote Talk by Claire Bowern: Universals: Some key questions for phonetic typology

  S1:   ----   ┊┊  S2:   ----

Claire Bowern Claire Bowern is Professor of linguistics at Yale University. Claire is a historical linguist whose research is centered around language change and language documentation in Indigenous Australia. Since 2015 she has served as the vice president of the Endangered Language Fund. While her work touches many areas, the overarching question is how to characterize the nature of language change. Language change involves a complex interplay of universal properties of language acquisition and production and community-specific social factors; her research program looks at how to study this so we understand both the micro change(es) in progress and the macro change that leads to language families. She works with speakers of endangered languages, with archival sound and print materials, and she uses computational and phylogenetic methods.
The SIGTYP2021 theme of "Universals and Diversity" raises many interesting questions for comparative phonetics. In this talk, I unpack and discuss some of the theoretical background to how universal thinking in typology might apply to phonetic corpora, particularly for computational approaches to investigating speech across large-scale datasets. Because of the way that auditory language is produced, phonetics involves extensive interaction between psychological, neurological, physiological, social, linguistic, situational, and signals processing. That is, speech begins in the mind and brain, but is shaped by the body, and encodes linguistic and social meaning. Moreover, language is an evolutionary system, with universal properties shared by all such systems. Following discussion of universals and variation, I suggest some key questions that this approach allows, around phylogenetics of speech, marking of social cues, and the role of phonological constrast in structuring phonetic variation.

Slides  BiliBili  RocketChat  Claire's Website

OTEANN: Estimating the Transparency of Orthographies with an Artificial Neural Network

By Xavier Marjou

  S1:   ----   ┊┊  S2:   ----

To transcribe spoken language to written medium, most alphabets enable an unambiguous sound-to-letter rule. However, some writing systems have distanced themselves from this simple concept and little work exists in Natural Language Processing (NLP) on measuring such distance. In this study, we use an Artificial Neural Network (ANN) model to evaluate the transparency between written words and their pronunciation, hence its name Orthographic Transparency Estimation with an ANN (OTEANN). Based on datasets derived from Wikimedia dictionaries, we trained and tested this model to score the percentage of false predictions in phoneme-to-grapheme and grapheme-to-phoneme translation tasks. The scores obtained on 17 orthographies were in line with the estimations of other studies. Interestingly, the model also provided insight into typical mistakes made by learners who only consider the phonemic rule in reading and writing.

BiliBili  RocketChat  Paper

Modeling Linguistic Typology - A Probabilistic Graphical Models Approach

By Xia (Robin) Lu

  S1:   ----   ┊┊  S2:   ----

In this paper, we propose to use probabilistic graphical models as a new theoretical and computational framework to study linguistic typology. The graphical structure of such a model represents a meta-language that consists of linguistic variables and the relationships between them while the parameters associated with each variable can be used to infer the strength of the relationships between the variables. Such models can also be used to predict feature values of new languages. Besides providing better solutions to existing problems in linguistic typology such a framework opens up to many new research topics that can help us to gain further insights into linguistic typology.

BiliBili  RocketChat  Paper

On the Universality of Lexical Concepts

By Bradley Hauer and Grzegorz Kondrak

  S1:   ----   ┊┊  S2:   ----

We posit that lexicalized concepts are universal, and thus can be annotated cross-linguistically in parallel corpora. This is one of the implications of a novel theory that formalizes the relationship between words and senses in both monolingual and multilingual settings. The theory is based on a unifying treatment of the notions of synonymy and translational equivalence as different aspects of the relation of sameness of meaning within and across languages.

BiliBili  RocketChat  Paper

Subword Geometry: Picturing Word Shapes

By Olga Sozinova and Tanja Samardzic

  S1:   ----   ┊┊  S2:   ----

In this work in progress, we are investigating the structural properties of subwords in 20 languages by extracting word shapes, i.e. sequences of subword lengths.

BiliBili  RocketChat  Paper

Plugins for Structurally Varied Languages in XMG Framework

By Valeria Generalova

N/A

This paper aims to suggest an XMG-based design of metagrammatical classes storing language-specific information on a multilingual grammar engineering project. It also presents a method of reusing the information from WALS. The principal claim is the hierarchy of features and the modular architecture of feature structures.

BiliBili  RocketChat  Paper


 Shared Task Module 


  Session 1   Starts: 10:00 AM     Ends: 11:00 AM
    (Moderators: Elizabeth Salesky, Sabrina Mielke, Badr Abdullah)

Shared Task 2021 Overview

By Elizabeth Salesky, Badr Abdullah, and Sabrina Mielke

  S1:   ----

While language identification is a fundamental speech and language processing task, for many languages and language families it remains a challenging task. For many low-resource and endangered languages this is in part due to resource availability: where larger datasets exist, they may be single-speaker or have different domains than desired application scenarios, demanding a need for domain and speaker-invariant language identification systems. This year’s shared task on robust spoken language identification sought to investigate just this scenario: systems were to be trained on largely single-speaker speech from one domain, but evaluated on data in other domains recorded from speakers under different recording circumstances, mimicking realistic low-resource scenarios. We see that domain and speaker mismatch proves very challenging for current methods which can perform above 95% accuracy in-domain, which domain adaptation can address to some degree, but that these conditions merit further investigation to make spoken language identification accessible in many scenarios.

BiliBili  RocketChat  Paper

Language ID Prediction from Speech Using Self-Attentive Pooling and 1D-Convolutions

By Roman Bedyakin and Nikolay Mikhaylovskiy

  S1:   ----

This memo describes NTR-TSU submission for SIGTYP 2021 Shared Task on predicting language IDs from speech. Spoken Language Identification (LID) is an important step in a multilingual Automated Speech Recognition (ASR) system pipeline. For many low-resource and endangered languages, only single-speaker recordings may be available, demanding a need for domain and speaker-invariant language ID systems. In this memo, we show that a convolutional neural network with a Self-Attentive Pooling layer shows promising results for the language identification task.

BiliBili  RocketChat  Paper

A ResNet-50-based Convolutional Neural Network Model for Language ID Identification from Speech Recordings

By Giuseppe G.A. Celano

  S1:   ----

This paper describes the model built for the SIGTYP 2021 Shared Task aimed at identifying 18 typologically different languages from speech recordings. Mel-frequency cepstral coefficients derived from audio files are transformed into spectrograms, which are then fed into a ResNet-50-based CNN architecture. The final model achieved validation and test accuracies of 0.73 and 0.53, respectively.

BiliBili  RocketChat  Paper

Anlirika: An LSTM–CNN Flow Twister for Spoken Language Identification

By Andreas Scherbakov, Liam Whittle, Ritesh Kumar, Siddharth Singh, Matthew Coleman and Ekaterina Vylomova

  S1:   ----

The paper presents Anlirika’s submission to SIGTYP 2021 Shared Task on Robust Spoken Language Identification. The task aims at building a robust system that generalizes well across different domains and speakers. The training data is limited to a single domain only with predominantly single speaker per language while the validation and test data samples are derived from diverse dataset and multiple speakers. We experiment with a neural system comprising a combination of dense, convolutional, and recurrent layers that are designed to perform better generalization and obtain speaker-invariant representations. We demonstrate that the task in its constrained form (without making use of external data or augmentation the train set with samples from the validation set) is still challenging. Our best system trained on the data augmented with validation samples achieves 29.9% accuracy on the test data.

Slides  BiliBili  RocketChat  Paper


 Socialization Module 


By SIGTYP2021 Organizing Committee

TBA

By SIGTYP2021 Organizing Committee

TBA



By SIGTYP2021 Organizing Committee

Closing Remarks and SIGTYP 2022 Announcement!