SIGTYP2022 — JULY,14th — 508-Tahuya/Hybrid
We kindly invite everyone to join the virtual part of SIGTYP 2022! Below you may explore papers, slides, and recorded talks. All discussions are happening in our Rocket.Chat. Each paper is provided with its own discussion channel. On our Rocket.Chat and Google Group you may also find a single Zoom link that will be used during the day of the workshop.
SIGTYP2022 Proceedings are now available here and on the ACL Anthology website.
The Best Paper Award: "Typological Word Order Correlations with Logistic Brownian Motion" ! Congratulations!
Time zone: America/Seattle
By SIGTYP2022 Organizing Committee
Opening remarks: the SIGTYP 2022 workshop, SIGTYP development, MRL and FieldMatters 2022! Slides are available here.
✻ Keynote Talk ✻
Kristen Howell is a data scientist at LivePerson Inc. in Seattle, Washington. Her research interests range from grammar engineering and grammar inference to conversational NLP. Throughout this research, the common thread is multilingual NLP across typologically diverse languages. Kristen received her PhD from the University of Washington in 2020, where she engaged with typological literature to develop technology for automatically generating grammars for local languages. Recent work at LivePerson has focused on multilingual NLP, leveraging deep learning techniques for conversational AI.
Abstract: In this talk Kristen will describe the benefit of implemented grammars as well as the challenges involved in creating them. She presents an inference system that can be used to automatically generate such grammars on the basis of interlinear glossed text (IGT) corpra. The inference system, called BASIL -- Building Analyses from Syntactic Inference in Local Languages, leverages typologically informed heuristics to infer syntactic and morphological information from linguistic corpora to select analyses that model the language. She will engage with the question of whether and to what extent typological features are apparent in IGT data and how effectively grammars generated with these features can model human language.
✻ Multilingual Representations (Long Talks) ✻
By Andrea Gregor De Varda and Roberto Zamparelli
The present work constitutes an attempt to investigate the relational structures learnt by mBERT, a multilingual transformer-based network, with respect to different cross-linguistic regularities proposed in the fields of theoretical and quantitative linguistics. We pursued this objective by relying on a zero-shot transfer experiment, evaluating the model's ability to generalize its native task to artificial languages that could either respect or violate some proposed language universal, and comparing its performance to the output of BERT, a monolingual model with an identical configuration. We created four artificial corpora through a Probabilistic Context-Free Grammar by manipulating the distribution of tokens and the structure of their dependency relations. We showed that while both models were favoured by a Zipfian distribution of the tokens and by the presence of head-dependency type structures, the multilingual transformer network exhibited a stronger reliance on hierarchical cues compared to its monolingual counterpart.
By Yulia Otmakhova, Karin Verspoor and Jey Han Lau
Though recently there have been an increased interest in how pre-trained language models encode different linguistic features, there is still a lack of systematic comparison between languages with different morphology and syntax. In this paper, using BERT as an example of a pre-trained model, we compare how three typologically different languages (English, Korean, and Russian) encode morphology and syntax features across different layers. In particular, we contrast languages which differ in a particular aspect, such as flexibility of word order, head directionality, morphological type, presence of grammatical gender, and morphological richness, across four different tasks.
Coffee break, chats, linguistic trivia
✻ Typology (Short Talks) ✻
10:10 – 10:22 Word-order Typology in Multilingual BERT: A Case Study in Subordinate-Clause Detection
By Dmitry Nikolaev and Sebastian Pado
The capabilities and limitations of BERT and similar models are still unclear when it comes to learning syntactic abstractions, in particular across languages. In this paper, we use the task of subordinate-clause detection within and across languages to probe these properties. We show that this task is deceptively simple, with easy gains offset by a long tail of harder cases, and that BERT's zero-shot performance is dominated by word-order effects, mirroring the SVO/VSO/SOV typology.
By Sihan Chen, Richard Futrell and Kyle Mahowald
Using data from Nintemann et al. (2020), we explore the variability in complexity and informativity across spatial demonstrative systems using spatial deictic lexicons from 223 languages. We argue from an information-theoretic perspective (Shannon, 1948) that spatial deictic lexicons are efficient in communication, balancing informativity and complexity. Specifically, we find that under an appropriate choice of cost function and need probability over meanings, among all the 21146 theoretically possible spatial deictic lexicons, those adopted by real languages lie near an efficient frontier. Moreover, we find that the conditions that the need probability and the cost function need to satisfy are consistent with the cognitive science literature regarding the source-goal asymmetry. We also show that the data are better explained by introducing a notion of systematicity, which is not currently accounted for in Information Bottleneck approaches to linguistic efficiency.
By Luigi Talamo
We describe a methodology to extract with finer accuracy word order patterns from texts automatically annotated with Universal Dependency (UD) trained parsers. We use the methodology to quantify the word order entropy of determiners, quantifiers and numerals in ten Indo-European languages, using UD-parsed texts from a parallel corpus of prosaic texts. Our results suggest that the combinations of different UD annotation layers, such as UD Relations, Universal Parts of Speech and lemma, and the introduction of language-specific lists of closed-category lemmata has the two-fold effect of improving the quality of analysis and unveiling hidden areas of variability in word order patterns.
By Temuulen Khishigsuren, Gábor Bella, Thomas Brochhagen, Daariimaa Marav, Fausto Giunchiglia and Khuyagbaatar Batsuren
Metonymy is regarded by most linguists as a universal cognitive phenomenon, especially since the emergence of the theory of conceptual mappings. However, the field data backing up claims of universality has not been large enough so far to provide conclusive evidence. We introduce a large-scale analysis of metonymy based on a lexical corpus of over 20 thousand metonymy instances from 189 languages and 69 genera. No prior study, to our knowledge, is based on linguistic coverage as broad as ours. Drawing on corpus analysis, evidence of universality is found at three levels: systematic metonymy in general, particular metonymy patterns, and specific metonymy concepts.
By Kai Hartung, Gerhard Jäger, Sören Gröttrup and Munir Georges
In this study we address the question to what extent syntactic word-order traits of different languages have evolved under correlation and whether such dependencies can be found universally across all languages or restricted to specific language families. To do so, we use logistic Brownian Motion under a Bayesian framework to model the trait evolution for 768 languages from 34 language families. We test for trait correlations both in single families and universally over all families. Separate models reveal no universal correlation patterns and Bayes Factor analysis of models over all covered families also strongly indicate lineage specific correlation patters instead of universal dependencies.
✻ Keynote Talk ✻
Graham Neubig is an associate professor at the Language Technologies Institute of Carnegie Mellon University. His research focuses on multilingual natural language processing, natural language interfaces to computers, and machine learning methods for NLP, with the final goal of every person in the world being able to communicate with each-other, and with computers in their own language. He also contributes to making NLP research more accessible through open publishing of research papers, advanced NLP course materials and video lectures, and open-source software, all of which are available on his web site.
Abstract: In NLP, there has been a very welcome recent trend towards focusing on models that benefit under-resourced languages, where it is hard to effectively train models due to lack of annotated or unannotated data. In this work, Graham will discuss some work to attack this problem, not by creating better models or training algorithms to use existing data, but to unlock data sources that were not available previously, namely undigitized text and lexicons. For the former, Graham will discuss work to digitize undigitized text through optical character recognition and post-correction. For the latter, he will discuss methods to adapt pre-trained multilingual models using lexicon-based data augmentation methods.
Lunch, discussions, etc.
✻ Shared Task Session ✻
By Johann-Mattis List, Ekaterina Vylomova, Robert Forkel, Nathan Hill and Ryan Cotterell
This study describes the structure and the results of the SIGTYP 2022 shared task on the prediction of cognate reflexes from multilingual wordlists. We asked participants to submit systems that would predict words in individual languages with the help of cognate words from related languages. Training and surprise data were based on standardized multilingual wordlists from several language families. Four teams submitted a total of eight systems, including both neural and non-neural systems, as well as systems adjusted to the task and systems using more general settings. While all systems showed a rather promising performance, reflecting the overwhelming regularity of sound change, the best performance throughout was achieved by a system based on convolutional networks originally designed for image restoration.
By Gerhard Jäger
In Jäger (2019) a computational framework was defined to start from parallel word lists of related languages and infer the corresponding vocabulary of the shared proto-language. The SIGTYP 2022 Shared Task is closely related. The main difference is that what is to be reconstructed is not the proto-form but an unknown word from an extant language. The system described here is a re-implementation of the tools used in the mentioned paper, adapted to the current task.
By Christo Kirov, Richard Sproat and Alexander Gutkin
The SIGTYP 2022 shared task concerns the problem of word reflex generation in a target language, given cognate words from a subset of related languages. We present two systems to tackle this problem, covering two very different modeling approaches. The first model extends transformer-based encoder-decoder sequence-to-sequence modeling, by encoding all available input cognates in parallel, and having the decoder attend to the resulting joint representation during inference. The second approach takes inspiration from the field of image restoration, where models are tasked with recovering pixels in an image that have been masked out. For reflex generation, the missing reflexes are treated as “masked pixels” in an “image” which is a representation of an entire cognate set across a language family. As in the image restoration case, cognate restoration is performed with a convolutional network.
By Giuseppe G. A. Celano
This paper presents the transformer model built to participate in the SIGTYP 2022 Shared Task on the Prediction of Cognate Reflexes. It consists of an encoder-decoder architecture with multi-head attention mechanism. Its output is concatenated with the one hot encoding of the language label of an input character sequence to predict a target character sequence. The results show that the transformer outperforms the baseline rule-based system only partially.
By Tiago Tresoldi
This work describes an implementation of the “extended alignment” model for cognate reflex prediction submitted to the “SIGTYP 2022 Shared Task on the Prediction of Cognate Reflexes”. Similarly to List et al. (2022a), the technique involves an automatic extension of sequence alignments with multilayered vectors that encode informational tiers on both site-specific traits, such as sound classes and distinctive features, as well as contextual and suprasegmental ones, conveyed by cross-site referrals and replication. The method allows to generalize the problem of cognate reflex prediction as a classification problem, with models trained using a parallel corpus of cognate sets. A model using random forests is trained and evaluated on the shared task for reflex prediction, and the experimental results are presented and discussed along with some differences to other implementations.
✻ Linguistic Trivia ✻
Guess the Language from Songs. Linguistic quizzes.
You may also use initial slides.
Coffee break, discussions, etc.
✻ Keynote Talk ✻
Isabel is a PhD student at Stanford in the Natural Language Processing group, advised by Dan Jurafsky. Her main research focuses on exploring the linguistic basis of computational language methods. She likes to focus on how language is both a discrete symbolic system and a system of continuous gradations, and exploring the limits of how large neural models can emcompass this combination.
She is very interested in looking at the behavior of large language models in multilingual settings, and analyizing the ways in which languages and dialects co-occur and interfere in single models.
Abstract: The development of successful language models has provided us with an exciting test bed: we have learners that can model language data they are given, and we can watch them do it. In this talk we will go over two sets of experiments that examine language representation and learning in language models, and discuss what we can learn from them. Firstly, we’ll look at subjecthood (the property of being the subject or the object of a sentence) in multilingual language models. By probing the embedding space, we show how a discrete feature like subjecthood can be encoded in a continuous space, affected but not fully determined by prototype effects, and also how these properties come into play with a feature being universally shared among many languages. Second, we approach questions of inductive learning biases and the abstract universals that underlie language by pretraining models on non-linguistic data and observing their language acquisition. Insofar as computational models of cognition act as hypothesis generators for inspiring and guiding our research into understanding language, language models are a very exciting tool to work with and understand.
✻ Databases and Corpora ✻
By Qingxia Guo, Nathaniel Imel and Shane Steinert-Threlkeld
This paper introduces a database for crosslinguistic modal semantics. The purpose of this database is to (1) enable ongoing consolidation of modal semantic typological knowledge into a repository according to uniform data standards and to (2) provide data for investigations in crosslinguistic modal semantic theory and experiments explaining such theories. We describe the kind of semantic variation that the database aims to record, the format of the data, and a current snapshot of the database, emphasizing access and contribution to the database in light of the goals above. We release the database at https://clmbr.shane.st/modal-typology.
By Chiara Zanchi, Silvia Luraghi and Claudia Roberta Combei
This paper describes an ongoing endeavor to construct Pavia Verbs Database (PaVeDa), 2013 an open-access typological resource that builds upon previous work on verb argument structure, in particular the Valency Patterns Leipzig (ValPaL) project (Hartmann et al., 2013). The PaVeDa database features four major innovations as compared to the ValPaL database: (i) it includes data from ancient languages enabling diachronic research; (ii) it expands the language sample to language families that are not represented in the ValPaL; (iii) it is linked to external corpora that are used as sources of usage-based examples of stored patterns; (iv) it introduces a new cross-linguistic layer of annotation for valency patterns which allows for contrastive data visualization.
By Jonne Sälevä and Constantine Lignos
We present ParaNames, a Wikidata-derived multilingual parallel name resource consisting of names for approximately 14 million entities spanning over 400 languages. ParaNames is useful for multilingual language processing, both in defining tasks for name translation tasks and as supplementary data for other tasks. We demonstrate an application of ParaNames by training a multilingual model for canonical name translation to and from English.
By SIGTYP2022 Organizing Committee
The Best Paper Award: "Typological Word Order Correlations with Logistic Brownian Motion" ! Congratulations!
Stay with us for SIGTYP 2023!
THANK YOU ALL!