Time zone: America/Seattle
By SIGTYP2022 Organizing Committee
Opening remarks: general comments about SIGTYP development, SIGTYP2022 submissions, shared task, etc.
✻ Keynote Talk ✻
Kristen Howell is a data scientist at LivePerson Inc. in Seattle, Washington. Her research interests range from grammar engineering and grammar inference to conversational NLP. Throughout this research, the common thread is multilingual NLP across typologically diverse languages. Kristen received her PhD from the University of Washington in 2020, where she engaged with typological literature to develop technology for automatically generating grammars for local languages. Recent work at LivePerson has focused on multilingual NLP, leveraging deep learning techniques for conversational AI.
In this talk Kristen will describe the benefit of implemented grammars as well as the challenges involved in creating them. She presents an inference system that can be used to automatically generate such grammars on the basis of interlinear glossed text (IGT) corpra. The inference system, called BASIL -- Building Analyses from Syntactic Inference in Local Languages, leverages typologically informed heuristics to infer syntactic and morphological information from linguistic corpora to select analyses that model the language. She will engage with the question of whether and to what extent typological features are apparent in IGT data and how effectively grammars generated with these features can model human language.
✻ Multilingual Representations (Long Talks) ✻
By Andrea Gregor De Varda and Roberto Zamparelli
The present work constitutes an attempt to investigate the relational structures learnt by mBERT, a multilingual transformer-based network, with respect to different cross-linguistic regularities proposed in the fields of theoretical and quantitative linguistics. We pursued this objective by relying on a zero-shot transfer experiment, evaluating the model's ability to generalize its native task to artificial languages that could either respect or violate some proposed language universal, and comparing its performance to the output of BERT, a monolingual model with an identical configuration. We created four artificial corpora through a Probabilistic Context-Free Grammar by manipulating the distribution of tokens and the structure of their dependency relations. We showed that while both models were favoured by a Zipfian distribution of the tokens and by the presence of head-dependency type structures, the multilingual transformer network exhibited a stronger reliance on hierarchical cues compared to its monolingual counterpart.
By Yulia Otmakhova, Karin Verspoor and Jey Han Lau
Though recently there have been an increased interest in how pre-trained language models encode different linguistic features, there is still a lack of systematic comparison between languages with different morphology and syntax. In this paper, using BERT as an example of a pre-trained model, we compare how three typologically different languages (English, Korean, and Russian) encode morphology and syntax features across different layers. In particular, we contrast languages which differ in a particular aspect, such as flexibility of word order, head directionality, morphological type, presence of grammatical gender, and morphological richness, across four different tasks.
Coffee break, chats, linguistic trivia
✻ Typology (Short Talks) ✻
By Dmitry Nikolaev and Sebastian Pado
The capabilities and limitations of BERT and similar models are still unclear when it comes to learning syntactic abstractions, in particular across languages. In this paper, we use the task of subordinate-clause detection within and across languages to probe these properties. We show that this task is deceptively simple, with easy gains offset by a long tail of harder cases, and that BERT's zero-shot performance is dominated by word-order effects, mirroring the SVO/VSO/SOV typology.
By Sihan Chen, Richard Futrell and Kyle Mahowald
Using data from Nintemann et al. (2020), we explore the variability in complexity and informativity across spatial demonstrative systems using spatial deictic lexicons from 223 languages. We argue from an information-theoretic perspective (Shannon, 1948) that spatial deictic lexicons are efficient in communication, balancing informativity and complexity. Specifically, we find that under an appropriate choice of cost function and need probability over meanings, among all the 21146 theoretically possible spatial deictic lexicons, those adopted by real languages lie near an efficient frontier. Moreover, we find that the conditions that the need probability and the cost function need to satisfy are consistent with the cognitive science literature regarding the source-goal asymmetry. We also show that the data are better explained by introducing a notion of systematicity, which is not currently accounted for in Information Bottleneck approaches to linguistic efficiency.
By Luigi Talamo
We describe a methodology to extract with finer accuracy word order patterns from texts automatically annotated with Universal Dependency (UD) trained parsers. We use the methodology to quantify the word order entropy of determiners, quantifiers and numerals in ten Indo-European languages, using UD-parsed texts from a parallel corpus of prosaic texts. Our results suggest that the combinations of different UD annotation layers, such as UD Relations, Universal Parts of Speech and lemma, and the introduction of language-specific lists of closed-category lemmata has the two-fold effect of improving the quality of analysis and unveiling hidden areas of variability in word order patterns.
By Temuulen Khishigsuren, Gábor Bella, Thomas Brochhagen, Daariimaa Marav, Fausto Giunchiglia and Khuyagbaatar Batsuren
Metonymy is regarded by most linguists as a universal cognitive phenomenon, especially since the emergence of the theory of conceptual mappings. However, the field data backing up claims of universality has not been large enough so far to provide conclusive evidence. We introduce a large-scale analysis of metonymy based on a lexical corpus of over 20 thousand metonymy instances from 189 languages and 69 genera. No prior study, to our knowledge, is based on linguistic coverage as broad as ours. Drawing on corpus analysis, evidence of universality is found at three levels: systematic metonymy in general, particular metonymy patterns, and specific metonymy concepts.
By Kai Hartung, Gerhard Jäger, Sören Gröttrup and Munir Georges
In this study we address the question to what extent syntactic word-order traits of different languages have evolved under correlation and whether such dependencies can be found universally across all languages or restricted to specific language families. To do so, we use logistic Brownian Motion under a Bayesian framework to model the trait evolution for 768 languages from 34 language families. We test for trait correlations both in single families and universally over all families. Separate models reveal no universal correlation patterns and Bayes Factor analysis of models over all covered families also strongly indicate lineage specific correlation patters instead of universal dependencies.
A bit of ice breaking :-)
Lunch, discussions, etc.
✻ Shared Task Session ✻
By Johann-Mattis List, Ekaterina Vylomova, Robert Forkel, Nathan Hill and Ryan Cotterell
This study describes the structure and the results of the SIGTYP 2022 shared task on the prediction of cognate reflexes from multilingual wordlists. We asked participants to submit systems that would predict words in individual languages with the help of cognate words from related languages. Training and surprise data were based on standardized multilingual wordlists from several language families. Four teams submitted a total of eight systems, including both neural and non-neural systems, as well as systems adjusted to the task and systems using more general settings. While all systems showed a rather promising performance, reflecting the overwhelming regularity of sound change, the best performance throughout was achieved by a system based on convolutional networks originally designed for image restoration.
By Gerhard Jäger
In Jäger (2019) a computational framework was defined to start from parallel word lists of related languages and infer the corresponding vocabulary of the shared proto-language. The SIGTYP 2022 Shared Task is closely related. The main difference is that what is to be reconstructed is not the proto-form but an unknown word from an extant language. The system described here is a re-implementation of the tools used in the mentioned paper, adapted to the current task.
By Christo Kirov, Richard Sproat and Alexander Gutkin
The SIGTYP 2022 shared task concerns the problem of word reﬂex generation in a target language, given cognate words from a subset of related languages. We present two systems to tackle this problem, covering two very different modeling approaches. The ﬁrst model extends transformer-based encoder-decoder sequence-to-sequence modeling, by encoding all available input cognates in parallel, and having the decoder attend to the resulting joint representation during inference. The second approach takes inspiration from the ﬁeld of image restoration, where models are tasked with recovering pixels in an image that have been masked out. For reﬂex generation, the missing reﬂexes are treated as “masked pixels” in an “image” which is a representation of an entire cognate set across a language family. As in the image restoration case, cognate restoration is performed with a convolutional network.
By Giuseppe G. A. Celano
This paper presents the transformer model built to participate in the SIGTYP 2022 Shared Task on the Prediction of Cognate Reflexes. It consists of an encoder-decoder architecture with multi-head attention mechanism. Its output is concatenated with the one hot encoding of the language label of an input character sequence to predict a target character sequence. The results show that the transformer outperforms the baseline rule-based system only partially.
By Tiago Tresoldi
This work describes an implementation of the “extended alignment” model for cognate reflex prediction submitted to the “SIGTYP 2022 Shared Task on the Prediction of Cognate Reflexes”. Similarly to List et al. (2022a), the technique involves an automatic extension of sequence alignments with multilayered vectors that encode informational tiers on both site-specific traits, such as sound classes and distinctive features, as well as contextual and suprasegmental ones, conveyed by cross-site referrals and replication. The method allows to generalize the problem of cognate reflex prediction as a classification problem, with models trained using a parallel corpus of cognate sets. A model using random forests is trained and evaluated on the shared task for reflex prediction, and the experimental results are presented and discussed along with some differences to other implementations.
Coffee break, discussions, etc.
✻ Keynote Talk ✻
Isabel is a PhD student at Stanford in the Natural Language Processing group, advised by Dan Jurafsky. Her main research focuses on exploring the linguistic basis of computational language methods. She likes to focus on how language is both a discrete symbolic system and a system of continuous gradations, and exploring the limits of how large neural models can emcompass this combination.
She is very interested in looking at the behavior of large language models in multilingual settings, and analyizing the ways in which languages and dialects co-occur and interfere in single models.
Looking beyond a single language, or to non-linguistic forms of data, can yield new insights into linguistic representation and use in language models. This talk will explore this theme in two threads: Firstly, what can we learn from passing non-linguistic data through language models? From natural modalities like music to controlled synthetic parentheses languages, we can use datasets with different underlying structures to explore knowledge in language model transfer learning. Knowing the structures in this data lets us understand if and how different features are acquired and generalized in language model training. Secondly, we will look at how typologically-aware analysis can help us understand joint multilingual representation in language models, with experiments that focus on agenthood and case in different languages in multilingual models. The typological diversity of agenthood gives us a handle into understanding how representations can be shared and also separated between languages. Examining language models at the points where diverse data differs – and systematically knowing the ways in which data differs – offers a useful window into how linguistic knowledge is represented in language models.
✻ Databases and Corpora ✻
By Qingxia Guo, Nathaniel Imel and Shane Steinert-Threlkeld
This paper introduces a database for crosslinguistic modal semantics. The purpose of this database is to (1) enable ongoing consolidation of modal semantic typological knowledge into a repository according to uniform data standards and to (2) provide data for investigations in crosslinguistic modal semantic theory and experiments explaining such theories. We describe the kind of semantic variation that the database aims to record, the format of the data, and a current snapshot of the database, emphasizing access and contribution to the database in light of the goals above. We release the database at https://clmbr.shane.st/modal-typology.
By Chiara Zanchi, Silvia Luraghi and Claudia Roberta Combei
This paper describes an ongoing endeavor to construct Pavia Verbs Database (PaVeDa), 2013 an open-access typological resource that builds upon previous work on verb argument structure, in particular the Valency Patterns Leipzig (ValPaL) project (Hartmann et al., 2013). The PaVeDa database features four major innovations as compared to the ValPaL database: (i) it includes data from ancient languages enabling diachronic research; (ii) it expands the language sample to language families that are not represented in the ValPaL; (iii) it is linked to external corpora that are used as sources of usage-based examples of stored patterns; (iv) it introduces a new cross-linguistic layer of annotation for valency patterns which allows for contrastive data visualization.
By Jonne Sälevä and Constantine Lignos
We present ParaNames, a Wikidata-derived multilingual parallel name resource consisting of names for approximately 14 million entities spanning over 400 languages. ParaNames is useful for multilingual language processing, both in defining tasks for name translation tasks and as supplementary data for other tasks. We demonstrate an application of ParaNames by training a multilingual model for canonical name translation to and from English.
By SIGTYP2022 Organizing Committee
Best Paper Award, Closing Remarks, and Future SIGTYP Events Announcement!