Olga Zamaraeva: Typologically-driven Modeling of wh-Questions in a Grammar Engineering Framework
Studying language typology and studying syntactic structure formally are both ways to learn about the range of variation in human languages. These two ways are often pursued separately from each other. Furthermore, assembling the complex and fragmented hypotheses about different syntactic phenomena along multiple typological dimensions becomes intractable without computational aid. In response to these issues, the Grammar Matrix grammar engineering framework combines typology and syntactic theory within a computational paradigm. As such, it offers a robust scaffolding for testing linguistic hypotheses in interaction and with respect to a clear area of applicability. In this talk, I will present my recent work on modeling the syntactic structure of constituent (wh-)questions in a typologically attested range, within the Grammar Matrix framework. The presented system of syntactic analyses is associated with grammar artifacts that can parse and generate sentences, which allowed me to rigorously test the analyses on test suites from diverse languages. The grammars can be extended directly in the future to cover more phenomena and more lexical items. Generally, the Grammar Matrix framework is intended to create implemented grammars for many languages of the world, particularly for endangered languages. In computational linguistics, formalized syntactic representations produced by such grammars play a crucial role in creating annotations which are then used for evaluating NLP system performance and which could be used for augmenting training data as well, in low-resource settings. Such grammars were also shown to be useful in applications such as grammar coaching, and advancing this line of research can contribute to educational and revitalization efforts. The talk comprises 4 parts (one hour in total), there will be Q&A sessions after each: 1) Introduction (focusing on NLP and language variation) 2) Computational syntax with HPSG 3) Assembling typologically diverse analyses 4) Future directions of research
Jon Rawski: Typology Emerges from Computability
Typology, from the ancient Sanskrit grammarians through to Alexander von Humboldt, is known to require two databases: an "encyclopedia of categories" and an "encyclopedia of types". The mathematical study of computable functions gives a rich encyclopedia of categories, and processes in natural language a rich encyclopedia of types. This talk will connect the two, especially in morphology and phonology. Jon will: 1) overview classes of string-to-string functions (polyregular, regular, rational and subsequential); 2) use them to determine the scope and limits of linguistic processes; 3) analytically connect them to classes of transducers (and acceptors using algebraic semirings); 4) show their usefulness for Seq2Seq interpretability experiments, and implications for ML in NLP generally.
Tiago Pimentel: An Informative Exploration of the Lexicon
During my PhD I've been exploring the lexicon through the lens of information theory. In this talk, I'll give an overview on results detailing the distribution of information in words (are initial or final positions more informative?), and cross-linguistic compensations (if a language has more information per character, are their words shorter?). I'll also present two new information-theoretic operationalisations (of systematicity and lexical ambiguity) which allow us to analyse computational linguistics question through corpus analyses -- relying only on natural (unsupervised) data.
Maria Ryskina: Informal Romanization Across Languages and Scripts
Informal romanization is an idiosyncratic way of typing non-Latin-script languages in Latin alphabet, commonly used in online communication. Although the character substitution choices vary between users, they are typically grounded in shared notions of visual and phonetic similarity between characters. In this talk, I will focus on the task of converting such romanized text into its native orthography and present experimental results for Russian, Arabic, and Kannada, highlighting the differences specific to writing systems. I will also show how similarity-encoding inductive bias helps in the absence of parallel data, present comparative error analysis for unsupervised finite-state and seq2seq models for this task, and explore how the combinations of the two model classes can leverage their different strengths.
Shruti Rijhwani: Cross-Lingual Entity Linking for Low-Resource Languages
Entity linking is the task of associating a named entity with its corresponding entry in a structured knowledge base (such as Wikipedia or Freebase). While entity linking systems for languages such as English and Spanish are well-developed, the performance of these methods on low-resource languages is significantly worse.
In this talk, I first discuss existing methods for cross-lingual entity linking and the associated challenges of adapting them to low-resource languages. Then, I present a suite of methods developed for entity linking that do not rely on resources in the target language. The success of our proposed methods is demonstrated with experiments on multiple languages, including extremely low-resource languages such as Tigrinya, Oromo, and Lao. Additionally, this talk will show how information from entity linking can be used with state-of-the-art neural models to improve low-resource named entity recognition.
David Inman: Conceptual Interdependence in Language Description, Typology, and NLP: Examples from Nuuchahnulth
The fields of language description, typology, and NLP can be and typically are pursued independently. However, approaching these from a perspective of interdependence reveals that methodologies in one can often answer or refine questions in another. Focusing on the example of coordination structures in Nuuchahnulth, a Wakashan language of British Columbia, I will walk through the connection among traditional linguistic fields and NLP, how these can inform each other, and why NLP researchers should be interested.
David Inman's research is centered on Indigenous American languages, their linguistic properties, history, and typological profile. His doctoral research utilized computational tools to document properties of Nuuchahnulth, a Wakashan language spoken in Canada, and he continues investigating the challenges to syntactic theory that this language presents. At the University of Zurich, he is developing typological questionnaires targeting areal patterns in the Americas, and investigating how these overlap to produce areas of historically intense linguistic contact.
Tuhin Chakrabarty: NeuroSymbolic methods for creative text generation
Recent neural models have led to important progress in natural language generation (NLG) tasks. While pre-trained models have facilitated advances in many areas of text generation, the fields of creative language generation especially figurative language are relatively unexplored. There are important challenges that need to be addressed such as the lack of a large amount of training data as well as the inherent need for common sense and connotative knowledge required for modeling these tasks. In this talk, I will present some of my recent work on neurosymbolic methods for controllable creative text generation focusing on various types of figurative language (e.g. metaphor, simile, sarcasm). Additionally, I will discuss how we can borrow from theoretically grounded concepts of figurative language and use these inductive biases to make our generations closer to humans.
Sabrina Mielke: Fair Comparisons for Generative Language Models -- with a bit of Information Theory
How can we fairly compare the performance of generative models on multiple languages? We will see how to use probabilistic and information theory-based measures, first to evaluate (monolingual) open-vocabulary language models by total bits and then pondering the meaning of “information” and how to use it to compare machine translation models. In both cases, we get only a little glimpse at what might make languages easier or harder for models, but deviating from the polished conference talk, I will recount how I spent half a year on a super-fancy model that yielded essentially the same conclusions as a simple averaging step... The rest of the talk will be dedicated to work on actually building new open-vocabulary language models, and on evaluating and ameliorating such models' gender bias in morphologically rich languages.
Sabrina is a PhD student at the Johns Hopkins University and a part-time research intern at HuggingFace, researching open-vocabulary language models for segmentation and tokenization. She has published and co-organized workshops and shared tasks on these topics as well as on morphology and typological analysis in ACL, NAACL, EMNLP, LREC, and AAAI. You can find her reminisce for a time when formal language theory played a bigger role in NLP on Twitter at @sjmielke.
Richard Futrell: Investigating Information-Theoretic Influences on the Order of Elements in Natural Language
Why is human language the way it is? I claim that human languages can be modeled as codes that maximize information transfer subject to constraints on the process of language production and comprehension. I use this efficiency-based framework to formulate quantitative theories of the order of words, phrases, and morphemes, aiming to explain the typological universals documented by linguists as well as the statistical distribution of orders in massively cross-linguistic corpus studies. I present results about Greenbergian word order correlations, adjective order in English, and the order of verbal dependents in Hindi.
Richard Futrell is an Assistant Professor in the Department of Language Science at the University of California, Irvine. His research focuses on language processing in humans and machines.
Eleanor Chodroff: Structure in Cross-linguistic Phonetic Realization
A central goal of linguistic study is to understand the range and limits of cross-linguistic variation. Cross-linguistic phonetic variation is no exception to this pursuit: previous research has provided some insight into expected universal tendencies, but access to relevant and large-scale speech data has only recently become feasible. In this talk, I focus on structure in cross-linguistic phonetic variation that may reflect a universal tendency for uniformity in the phonetic realisation of a shared feature. I present case studies from cross-talker variation within a language, and then insight from cross-linguistic meta-analyses and larger-scale corpus studies.
Eleanor Chodroff is a Lecturer in Phonetics and Phonology at the University of York. She received her PhD in Cognitive Science from Johns Hopkins University in 2017 and did a post-doc at Northwestern University in Linguistics working on speech prosody. Her research focuses on the phonetics–phonology interface, cross-talker and cross-linguistic phonetic variation, speech prosody, and speech perception.
Amit Moryossef: Including Signed Languages in NLP
Signed languages are the primary means of communication for many deaf and hard-of-hearing individuals. Since signed languages exhibit all the fundamental linguistic properties of natural language, I believe that tools and theories of Natural Language Processing (NLP) are crucial to its modeling. However, existing research in Sign Language Processing (SLP) seldom attempts to explore and leverage the linguistic organization of signed languages. In this talk, I discuss the linguistic properties of signed languages, the current open questions and challenges in modeling them, and present my current research to mitigate them.
Duygu Ataman: Machine Translation of Morphologically-Rich Languages: a Survey and Open Challenges
Morphologically-rich languages challenge neural machine translation (NMT) models with extremely sparse vocabularies where atomic treatment of surface forms is unrealistic. This problem is typically addressed by either pre-processing words into subword units or performing translation directly at the level of characters. The former is based on word segmentation algorithms optimized using corpus-level statistics with no regard to the translation task. The latter approach has shown significant benefits for translating morphologically-rich languages, although practical applications are still limited due to increased requirements in terms of model capacity. In this talk, we present an overview of recent approaches to NMT developed for translating morphologically-rich languages and open challenges related to their future deployment.
Duygu Ataman holds a bachelor's and a master's degree in electrical engineering and computer science from Middle East Technical University, and KU Leuven, respectively. She completed her Ph.D. in computer science in 2019 at the University of Trento under the supervision of Marcello Federico. In her doctoral research she studied unsupervised learning of morphology, from the aspects of linguistics, cognitive science and statistics, and designed a purely statistical formulation of it within the Bayesian framework, which could be implemented in decoders of neural machine translation models in order to generate better translations in morphologically-rich languages. During her Ph.D. she was also a visiting student at the School of Informatics, University of Edinburgh advised by Dr. Alexandra Birch, and an applied scientist intern at Amazon Alexa Research. After recently completing her post-doctoral research and studies at the Institute of Computational Linguistics, University of Zürich she will soon join New York University's Courant Institute as an assistant professor and faculty fellow.
Ekaterina Vylomova: UniMorph and Morphological Inflection Task: Past, Present, and Future
In the 1960s, Hockett proposed a set of essential properties that are unique to human language such as displacement, productivity, duality of patterning, and learnability. Regardless of the language we use, these features allow us to produce new utterances and infer their meanings. Still, languages differ in the way they express meanings, or as Jacobson put it, “Languages differ essentially in what they must convey and not in what they may convey”. From a typological point of view, it is crucial to describe and understand the limits of cross-linguistic variation. In this talk, I will focus on cross-lingual annotation and regularities in inflectional morphology. More specifically, I will discuss the UniMorph project, an attempt to create a universal (cross-lingual) annotation schema, with morphosyntactic features that would occupy an intermediate position between the descriptive categories and comparative concepts. UniMorph allows an inflected word from any language to be defined by its lexical meaning, typically carried by the lemma, and a bundle of universal morphological features defined by the schema. Since 2016, the UniMorph database has been gradually developed and updated with new languages, and SIGMORPHON shared tasks served as a platform to compare computational models of inflectional morphology. During 2016–2021, the shared tasks made it possible to explore the data-driven systems’ ability to learn declension and conjugation paradigms as well as to evaluate how well they generalize across typologically diverse languages. It is especially important, since elaboration of formal techniques of cross-language generalization and prediction of universal entities across related languages should provide a new potential to the modeling of under-resourced and endangered languages. In the second part of the talk, I will outline certain challenges we faced while converting the language-specific features into UniMorph (such as case compounding). In addition, I will also discuss typical errors made by the majority of the systems, e.g. incorrectly predicted instances due to allomorphy, form variation, misspelled words, looping effects. Finally, I will provide case studies for Russian, Tibetan, and Nen.
Ekaterina Vylomova is a Lecturer and a Postdoctoral Fellow at the University of Melbourne. Her research is focused on compositionality modelling for morphology, models of inflectional and derivational morphology, linguistic typology, diachronic language models, and neural machine translation. She co-organized SIGTYP 2019 – 2021 workshops and shared tasks and the SIGMORPHON 2017 – 2021 shared tasks on morphological reinflection.
Adina Williams: How Strongly does Grammatical Gender Correlate with the Lexical Semantics of Nouns?
Since at least Ferdinand de Saussure, linguists have aimed to understand the strength and substance of the relationship between word meaning and word form. In this talk, I present several works that explore one particular aspect of this long standing research program: grammatical gender. In particular, this presentation asks the following question: is there a statistically significant relationship between the morphological gender of a noun and its lexical meaning? I will present three recent studies that answer this question in the affirmative. These works measure the strength of the correlation between grammatical gender and several operationalizations of lexical meaning (using collocations and word embeddings). They also explore the relationship between meaning and orthographic form, uncovering related correlations for other grammatical systems (such as declension class). These works highlight how technical advancements in multilingual NLP tools and increasing availability of large text corpora can shed light on some of the most enduring questions about the nature of language.
Adina is a Research Scientist at Facebook AI Research in NYC. Her main research goal is to strengthen connections between linguistics, cognitive science, and natural language processing. Towards that end, she brings insights about human language to bear on training, evaluating, and debiasing ML-based NLP systems, and applies tools from NLP to uncover new facts about human language.
Kyle Gorman: On "Massively Multilingual" Natural Language Processing
Early work in speech & language processing was critiqued for an overwhelming focus on English (and a few other regionally hegemonic languages). In part, this reflected resource limitations of the time. In the first half of this talk, I will discuss various ways in which speech & language processing technologies can be said to be "monolingual" or "multilingual". I will identify several distinct tendencies pushing the field towards greater multilinguality and note some tensions between these various tendencies. In the second half of the talk I will discuss some of the work out of my lab exploiting free, massively multilingual data extracted from Wiktionary, a free online dictionary. These resources include UniMorph, a collection of morphological paradigms, and WikiPron, a collection of pronunciation dictionaries. I will discuss how these data are collected and vetted, and their use in a series of recent shared tasks hosted by special interest groups of the Association for Computational Linguistics.
Kyle Mahowald: “Deep” Subjecthood: Classifying Grammatical Subjects and Objects across Languages
What do contextual embedding models know about grammatical subjects and objects, and how does that knowledge vary typologically? To explore that question, I will present a variety of results, probing both humans and machines using a bespoke subject/object classification task. In the first part of the talk, I will show that type-level embeddings can explain a large part of the variance in whether a given noun is a subject, but that there are cases in which contextual models play a crucial role. In the second part of the talk, I explore subject/object classification in Multilingual BERT on both transitive and intransitive sentences, across languages that vary in morphosyntactic alignment. In particular, I explore how a classifier trained on transitive subjects and objects classifies held-out intransitive subjects, comparing the model performance within and across nominative/accusative and ergative/absolutive languages. I consider the implications of these results for linguistic theories of subjecthood.