SIGTYP2026 — MARCH,29th — Rabat/Hybrid
The in-person part woth the workshop will be happening in SALLE LE LIXUS (Level - 1)
We kindly invite everyone to join the virtual part of SIGTYP 2026! Below you may explore papers, slides, and recorded talks. On Google Group you may also find a single Zoom link that will be used during the day of the workshop.
SIGTYP2026 Proceedings are now available here.
Our colleague Vilém (Vilda) Zouhar has kindly agreed to help us with the in-person part and will be assisting you during the workshop!
Time zone: Rabat, Morocco
By SIGTYP2026 Organizing Committee
Opening remarks: the SIGTYP 2026 workshop, SIGTYP development and MRL!
✻ Session 1 ✻
By Johannes Laurmaa
Automatically generating grammatically correct sentences in case-marking languages is hard because nominal case inflection depends on context. In template-based generation, placeholders must be inflected to the right case before insertion, otherwise the result is ungrammatical. We formalise this case selection problem for template slots and present a practical, data-driven solution designed for morphologically rich, case-marking languages, and apply it to Finnish. We automatically derive training instances from raw text via morphological analysis, and fine-tune transformer encoders to predict a distribution over 14 grammatical cases, with and without lemma conditioning. The predicted case is then realized by a morphological generator at deployment. On a held-out test set in the lemma-conditioned setting, our model attains 89.1% precision, 81.1% recall, and 84.2% F1, with recall@3 of 93.3% (macro averages). The probability outputs support abstention and top-$k$- suggestion User Interfaces, enabling robust, lightweight template filling for production use in multiple domains, such as customer messaging. The pipeline assumes only access to raw text plus a morphological analyzer and generator, and can be applied to other languages with productive case systems.
✻ Keynote Talk ✻
Abstract: Human languages exhibit striking variation. At the same time, certain linguistic patterns crop up again and again, while others seem to be extremely rare. What these tantalising observations tell us about human language is one of the most contentious questions in linguistics. Do similarities between languages reflect accidents of history? A special capacity for language in humans? More general features of the human mind? Do they reflect hard-and-fast constraints on the space of possible languages? Or soft biases that influence learning and usage? Traditionally, linguists have argued for one or another of these answers based on limited sources of evidence. For example, it is common to base claims about universality on small samples of languages, case studies of how a handful of languages change over time, or examples of how individual languages are learned. In this talk, I use two case studies to highlight how behavioral experiments, targeting diverse participant populations, can be used to bring crucial empirical evidence to bear on how language is shaped (or not!) by the human linguistic and cognitive system. In the first case study, I summarise a series of experiments targeting universals of nominal word order. In the second, I describe recent experimental work on cross-linguistic trends in pronoun systems. I end by discussing how I think language models can bring additional sources of converging evidence to bear on the question of what cognitive mechanisms underlie these language universals.
Bio: Jennifer Culbertson is Professor in the Department of Linguistics and English Language at the University of Edinburgh. She is a founding member of the Centre for Language Evolution, with her
research focusing on how typological universals are shaped by properties of human cognition. She is best known for her work investigating universals of word order and morphological categories using the
experimental method of Artificial Language Learning.
Coffee break, chats, networking
✻ Session 2 (Chair: Priya Rani) ✻
By Badal Nyalang
We present MeiteiRoBERTa, the first publicly available monolingual RoBERTa-based language model for Meitei (Manipuri), a low-resource language spoken by over 1.8 million people in Northeast India. Trained from scratch on 76 million words of Meitei text in Bengali script, our model achieves a perplexity of 65.89, representing a 5.2× improvement over multilingual baselines BERT (341.56) and MuRIL (355.65). Through comprehensive evaluation on perplexity, tokenization efficiency, and semantic representation quality, we demonstrate that domain-specific pre training significantly outperforms general-purpose multilingual models for low-resource languages. Our model exhibits superior semantic understanding with 0.769 similarity separation compared to 0.035 for mBERT and near-zero for MuRIL, despite MuRIL’s better tokenization efficiency (fertility: 3.29 vs. 4.65). We publicly release the model, training code, and datasets to accelerate NLP research for Meitei and other underrepresented Northeast Indian languages.
By Petr Kocharov and Lilit Kharatyan
The paper presents a prototype of a web-app designed to automatically generate verb valency lexica based on the Universal Dependencies (UD) treebanks. It offers an overview of the structure of the app, its core functionality, and functional extensions designed to handle treebank-specific features. Besides, the paper highlights the limitations of the prototype and the potential of its further development.
By York Hay Ng, Aditya Khan, Xiang Lu, Matteo Salloum, Michael Zhou, Phuong H. Hoang, Seza Dogruoz and En-Shiun Annie Lee
Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. First, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data. Second, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. Across multiple zero-shot transfer benchmarks, we demonstrate that our representations significantly improve transfer performance when the distance type is relevant to the task, while our composite distance yields gains in most tasks.
✻ Session 3: Findings (Chair: Priya Rani) ✻
By Rayyan Merchant and Kevin Tang
As a digraphic language, the Persian language utilizes two written standards: Perso-Arabic in Afghanistan and Iran, and Tajik-Cyrillic in Tajikistan. Despite the significant similarity between the dialects of each country, script differences prevent simple one-to-one mapping, hindering written communication and interaction between Tajikistan and its Persian-speaking ``siblings''. To overcome this, previously-published efforts have investigated machine transliteration models to convert between the two scripts. Unfortunately, most efforts did not use datasets other than those they created, limiting these models to certain domains of text such as archaic poetry or word lists. A truly usable transliteration system must be capable of handling varied domains, meaning that suck models lack the versatility required for real-world usage. The contrast in domain between data also obscures the task's true difficulty. We present a new state-of-the-art sequence-to-sequence model for Tajik-Farsi transliteration trained across all available datasets, and present two datasets of our own. Our results across domains provide clearer understanding of the task, and set comprehensive comparable leading benchmarks. Overall, our model achieves chrF++ and Normalized CER scores of 87.91 and 0.05 from Farsi to Tajik and 92.28 and 0.04 from Tajik to Farsi. Our model, data, and code are available at https://anonymous.4open.science/r/ParsTranslit-FB30/.
By Sanjeev Kumar, Preethi Jyothi and Pushpak Bhattacharyya
Multilingual models are widely used for machine translation (MT). However, their effectiveness for extremely low-resource languages (ELRLs) depends critically on how related languages are incorporated during fine-tuning. In this work, we study the role of language mixing directionality, linguistic relatedness, and script compatibility in ELRL translation. We propose SrcMix, a simple source-side mixing strategy that combines related ELRLs during fine-tuning while constraining the decoder to a single target language. Compared to its target-side counterpart TgtMix, SrcMix improves performance by +3 ChrF++ and +5 BLEU in high-resource to ELRL translations, and by +5 ChrF++ and +12 BLEU in mid-resource to ELRL translations. We also release the first Angika MT dataset and provide a systematic comparison of LLM (Aya-101) and NMT (mT5-Large) models under ELRL settings, highlighting the importance of directional mixing and linguistic compatibility.
Lunch break, chats, networking
Join us for LINGUISTIC TRIVIA!
✻ Session 4 ✻
By Julius Steuer, Toshiki Nakai, Andrew Thomas Dyer, Luigi Talamo and Annemarie Verkerk
The uniform information density (UID) hypothesis postulates that linguistic units are distributed in a text in such a way that the variance around an average information density is minimized. The relationship between information density and information status (IS) is so far underexplored. In this ongoing work, we project IS annotations on the English section of the CIEP+ corpus (Verkerk & Talamo 2024) to parallel sections in other languages. We then use the projected annotations to evaluate the relationship between IS and information content in a typologically diverse sample of languages. Our preliminary findings indicate that there is an effect of information status on information density, with the directionality of the effect depending on language and part of speech.
By Andrew Thomas Dyer
It is common in cognitive computational linguistics to use language model surprisal as a measure of the information content of units in language production. From here, it is tempting to then apply this to information structure and status, considering surprising mentions to be \textit{new} and unsurprising ones to be given, providing us with a ready-made continuous metric of information givenness/newness. To see if this conflation is appropriate, we perform regression experiments to see if language model surprisal is actually well predicted by information status as manually annotated, and if so, if this effect is separable from more trivial linguistic information such as parts of speech and word frequency. We find that information status alone is at best a very weak predictor of surprisal, and that surprisal can be much better predicted by the effect of parts of speech, which are highly correlated with both information status and surprisal; and word frequency. We conclude that surprisal should not be used as a continuous representation of information status by itself.
By Jonathan Hus and Antonios Anastasopoulos
Linguistic reference material is a trove of information that can be utilized for the analysis of languages. The material, in the form of grammar books and sketches, has been used for machine translation, but it can also be used for language analysis. Retrieval Augmented Generation (RAG) has been demonstrated to improve large language model (LLM) capabilities by incorporating external reference material into the generation process. In this paper, we investigate the use of grammar books and RAG techniques to identify language features. We use Grambank for feature definition and ground truth values, and we evaluate on five typologically diverse low-resource languages. We demonstrate that this approach can effectively make use of reference material.
By Antoine Taroni, Ludovic Moncla and Frederique Laforest
We model translation as an Information Bottleneck (IB) optimization problem, treating source sentences as stimuli and target sentences as compressed meanings. Using a French novel bitexts with English, German, and Serbian translations, we extract and align spatial terms. We substitute human similarity judgments with the cosine similarity of contextual embeddings. Compared to random alignements, real translations lie closer to the IB optimal frontier, suggesting relative efficiency with respect to this objective.
Coffee break, chats, networking
✻ Keynote Talk ✻
Abstract: Cross-language variation in semantic categories (e.g. word meanings) has been explained in terms of information-theoretic principles. Central to such accounts is a prior distribution over meanings that need to be conveyed. It has often been assumed for convenience that this distribution is the same for different speech communities, but loosening that assumption allows us to connect information-theoretic explanations to classic proposals about the relation of language and culture.
Bio: Terry Regier is a cognitive scientist and linguist whose research investigates language, meaning, and cognition. He is best known for his work exploring word meanings across languages, examining
how word meanings reflect and sometimes shape thought and perception. Regier’s lab integrates computational modeling, cross-linguistic data, and behavioral experiments to study universals and variation in semantic domains such as color, kinship, number, and spatial relations. His research contributes to understanding how languages evolve, and how language and cognition are influenced by cultural diversity.
By SIGTYP2026 Organizing Committee
Stay with us for SIGTYP 2027!
THANK YOU ALL!
