and

SIGTYP2023 — MAY,6th — Dubrovnik/Hybrid


We kindly invite everyone to join the virtual part of SIGTYP 2023! Below you may explore papers, slides, and recorded talks. All discussions are happening in our Rocket.Chat. Each paper is provided with its own discussion channel. On our Rocket.Chat and Google Group you may also find a single Zoom link that will be used during the day of the workshop.
SIGTYP2023 Proceedings are now available here.


Time zone: Croatia/Dubrovnik

By SIGTYP2023 Organizing Committee

Opening remarks: the SIGTYP 2023 workshop, SIGTYP development and MRL!


 Keynote Talk 


Ella Rabinovich Semantic typologists have proposed (and empirically supported, across many domains) that the more two concepts are colexified across languages, the more similar those two concepts are. As such, crosslinguistic patterns of colexification can be used to deduce pairwise similarity among concepts, yielding a universal semantic similarity space for a domain (e.g., Berlin and Kay, 1969; Levinson et al., 2003). In this talk she will show that insights from semantic typology in the domain of indefinite pronouns are suggestive of challenges in their second-language (L2) English acquisition. She will also present a corpus-based analysis of L2 English indefinite pronouns and results of an automatic approach to detecting L2 semantic infelicities, stemming from these challenges.
Bio: Ella is a researcher in the IBM Research Labs in Haifa, Israel. Her research focuses on computational approaches to the study of various aspects of bilingualism. She explores the unique properties of translated texts and productions of advanced non-native speakers, covering a wide range of syntactic and lexical phenomena and applying state-of-the-art (supervised and unsupervised) machine learning and natural language processing techniques. Additional topics she is interested in include computational social science, informational retrieval, argumentation mining. She completed my Ph.D. at the Department of Computer Science, University of Haifa (Israel) under the supervision of Prof. Shuly Wintner.

Slides   Ella's Website   Discuss ❯❯


 Cross-Lingual Transfer  


By Fred Philippy, Siwen Guo and Shohreh Haddadan

Prior research has investigated the impact of various linguistic features on cross-lingual transfer performance. In this study, we investigate the manner in which this effect can be mapped onto the representation space. While past studies have focused on the impact on cross-lingual alignment in multilingual language models during fine-tuning, this study examines the absolute evolution of the respective language representation spaces produced by MLLMs. We place a specific emphasis on the role of linguistic characteristics and investigate their inter-correlation with the impact on representation spaces and cross-lingual transfer performance. Additionally, this paper provides preliminary evidence of how these findings can be leveraged to enhance transfer to linguistically distant languages.

Slides   Paper   Discuss ❯❯

By Marcell Richard Fekete and Johannes Bjerva

Transformer-based language models (LMs) offer superior performance in a wide range of NLP tasks compared to previous paradigms. However, the vast majority of the world's languages do not have adequate training data available for monolingual LMs (Joshi et al., 2020). While the use of multilingual LMs might address this data imbalance, there is evidence that multilingual LMs struggle when it comes to model adaptation to to resource-poor languages (Wu and Dredze, 2020), or to languages which have typological characteristics unseen by the LM (Üstün et al., 2022). Other approaches aim to adapt monolingual LMs to resource-poor languages that are related to the model language. However, there are conflicting findings regarding whether language relatedness correlates with successful adaptation (de Vries et al., 2021), or not (Ács et al., 2021). With gradual LM adaptation, our approach presented in this extended abstract, we add to the research direction of monolingual LM adaptation. Instead of direct adaptation to a target language, we propose adaptation in stages, first adapting to one or more intermediate languages before the final adaptation step. Inspired by principles of curriculum learning (Bengio et al., 2009), we search for an ideal ordering of languages that can result in improved LM performance on the target language. We follow evidence that typological similarity might correlate with the success of cross-lingual transfer (Pires et al., 2019; Üstün et al., 2022; de Vries et al., 2021) as we believe the success of this transfer is essential for successful model adaptation. Thus we order languages based on their relative typological similarity between them. In our approach, we quantify typological similarity using structural vectors as derived from counts of dependency links (Bjerva et al., 2019), as such fine-grained measures can give a more accurate picture of the typological characteristics of languages (Ponti et al., 2019). We believe that gradual LM adaptation may lead to improved LM performance on a range of resource-poor languages and typologically diverse languages. Additionally, it enables future research to evaluate the correlation between the success of cross-lingual transfer and various typological similarity measures.

Slides   Paper   Discuss ❯❯

By Sepideh Mollanorozy, Marc Tanti and Malvina Nissim

The success of cross-lingual transfer learning for POS tagging has been shown to be strongly dependent, among other factors, on the (typological and/or genetic) similarity of the low-resource language used for testing and the language(s) used in pre-training or to fine-tune the model. We further unpack this finding in two directions by zooming in on a single language, namely Persian. First, still focusing on POS tagging we run an in-depth analysis of the behaviour of Persian with respect to closely related languages and languages that appear to benefit from cross-lingual transfer with Persian. To do so, we also use the World Atlas of Language Structures to determine which properties are shared between Persian and other languages included in the experiments. Based on our results, Persian seems to be a reasonable potential language for Kurmanji and Tagalog low-resource languages for other tasks as well. Second, we test whether previous findings also hold on a task other than POS tagging to pull apart the benefit of language similarity and the specific task for which such benefit has been shown to hold. We gather sentiment analysis datasets for 31 target languages and through a series of cross-lingual experiments analyse which languages most benefit from Persian as the source. The set of languages that benefit from Persian had very little overlap across the two tasks, suggesting a strong task-dependent component in the usefulness of language similarity in cross-lingual transfer.

Slides   Paper   Discuss ❯❯

By Charlotte Pouw, Nora Hollenstein and Lisa Beinborn

When humans read a text, their eye movements are influenced by the structural complexity of the input sentences. This cognitive phenomenon holds across languages and recent studies indicate that multilingual language models utilize structural similarities between languages to facilitate cross-lingual transfer. We use sentence-level eye-tracking patterns as a cognitive indicator for structural complexity and show that the multilingual model XLM-RoBERTa can successfully predict varied patterns for 13 typologically diverse languages, despite being fine-tuned only on English data. We quantify the sensitivity of the model to structural complexity and distinguish a range of complexity characteristics. Our results indicate that the model develops a meaningful bias towards sentence length but also integrates cross-lingual differences. We conduct a control experiment with randomized word order and find that the model seems to additionally capture more complex structural information.

Slides   Paper   Discuss ❯❯


Coffee break, chats, linguistic trivia


 Multilinguality 


By Ibraheem Muhammad Moosa, Mahmud Elahi Akhter and Ashfia Binte Habib

As there is a scarcity of large representative corpora for most languages, it is important for Multilingual Language Models (MLLM) to extract the most out of existing corpora. In this regard, script diversity presents a challenge to MLLMs by reducing lexical overlap among closely related languages. Therefore, transliterating closely related languages that use different writing scripts to a common script may improve the downstream task performance of MLLMs. In this paper, we pretrain two ALBERT models to empirically measure the effect of transliteration on MLLMs. We specifically focus on the Indo-Aryan language family, which has the highest script diversity in the world. Afterward, we evaluate our models on the IndicGLUE benchmark. We perform Mann-Whitney U test to rigorously verify whether the effect of transliteration is significant or not. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages. We also measure the cross-lingual representation similarity (CLRS) of the models using centered kernel alignment (CKA) on parallel sentences of eight languages from the FLORES-101 dataset. We find that the hidden representations of the transliteration-based model have higher and more stable CLRS scores. Our code is available at Github and Hugging Face Hub.

Slides   Paper   Discuss ❯❯

By Isabel Papadimitriou, Kezia Lopez and Dan Jurafsky

While multilingual language models can improve NLP performance on low-resource languages by leveraging higher-resource languages, they also reduce average performance on all languages (the ''curse of multilinguality''). Here we show another problem with multilingual models: grammatical structures in higher-resource languages bleed into lower-resource languages, a phenomenon we call grammatical structure bias. We show this bias via a novel method for comparing the fluency of multilingual models to the fluency of monolingual Spanish and Greek models: testing their preference for two carefully-chosen variable grammatical structures (optional pronoun-drop in Spanish and optional Subject-Verb ordering in Greek). We find that multilingual BERT is biased toward the English-like setting (explicit pronouns and Subject-Verb-Object ordering) and against the default Spanish and Gerek settings, as compared to our monolingual control language model. With our case studies, we hope to bring to light the fine-grained ways in which multilingual models can be biased, and encourage more linguistically-aware fluency evaluation.

Slides   Paper   Discuss ❯❯

By Doreen Osmelak and Shuly Wintner

When multilingual speakers involve in a conversation they inevitably introduce code-switching (CS), i.e., mixing of more than one language between and within utterances. CS is still an understudied phenomenon, especially in the written medium, and relatively few computational resources for studying it are available. We describe a corpus of German-English code-switching in social media interactions. We focus on some challenges in annotating CS, especially due to words whose language ID cannot be easily determined. We introduce a novel schema for such word-level annotation, with which we manually annotated a subset of the corpus. We then trained classifiers to predict and identify switches, and applied them to the remainder of the corpus. Thereby, we created a large scale corpus of German-English mixed utterances with precise indications of CS points.

Slides   Paper   Discuss ❯❯

By Frederic Blum and Johann-Mattis List

Sound correspondence patterns form the basis of cognate detection and phonological reconstruction in historical language comparison. Methods for the automatic inference of correspondence patterns from phonetically aligned cognate sets have been proposed, but their application to multilingual wordlists requires extremely well annotated datasets. Since annotation is tedious and time consuming, it would be desirable to find ways to improve aligned cognate data automatically. Taking inspiration from trimming techniques in evolutionary biology, which improve alignments by excluding problematic sites, we propose a workflow that trims phonetic alignments in comparative linguistics prior to the inference of correspondence patterns. Testing these techniques on a large standardized collection of ten datasets with expert annotations from different language families, we find that the best trimming technique substantially improves the overall consistency of the alignments, showing a clear increase in the proportion of frequent correspondence patterns and words exhibiting regular cognate relations.

Slides   Paper   Discuss ❯❯

By Badr M. Abdullah, Mohammed Maqsood Shaik and Dietrich Klakow

Self-supervision has emerged as an effective paradigm for learning representations of spoken language from raw audio without explicit labels or transcriptions. Self-supervised speech models, such as wav2vec 2.0 (Baevski et al., 2020) and HuBERT (Hsu et al., 2021), have shown significant promise in improving the performance across different speech processing tasks. One of the main advantages of self-supervised speech models is that they can be pre-trained on a large sample of languages (Conneau et al., 2020; Babu et al.,2022), which facilitates cross-lingual transfer for low-resource languages (San et al., 2021). State-of-the-art self-supervised speech models include a quantization module that transforms the continuous acoustic input into a sequence of discrete units. One of the key questions in this area is whether the discrete representations learned via self-supervision are language-specific or language-universal. In other words, we ask: do the discrete units learned by a multilingual speech model represent the same speech sounds across languages or do they differ based on the specific language being spoken? From the practical perspective, this question has important implications for the development of speech models that can generalize across languages, particularly for low-resource languages. Furthermore, examining the level of linguistic abstraction in speech models that lack symbolic supervision is also relevant to the field of human language acquisition (Dupoux, 2018).

Slides   Paper   Discuss ❯❯

By Tomasz Limisiewicz, Dan Malkin and Gabriel Stanovsky

Multilingual models have been widely used for the cross-lingual transfer to low-resource languages. However, the performance on these languages is hindered by their under-representation in the pretraining data. To alleviate this problem, we propose a novel multilingual training technique based on teacher-student knowledge distillation. In this setting, we utilize monolingual teacher models optimized for their language. We use those teachers along with balanced (sub-sampled) data to distill the teachers' knowledge into a single multilingual student. Our method outperforms standard training methods in low-resource languages and retains performance on high-resource languages while using the same amount of data. If applied widely, our approach can increase the representation of low-resource languages in NLP systems.

Slides   Paper   Discuss ❯❯

By Simran Khanuja, Sebastian Ruder and Partha Talukdar

In order for NLP technology to be widely applicable, fair, and useful, it needs to serve a diverse set of speakers across the world's languages, be equitable, i.e., not unduly biased towards any particular language, and be inclusive of all users, particularly in low-resource settings where compute constraints are common. In this paper, we propose an evaluation paradigm that assesses NLP technologies across all three dimensions. While diversity and inclusion have received attention in recent literature, equity is currently unexplored. We propose to address this gap using the Gini coefficient, a well-established metric used for estimating societal wealth inequality. Using our paradigm, we highlight the distressed state of current technologies for Indian (IN) languages (a linguistically large and diverse set, with a varied speaker population), across all three dimensions. To improve upon these metrics, we demonstrate the importance of region-specific choices in model building and dataset creation, and more importantly, propose a novel, generalisable approach to optimal resource allocation during fine-tuning. Finally, we discuss steps to mitigate these biases and encourage the community to employ multi-faceted evaluation when building linguistically diverse and equitable technologies.

Slides   Paper   Discuss ❯❯

By Uri Berger, Lea Frermann, Gabriel Stanovsky and Omri Abend

We present a large, multilingual study into how vision constrains linguistic choice, covering four languages and five linguistic properties, such as verb transitivity or use of numerals. We propose a novel method that leverages existing corpora of images with captions written by native speakers, and apply it to nine corpora, comprising 600k images and 3M captions. We study the relation between visual input and linguistic choices by training classifiers to predict the probability of expressing a property from raw images, and find evidence supporting the claim that linguistic properties are constrained by visual context across languages. We complement this investigation with a corpus study, taking the test case of numerals. Specifically, we use existing annotations (number or type of objects) to investigate the effect of different visual conditions on the use of numeral expressions in captions, and show that similar patterns emerge across languages. Our methods and findings both confirm and extend existing research in the cognitive literature. We additionally discuss possible applications for language generation.

Slides   Paper   Discuss ❯❯


Lunch, discussions, linguistic trivia


 Keynote Talk 


Natalia Levshina Different languages use different linguistic cues to express “who did what to whom”, helping the addressee to identify Subject and Object. These cues include case marking, agreement, semantics, and word order. Previous research has revealed that different cues can be correlated (Greenberg 1966; Sinnemäki 2010; Levshina 2021). For example, some languages express the roles with case (Latin, Czech) and relatively flexible word order, while others (English, Mandarin) use rigid word order and have no nominal case makers. Some of the differences between the languages have been explained by sociolinguistic factors, such as population size and high proportion of L2 (non-native) users, which can lead to grammatical simplification – in particular, to loss of case (Lupyan & Dale 2010; McWhorter 2011; Trudgill 2011; Bentz & Winter 2013; Koplenig 2019).
The aim of my talk is to investigate the relations between the linguistic and sociolinguistic variables and confirm or refute the previous hypotheses about the typological correlations and causal links. To obtain the linguistic variables, I use typological databases, such as the World Atlas of Language Structures (WALS) by Dryer & Haspelmath (2013), and corpus data: large web-based corpora of online news (Goldhahn et al. 2012) annotated with Universal Dependencies; Universal Dependencies corpora (Zeman et al. 2022); the parallel corpus of Bible translations (Mayer & Cysouw 2014) and word order data inferred from this corpus (Östling 2015). These sources provide information about four variables that help to understand “who did what to whom”: 1) the entropy of Subject and Object order based on the probabilities of Subject-Object (SO) and Object-Subject (OS) orders in the corpora; 2) whether the forms of Subject and Object are the same or distinct thanks to case flagging; 3) the position of the lexical verb in a transitive clause: final or non-final, and 4) Mutual Information of the grammatical roles and lexemes, which approximates semantic tightness of a language (Hawkins 1986). To get the sociolinguistic data, I use the information about the population size and L2 speaker proportions from the Ethnologue database, as well as the datasets from Koplenig (2019) and Sinnemäki & Di Garbo (2018). The relationships between all these variables are studied with the help of correlational and causal analyses (Pearl 2000), which involve different types of generalized mixed models and the Fast Causal Inference algorithm (Zhang 2008). The genealogical and geographic dependencies between the languages are modelled as random effects in generalized mixed-effects models.
Bio: Natalia Levshina is a linguist working at the Max Planck Institute for Psycholinguistics in Nijmegen. Her main research interests are linguistic typology, corpora, cognitive and functional linguistics. After obtaining her PhD at the University of Leuven in 2011, she has worked in Jena, Marburg, Louvain-la-Neuve and Leipzig, where she got her habilitation qualification in 2019. She has recently published a book “Communicative Efficiency: Language structure and use” (Cambridge University Press, 2022), in which she formulates the main principles of communicatively efficient linguistic behaviour and shows how these principles can explain why human languages are the way they are. Natalia is also the author of a best-selling statistical manual “How to Do Linguistics with R” (Benjamins, 2015).

Slides   Natalia's Webpage   Discuss ❯❯


 Linguistic Complexity  


By Julius Steuer, Johann-Mattis List, Badr M. Abdullah and Dietrich Klakow

We present a cross-linguistic study of vowel harmony that aims to quantifies this phenomenon using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have heavily relied on inflected word-forms in the analysis on vowel harmony. On the contrary, we train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists offering a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.

Slides   Paper   Discuss ❯❯

By Deniz Özyıldız, Ciyang Qing, Floris Roelofsen, Maribel Romero and Wataru Uegaki

We introduce a cross-linguistic database for attitude predicates, which references their combinatorial (syntactic) and semantic properties. Our data allows assessment of cross-linguistic generalizations about attitude predicates as well as discovery of new typological/cross-linguistic patterns. This paper motivates empirical and theoretical issues that our database will help to address, the sample predicates and the properties that it references, as well as our design and methodological choices. Two case studies illustrate how the database can be used to assess validity of cross-linguistic generalizations.

Slides   Paper   Discuss ❯❯

By Andrew Thomas Dyer

In this replication study of previous research into dependency length whether previous findings are upheld when controlling for variation in domain and sentence content between languages. We follow the approach of previous research minimisation (DLM), we pilot a new parallel multilingual parsed corpus to examine in comparing the dependency lengths of observed sentences in a multilingual corpus to a variety of baselines: permutations of the sentences, either random or according to some fixed schema. We go on to compare DLM with intervener complexity measure (ICM), an alternative measure of syntactic complexity. Our findings uphold both dependency length and intervener complexity minimisation in all languages under investigation. We also find a markedly lesser extent of dependency length minimisation in verb-final languages, and the same for intervener complexity measure. We conclude that dependency length and intervener complexity minimisation as universals are upheld when controlling for domain and content variation, but that further research is needed into the asymmetry between verb-final and other languages in this regard.

Slides   Paper   Discuss ❯❯


Coffee, discussions


 Shared Task Session 


By Priya Rani, Koustava Goswami, Adrian Doyle, Theodorus Fransen, Bernardo Stearns and John P. McCrae

This paper describes the structure and findings of the SIGTYP 2023 shared task on cognate and derivative detection for low-resourced languages, broken down into a supervised and unsupervised sub-task. The participants were asked to submit the test data’s final prediction. A total of nine teams registered for the shared task where seven teams registered for both sub-tasks. Only two participants ended up submitting system descriptions, with only one submitting systems for both sub-tasks. While all systems show a rather promising performance, all could be within the baseline score for the supervised sub-task. However, the system submitted for the unsupervised sub-task outperforms the baseline score.

Slides   Paper   Discuss ❯❯

By Tomasz Limisiewicz

In this work, I present ÚFAL submission for the supervised task of detecting cognates and derivatives. Cognates are word pairs in different languages sharing the origin in earlier attested forms in ancestral language, while derivatives come directly from another language. For the task, I developed gradient boosted tree classifier trained on linguistic and statistical features. The solution came first from two delivered systems with an 87% F1 score on the test split. This write-up gives an insight into the system and shows the importance of using linguistic features and character-level statistics for the task.

Slides   Paper   Discuss ❯❯

By Liviu P. Dinu, Ioan-Bogdan Iordache and Ana Sabina Uban

The identification of cognates and derivatives is a fundamental process in historical linguistics, on which any further research is based. In this paper we present our contribution to the SIGTYP 2023 Shared Task on cognate and derivative detection. We propose a multi-lingual solution based on features extracted from the alignment of the orthographic and phonetic representations of the words.

Slides   Paper   Discuss ❯❯


Coffee break, discussions, etc.


 Syntax and Morphology 


By Hannah J. Haynie, Damián Blasi, Hedvig Skirgård, Simon J. Greenhill, Quentin D. Atkinson and Russell D. Gray

Of approximately 7,000 languages around the world, only a handful have abundant computational resources. Extending the reach of language technologies to diverse, less-resourced languages is important for tackling the challenges of digital equity and inclusion. Here we introduce the Grambank typological database as a resource to support such efforts. To date, work that uses typological data to extend computational research to less-resourced languages has relied on cross-linguistic morphosyntax datasets that are sparsely populated, use categorical coding that can be difficult to interpret, and introduce redundant information across features. Grambank presents similar information (e.g. word order, grammatical relation marking, constructions like interrogatives and negation), but is designed to avoid several disadvantages of legacy typological resources. Grambank’s 195 features encode basic information about morphology and syntax for 2,467 languages. 83% of these languages are annotated for at least 100 features. By implementing binary coding for most features and curating the dataset to avoid logical dependencies, Grambank presents information in a user-friendly format for computational applications. The scale, completeness, reliability, format, and documentation of Grambank make it a useful resource for linguistically-informed models, cross-lingual NLP, and research targeting less-resourced languages.

Slides   Paper   Discuss ❯❯

By Coleman Haley, Edoardo M. Ponti and Sharon Goldwater

In morphology, a distinction is commonly drawn between inflection and derivation. However, a precise definition of this distinction which captures the way the terms are used across languages remains elusive within linguistic theory, typically being based on subjective tests. In this study, we present 4 quantitative measures which use the statistics of a raw text corpus in a language to estimate how much and how variably a morphological construction changes aspects of the lexical entry, specifically, the word's form and the word's semantic and syntactic properties (as operationalised by distributional word embeddings). Based on a sample of 26 languages, we find that we can reconstruct 90\% of the classification of constructions into inflection and derivation in Unimorph using our 4 measures, providing large-scale cross-linguistic evidence that the concepts of inflection and derivation are associated with measurable signatures in terms of form and distribution signatures that behave consistently across a variety of languages. Critically, our measures and models are entirely language-agnostic, yet perform well across all languages studied. We find that while there is a high degree of consistency in the use of the terms inflection and derivation in terms of our measures, there are still many constructions near the model's decision boundary between the two categories, indicating a gradient, rather than categorical, distinction.

Slides   Paper   Discuss ❯❯

By Andreas Shcherbakov and Kat Vylomova

Generalization to novel forms and feature combinations is the key to efficient learning. Recently, Goldman et al. (2022) demonstrated that contemporary neural approaches to morphological inflection still struggle to generalize to unseen words and feature combinations, even in agglutinative languages. In this paper, we argue that the use of morphological segmentation in inflection modeling allows decomposing the problem into sub-problems of substantially smaller search space. We suggest that morphological segments may be globally topologically sorted according to their grammatical categories within a given language. Our experiments demonstrate that such segmentation provides all the necessary information for better generalization, especially in agglutinative languages.

Slides   Paper   Discuss ❯❯

By Chinmay Choudhary and Colm O’riordan

Prior research has investigated the impact of various linguistic features on cross-lingual transfer performance. In this study, we investigate the manner in which this effect can be mapped onto the representation space. While past studies have focused on the impact on cross-lingual alignment in multilingual language models during fine-tuning, this study examines the absolute evolution of the respective language representation spaces produced by MLLMs. We place a specific emphasis on the role of linguistic characteristics and investigate their inter-correlation with the impact on representation spaces and cross-lingual transfer performance. Additionally, this paper provides preliminary evidence of how these findings can be leveraged to enhance transfer to linguistically distant languages.

Slides   Paper   Discuss ❯❯

By Luca Brigada Villa and Martina Giarda

In this paper we test the parsing performances of a multilingual parser on Old English data using different sets of languages, alone and combined with the target language, to train the models. We compare the results obtained by the models and we analyze more in deep the annotation of some peculiar syntactic constructions of the target language, providing plausible linguistic explanations of the errors made even by the best performing models.

Slides   Paper   Discuss ❯❯

By Diego Alves, Božo Bekavac, Daniel Zeman and Marko Tadić

This article presents a comparative analysis of four different syntactic typological approaches applied to 20 different languages to determine the most effective one to be used for the improvement of dependency parsing results via corpora combination. We evaluated these strategies by calculating the correlation between the language distances and the empirical LAS results obtained when languages were combined in pairs. From the results, it was possible to observe that the best method is based on the extraction of word order patterns which happen inside subtrees of the syntactic structure of the sentences.

Slides   Paper   Discuss ❯❯


By SIGTYP2023 Organizing Committee

Stay with us for SIGTYP 2024!
THANK YOU ALL!