Word Embedding Evaluation for

Ancient and Historical Languages


In 2024, SIGTYP is hosting a Word Embedding Evaluation for Ancient and Historical Languages. Since the rise of word embeddings, their evaluation has been considered a challenging task that sparked considerable debate regarding the optimal approach. The two major strategies that researchers have developed over the years are intrinsic and extrinsic evaluation. The first amounts to solving specially designed problems like semantic proportions or “odd one out”, or comparing word/sentence similarity scores yielded by a model to human judgement. The second one focuses on solving downstream NLP tasks, such as sentiment analysis or question answering, probing word or sentence representations in real-world applications.

In recent years, sets of downstream tasks called benchmarks have become a very popular, if not default, method to evaluate general-purpose word and sentence embeddings. Starting with decaNLP (McCann et al., 2018) and SentEval (Conneau & Kiela, 2018), multitask benchmarks for NLU keep appearing and improving every year. However, even the largest multilingual benchmarks, such as XGLUE, XTREME, XTREME-R or XTREME-UP (Hu et al., 2020; Liang et al., 2020; Ruder et al., 2021, 2023), only include modern languages. When it comes to ancient and historical languages, scholars mostly adapt/translate intrinsic evaluation datasets from modern languages or create their own diagnostic tests. We argue that there is a need for a universal evaluation benchmark for embeddings learned from ancient and historical language data and view this shared task as a proving ground for it.

The shared task involves solving the following problems for 12+ ancient and historical languages that belong to 4 language families and use 6 different scripts.

Subtasks

  • A.Constrained
    • 1. POS-tagging
    • 2. Full morphological annotation
    • 3. Lemmatisation
  • B. Unconstrained
    • 1. POS-tagging
    • 2. Full morphological annotation
    • 3. Lemmatisation
    • 4. Filling gaps (word-level, character-level)

For subtask A, participants are not allowed to use any additional data; however, they can reduce and balance provided training datasets if they see fit. For subtask B, participants are allowed to use any additional data in any language, including pre-trained embeddings and LLMs.

Participants will be invited to describe their system in a paper for the SIGTYP workshop proceedings. The task organizers will write an overview paper that describes the task and summarizes the different approaches taken, and analyzes their results.

Important Links

  ↣  Shared Task GIT Page (provides all the details and regular updates)!  

  ↣  Register for the Task!  

Important Dates

  Training data Release: ↣  5 November 2023
  Test data Release: ↣  2 January 2024
  Submissions Due: ↣  8 January 2024
  Notifications of Results: ↣  13 January 2024
  System Descriptions Due: ↣  20 January 2024
  Notifications of Acceptance: ↣  27 January 2024
  Camera-ready Due: ↣  3 February 2024
  Video Recordings Due: ↣  15 March 2024
  Workshop: ↣  21/22 March 2024

Task Organizers

Oksana Dereza, Insight SFI Research Centre for Data Analytics, Data Science Institute, University of Galway
Priya Rani, SFI Centre for Research and Training in AI, Data Science Institute, University of Galway
Atul Kr. Ojha, Insight SFI Research Centre for Data Analytics, Data Science Institute, University of Galway
Adrian Doyle, Insight SFI Research Centre for Data Analytics, Data Science Institute, University of Galway
Pádraic Moran, School of Languages, Literatures and Cultures, Moore Institute, University of Galway
John P. McCrae, Insight SFI Research Centre for Data Analytics, Data Science Institute, University of Galway

Contact

    Please contact oksana.dereza@insight-centre.org or priya.rani@insight-centre.org if you have any questions