MRL 2025 Shared Task on Multilingual Physical Reasoning Datasets

Overview

Many languages lack culturally-specific evaluation datasets that are created by language community members themselves. The MRL 2025 Shared Task is to contribute a manually-annotated physical commonsense reasoning evaluation dataset for your language(s), e.g. for researchers who speak non-English language(s) natively. The format will be similar to PIQA, a physical commonsense reasoning benchmark where each example consists of a prompt with two candidate completions ("solutions"). We aim to collaboratively construct a multilingual physical reasoning benchmark with broad language coverage and culturally-specific examples for different languages.

All authors of accepted submissions will have the option to be included on the resulting benchmark paper. To be included on the final paper, we may require slight modifications to the accepted datasets where necessary, and/or manual curation of a small number of additional examples.

To express interest, we encourage you to fill out this optional form: https://forms.gle/zxhpCfL6wvBzb15e7. We will host an optional FAQ meeting in early/mid August to answer questions, or you can contact the organizers at: mrl2025-workshop@googlegroups.com.

Call for Submissions

The MRL 2025 Shared Task accepts submissions of non-English PIQA-style datasets with accompanying dataset description papers. The submission deadline is September 15, 2025 (see Important Dates below).

Submission link (Google form): https://forms.gle/NYZnaxakspSWwPsW6

Format: Paper PDF, with dataset as supplementary materials. The dataset should be a .tsv with minimum columns: prompt, solution0, solution1, label (0 or 1 as the correct solution). All completion ("solution") pairs should be as similar as possible, only differing in one or two words. A column with English translations may be helpful, but is *not* required. See examples in Examples and Suggestions below.
Languages: We welcome contributions for any language other than English. We particularly encourage submissions of datasets in under-resourced languages.
Number of Items: Each submission should contain at least 100 original examples (i.e. not translated from the original PIQA dataset). We encourage authors to create larger datasets if possible. Authors are welcome to include translated PIQA examples, but these do not count towards the 100 original items, and they must be annotated as English PIQA translations in some way.
Paper Page Count: The submitted paper should be at most eight pages in the EMNLP template, but we welcome much shorter papers. For example, datasets that are entirely manually constructed may require only a few paragraphs of description (e.g. any heuristics or methods for writing sentences and solutions).
Reporting: The paper should report enough detail such that a speaker of your language(s) would reasonably be able to reconstruct a comparable dataset. Importantly: How did you create your items? For example, how were prompts selected? How many native speakers checked each example (minimum 1)?
Evaluation: Evaluations of existing models on the submitted datasets are welcome, but *not* required. We will run evaluations once submissions are accepted.

Examples and Suggestions

Here, we provide examples and suggestions to help authors construct high-quality PIQA-style datasets:

Use items of variable length. Try not to include too many short items, as they may be too easy for larger models.
The two candidate solutions should be as similar as possible, only differing in one or two words (see examples below).
We encourage authors to include culturally-relevant examples for their language(s). For example, some items may not be easily translatable into English.
Example items:

{"prompt": "What's the best material for a DIY walking stick?",
"solution0": "A discarded tree branch.",
"solution1": "A discarded lead pipe.",
"label": 0}

{"prompt": "How to make Peanut Butter Rice Crispy Fantasy fudge crunch at home:",
"solution0": "Mix 3 cups granulated sugar, 3/4 cup margarine, and 2/3 cup evaporated milk in a large, heavy saucepan over medium heat, stirring to dissolve sugar. Bring mixture to a full boil for 5 minutes, stirring constantly. Remove from heat and stir in 12 ounce semi sweet chocolate chips until melted and thoroughly combined. Beat in 1 7 ounce jar of marshmallow creme, 1 1/2 cup Rice Crispies, 1/3 cup chunky peanut butter and 1 teaspoon vanilla extract. Transfer fudge to Greased 8 x 13\" pan and let cool before cutting into squares.",
"solution1": "Mix 3 cups granulated sugar, 3/4 cup margarine, and 2/3 cup evaporated milk in a large, heavy saucepan over medium heat, stirring to dissolve sugar. Bring mixture to a full boil for 5 minutes, stirring constantly. Remove from heat and stir in 12 ounce semi sweet chocolate chips until melted and thoroughly combined. Beat in 1 7 ounce jar of marshmallow creme, 1 1/2 cup Rice Crispies, 1/3 cup chunky peanut butter and 1 teaspoon vanilla extract and 5 eggs. Transfer fudge to Greased 8 x 13\" pan and let cool before cutting into squares.",
"label": 0}

{"prompt": "After staining wood, you should",
"solution0": "allow it to sit for several hours so the stain can dry.",
"solution1": "allow it to sit for several months so the stain can dry.",
"label": 0}

{"prompt": "When a light metal cup falls off a counter,",
"solution0": "it will shatter after hitting the ground.",
"solution1": "it will bounce after hitting the ground.",
"label": 1}

Important Dates

June 2025: Call for papers released.
September 15, 2025: Shared task submission deadline.

October 1, 2025: Decision notification.
November 5-9, 2025: MRL workshop at EMNLP 2025.
November 2025 through early 2026: Organizers will work with the authors to prepare the compiled dataset and benchmark paper for publication.

Contact Us

Email: mrl2025-workshop@googlegroups.com