XOR-TyDi

Cross-lingual Open-Retrieval Question Answering

What is XOR-TyDi QA?

XOR-TyDi QA brings together for the first time information-seeking questions, open-retrieval QA, and multilingual QA to create a multilingual open-retrieval QA dataset that enables cross-lingual answer retrieval. It consists of questions written by information-seeking native speakers in 7 typologically diverse languages and answer annotations that are retrieved from multilingual document collections. There are three sub-tasks: XOR-Retrieve, XOR-EnglishSpan, and XOR-Full.



XOR-TyDi QA is is meant to be an academic resource and has significant limitations. Please read our detailed datasheet before considering it for any practical application.


Getting Started

XOR-TyDi QA is distributed under the CC BY-SA 4.0 license. The training, development and test sets can be downloaded below.


For XOR-Retrieve and XOR-English Span:

For XOR-Full:


A more comprehensive summary about data, available resources, baseline models and evaluation is included in our Github repository README, which is linked below.

Getting Started Guide

Evaluation and Submission to Leaderboard

To evaluate your models on the three tasks, we have also made available the evaluation script we will use for official evaluation, along with a sample prediction file that the script will take as input. To run the evaluation, use python evals/eval_xor_[task_name].py --data_file <path_to_input_data> --pred_file <path_to_predictions>.

To submit your models and evaluate them on the official test sets, please read our submission guide hosted on Github.

Submission Guide

Additional resources

To facilitate future research in cross-lingual open-domain QA and related domains, we also releases additional resource such as 30k human translated questions in 7 languages and a GoldParagraph reading comprehension dataset. We are also happy to share the preprocessed Wikipedia databases for 7 languages, annotation interface and document collections seen by our annotators upon request.

Details of the formats: please see our README for the format details of the translation data. GoldParagraph data's zip file contains a README file, which describes the details of the formats and differences between the data files with span answers only and the ones with boolean or long answers in addition to short answers.

Have Questions?

Ask us questions at akari@cs.washington.edu

XOR-TyDi v1.1 Leaderboard

Task 1: XOR-Retrieve

XOR-Retrieve is a cross-lingual retrieval task where a question is written in a target language (e.g., Japanese) and a system is required to retrieve English paragraphs that answer the question. The scores are macro-average over the 7 target languages.
Although we see the effectiveness of blackbox systems (e.g., Google Translate), we encourage the community to use white-box systems so that all experimental details can be understood. The systems using external blackbox APIs are highlighted in gray and ranked in the table of "Systems using external APIs" for reference.


Metrics: R@5kt, R@2kt (the recall by computing the fraction of the questions for which the minimal answer is contained in the top 5,000 / 2,000 tokens selected.)

Rank Model R@5kt R@2kt

1

October 28, 2022
PrimeQA (DrDecr-large with PLAID + Colbert V2)

IBM Research AI

74.7 69.2

2

June 10, 2022
Quick

Microsoft STCA

72.0 65.6

3

August 22, 2022
PrimeQA (DrDecr with PLAID + Colbert V2)

IBM Research AI

71.9 65.8

4

June 21, 2022
LEXA

Huawei Noah's Ark lab

70.5 65.1

5

February 11, 2022
DrDecr

IBM Research AI

70.3 63.0

6

March 14, 2022
Sentri 2.0 base

Huawei Noah's Ark lab

64.6 58.5

7

January 7, 2022
Contrastive Context-aware Pretraining Model (CCP)

Microsoft STCA

63.0 54.8

8

August 26, 2021
Single Encoder Retriever (Sentri)

Huawei Noah's Ark lab

61.0 52.7

9

October 7, 2021
Single Encoder Retriever (Sentri, resubmission)

Huawei Noah's Ark lab

60.7 55.5

10

June 19, 2021
GAAMA (ColBERT ensemble with xlm-r + UW Translate)

IBM Research AI, NY

59.9 52.8

11

April 11, 2021
DPR + Vanilla Transformer MT

University of Washington, AI2, Google, UT Austin

50.0 42.7

12

April 11, 2021
Multilingual DPR

University of Washington, AI2, Google, UT Austin

48.0 38.8

*Systems using external APIs

Rank Model R@5kt R@2kt

1

June 18, 2021
GAAMA (ColBERT Ensemble with IBM NMT + Google MT)

IBM Research AI, NY

71.4 65.0

2

January 7, 2022
DrDecr

IBM Research AI, NY

70.1 62.4

3

April 11, 2021
DPR + Google Translate

University of Washington, AI2, Google, UT Austin

67.2 59.3

4

April 11, 2021
Path Retriever + Google Translate

University of Washington, AI2, Google, UT Austin

61.7 58.2

Task 2: XOR-English Span

XOR-English Span is a cross-lingual retrieval task where a question is written in a target language (e.g., Japanese) and a system is required to output a short answer in English. The scores are macro-average over the 7 target languages.
As in the XOR-Retrieve table, the systems using external blackbox APIs are highlighted in gray and ranked in the table of "Systems using external APIs" for reference.


Metrics: F1, EM over the annotated answer’s token set following prior work (Rajpurkar et al., 2016).

Rank Model F1 EM

1

July 27, 2021
GAAMA (XLM-R) With UW MT + pure multilingual MRC

IBM Research AI, NY

22.7 16.5

2

April 11, 2021
DPR + Vanilla Transformer

University of Washington, AI2, Google, UT Austin

20.5 15.7

3

April 11, 2021
Multilingual DPR

University of Washington, AI2, Google, UT Austin

17.2 12.3

*Systems using external APIs

Rank Model F1 EM

1

July 27, 2021
GAAMA (XLM-R) with IBM & Google MT

IBM Research AI, NY

35.6 28.0

2

April 11, 2021
DPR + Google Translate

University of Washington, AI2, Google, UT Austin

32.9 25.3

Task 3: XOR-Full

XOR-Full is a cross-lingual retrieval task where a question is written in the target language (e.g., Japanese) and a system is required to output a short answer in a target language. The scores are macro-average over the 7 target languages.
As in the tables above, the systems using external blackbox APIs are highlighted in gray and ranked in the table of "Systems using external APIs" for reference.


Metrics: F1, EM, BLEU over the annotated answer's token set.

Rank Model F1 EM BLEU

1

March 22, 2022
Sentri + MFiD base

Huawei Noah's Ark lab

46.2 39.0 33.7

2

July 26, 2021
CORA

University of Washington, AI2

43.5 33.5 31.1

3

July 14, 2021
Single Encoder Retriever (Sentri)

Huawei Noah's Ark lab

20.1 13.5 20.1

4

April 11, 2021
DPR + Vanilla Transformer MT + BM25 (English Wikipedia + L Wikipedia)

University of Washington, AI2, Google, UT Austin

9.5 6.0 8.9

5

April 11, 2021
DPR + Vanilla Transformer MT (English Wikipedia only)

University of Washington, AI2, Google, UT Austin

7.2 3.4 6.3

*Systems using external APIs

Rank Model F1 EM BLEU

1

April 11, 2021
DPR + Google Translate (English Wikipedia only)

University of Washington, AI2, Google, UT Austin

19.5 12.0 15.7

2

April 11, 2021
DPR + Google Translate + Google Custom Search (English Wikipedia + L Wikipedia)

University of Washington, AI2, Google, UT Austin

18.7 12.1 16.8

3

April 11, 2021
Multilingual DPR + Google Custom Search (English Wikipedia + L Wikipedia)

University of Washington, AI2, Google, UT Austin

15.7 10.0 13.9

4

April 11, 2021
DPR + Vanilla Transformer MT + Google Custom Search (English Wikipedia + L Wikipedia)

University of Washington, AI2, Google, UT Austin

13.7 8.7 12.0

5

April 11, 2021
Google Custom Search (L Wikipedia only)

University of Washington, AI2, Google, UT Austin

11.3 7.5 9.7