XOR-TyDi QA brings together for the first time information-seeking questions, open-retrieval QA, and multilingual QA to create a multilingual open-retrieval QA dataset that enables cross-lingual answer retrieval. It consists of questions written by information-seeking native speakers in 7 typologically diverse languages and answer annotations that are retrieved from multilingual document collections. There are three sub-tasks: XOR-Retrieve, XOR-EnglishSpan, and XOR-Full.
XOR-TyDi QA is is meant to be an academic resource and has significant limitations. Please read our detailed datasheet before considering it for any practical application.
XOR-TyDi QA is distributed under the CC BY-SA 4.0 license. The training, development and test sets can be downloaded below.
For XOR-Retrieve and XOR-English Span:
For XOR-Full:
A more comprehensive summary about data, available resources, baseline models and evaluation is included in our Github repository README, which is linked below.
Getting Started GuideTo evaluate your models on the three tasks, we have also made available the evaluation
script we will use
for official evaluation, along with a sample prediction file that the script will take
as input. To run the evaluation, use
python evals/eval_xor_[task_name].py --data_file <path_to_input_data> --pred_file <path_to_predictions>
.
To submit your models and evaluate them on the official test sets, please read our submission guide hosted on Github.
Submission GuideTo facilitate future research in cross-lingual open-domain QA and related domains, we also releases additional resource such as 30k human translated questions in 7 languages and a GoldParagraph reading comprehension dataset. We are also happy to share the preprocessed Wikipedia databases for 7 languages, annotation interface and document collections seen by our annotators upon request.
Details of the formats: please see our README for the format details of the translation data. GoldParagraph data's zip file contains a README file, which describes the details of the formats and differences between the data files with span answers only and the ones with boolean or long answers in addition to short answers.
Ask us questions at akari@cs.washington.edu
XOR-Retrieve is a cross-lingual retrieval task where a question is written in a target language (e.g., Japanese) and a system is required to retrieve English paragraphs that answer the question. The scores are macro-average over the 7 target languages. Although we see the effectiveness of blackbox systems (e.g., Google Translate), we encourage the community to use white-box systems so that all experimental details can be understood. The systems using external blackbox APIs are highlighted in gray and ranked in the table of "Systems using external APIs" for reference.
Metrics: R@5kt, R@2kt (the recall by computing the fraction of the questions for which the minimal answer is contained in the top 5,000 / 2,000 tokens selected.)
Rank | Model | R@5kt | R@2kt |
---|---|---|---|
1 October 28, 2022 |
PrimeQA (DrDecr-large with PLAID + Colbert V2) IBM Research AI |
74.7 | 69.2 |
2 June 10, 2022 |
Quick Microsoft STCA |
72.0 | 65.6 |
3 August 22, 2022 |
PrimeQA (DrDecr with PLAID + Colbert V2) IBM Research AI |
71.9 | 65.8 |
4 June 21, 2022 |
LEXA Huawei Noah's Ark lab |
70.5 | 65.1 |
5 February 11, 2022 |
DrDecr IBM Research AI |
70.3 | 63.0 |
6 March 14, 2022 |
Sentri 2.0 base Huawei Noah's Ark lab |
64.6 | 58.5 |
7 January 7, 2022 |
Contrastive Context-aware Pretraining Model (CCP) Microsoft STCA |
63.0 | 54.8 |
8 August 26, 2021 |
Single Encoder Retriever (Sentri) Huawei Noah's Ark lab |
61.0 | 52.7 |
9 October 7, 2021 |
Single Encoder Retriever (Sentri, resubmission) Huawei Noah's Ark lab |
60.7 | 55.5 |
10 June 19, 2021 |
GAAMA (ColBERT ensemble with xlm-r + UW Translate) IBM Research AI, NY |
59.9 | 52.8 |
11 April 11, 2021 |
DPR + Vanilla Transformer MT University of Washington, AI2, Google, UT Austin |
50.0 | 42.7 |
12 April 11, 2021 |
Multilingual DPR University of Washington, AI2, Google, UT Austin |
48.0 | 38.8 |
Rank | Model | R@5kt | R@2kt |
---|---|---|---|
1 June 18, 2021 |
GAAMA (ColBERT Ensemble with IBM NMT + Google MT) IBM Research AI, NY |
71.4 | 65.0 |
2 January 7, 2022 |
DrDecr IBM Research AI, NY |
70.1 | 62.4 |
3 April 11, 2021 |
DPR + Google Translate University of Washington, AI2, Google, UT Austin |
67.2 | 59.3 |
4 April 11, 2021 |
Path Retriever + Google Translate University of Washington, AI2, Google, UT Austin |
61.7 | 58.2 |
XOR-English Span is a cross-lingual retrieval task where a question is written in a
target language (e.g., Japanese) and a system is required to output a short answer in
English. The scores are macro-average over the 7 target languages.
As in the XOR-Retrieve table, the systems using external blackbox APIs are highlighted
in gray
and ranked in the table of "Systems using external APIs" for reference.
Metrics: F1, EM over the annotated answer’s token set following prior work (Rajpurkar et al., 2016).
Rank | Model | F1 | EM |
---|---|---|---|
1 July 27, 2021 |
GAAMA (XLM-R) With UW MT + pure multilingual MRC IBM Research AI, NY |
22.7 | 16.5 |
2 April 11, 2021 |
DPR + Vanilla Transformer University of Washington, AI2, Google, UT Austin |
20.5 | 15.7 |
3 April 11, 2021 |
Multilingual DPR University of Washington, AI2, Google, UT Austin |
17.2 | 12.3 |
Rank | Model | F1 | EM |
---|---|---|---|
1 July 27, 2021 |
GAAMA (XLM-R) with IBM & Google MT IBM Research AI, NY |
35.6 | 28.0 |
2 April 11, 2021 |
DPR + Google Translate University of Washington, AI2, Google, UT Austin |
32.9 | 25.3 |
XOR-Full is a cross-lingual retrieval task where a question is written in the
target language (e.g., Japanese) and a system is required to output a short answer in
a target language. The scores are macro-average over the 7 target languages.
As in the tables above, the systems using external blackbox APIs are highlighted in gray
and ranked in the table of "Systems using external APIs" for reference.
Metrics: F1, EM, BLEU over the annotated answer's token set.
Rank | Model | F1 | EM | BLEU |
---|---|---|---|---|
1 March 22, 2022 |
Sentri + MFiD base Huawei Noah's Ark lab |
46.2 | 39.0 | 33.7 |
2 July 26, 2021 |
CORA University of Washington, AI2 |
43.5 | 33.5 | 31.1 |
3 July 14, 2021 |
Single Encoder Retriever (Sentri) Huawei Noah's Ark lab |
20.1 | 13.5 | 20.1 |
4 April 11, 2021 |
DPR + Vanilla Transformer MT + BM25 (English
Wikipedia + L Wikipedia) University of Washington, AI2, Google, UT Austin |
9.5 | 6.0 | 8.9 |
5 April 11, 2021 |
DPR + Vanilla Transformer MT (English Wikipedia
only) University of Washington, AI2, Google, UT Austin |
7.2 | 3.4 | 6.3 |
Rank | Model | F1 | EM | BLEU |
---|---|---|---|---|
1 April 11, 2021 |
DPR + Google Translate (English Wikipedia only) University of Washington, AI2, Google, UT Austin |
19.5 | 12.0 | 15.7 |
2 April 11, 2021 |
DPR + Google Translate + Google Custom Search
(English Wikipedia + L Wikipedia) University of Washington, AI2, Google, UT Austin |
18.7 | 12.1 | 16.8 |
3 April 11, 2021 |
Multilingual DPR + Google Custom Search (English
Wikipedia + L Wikipedia) University of Washington, AI2, Google, UT Austin |
15.7 | 10.0 | 13.9 |
4 April 11, 2021 |
DPR + Vanilla Transformer MT + Google Custom
Search (English Wikipedia + L Wikipedia) University of Washington, AI2, Google, UT Austin |
13.7 | 8.7 | 12.0 |
5 April 11, 2021 |
Google Custom Search (L Wikipedia only) University of Washington, AI2, Google, UT Austin |
11.3 | 7.5 | 9.7 |