What is XOR-TyDi QA?

XOR-TyDi QA brings together for the first time information-seeking questions, open-retrieval QA, and multilingual QA to create a multilingual open-retrieval QA dataset that enables cross-lingual answer retrieval. It consists of questions written by information-seeking native speakers in 7 typologically diverse languages and answer annotations that are retrieved from multilingual document collections. There are three sub-tasks: XOR-Retrieve, XOR-EnglishSpan, and XOR-Full.

XOR-TyDi QA is is meant to be an academic resource and has significant limitations. Please read our detailed datasheet before considering it for any practical application.

XOR-TyDi datasheet

Getting Started

XOR-TyDi QA is distributed under the CC BY-SA 4.0 license. The training, development and test sets can be downloaded below.

For XOR-Retrieve and XOR-English Span:

For XOR-Full:

A more comprehensive summary about data, available resources, baseline models and evaluation is included in our Github repository README, which is linked below.

Getting Started Guide

Evaluation and Submission to Leaderboard

To evaluate your models on the three tasks, we have also made available the evaluation script we will use for official evaluation, along with a sample prediction file that the script will take as input. To run the evaluation, use python evals/eval_xor_[task_name].py --data_file <path_to_input_data> --pred_file <path_to_predictions>.

To submit your models and evaluate them on the official test sets, please read our submission guide hosted on Github.

Submission Guide

Additional resources

To facilitate future research in cross-lingual open-domain QA and related domains, we also releases additional resource such as 30k human translated questions in 7 languages and a GoldParagraph reading comprehension dataset. We are also happy to share the preprocessed Wikipedia databases for 7 languages, annotation interface and document collections seen by our annotators upon request.

Details of the formats: please see our README for the format details of the translation data. GoldParagraph data's zip file contains a README file, which describes the details of the formats and differences between the data files with span answers only and the ones with boolean or long answers in addition to short answers.

Have Questions?

Ask us questions at akari@cs.washington.edu

XOR-TyDi v1.1 Leaderboard

Task 1: XOR-Retrieve

XOR-Retrieve is a cross-lingual retrieval task where a question is written in a target language (e.g., Japanese) and a system is required to retrieve English paragraphs that answer the question. The scores are macro-average over the 7 target languages.
Although we see the effectiveness of blackbox systems (e.g., Google Translate), we encourage the community to use white-box systems so that all experimental details can be understood. The systems using external blackbox APIs are highlighted in gray and ranked in the table of "Systems using external APIs" for reference.

Metrics: R@5kt, R@2kt (the recall by computing the fraction of the questions for which the minimal answer is contained in the top 5,000 / 2,000 tokens selected.)

Rank	Model	R@5kt	R@2kt
1 October 28, 2022	PrimeQA (DrDecr-large with PLAID + Colbert V2) IBM Research AI	74.7	69.2
2 June 10, 2022	Quick Microsoft STCA	72.0	65.6
3 August 22, 2022	PrimeQA (DrDecr with PLAID + Colbert V2) IBM Research AI	71.9	65.8
4 June 21, 2022	LEXA Huawei Noah's Ark lab	70.5	65.1
5 February 11, 2022	DrDecr IBM Research AI	70.3	63.0
6 March 14, 2022	Sentri 2.0 base Huawei Noah's Ark lab	64.6	58.5
7 January 7, 2022	Contrastive Context-aware Pretraining Model (CCP) Microsoft STCA	63.0	54.8
8 August 26, 2021	Single Encoder Retriever (Sentri) Huawei Noah's Ark lab	61.0	52.7
9 October 7, 2021	Single Encoder Retriever (Sentri, resubmission) Huawei Noah's Ark lab	60.7	55.5
10 June 19, 2021	GAAMA (ColBERT ensemble with xlm-r + UW Translate) IBM Research AI, NY	59.9	52.8
11 April 11, 2021	DPR + Vanilla Transformer MT University of Washington, AI2, Google, UT Austin	50.0	42.7
12 April 11, 2021	Multilingual DPR University of Washington, AI2, Google, UT Austin	48.0	38.8

*Systems using external APIs

Rank	Model	R@5kt	R@2kt
1 June 18, 2021	GAAMA (ColBERT Ensemble with IBM NMT + Google MT) IBM Research AI, NY	71.4	65.0
2 January 7, 2022	DrDecr IBM Research AI, NY	70.1	62.4
3 April 11, 2021	DPR + Google Translate University of Washington, AI2, Google, UT Austin	67.2	59.3
4 April 11, 2021	Path Retriever + Google Translate University of Washington, AI2, Google, UT Austin	61.7	58.2

Task 2: XOR-English Span

XOR-English Span is a cross-lingual retrieval task where a question is written in a target language (e.g., Japanese) and a system is required to output a short answer in English. The scores are macro-average over the 7 target languages.
As in the XOR-Retrieve table, the systems using external blackbox APIs are highlighted in gray and ranked in the table of "Systems using external APIs" for reference.

Metrics: F1, EM over the annotated answer’s token set following prior work (Rajpurkar et al., 2016).

Rank	Model	F1	EM
1 July 27, 2021	GAAMA (XLM-R) With UW MT + pure multilingual MRC IBM Research AI, NY	22.7	16.5
2 April 11, 2021	DPR + Vanilla Transformer University of Washington, AI2, Google, UT Austin	20.5	15.7
3 April 11, 2021	Multilingual DPR University of Washington, AI2, Google, UT Austin	17.2	12.3

*Systems using external APIs

Rank	Model	F1	EM
1 July 27, 2021	GAAMA (XLM-R) with IBM & Google MT IBM Research AI, NY	35.6	28.0
2 April 11, 2021	DPR + Google Translate University of Washington, AI2, Google, UT Austin	32.9	25.3

Task 3: XOR-Full

XOR-Full is a cross-lingual retrieval task where a question is written in the target language (e.g., Japanese) and a system is required to output a short answer in a target language. The scores are macro-average over the 7 target languages.
As in the tables above, the systems using external blackbox APIs are highlighted in gray and ranked in the table of "Systems using external APIs" for reference.

Metrics: F1, EM, BLEU over the annotated answer's token set.

Rank	Model	F1	EM	BLEU
1 March 22, 2022	Sentri + MFiD base Huawei Noah's Ark lab	46.2	39.0	33.7
2 July 26, 2021	CORA University of Washington, AI2	43.5	33.5	31.1
3 July 14, 2021	Single Encoder Retriever (Sentri) Huawei Noah's Ark lab	20.1	13.5	20.1
4 April 11, 2021	DPR + Vanilla Transformer MT + BM25 (English Wikipedia + L Wikipedia) University of Washington, AI2, Google, UT Austin	9.5	6.0	8.9
5 April 11, 2021	DPR + Vanilla Transformer MT (English Wikipedia only) University of Washington, AI2, Google, UT Austin	7.2	3.4	6.3

*Systems using external APIs

Rank	Model	F1	EM	BLEU
1 April 11, 2021	DPR + Google Translate (English Wikipedia only) University of Washington, AI2, Google, UT Austin	19.5	12.0	15.7
2 April 11, 2021	DPR + Google Translate + Google Custom Search (English Wikipedia + L Wikipedia) University of Washington, AI2, Google, UT Austin	18.7	12.1	16.8
3 April 11, 2021	Multilingual DPR + Google Custom Search (English Wikipedia + L Wikipedia) University of Washington, AI2, Google, UT Austin	15.7	10.0	13.9
4 April 11, 2021	DPR + Vanilla Transformer MT + Google Custom Search (English Wikipedia + L Wikipedia) University of Washington, AI2, Google, UT Austin	13.7	8.7	12.0
5 April 11, 2021	Google Custom Search (L Wikipedia only) University of Washington, AI2, Google, UT Austin	11.3	7.5	9.7

XOR-TyDi

Cross-lingual Open-Retrieval Question Answering

What is XOR-TyDi QA?

Getting Started

Evaluation and Submission to Leaderboard

Additional resources

Have Questions?

XOR-TyDi v1.1 Leaderboard

Task 1: XOR-Retrieve

*Systems using external APIs

Task 2: XOR-English Span

*Systems using external APIs

Task 3: XOR-Full

*Systems using external APIs