XOR-TyDi QA brings together for the first time information-seeking questions, open-retrieval QA, and multilingual QA to create a multilingual open-retrieval QA dataset that enables cross-lingual answer retrieval. It consists of questions written by information-seeking native speakers in 7 typologically diverse languages and answer annotations that are retrieved from multilingual document collections. There are three sub-tasks: XOR-Retrieve, XOR-EnglishSpan, and XOR-Full.
XOR-TyDi QA is is meant to be an academic resource and has significant limitations. Please read our detailed datasheet before considering it for any practical application.
XOR-TyDi QA is distributed under the CC BY-SA 4.0 license. The training, development and test sets can be downloaded below.
For XOR-Retrieve and XOR-English Span:
For XOR-Full:
A more comprehensive summary about data, available resources, baseline models and evaluation is included in our Github repository README, which is linked below.
Getting Started GuideTo evaluate your models on the three tasks, we have also made available the evaluation
                                script we will use
                                for official evaluation, along with a sample prediction file that the script will take
                                as input. To run the evaluation, use
                                python evals/eval_xor_[task_name].py --data_file <path_to_input_data> --pred_file <path_to_predictions>.
                            
To submit your models and evaluate them on the official test sets, please read our submission guide hosted on Github.
Submission GuideTo facilitate future research in cross-lingual open-domain QA and related domains, we also releases additional resource such as 30k human translated questions in 7 languages and a GoldParagraph reading comprehension dataset. We are also happy to share the preprocessed Wikipedia databases for 7 languages, annotation interface and document collections seen by our annotators upon request.
Details of the formats: please see our README for the format details of the translation data. GoldParagraph data's zip file contains a README file, which describes the details of the formats and differences between the data files with span answers only and the ones with boolean or long answers in addition to short answers.
Ask us questions at akari@cs.washington.edu
XOR-Retrieve is a cross-lingual retrieval task where a question is written in a target language (e.g., Japanese) and a system is required to retrieve English paragraphs that answer the question. The scores are macro-average over the 7 target languages. Although we see the effectiveness of blackbox systems (e.g., Google Translate), we encourage the community to use white-box systems so that all experimental details can be understood. The systems using external blackbox APIs are highlighted in gray and ranked in the table of "Systems using external APIs" for reference.
Metrics: R@5kt, R@2kt (the recall by computing the fraction of the questions for which the minimal answer is contained in the top 5,000 / 2,000 tokens selected.)
| Rank | Model | R@5kt | R@2kt | 
|---|---|---|---|
| 1October 28, 2022 | PrimeQA (DrDecr-large with PLAID + Colbert V2) IBM Research AI | 74.7 | 69.2 | 
| 2June 10, 2022 | Quick Microsoft STCA | 72.0 | 65.6 | 
| 3August 22, 2022 | PrimeQA (DrDecr with PLAID + Colbert V2) IBM Research AI | 71.9 | 65.8 | 
| 4June 21, 2022 | LEXA Huawei Noah's Ark lab | 70.5 | 65.1 | 
| 5February 11, 2022 | DrDecr IBM Research AI | 70.3 | 63.0 | 
| 6March 14, 2022 | Sentri 2.0 base Huawei Noah's Ark lab | 64.6 | 58.5 | 
| 7January 7, 2022 | Contrastive Context-aware Pretraining Model (CCP) Microsoft STCA | 63.0 | 54.8 | 
| 8August 26, 2021 | Single Encoder Retriever (Sentri) Huawei Noah's Ark lab | 61.0 | 52.7 | 
| 9October 7, 2021 | Single Encoder Retriever (Sentri, resubmission) Huawei Noah's Ark lab | 60.7 | 55.5 | 
| 10June 19, 2021 | GAAMA (ColBERT ensemble with xlm-r + UW Translate) IBM Research AI, NY | 59.9 | 52.8 | 
| 11April 11, 2021 | DPR + Vanilla Transformer MT University of Washington, AI2, Google, UT Austin | 50.0 | 42.7 | 
| 12April 11, 2021 | Multilingual DPR University of Washington, AI2, Google, UT Austin | 48.0 | 38.8 | 
| Rank | Model | R@5kt | R@2kt | 
|---|---|---|---|
| 1June 18, 2021 | GAAMA (ColBERT Ensemble with IBM NMT + Google MT) IBM Research AI, NY | 71.4 | 65.0 | 
| 2January 7, 2022 | DrDecr IBM Research AI, NY | 70.1 | 62.4 | 
| 3April 11, 2021 | DPR + Google Translate University of Washington, AI2, Google, UT Austin | 67.2 | 59.3 | 
| 4April 11, 2021 | Path Retriever + Google Translate University of Washington, AI2, Google, UT Austin | 61.7 | 58.2 | 
 XOR-English Span is a cross-lingual retrieval task where a question is written in a
                                target language (e.g., Japanese) and a system is required to output a short answer in
                                English. The scores are macro-average over the 7 target languages. 
                                As in the XOR-Retrieve table, the systems using external blackbox APIs are highlighted
                                in gray
                                and ranked in the table of "Systems using external APIs" for reference.
                            
Metrics: F1, EM over the annotated answer’s token set following prior work (Rajpurkar et al., 2016).
| Rank | Model | F1 | EM | 
|---|---|---|---|
| 1July 27, 2021 | GAAMA (XLM-R) With UW MT + pure multilingual MRC IBM Research AI, NY | 22.7 | 16.5 | 
| 2April 11, 2021 | DPR + Vanilla Transformer University of Washington, AI2, Google, UT Austin | 20.5 | 15.7 | 
| 3April 11, 2021 | Multilingual DPR University of Washington, AI2, Google, UT Austin | 17.2 | 12.3 | 
| Rank | Model | F1 | EM | 
|---|---|---|---|
| 1July 27, 2021 | GAAMA (XLM-R) with IBM & Google MT IBM Research AI, NY | 35.6 | 28.0 | 
| 2April 11, 2021 | DPR + Google Translate University of Washington, AI2, Google, UT Austin | 32.9 | 25.3 | 
 XOR-Full is a cross-lingual retrieval task where a question is written in the
                                target language (e.g., Japanese) and a system is required to output a short answer in
                                a target language. The scores are macro-average over the 7 target languages.
                                
                                As in the tables above, the systems using external blackbox APIs are highlighted in gray
                                and ranked in the table of "Systems using external APIs" for reference.
                            
Metrics: F1, EM, BLEU over the annotated answer's token set.
| Rank | Model | F1 | EM | BLEU | 
|---|---|---|---|---|
| 1March 22, 2022 | Sentri + MFiD base Huawei Noah's Ark lab | 46.2 | 39.0 | 33.7 | 
| 2July 26, 2021 | CORA University of Washington, AI2 | 43.5 | 33.5 | 31.1 | 
| 3July 14, 2021 | Single Encoder Retriever (Sentri) Huawei Noah's Ark lab | 20.1 | 13.5 | 20.1 | 
| 4April 11, 2021 | DPR + Vanilla Transformer MT + BM25 (English
                                        Wikipedia + L Wikipedia) University of Washington, AI2, Google, UT Austin | 9.5 | 6.0 | 8.9 | 
| 5April 11, 2021 | DPR + Vanilla Transformer MT (English Wikipedia
                                        only) University of Washington, AI2, Google, UT Austin | 7.2 | 3.4 | 6.3 | 
| Rank | Model | F1 | EM | BLEU | 
|---|---|---|---|---|
| 1April 11, 2021 | DPR + Google Translate (English Wikipedia only) University of Washington, AI2, Google, UT Austin | 19.5 | 12.0 | 15.7 | 
| 2April 11, 2021 | DPR + Google Translate + Google Custom Search
                                        (English Wikipedia + L Wikipedia) University of Washington, AI2, Google, UT Austin | 18.7 | 12.1 | 16.8 | 
| 3April 11, 2021 | Multilingual DPR + Google Custom Search (English
                                        Wikipedia + L Wikipedia) University of Washington, AI2, Google, UT Austin | 15.7 | 10.0 | 13.9 | 
| 4April 11, 2021 | DPR + Vanilla Transformer MT + Google Custom
                                        Search (English Wikipedia + L Wikipedia) University of Washington, AI2, Google, UT Austin | 13.7 | 8.7 | 12.0 | 
| 5April 11, 2021 | Google Custom Search (L Wikipedia only) University of Washington, AI2, Google, UT Austin | 11.3 | 7.5 | 9.7 |