What is HotpotQA?
HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. It is collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.
For more details about HotpotQA, please refer to our EMNLP 2018 paper:
Getting started
HotpotQA is distributed under a CC BY-SA 4.0 License. The training and development sets can be downloaded below.
A more comprehensive summary about data download, preprocessing, baseline model training, and evaluation is included in our GitHub repository, and linked below.
Once you have built your model, you can use the evaluation script we provide below to evaluate model performance by running python hotpot_evaluate_v1.py <path_to_prediction> <path_to_gold>
To submit your models and evaluate them on the official test sets, please read our submission guide hosted on Codalab.
We also release the processed Wikipedia used in the process of creating HotpotQA (also under a CC BY-SA 4.0 License), serving both as the corpus for the fullwiki setting in our evaluation, and hopefully as a standalone resource for future researches involving processed text on Wikipedia. Below please find the link to the documentation for this corpus.
Stay connected!
Join our Google group to receive updates or initiate discussions about HotpotQA!
If you use HotpotQA in your research, please cite our paper with the following BibTeX entry
@inproceedings{yang2018hotpotqa, title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering}, author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.}, booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})}, year={2018} }
Model | Code | Ans | Sup | Joint | ||||
---|---|---|---|---|---|---|---|---|
EM | F1 | EM | F1 | EM | F1 | |||
1 Dec 1, 2019 |
HGN-large (single model) Anonymous |
69.22 | 82.19 | 62.76 | 88.47 | 47.11 | 74.21 | |
2 Oct 18, 2019 |
C2F Reader (single model) Joint Laboratory of HIT and iFLYTEK Research |
67.98 | 81.24 | 60.81 | 87.63 | 44.67 | 72.73 | |
3 Nov 19, 2019 |
SAE-large (single model) JD AI Research Tu, Huang et al., AAAI 2020 |
66.92 | 79.62 | 61.53 | 86.86 | 45.36 | 71.45 | |
4 Sep 27, 2019 |
HGN (single model) Microsoft Dynamics 365 AI Research Fang et al., 2019 |
66.07 | 79.36 | 60.33 | 87.33 | 43.57 | 71.03 | |
5 Jul 29, 2019 |
TAP 2 (ensemble) |
66.64 | 79.82 | 57.21 | 86.69 | 41.21 | 70.65 | |
6 Oct 1, 2019 |
EPS + BERT(wwm) (single model) Anonymous |
65.79 | 79.05 | 58.50 | 86.26 | 42.47 | 70.48 | |
7 Jul 29, 2019 |
TAP 2 (single model) |
64.99 | 78.59 | 55.47 | 85.57 | 39.77 | 69.12 | |
8 May 31, 2019 |
EPS + BERT(large) (single model) Anonymous |
63.29 | 76.36 | 58.25 | 85.60 | 41.39 | 67.92 | |
9 Aug 31, 2019 |
SAE (single model) JD AI Research Tu, Huang et al., AAAI 2020 |
60.36 | 73.58 | 56.93 | 84.63 | 38.81 | 64.96 | |
10 Jun 13, 2019 |
P-BERT (single model) Anonymous |
61.18 | 74.16 | 51.38 | 82.76 | 35.42 | 63.79 | |
11 Sep 16, 2019 |
LQR-net 2 + BERT-Base (single model) Anonymous |
60.20 | 73.78 | 56.21 | 84.09 | 36.56 | 63.68 | |
12 Apr 11, 2019 |
EPS + BERT (single model) Anonymous |
60.13 | 73.31 | 52.55 | 83.20 | 35.40 | 63.41 | |
13 May 16, 2019 |
PIPE (single model) Anonymous |
59.77 | 72.77 | 52.53 | 82.82 | 35.54 | 62.92 | |
14 Dec 1, 2019 |
SEval (single model) Anonymous |
61.87 | 74.37 | 45.73 | 80.50 | 33.32 | 62.73 | |
15 Jun 8, 2019 |
TAP (single model) |
58.63 | 71.48 | 46.84 | 82.98 | 32.03 | 61.90 | |
16 Aug 14, 2019 |
SAQA (single model) Anonymous |
55.07 | 70.22 | 57.62 | 84.19 | 35.94 | 61.72 | |
17 Sep 2, 2019 |
MKGN (single model) Anonymous |
57.09 | 70.69 | 54.26 | 83.54 | 35.59 | 61.69 | |
18 Apr 19, 2019 |
GRN + BERT (single model) Anonymous |
55.12 | 68.98 | 52.55 | 84.06 | 32.88 | 60.31 | |
19 Jun 19, 2019 |
LQR-net + BERT-Base (single model) Anonymous |
57.20 | 70.66 | 50.20 | 82.42 | 31.18 | 59.99 | |
20 Apr 22, 2019 |
DFGN (single model) Shanghai Jiao Tong University & ByteDance AI Lab (Xiao, Qu, Qiu et al. ACL19) |
56.31 | 69.69 | 51.50 | 81.62 | 33.62 | 59.82 | |
21 Nov 21, 2018 |
QFE (single model) NTT Media Intelligence Laboratories (Nishida et al., ACL'19) |
53.86 | 68.06 | 57.75 | 84.49 | 34.63 | 59.61 | |
22 Apr 17, 2019 |
LQR-net (ensemble) Anonymous |
55.19 | 69.55 | 47.15 | 82.42 | 28.42 | 58.86 | |
23 Mar 4, 2019 |
GRN (single model) Anonymous |
52.92 | 66.71 | 52.37 | 84.11 | 31.77 | 58.47 | |
24 Mar 1, 2019 |
DFGN + BERT (single model) Anonymous |
55.17 | 68.49 | 49.85 | 81.06 | 31.87 | 58.23 | |
25 Mar 4, 2019 |
BERT Plus (single model) CIS Lab |
55.84 | 69.76 | 42.88 | 80.74 | 27.13 | 58.23 | |
26 May 18, 2019 |
KGNN (single model) Tsinghua University (Ye et al., 2019) |
50.81 | 65.75 | 38.74 | 76.79 | 22.40 | 52.82 | |
27 Oct 10, 2018 |
Baseline Model (single model) Carnegie Mellon University, Stanford University, & Universite de Montreal (Yang, Qi, Zhang, et al. 2018) |
45.60 | 59.02 | 20.32 | 64.49 | 10.83 | 40.16 | |
- Sep 24, 2019 |
ChainEx (single model) UT Austin (Chen et al., 2019) |
61.20 | 74.11 | N/A | N/A | N/A | N/A | |
- Feb 27, 2019 |
DecompRC (single model) University of Washington (Min et al., ACL'18) |
55.20 | 69.63 | N/A | N/A | N/A | N/A | |
- Apr 2, 2019 |
MatrixRC (single model) Anonymous |
47.07 | 60.75 | N/A | N/A | N/A | N/A |
Model | Code | Ans | Sup | Joint | ||||
---|---|---|---|---|---|---|---|---|
EM | F1 | EM | F1 | EM | F1 | |||
1 Nov 6, 2019 |
Robustly Finetuned Graph-based Recurrent Retriever (single model) Anonymous |
60.04 | 72.96 | 49.08 | 76.41 | 35.35 | 61.18 | |
2 Dec 1, 2019 |
HGN-large + SemanticRetrievalMRS IR (single model) Anonymous |
57.85 | 69.93 | 51.01 | 76.82 | 37.17 | 60.74 | |
3 Oct 7, 2019 |
HGN + SemanticRetrievalMRS IR (single model) Microsoft Dynamics 365 AI Research Fang et al., 2019 |
56.71 | 69.16 | 49.97 | 76.39 | 35.63 | 59.86 | |
4 Sep 20, 2019 |
Graph-based Recurrent Retriever (single model) Anonymous |
56.04 | 68.87 | 44.14 | 73.03 | 29.18 | 55.31 | |
5 Sep 28, 2019 |
MIR+EPS+BERT (single model) Anonymous |
52.86 | 64.79 | 42.75 | 72.00 | 31.19 | 54.75 | |
6 Sep 21, 2019 |
Transformer-XH (single model) Anonymous |
48.95 | 60.75 | 41.66 | 70.01 | 27.13 | 49.57 | |
7 May 15, 2019 |
SemanticRetrievalMRS (single model) UNC-NLP (Nie et al., EMNLP'2019) |
45.32 | 57.34 | 38.67 | 70.83 | 25.14 | 47.60 | |
8 Jul 31, 2019 |
Entity-centric BERT Pipeline (single model) Anonymous |
41.82 | 53.09 | 26.26 | 57.29 | 17.01 | 39.18 | |
9 May 21, 2019 |
GoldEn Retriever (single model) Stanford University (Qi et al., EMNLP-IJCNLP 2019) |
37.92 | 48.58 | 30.69 | 64.24 | 18.04 | 39.13 | |
10 Aug 14, 2019 |
PR-Bert (single model) KingSoft AI Lab |
43.33 | 53.79 | 21.90 | 59.63 | 14.50 | 39.11 | |
11 Dec 4, 2019 |
SAFSr-Bert (single model) Anonymous |
39.35 | 51.40 | 24.21 | 58.54 | 13.34 | 37.00 | |
12 Feb 21, 2019 |
Cognitive Graph QA (single model) Tsinghua KEG & Alibaba DAMO Academy (Ding et al., ACL'19) |
37.12 | 48.87 | 22.82 | 57.69 | 12.42 | 34.92 | |
13 Mar 5, 2019 |
MUPPET (single model) Technion (Feldman and El-Yaniv, ACL'19) |
30.61 | 40.26 | 16.65 | 47.33 | 10.85 | 27.01 | |
14 Apr 7, 2019 |
GRN + BERT (single model) Anonymous |
29.87 | 39.14 | 13.16 | 49.67 | 8.26 | 25.84 | |
15 May 20, 2019 |
Entity-centric IR (single model) Anonymous |
35.36 | 46.26 | 0.06 | 43.16 | 0.02 | 25.47 | |
16 May 19, 2019 |
KGNN (single model) Tsinghua University (Ye et al., 2019) |
27.65 | 37.19 | 12.65 | 47.19 | 7.03 | 24.66 | |
17 Aug 16, 2019 |
SAQA (single model) Anonymous |
28.44 | 38.62 | 14.69 | 47.17 | 8.62 | 24.49 | |
18 Mar 4, 2019 |
GRN (single model) Anonymous |
27.34 | 36.48 | 12.23 | 48.75 | 7.40 | 23.55 | |
19 Nov 25, 2018 |
QFE (single model) NTT Media Intelligence Laboratories (Nishida et al., ACL'19) |
28.66 | 38.06 | 14.20 | 44.35 | 8.69 | 23.10 | |
20 Nov 29, 2019 |
SAFSr_model (single model) Anonymous |
28.91 | 39.14 | 8.03 | 40.55 | 4.06 | 20.90 | |
21 Oct 12, 2018 |
Baseline Model (single model) Carnegie Mellon University, Stanford University, & Universite de Montreal (Yang, Qi, Zhang, et al. 2018) |
23.95 | 32.89 | 3.86 | 37.71 | 1.85 | 16.15 | |
- May 19, 2019 |
TPReasoner w/o BERT (single model) Anonymous |
36.04 | 47.43 | N/A | N/A | N/A | N/A | |
- Feb 28, 2019 |
DecompRC (single model) University of Washington (Min et al., ACL'18) |
30.00 | 40.65 | N/A | N/A | N/A | N/A | |
- Mar 3, 2019 |
MultiQA (single model) Anonymous |
30.73 | 40.23 | N/A | N/A | N/A | N/A |