HotpotQA

A Dataset for Diverse, Explainable Multi-hop Question Answering

What is HotpotQA?

HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. It is collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.

For more details about HotpotQA, please refer to our EMNLP 2018 paper:

Getting started

HotpotQA is distributed under a CC BY-SA 4.0 License. The training and development sets can be downloaded below.

A more comprehensive summary about data download, preprocessing, baseline model training, and evaluation is included in our GitHub repository, and linked below.

Once you have built your model, you can use the evaluation script we provide below to evaluate model performance by running python hotpot_evaluate_v1.py <path_to_prediction> <path_to_gold>

To submit your models and evaluate them on the official test sets, please read our submission guide hosted on Codalab.

We also release the processed Wikipedia used in the process of creating HotpotQA (also under a CC BY-SA 4.0 License), serving both as the corpus for the fullwiki setting in our evaluation, and hopefully as a standalone resource for future researches involving processed text on Wikipedia. Below please find the link to the documentation for this corpus.

Stay connected!

Join our Google group to receive updates or initiate discussions about HotpotQA!

If you use HotpotQA in your research, please cite our paper with the following BibTeX entry

@inproceedings{yang2018hotpotqa,
  title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering},
  author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.},
  booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})},
  year={2018}
}
Leaderboard (Distractor Setting)
In the distractor setting, a question-answering system reads 10 paragraphs to provide an answer (Ans) to a question. They must also justify these answers with supporting facts (Sup).
Model Code Ans Sup Joint
EM F1 EM F1 EM F1
1
Sep 6, 2020
SpiderNet-large (single model)
Kingsoft AI Lab
70.15 83.02 63.82 88.85 47.54 74.88
2
Dec 1, 2019
HGN-large (single model)
Anonymous
69.22 82.19 62.76 88.47 47.11 74.21
3
Jun 10, 2020
BFR-Graph (single model)
Anonymous
70.06 82.20 61.33 88.41 45.92 74.13
4
May 11, 2020
GSAN-large (single model)
Anonymous
68.57 81.62 62.36 88.73 46.06 73.89
5
May 28, 2020
ETC-large (single model)
Anonymous
68.12 81.18 63.25 89.09 46.40 73.62
6
May 28, 2020
Longformer (single model)
Anonymous
68.00 81.25 63.09 88.34 45.91 73.16
7
Oct 18, 2019
C2F Reader (single model)
Joint Laboratory of HIT and iFLYTEK Research
(Shao, Cui et al. 2020)
67.98 81.24 60.81 87.63 44.67 72.73
8
Jun 15, 2020
AMGN (single model)
Anonymous
68.03 81.17 61.70 87.43 44.86 72.40
9
Nov 19, 2019
SAE-large (single model)
JD AI Research
Tu, Huang et al., AAAI 2020
66.92 79.62 61.53 86.86 45.36 71.45
10
Sep 27, 2019
HGN (single model)
Microsoft Dynamics 365 AI Research
Fang et al., 2019
66.07 79.36 60.33 87.33 43.57 71.03
11
Aug 19, 2020
SpiderNet-Base (single model)
Anonymous
66.38 79.53 60.35 86.90 43.83 70.90
12
Jul 29, 2019
TAP 2 (ensemble)
IBM Research AI & IISc
66.64 79.82 57.21 86.69 41.21 70.65
13
Oct 1, 2019
EPS + BERT(wwm) (single model)
Anonymous
65.79 79.05 58.50 86.26 42.47 70.48
14
Jul 29, 2019
TAP 2 (single model)
IBM Research AI & IISc
64.99 78.59 55.47 85.57 39.77 69.12
15
May 31, 2019
EPS + BERT(large) (single model)
Anonymous
63.29 76.36 58.25 85.60 41.39 67.92
16
Jul 24, 2020
GAR+BERT (single model)
York University
60.66 74.67 57.05 87.02 37.85 66.65
17
May 11, 2020
GSAN-base (single model)
Anonymous
61.25 74.74 57.74 86.28 39.56 66.62
18
Aug 31, 2019
SAE (single model)
JD AI Research
Tu, Huang et al., AAAI 2020
60.36 73.58 56.93 84.63 38.81 64.96
19
Jun 13, 2019
P-BERT (single model)
Anonymous
61.18 74.16 51.38 82.76 35.42 63.79
20
Sep 16, 2019
LQR-net 2 + BERT-Base (single model)
Anonymous
60.20 73.78 56.21 84.09 36.56 63.68
21
Apr 11, 2019
EPS + BERT (single model)
Anonymous
60.13 73.31 52.55 83.20 35.40 63.41
22
May 16, 2019
PIPE (single model)
Anonymous
59.77 72.77 52.53 82.82 35.54 62.92
23
Dec 1, 2019
SEval (single model)
Anonymous
61.87 74.37 45.73 80.50 33.32 62.73
24
Jul 30, 2020
GAR (single model)
York University
54.04 69.56 57.06 86.42 34.80 62.03
25
Jun 8, 2019
TAP (single model)

58.63 71.48 46.84 82.98 32.03 61.90
26
Aug 14, 2019
SAQA (single model)
Anonymous
55.07 70.22 57.62 84.19 35.94 61.72
27
Sep 2, 2019
MKGN (single model)
Anonymous
57.09 70.69 54.26 83.54 35.59 61.69
28
Apr 19, 2019
GRN + BERT (single model)
Anonymous
55.12 68.98 52.55 84.06 32.88 60.31
29
Jun 19, 2019
LQR-net + BERT-Base (single model)
Anonymous
57.20 70.66 50.20 82.42 31.18 59.99
30
Apr 22, 2019
DFGN (single model)
Shanghai Jiao Tong University & ByteDance AI Lab
(Xiao, Qu, Qiu et al. ACL19)
56.31 69.69 51.50 81.62 33.62 59.82
31
Nov 21, 2018
QFE (single model)
NTT Media Intelligence Laboratories
(Nishida et al., ACL'19)
53.86 68.06 57.75 84.49 34.63 59.61
32
Jun 3, 2020
IRC (single model)
Anonymous
58.56 72.53 36.67 79.35 23.30 59.22
33
Apr 17, 2019
LQR-net (ensemble)
Anonymous
55.19 69.55 47.15 82.42 28.42 58.86
34
Mar 4, 2019
GRN (single model)
Anonymous
52.92 66.71 52.37 84.11 31.77 58.47
35
Mar 1, 2019
DFGN + BERT (single model)
Anonymous
55.17 68.49 49.85 81.06 31.87 58.23
36
Mar 4, 2019
BERT Plus (single model)
CIS Lab
55.84 69.76 42.88 80.74 27.13 58.23
37
May 18, 2019
KGNN (single model)
Tsinghua University
(Ye et al., 2019)
50.81 65.75 38.74 76.79 22.40 52.82
38
Oct 10, 2018
Baseline Model (single model)
Carnegie Mellon University, Stanford University, & Universite de Montreal
(Yang, Qi, Zhang, et al. 2018)
45.60 59.02 20.32 64.49 10.83 40.16
-
Feb 3, 2020
Unsupervised Decomposition (single model)
Facebook AI Research, New York University & University College London
Unsupervised Question Decomposition for Question Answering
66.33 79.34 N/A N/A N/A N/A
-
Sep 24, 2019
ChainEx (single model)
UT Austin
(Chen et al., 2019)
61.20 74.11 N/A N/A N/A N/A
-
Aug 23, 2020
GAR+BERT without supporting facts (single model)
York University
56.78 70.93 N/A N/A N/A N/A
-
Feb 27, 2019
DecompRC (single model)
University of Washington
(Min et al., ACL'18)
55.20 69.63 N/A N/A N/A N/A
-
Aug 4, 2020
GAR without supporting facts (single model)
York University
52.61 68.17 N/A N/A N/A N/A
-
Apr 2, 2019
MatrixRC (single model)
Anonymous
47.07 60.75 N/A N/A N/A N/A
Leaderboard (Fullwiki Setting)
In the fullwiki setting, a question-answering system must find the answer to a question in the scope of the entire Wikipedia. Similar to in the distractor setting, systems are evaluated on the accuracy of their answers (Ans) and the quality of the supporting facts they use to justify them (Sup).
Model Code Ans Sup Joint
EM F1 EM F1 EM F1
1
Sep 7, 2020
EBS-SH (single model)
Samsung SDS AI Research
65.53 78.61 55.90 83.13 40.91 68.94
2
Aug 3, 2020
IRRR (single model)
Stanford University & Samsung Research
65.71 78.19 55.93 82.05 42.14 68.59
3
Sep 10, 2020
Anonymous (single model)
Anonymous
65.05 78.02 55.35 82.69 40.51 68.37
4
Aug 6, 2020
SDS-NET_Hotpotver1.1 (single model)
AI Advanced Research Lab
64.94 78.18 54.49 82.48 39.44 68.10
5
Aug 28, 2020
SDS-NET_Hotpotver1.1 (ensemble)
AI Advanced Research Lab
65.26 78.27 54.22 82.21 40.02 68.08
6
Aug 26, 2020
Recursive Dense Retriever (single model)
Anonymous
62.28 75.29 57.46 80.86 41.78 66.55
7
May 21, 2020
Step-by-Step Retriever (single model)
Joint Laboratory of HIT and iFLYTEK Research
62.95 75.43 54.61 80.00 40.36 66.22
8
Jun 9, 2020
HopRetriever-V1 (single model)
anonymous
60.83 73.93 53.07 79.26 38.00 63.91
9
May 21, 2020
DDRQA (single model)
Georgia Institute of Technology & Peking University
(Yuyu, Ping et al. 2020)
62.53 75.91 51.01 78.86 36.04 63.88
10
Jul 6, 2020
SDS-NET Hotpotver1.0 (single model)
AI Advanced Research Lab
64.29 77.23 51.12 78.57 36.29 63.75
11
Mar 6, 2020
DR model large (single model)
Anonymous
62.01 75.32 49.88 77.77 35.44 62.95
12
Feb 11, 2020
HGN-albert + SemanticRetrievalMRS IR (single model)
Anonymous
59.74 71.41 51.03 77.37 37.92 62.26
13
Nov 6, 2019
Robustly Fine-tuned Graph-based Recurrent Retriever (single model)
Salesforce Research & University of Washington
(Asai et al., ICLR 2020)
60.04 72.96 49.08 76.41 35.35 61.18
14
Dec 1, 2019
HGN-large + SemanticRetrievalMRS IR (single model)
Anonymous
57.85 69.93 51.01 76.82 37.17 60.74
15
Oct 7, 2019
HGN + SemanticRetrievalMRS IR (single model)
Microsoft Dynamics 365 AI Research
Fang et al., 2019
56.71 69.16 49.97 76.39 35.63 59.86
16
Jul 27, 2020
SAFSR model (single model)
Anonymous
58.89 71.60 48.03 75.69 34.46 59.84
17
Feb 13, 2020
DR model (single model)
Anonymous
58.82 71.68 41.55 72.54 29.34 56.82
18
Dec 8, 2019
Quark + SemanticRetrievalMRS IR (single model)
Anonymous
55.50 67.51 45.64 72.95 32.89 56.23
19
Sep 20, 2019
Graph-based Recurrent Retriever (single model)
Anonymous
56.04 68.87 44.14 73.03 29.18 55.31
20
Sep 28, 2019
MIR+EPS+BERT (single model)
Anonymous
52.86 64.79 42.75 72.00 31.19 54.75
21
Feb 4, 2020
Transformer-XH-final(BERT-base) (single model)
University of Maryland, Microsoft AI & Research
(Zhao et al. ICLR 2020)
51.60 64.07 40.91 71.42 26.14 51.29
22
Sep 21, 2019
Transformer-XH (single model)
Anonymous
48.95 60.75 41.66 70.01 27.13 49.57
23
May 15, 2019
SemanticRetrievalMRS (single model)
UNC-NLP
(Nie et al., EMNLP'2019)
45.32 57.34 38.67 70.83 25.14 47.60
24
Feb 21, 2020
DrKIT (single model)
Carnegie Mellon University, Google Research
(Dhingra et al, ICLR 2020)
42.13 51.72 37.05 59.84 24.69 42.88
25
Jul 31, 2019
Entity-centric BERT Pipeline (single model)
Anonymous
41.82 53.09 26.26 57.29 17.01 39.18
26
May 21, 2019
GoldEn Retriever (single model)
Stanford University
(Qi et al., EMNLP-IJCNLP 2019)
37.92 48.58 30.69 64.24 18.04 39.13
27
Aug 14, 2019
PR-Bert (single model)
KingSoft AI Lab
43.33 53.79 21.90 59.63 14.50 39.11
28
Dec 4, 2019
SAFSr-Bert (single model)
Anonymous
39.35 51.40 24.21 58.54 13.34 37.00
29
Feb 21, 2019
Cognitive Graph QA (single model)
Tsinghua KEG & Alibaba DAMO Academy
(Ding et al., ACL'19)
37.12 48.87 22.82 57.69 12.42 34.92
30
Mar 5, 2019
MUPPET (single model)
Technion
(Feldman and El-Yaniv, ACL'19)
30.61 40.26 16.65 47.33 10.85 27.01
31
Apr 7, 2019
GRN + BERT (single model)
Anonymous
29.87 39.14 13.16 49.67 8.26 25.84
32
May 20, 2019
Entity-centric IR (single model)
Anonymous
35.36 46.26 0.06 43.16 0.02 25.47
33
May 19, 2019
KGNN (single model)
Tsinghua University
(Ye et al., 2019)
27.65 37.19 12.65 47.19 7.03 24.66
34
Aug 16, 2019
SAQA (single model)
Anonymous
28.44 38.62 14.69 47.17 8.62 24.49
35
Mar 4, 2019
GRN (single model)
Anonymous
27.34 36.48 12.23 48.75 7.40 23.55
36
Nov 25, 2018
QFE (single model)
NTT Media Intelligence Laboratories
(Nishida et al., ACL'19)
28.66 38.06 14.20 44.35 8.69 23.10
37
Nov 29, 2019
SAFSr_model (single model)
Anonymous
28.91 39.14 8.03 40.55 4.06 20.90
38
Oct 12, 2018
Baseline Model (single model)
Carnegie Mellon University, Stanford University, & Universite de Montreal
(Yang, Qi, Zhang, et al. 2018)
23.95 32.89 3.86 37.71 1.85 16.15
-
May 19, 2019
TPReasoner w/o BERT (single model)
Anonymous
36.04 47.43 N/A N/A N/A N/A
-
Feb 28, 2019
DecompRC (single model)
University of Washington
(Min et al., ACL'18)
30.00 40.65 N/A N/A N/A N/A
-
Mar 3, 2019
MultiQA (single model)
Anonymous
30.73 40.23 N/A N/A N/A N/A