HotpotQA

A Dataset for Diverse, Explainable Multi-hop Question Answering

What is HotpotQA?

HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. It is collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.

For more details about HotpotQA, please refer to our EMNLP 2018 paper:

If you work on open-domain multi-hop question answering, you might also be interested in a new dataset one of our authors (Peng Qi) published more recently, BeerQA, which features open-domain questions that might require varying hops of reasoning to answer, and which HotpotQA is made part of.

Getting started

HotpotQA is distributed under a CC BY-SA 4.0 License. The training and development sets can be downloaded below.

A more comprehensive summary about data download, preprocessing, baseline model training, and evaluation is included in our GitHub repository, and linked below.

Once you have built your model, you can use the evaluation script we provide below to evaluate model performance by running python hotpot_evaluate_v1.py <path_to_prediction> <path_to_gold>

To submit your models and evaluate them on the official test sets, please read our submission guide hosted on Codalab.

We also release the processed Wikipedia used in the process of creating HotpotQA (also under a CC BY-SA 4.0 License), serving both as the corpus for the fullwiki setting in our evaluation, and hopefully as a standalone resource for future researches involving processed text on Wikipedia. Below please find the link to the documentation for this corpus.

Stay connected!

Join our Google group to receive updates or initiate discussions about HotpotQA!

If you use HotpotQA in your research, please cite our paper with the following BibTeX entry

@inproceedings{yang2018hotpotqa,
  title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering},
  author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.},
  booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})},
  year={2018}
}
Leaderboard (Distractor Setting)
In the distractor setting, a question-answering system reads 10 paragraphs to provide an answer (Ans) to a question. They must also justify these answers with supporting facts (Sup).
Model Code Ans Sup Joint
EM F1 EM F1 EM F1
1
Aug 7, 2023
Beam Retrieval (single model)
BUPT & Tencent
(Zhang, Zhang, Zhang, et al. 2023)
72.69 85.04 66.25 90.09 50.53 77.54
2
Jul 7, 2022
PipNet (single model)
Tencent Cloud Xiaowei
72.26 84.86 63.71 89.41 48.76 76.95
3
Jun 27, 2022
Smoothing R3 (single model)
Fudan University & Huawei Poisson Lab
Rethinking Label Smoothing on Multi-hop Question Answering
72.07 84.34 65.44 89.55 49.73 76.69
4
Jan 28, 2022
FE2H on ALBERT (single model)
Nanjing University
From Easy to Hard: Two-stage Selector and Reader for Multi-hop Question Answering
71.89 84.44 64.98 89.14 50.04 76.54
5
May 16, 2022
R3 (single model)
Fudan University & Huawei Poisson Lab
Rethinking Label Smoothing on Multi-hop Question Answering
71.27 83.57 65.25 88.98 49.81 76.02
6
May 28, 2021
SAE+ (single model)
JD AI Research
70.74 83.61 63.70 88.95 48.15 75.72
7
Jul 12, 2021
S2G+EGA (single model)
Shanghai Jiao Tong University
70.92 83.44 63.86 88.68 48.76 75.47
8
Feb 27, 2021
S2G+ (single model)
Shanghai Jiao Tong University
70.72 83.53 64.30 88.72 48.60 75.45
9
Jan 11, 2021
AMGN+ (single model)
Anonymous
70.53 83.37 63.57 88.83 47.77 75.24
10
Mar 23, 2022
RD Model (single model)

70.35 82.86 63.57 88.81 47.96 75.17
11
Feb 14, 2022
FE2H on ELECTRA (single model)
Anonymous
69.54 82.69 64.78 88.71 48.46 74.90
12
Sep 6, 2020
SpiderNet-large (single model)
Kingsoft AI Lab
70.15 83.02 63.82 88.85 47.54 74.88
13
Feb 25, 2023
GIT (single model)
KAIST
70.07 82.86 62.59 88.53 47.22 74.84
14
Feb 20, 2021
S2G+ (single model)
Anonymous
69.38 82.17 64.30 88.72 48.00 74.36
15
Dec 30, 2021
AnonymousS (single model)
Anonymous
69.66 82.42 62.99 87.85 47.84 74.27
16
Nov 23, 2020
Anonymous (single model)
Anonymous
70.24 82.36 62.26 88.46 46.81 74.27
17
Dec 1, 2019
HGN-large (single model)
Anonymous
69.22 82.19 62.76 88.47 47.11 74.21
18
Nov 15, 2020
AMGN (single model)
Anonymous
69.89 82.79 62.67 88.12 46.59 74.20
19
Dec 15, 2021
BoSe (single model)
Anonymous
69.66 82.43 62.52 87.73 47.52 74.18
20
Jun 10, 2020
BFR-Graph (single model)
Anonymous
70.06 82.20 61.33 88.41 45.92 74.13
21
Apr 9, 2021
KIFGraph (single model)
LAB
69.53 82.42 61.79 87.98 46.49 74.12
22
Dec 14, 2021
Anonymous (single model)
Anonymous
69.43 82.47 61.85 87.59 46.57 73.93
23
May 11, 2020
GSAN-large (single model)
Anonymous
68.57 81.62 62.36 88.73 46.06 73.89
24
Sep 14, 2021
GIT (single model)
KAIST
69.12 82.01 62.05 88.19 46.50 73.87
25
Oct 6, 2020
FFReader-large (single model)
Kyoto University
(Alkhaldi et al., 2021)
68.89 82.16 62.10 88.42 45.61 73.78
26
May 28, 2020
ETC-large (single model)
Anonymous
68.12 81.18 63.25 89.09 46.40 73.62
27
May 28, 2020
Longformer (single model)
Anonymous
68.00 81.25 63.09 88.34 45.91 73.16
28
May 24, 2021
RealFormer (single model)
Anonymous
67.41 80.59 63.38 89.00 46.14 73.13
29
Apr 15, 2022
EGF Reader-large (single model)
Anonymous
68.10 80.96 62.60 88.20 46.15 72.96
30
Oct 18, 2019
C2F Reader (single model)
Joint Laboratory of HIT and iFLYTEK Research
(Shao, Cui et al. 2020)
67.98 81.24 60.81 87.63 44.67 72.73
31
Feb 11, 2021
Text-CAN large (single model)
Usyd NLP
67.53 80.80 61.62 86.95 45.75 72.52
32
Jun 15, 2020
SEGraph (single model)
Anonymous
68.03 81.17 61.70 87.43 44.86 72.40
33
Jan 24, 2021
S2G-large (single model)
Anonymous
67.34 80.24 62.66 87.61 45.80 72.26
34
Jun 29, 2021
()

67.44 80.27 60.08 86.16 44.69 71.46

Jun 30, 2021
() (single model)
Anonymous
67.44 80.27 60.08 86.16 44.69 71.46
36
Nov 19, 2019
SAE-large (single model)
JD AI Research
Tu, Huang et al., AAAI 2020
66.92 79.62 61.53 86.86 45.36 71.45
37
Sep 27, 2019
HGN (single model)
Microsoft Dynamics 365 AI Research
Fang et al., 2019
66.07 79.36 60.33 87.33 43.57 71.03
38
Aug 19, 2020
SpiderNet-Base (single model)
Anonymous
66.38 79.53 60.35 86.90 43.83 70.90
39
Jul 29, 2019
TAP 2 (ensemble)
IBM Research AI & IISc
66.64 79.82 57.21 86.69 41.21 70.65
40
Oct 1, 2019
EPS + BERT(wwm) (single model)
Anonymous
65.79 79.05 58.50 86.26 42.47 70.48
41
Mar 2, 2021
S2G-base (single model)
Anonymous
63.72 77.02 61.33 87.19 43.74 69.51
42
Feb 24, 2021
BDR+JNM (single model)
Anonymous
65.13 77.96 56.85 85.03 41.91 69.12
43
Jul 29, 2019
TAP 2 (single model)
IBM Research AI & IISc
64.99 78.59 55.47 85.57 39.77 69.12
44
Dec 3, 2020
AnonymousK (single model)
Anonymous
63.63 77.15 57.00 86.17 40.04 68.75
45
May 5, 2021
GAR-BERT (single model)
York University
62.67 76.35 59.50 87.98 40.64 68.74
46
May 31, 2019
EPS + BERT(large) (single model)
Anonymous
63.29 76.36 58.25 85.60 41.39 67.92
47
Jul 30, 2020
()

60.66 74.67 57.05 87.02 37.85 66.65
48
May 11, 2020
GSAN-base (single model)
Anonymous
61.25 74.74 57.74 86.28 39.56 66.62
49
Feb 12, 2021
Text-CAN (single model)
Usyd NLP
60.17 73.99 58.33 85.75 39.31 65.95
50
Aug 31, 2019
SAE (single model)
JD AI Research
Tu, Huang et al., AAAI 2020
60.36 73.58 56.93 84.63 38.81 64.96
51
Mar 13, 2021
GAR (single model)
York University
56.61 71.40 58.36 87.27 36.79 64.01

Mar 15, 2021
()

56.61 71.40 58.36 87.27 36.79 64.01
53
Jun 13, 2019
P-BERT (single model)
Anonymous
61.18 74.16 51.38 82.76 35.42 63.79
54
Sep 16, 2019
LQR-net 2 + BERT-Base (single model)
Anonymous
60.20 73.78 56.21 84.09 36.56 63.68
55
Apr 11, 2019
EPS + BERT (single model)
Anonymous
60.13 73.31 52.55 83.20 35.40 63.41
56
May 16, 2019
PIPE (single model)
Anonymous
59.77 72.77 52.53 82.82 35.54 62.92
57
Dec 1, 2019
SEval (single model)
Anonymous
61.87 74.37 45.73 80.50 33.32 62.73
58
Jun 8, 2019
TAP (single model)

58.63 71.48 46.84 82.98 32.03 61.90
59
Aug 14, 2019
SAQA (single model)
Anonymous
55.07 70.22 57.62 84.19 35.94 61.72
60
Sep 2, 2019
MKGN (single model)
Anonymous
57.09 70.69 54.26 83.54 35.59 61.69
61
Apr 19, 2019
GRN + BERT (single model)
Anonymous
55.12 68.98 52.55 84.06 32.88 60.31
62
Jun 19, 2019
LQR-net + BERT-Base (single model)
Anonymous
57.20 70.66 50.20 82.42 31.18 59.99
63
Apr 22, 2019
DFGN (single model)
Shanghai Jiao Tong University & ByteDance AI Lab
(Xiao, Qu, Qiu et al. ACL19)
56.31 69.69 51.50 81.62 33.62 59.82
64
Nov 21, 2018
QFE (single model)
NTT Media Intelligence Laboratories
(Nishida et al., ACL'19)
53.86 68.06 57.75 84.49 34.63 59.61
65
Jun 3, 2020
IRC (single model)
NTT Media Intelligence Laboratories
(Nishida et al., 2021)
58.54 72.67 36.56 79.53 23.57 59.43
66
Apr 17, 2019
LQR-net (ensemble)
Anonymous
55.19 69.55 47.15 82.42 28.42 58.86
67
Mar 4, 2019
GRN (single model)
Anonymous
52.92 66.71 52.37 84.11 31.77 58.47
68
Mar 1, 2019
DFGN + BERT (single model)
Anonymous
55.17 68.49 49.85 81.06 31.87 58.23
69
Mar 4, 2019
BERT Plus (single model)
CIS Lab
55.84 69.76 42.88 80.74 27.13 58.23
70
May 18, 2019
KGNN (single model)
Tsinghua University
(Ye et al., 2019)
50.81 65.75 38.74 76.79 22.40 52.82
71
Jul 14, 2021
RoBERTa-L Two-step Model (single model)
Anonymous
67.61 80.36 1.10 64.01 0.76 52.50
72
Mar 13, 2021
GAR-NOSF (single model)
York University
56.20 71.17 9.37 54.76 6.25 41.42

Mar 15, 2021
()

56.20 71.17 9.37 54.76 6.25 41.42
74
Aug 24, 2020
()

56.78 70.93 8.35 53.77 5.23 40.89
75
Oct 10, 2018
Baseline Model (single model)
Carnegie Mellon University, Stanford University, & Universite de Montreal
(Yang, Qi, Zhang, et al. 2018)
45.60 59.02 20.32 64.49 10.83 40.16
76
Aug 24, 2020
()

52.61 68.17 9.00 53.62 5.76 39.25
-
Feb 3, 2020
Unsupervised Decomposition (single model)
Facebook AI Research, New York University & University College London
Perez et al. EMNLP 2020
66.33 79.34 N/A N/A N/A N/A
-
Sep 24, 2019
ChainEx (single model)
UT Austin
(Chen et al., 2019)
61.20 74.11 N/A N/A N/A N/A
-
Feb 27, 2019
DecompRC (single model)
University of Washington
(Min et al., ACL'18)
55.20 69.63 N/A N/A N/A N/A
Leaderboard (Fullwiki Setting)
In the fullwiki setting, a question-answering system must find the answer to a question in the scope of the entire Wikipedia. Similar to in the distractor setting, systems are evaluated on the accuracy of their answers (Ans) and the quality of the supporting facts they use to justify them (Sup).
Model Code Ans Sup Joint
EM F1 EM F1 EM F1
1
May 10, 2021
AISO (single model)
Institute of Computing Technology, Chinese Academy of Sciences
(Zhu, Pang et al., EMNLP 2021)
67.46 80.52 61.17 86.02 44.87 72.00
2
Jan 31, 2023
Chain-of-Skills (single model)
Carnegie Mellon University, Microsoft Research and UIUC
Ma et al. ACL 2023
67.38 80.14 61.25 85.31 45.65 71.65
3
Feb 1, 2021
TPRR (single model)
Huawei Poisson Lab & Parallel Distributed Computing Lab
66.95 79.50 59.43 84.25 44.37 70.83
4
Jan 15, 2021
HopRetriever + Sp-search (single model)
Huawei Noah's Ark Lab & Huawei Cloud
(Li, Li, Shang, et al. 2020)
67.13 79.91 57.38 83.52 43.20 70.61
5
Dec 1, 2020
EBS-Large (single model)
Samsung SDS AI Research
66.18 79.32 57.29 83.98 41.95 70.04
6
Dec 18, 2020
HopRetriever (single model)
Huawei Noah's Ark Lab
67.13 79.91 57.23 82.59 43.10 69.84
7
Nov 30, 2020
IRRR+ (single model)
Stanford University & Samsung Research
(Qi, Lee, Sido, and Manning. 2020)
66.33 79.10 56.92 83.24 42.75 69.60
8
Dec 31, 2020
Anonymous (single model)
Anonymous
65.68 78.49 58.24 83.31 43.44 69.54
9
Sep 7, 2020
EBS-SH (single model)
Samsung SDS AI Research
65.53 78.61 55.90 83.13 40.91 68.94
10
Aug 3, 2020
IRRR (single model)
Stanford University & Samsung Research
(Qi, Lee, Sido, and Manning. 2020)
65.71 78.19 55.93 82.05 42.14 68.59
11
Oct 27, 2020
Anonymous (single model)
Anonymous
65.21 78.02 56.61 82.44 42.26 68.54
12
Sep 10, 2020
Anonymous (single model)
Anonymous
65.05 78.02 55.35 82.69 40.51 68.37
13
Aug 6, 2020
Anonymous (single model)
Anonymous
64.94 78.18 54.49 82.48 39.44 68.10
14
Aug 28, 2020
Anonymous (ensemble)
Anonymous
65.26 78.27 54.22 82.21 40.02 68.08
15
Oct 29, 2020
HopRetriever-V2 (single model)
anonymous
64.83 77.81 56.08 81.79 40.95 67.75
16
May 13, 2021
Anonymous (single model)
Anonymous
62.90 75.82 57.71 81.26 42.18 67.08
17
Dec 4, 2021
AFSGraph-retriever (single model)
Anonymous
64.55 77.79 55.65 81.23 41.05 66.98
18
May 19, 2021
Anonymous (single model)
Anonymous
62.67 75.51 57.54 80.93 42.03 66.87
19
Aug 26, 2020
Recursive Dense Retriever (single model)
Facebook AI & UCSB & UMass
Xiong, Li et al., ICLR 2021
62.28 75.29 57.46 80.86 41.78 66.55
20
May 21, 2020
Step-by-Step Retriever (single model)
Joint Laboratory of HIT and iFLYTEK Research
62.95 75.43 54.61 80.00 40.36 66.22
21
Nov 28, 2020
Anonymous (single model)
Anonymous
61.79 74.71 53.51 80.05 38.43 64.45
22
Jun 9, 2020
HopRetriever-V1 (single model)
anonymous
60.83 73.93 53.07 79.26 38.00 63.91
23
May 21, 2020
DDRQA (single model)
Georgia Institute of Technology & Peking University
(Yuyu, Ping et al. 2020)
62.53 75.91 51.01 78.86 36.04 63.88
24
Jul 6, 2020
Anonymous (single model)
Anonymous
64.29 77.23 51.12 78.57 36.29 63.75
25
Mar 6, 2020
DR model large (single model)
Anonymous
62.01 75.32 49.88 77.77 35.44 62.95
26
Nov 24, 2021
()

61.71 74.57 50.04 77.16 36.77 62.92

Nov 24, 2021
HopAns (single model)
ptf
61.71 74.57 50.04 77.16 36.77 62.92
28
Nov 21, 2020
Anonymous (single model)
Anonymous
60.44 73.22 52.01 77.05 37.98 62.86
29
Nov 15, 2021
Multi-dimensional-AFSGraph (single model)
Anonymous
61.53 74.61 50.33 77.24 36.21 62.44
30
Feb 11, 2020
HGN-albert + SemanticRetrievalMRS IR (single model)
Anonymous
59.74 71.41 51.03 77.37 37.92 62.26
31
Aug 19, 2021
Tree-shaped-cluster (single model)
Anonymous
60.31 73.14 49.87 76.83 35.85 61.73
32
Feb 6, 2021
AFSgraph (single model)
Anonymous
60.08 72.97 49.96 76.85 35.89 61.66
33
Nov 6, 2019
Robustly Fine-tuned Graph-based Recurrent Retriever (single model)
Salesforce Research & University of Washington
(Asai et al., ICLR 2020)
60.04 72.96 49.08 76.41 35.35 61.18
34
Oct 4, 2020
AFSgraph model (single model)
Anonymous
60.06 72.97 48.49 75.94 35.03 60.90
35
Dec 1, 2019
HGN-large + SemanticRetrievalMRS IR (single model)
Anonymous
57.85 69.93 51.01 76.82 37.17 60.74
36
Jan 24, 2021
DPR-recurrent (single model)
Anonymous
59.79 72.65 47.95 74.89 34.54 60.23
37
Jan 19, 2021
RoBERTa-DenseRetriever (single model)
Anonymous
59.60 72.43 47.87 74.79 34.53 60.05
38
Oct 7, 2019
HGN + SemanticRetrievalMRS IR (single model)
Microsoft Dynamics 365 AI Research
Fang et al., 2019
56.71 69.16 49.97 76.39 35.63 59.86
39
Jul 27, 2020
()

58.89 71.60 48.03 75.69 34.46 59.84
40
Jan 21, 2021
GraphRR-Fast (single model)
Anonymous
58.21 70.86 42.91 71.30 30.95 56.85
41
Feb 13, 2020
DR model (single model)
Anonymous
58.82 71.68 41.55 72.54 29.34 56.82
42
Dec 8, 2019
Quark + SemanticRetrievalMRS IR (single model)
Allen Institute for AI and Indian Institute of Technology
A Simple Yet Strong Pipeline for HotpotQA
55.50 67.51 45.64 72.95 32.89 56.23
43
May 6, 2021
GAR-BERT (single model)
York University
52.28 64.84 49.00 74.73 33.00 56.10
44
Sep 20, 2019
Graph-based Recurrent Retriever (single model)
Anonymous
56.04 68.87 44.14 73.03 29.18 55.31
45
Sep 28, 2019
MIR+EPS+BERT (single model)
Anonymous
52.86 64.79 42.75 72.00 31.19 54.75
46
Mar 14, 2021
GAR (single model)
York University
48.22 61.33 48.34 73.89 30.61 52.95
47
Feb 4, 2020
Transformer-XH-final(BERT-base) (single model)
University of Maryland, Microsoft AI & Research
(Zhao et al. ICLR 2020)
51.60 64.07 40.91 71.42 26.14 51.29
48
Sep 21, 2019
Transformer-XH (single model)
Anonymous
48.95 60.75 41.66 70.01 27.13 49.57
49
May 15, 2019
SemanticRetrievalMRS (single model)
UNC-NLP
(Nie et al., EMNLP'2019)
45.32 57.34 38.67 70.83 25.14 47.60
50
Nov 28, 2020
()

43.22 54.35 38.62 63.61 25.37 44.88
51
Feb 21, 2020
DrKIT (single model)
Carnegie Mellon University, Google Research
(Dhingra et al, ICLR 2020)
42.13 51.72 37.05 59.84 24.69 42.88
52
Nov 28, 2020
()

38.94 50.72 38.29 62.19 23.33 41.77
53
Jul 31, 2019
Entity-centric BERT Pipeline (single model)
Anonymous
41.82 53.09 26.26 57.29 17.01 39.18
54
May 21, 2019
GoldEn Retriever (single model)
Stanford University
(Qi et al., EMNLP-IJCNLP 2019)
37.92 48.58 30.69 64.24 18.04 39.13
55
Aug 14, 2019
PR-Bert (single model)
KingSoft AI Lab
43.33 53.79 21.90 59.63 14.50 39.11
56
Dec 4, 2019
SAFSr-Bert (single model)
Anonymous
39.35 51.40 24.21 58.54 13.34 37.00
57
Feb 21, 2019
Cognitive Graph QA (single model)
Tsinghua KEG & Alibaba DAMO Academy
(Ding et al., ACL'19)
37.12 48.87 22.82 57.69 12.42 34.92
58
Mar 14, 2021
GAR-NOSF (single model)
York University
47.50 60.62 7.62 44.79 4.88 33.36
59
Apr 12, 2021
IKFGraph (single model)
anonymous
35.82 45.33 15.97 51.20 11.46 30.38
60
Jul 8, 2022
AnonymousQ (single model)
Anonymous
36.85 45.95 15.25 46.76 11.54 29.07
61
May 15, 2023
HGN Model-reproduce (single model)
Peking University
33.51 42.69 15.59 49.32 10.95 28.40
62
Mar 5, 2019
MUPPET (single model)
Technion
(Feldman and El-Yaniv, ACL'19)
30.61 40.26 16.65 47.33 10.85 27.01
63
Apr 7, 2019
GRN + BERT (single model)
Anonymous
29.87 39.14 13.16 49.67 8.26 25.84
64
May 20, 2019
Entity-centric IR (single model)
Anonymous
35.36 46.26 0.06 43.16 0.02 25.47
65
May 19, 2019
KGNN (single model)
Tsinghua University
(Ye et al., 2019)
27.65 37.19 12.65 47.19 7.03 24.66
66
Aug 16, 2019
SAQA (single model)
Anonymous
28.44 38.62 14.69 47.17 8.62 24.49
67
Mar 4, 2019
GRN (single model)
Anonymous
27.34 36.48 12.23 48.75 7.40 23.55
68
Nov 25, 2018
QFE (single model)
NTT Media Intelligence Laboratories
(Nishida et al., ACL'19)
28.66 38.06 14.20 44.35 8.69 23.10
69
Nov 29, 2019
SAFSr_model (single model)
Anonymous
28.91 39.14 8.03 40.55 4.06 20.90
70
Oct 12, 2018
Baseline Model (single model)
Carnegie Mellon University, Stanford University, & Universite de Montreal
(Yang, Qi, Zhang, et al. 2018)
23.95 32.89 3.86 37.71 1.85 16.15
71
Jan 30, 2021
graph-recurrent-retriever+roberta-base w. S/R-pretraining (single model)
Anonymous
58.13 70.96 0.00 0.00 0.00 0.00
72
Mar 1, 2019
()

30.00 40.65 0.00 0.00 0.00 0.00
-
Dec 13, 2022
()

58.05 71.08 N/A N/A N/A N/A
-
May 19, 2019
TPReasoner w/o BERT (single model)
Anonymous
36.04 47.43 N/A N/A N/A N/A
-
Mar 3, 2019
MultiQA (single model)
Anonymous
30.73 40.23 N/A N/A N/A N/A