What is HotpotQA?

HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. It is collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.

For more details about HotpotQA, please refer to our EMNLP 2018 paper:

If you work on open-domain multi-hop question answering, you might also be interested in a new dataset one of our authors (Peng Qi) published more recently, BeerQA, which features open-domain questions that might require varying hops of reasoning to answer, and which HotpotQA is made part of.

Getting started

HotpotQA is distributed under a CC BY-SA 4.0 License. The training and development sets can be downloaded below.

A more comprehensive summary about data download, preprocessing, baseline model training, and evaluation is included in our GitHub repository, and linked below.

Once you have built your model, you can use the evaluation script we provide below to evaluate model performance by running python hotpot_evaluate_v1.py <path_to_prediction> <path_to_gold>

To submit your models and evaluate them on the official test sets, please read our submission guide hosted on Codalab.

We also release the processed Wikipedia used in the process of creating HotpotQA (also under a CC BY-SA 4.0 License), serving both as the corpus for the fullwiki setting in our evaluation, and hopefully as a standalone resource for future researches involving processed text on Wikipedia. Below please find the link to the documentation for this corpus.

Stay connected!

Join our Google group to receive updates or initiate discussions about HotpotQA!

If you use HotpotQA in your research, please cite our paper with the following BibTeX entry

@inproceedings{yang2018hotpotqa,
  title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering},
  author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.},
  booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})},
  year={2018}
}

Leaderboard (Distractor Setting)

In the distractor setting, a question-answering system reads 10 paragraphs to provide an answer (Ans) to a question. They must also justify these answers with supporting facts (Sup).

	Model	Code	Ans		Sup		Joint
	Model	Code	EM	F₁	EM	F₁	EM	F₁
1 Aug 7, 2023	Beam Retrieval (single model) BUPT & Tencent (Zhang, Zhang, Zhang, et al. 2023)		72.69	85.04	66.25	90.09	50.53	77.54
2 Jul 7, 2022	PipNet (single model) Tencent Cloud Xiaowei		72.26	84.86	63.71	89.41	48.76	76.95
3 Jun 27, 2022	Smoothing R3 (single model) Fudan University & Huawei Poisson Lab Rethinking Label Smoothing on Multi-hop Question Answering		72.07	84.34	65.44	89.55	49.73	76.69
4 Jan 28, 2022	FE2H on ALBERT (single model) Nanjing University From Easy to Hard: Two-stage Selector and Reader for Multi-hop Question Answering		71.89	84.44	64.98	89.14	50.04	76.54
5 May 16, 2022	R3 (single model) Fudan University & Huawei Poisson Lab Rethinking Label Smoothing on Multi-hop Question Answering		71.27	83.57	65.25	88.98	49.81	76.02
6 May 28, 2021	SAE+ (single model) JD AI Research		70.74	83.61	63.70	88.95	48.15	75.72
7 Jul 12, 2021	S2G+EGA (single model) Shanghai Jiao Tong University		70.92	83.44	63.86	88.68	48.76	75.47
8 Feb 27, 2021	S2G+ (single model) Shanghai Jiao Tong University		70.72	83.53	64.30	88.72	48.60	75.45
9 Jan 11, 2021	AMGN+ (single model) Anonymous		70.53	83.37	63.57	88.83	47.77	75.24
10 Mar 23, 2022	RD Model (single model)		70.35	82.86	63.57	88.81	47.96	75.17
11 Feb 14, 2022	FE2H on ELECTRA (single model) Anonymous		69.54	82.69	64.78	88.71	48.46	74.90
12 Sep 6, 2020	SpiderNet-large (single model) Kingsoft AI Lab		70.15	83.02	63.82	88.85	47.54	74.88
13 Feb 25, 2023	GIT (single model) KAIST		70.07	82.86	62.59	88.53	47.22	74.84
14 Feb 20, 2021	S2G+ (single model) Anonymous		69.38	82.17	64.30	88.72	48.00	74.36
15 Dec 30, 2021	AnonymousS (single model) Anonymous		69.66	82.42	62.99	87.85	47.84	74.27
16 Nov 23, 2020	Anonymous (single model) Anonymous		70.24	82.36	62.26	88.46	46.81	74.27
17 Dec 1, 2019	HGN-large (single model) Anonymous		69.22	82.19	62.76	88.47	47.11	74.21
18 Nov 15, 2020	AMGN (single model) Anonymous		69.89	82.79	62.67	88.12	46.59	74.20
19 Dec 15, 2021	BoSe (single model) Anonymous		69.66	82.43	62.52	87.73	47.52	74.18
20 Jun 10, 2020	BFR-Graph (single model) Anonymous		70.06	82.20	61.33	88.41	45.92	74.13
21 Apr 9, 2021	KIFGraph (single model) LAB		69.53	82.42	61.79	87.98	46.49	74.12
22 Dec 14, 2021	Anonymous (single model) Anonymous		69.43	82.47	61.85	87.59	46.57	73.93
23 May 11, 2020	GSAN-large (single model) Anonymous		68.57	81.62	62.36	88.73	46.06	73.89
24 Sep 14, 2021	GIT (single model) KAIST		69.12	82.01	62.05	88.19	46.50	73.87
25 Oct 6, 2020	FFReader-large (single model) Kyoto University (Alkhaldi et al., 2021)		68.89	82.16	62.10	88.42	45.61	73.78
26 May 28, 2020	ETC-large (single model) Anonymous		68.12	81.18	63.25	89.09	46.40	73.62
27 May 28, 2020	Longformer (single model) Anonymous		68.00	81.25	63.09	88.34	45.91	73.16
28 May 24, 2021	RealFormer (single model) Anonymous		67.41	80.59	63.38	89.00	46.14	73.13
29 Apr 15, 2022	EGF Reader-large (single model) Anonymous		68.10	80.96	62.60	88.20	46.15	72.96
30 Oct 18, 2019	C2F Reader (single model) Joint Laboratory of HIT and iFLYTEK Research (Shao, Cui et al. 2020)		67.98	81.24	60.81	87.63	44.67	72.73
31 Feb 11, 2021	Text-CAN large (single model) Usyd NLP		67.53	80.80	61.62	86.95	45.75	72.52
32 Jun 15, 2020	SEGraph (single model) Anonymous		68.03	81.17	61.70	87.43	44.86	72.40
33 Jan 24, 2021	S2G-large (single model) Anonymous		67.34	80.24	62.66	87.61	45.80	72.26
34 Jun 29, 2021	()		67.44	80.27	60.08	86.16	44.69	71.46
Jun 30, 2021	() (single model) Anonymous		67.44	80.27	60.08	86.16	44.69	71.46
36 Nov 19, 2019	SAE-large (single model) JD AI Research Tu, Huang et al., AAAI 2020		66.92	79.62	61.53	86.86	45.36	71.45
37 Sep 27, 2019	HGN (single model) Microsoft Dynamics 365 AI Research Fang et al., 2019		66.07	79.36	60.33	87.33	43.57	71.03
38 Aug 19, 2020	SpiderNet-Base (single model) Anonymous		66.38	79.53	60.35	86.90	43.83	70.90
39 Jul 29, 2019	TAP 2 (ensemble) IBM Research AI & IISc		66.64	79.82	57.21	86.69	41.21	70.65
40 Oct 1, 2019	EPS + BERT(wwm) (single model) Anonymous		65.79	79.05	58.50	86.26	42.47	70.48
41 Mar 2, 2021	S2G-base (single model) Anonymous		63.72	77.02	61.33	87.19	43.74	69.51
42 Feb 24, 2021	BDR+JNM (single model) Anonymous		65.13	77.96	56.85	85.03	41.91	69.12
43 Jul 29, 2019	TAP 2 (single model) IBM Research AI & IISc		64.99	78.59	55.47	85.57	39.77	69.12
44 Dec 3, 2020	AnonymousK (single model) Anonymous		63.63	77.15	57.00	86.17	40.04	68.75
45 May 5, 2021	GAR-BERT (single model) York University		62.67	76.35	59.50	87.98	40.64	68.74
46 May 31, 2019	EPS + BERT(large) (single model) Anonymous		63.29	76.36	58.25	85.60	41.39	67.92
47 Jul 30, 2020	()		60.66	74.67	57.05	87.02	37.85	66.65
48 May 11, 2020	GSAN-base (single model) Anonymous		61.25	74.74	57.74	86.28	39.56	66.62
49 Feb 12, 2021	Text-CAN (single model) Usyd NLP		60.17	73.99	58.33	85.75	39.31	65.95
50 Aug 31, 2019	SAE (single model) JD AI Research Tu, Huang et al., AAAI 2020		60.36	73.58	56.93	84.63	38.81	64.96
51 Mar 13, 2021	GAR (single model) York University		56.61	71.40	58.36	87.27	36.79	64.01
Mar 15, 2021	()		56.61	71.40	58.36	87.27	36.79	64.01
53 Jun 13, 2019	P-BERT (single model) Anonymous		61.18	74.16	51.38	82.76	35.42	63.79
54 Sep 16, 2019	LQR-net 2 + BERT-Base (single model) Anonymous		60.20	73.78	56.21	84.09	36.56	63.68
55 Apr 11, 2019	EPS + BERT (single model) Anonymous		60.13	73.31	52.55	83.20	35.40	63.41
56 May 16, 2019	PIPE (single model) Anonymous		59.77	72.77	52.53	82.82	35.54	62.92
57 Dec 1, 2019	SEval (single model) Anonymous		61.87	74.37	45.73	80.50	33.32	62.73
58 Jun 8, 2019	TAP (single model)		58.63	71.48	46.84	82.98	32.03	61.90
59 Aug 14, 2019	SAQA (single model) Anonymous		55.07	70.22	57.62	84.19	35.94	61.72
60 Sep 2, 2019	MKGN (single model) Anonymous		57.09	70.69	54.26	83.54	35.59	61.69
61 Apr 19, 2019	GRN + BERT (single model) Anonymous		55.12	68.98	52.55	84.06	32.88	60.31
62 Jun 19, 2019	LQR-net + BERT-Base (single model) Anonymous		57.20	70.66	50.20	82.42	31.18	59.99
63 Apr 22, 2019	DFGN (single model) Shanghai Jiao Tong University & ByteDance AI Lab (Xiao, Qu, Qiu et al. ACL19)		56.31	69.69	51.50	81.62	33.62	59.82
64 Nov 21, 2018	QFE (single model) NTT Media Intelligence Laboratories (Nishida et al., ACL'19)		53.86	68.06	57.75	84.49	34.63	59.61
65 Jun 3, 2020	IRC (single model) NTT Media Intelligence Laboratories (Nishida et al., 2021)		58.54	72.67	36.56	79.53	23.57	59.43
66 Apr 17, 2019	LQR-net (ensemble) Anonymous		55.19	69.55	47.15	82.42	28.42	58.86
67 Mar 4, 2019	GRN (single model) Anonymous		52.92	66.71	52.37	84.11	31.77	58.47
68 Mar 1, 2019	DFGN + BERT (single model) Anonymous		55.17	68.49	49.85	81.06	31.87	58.23
69 Mar 4, 2019	BERT Plus (single model) CIS Lab		55.84	69.76	42.88	80.74	27.13	58.23
70 May 18, 2019	KGNN (single model) Tsinghua University (Ye et al., 2019)		50.81	65.75	38.74	76.79	22.40	52.82
71 Jul 14, 2021	RoBERTa-L Two-step Model (single model) Anonymous		67.61	80.36	1.10	64.01	0.76	52.50
72 Mar 13, 2021	GAR-NOSF (single model) York University		56.20	71.17	9.37	54.76	6.25	41.42
Mar 15, 2021	()		56.20	71.17	9.37	54.76	6.25	41.42
74 Aug 24, 2020	()		56.78	70.93	8.35	53.77	5.23	40.89
75 Oct 10, 2018	Baseline Model (single model) Carnegie Mellon University, Stanford University, & Universite de Montreal (Yang, Qi, Zhang, et al. 2018)		45.60	59.02	20.32	64.49	10.83	40.16
76 Aug 24, 2020	()		52.61	68.17	9.00	53.62	5.76	39.25
- Feb 3, 2020	Unsupervised Decomposition (single model) Facebook AI Research, New York University & University College London Perez et al. EMNLP 2020		66.33	79.34	N/A	N/A	N/A	N/A
- Sep 24, 2019	ChainEx (single model) UT Austin (Chen et al., 2019)		61.20	74.11	N/A	N/A	N/A	N/A
- Feb 27, 2019	DecompRC (single model) University of Washington (Min et al., ACL'18)		55.20	69.63	N/A	N/A	N/A	N/A

Leaderboard (Fullwiki Setting)

In the fullwiki setting, a question-answering system must find the answer to a question in the scope of the entire Wikipedia. Similar to in the distractor setting, systems are evaluated on the accuracy of their answers (Ans) and the quality of the supporting facts they use to justify them (Sup).

	Model	Code	Ans		Sup		Joint
	Model	Code	EM	F₁	EM	F₁	EM	F₁
1 May 10, 2021	AISO (single model) Institute of Computing Technology, Chinese Academy of Sciences (Zhu, Pang et al., EMNLP 2021)		67.46	80.52	61.17	86.02	44.87	72.00
2 Jan 31, 2023	Chain-of-Skills (single model) Carnegie Mellon University, Microsoft Research and UIUC Ma et al. ACL 2023		67.38	80.14	61.25	85.31	45.65	71.65
3 Feb 1, 2021	TPRR (single model) Huawei Poisson Lab & Parallel Distributed Computing Lab		66.95	79.50	59.43	84.25	44.37	70.83
4 Jan 15, 2021	HopRetriever + Sp-search (single model) Huawei Noah's Ark Lab & Huawei Cloud (Li, Li, Shang, et al. 2020)		67.13	79.91	57.38	83.52	43.20	70.61
5 Dec 1, 2020	EBS-Large (single model) Samsung SDS AI Research		66.18	79.32	57.29	83.98	41.95	70.04
6 Dec 18, 2020	HopRetriever (single model) Huawei Noah's Ark Lab		67.13	79.91	57.23	82.59	43.10	69.84
7 Nov 30, 2020	IRRR+ (single model) Stanford University & Samsung Research (Qi, Lee, Sido, and Manning. 2020)		66.33	79.10	56.92	83.24	42.75	69.60
8 Dec 31, 2020	Anonymous (single model) Anonymous		65.68	78.49	58.24	83.31	43.44	69.54
9 Sep 7, 2020	EBS-SH (single model) Samsung SDS AI Research		65.53	78.61	55.90	83.13	40.91	68.94
10 Aug 3, 2020	IRRR (single model) Stanford University & Samsung Research (Qi, Lee, Sido, and Manning. 2020)		65.71	78.19	55.93	82.05	42.14	68.59
11 Oct 27, 2020	Anonymous (single model) Anonymous		65.21	78.02	56.61	82.44	42.26	68.54
12 Sep 10, 2020	Anonymous (single model) Anonymous		65.05	78.02	55.35	82.69	40.51	68.37
13 Aug 6, 2020	Anonymous (single model) Anonymous		64.94	78.18	54.49	82.48	39.44	68.10
14 Aug 28, 2020	Anonymous (ensemble) Anonymous		65.26	78.27	54.22	82.21	40.02	68.08
15 Oct 29, 2020	HopRetriever-V2 (single model) anonymous		64.83	77.81	56.08	81.79	40.95	67.75
16 May 13, 2021	Anonymous (single model) Anonymous		62.90	75.82	57.71	81.26	42.18	67.08
17 Dec 4, 2021	AFSGraph-retriever (single model) Anonymous		64.55	77.79	55.65	81.23	41.05	66.98
18 May 19, 2021	Anonymous (single model) Anonymous		62.67	75.51	57.54	80.93	42.03	66.87
19 Aug 26, 2020	Recursive Dense Retriever (single model) Facebook AI & UCSB & UMass Xiong, Li et al., ICLR 2021		62.28	75.29	57.46	80.86	41.78	66.55
20 May 21, 2020	Step-by-Step Retriever (single model) Joint Laboratory of HIT and iFLYTEK Research		62.95	75.43	54.61	80.00	40.36	66.22
21 Nov 28, 2020	Anonymous (single model) Anonymous		61.79	74.71	53.51	80.05	38.43	64.45
22 Jun 9, 2020	HopRetriever-V1 (single model) anonymous		60.83	73.93	53.07	79.26	38.00	63.91
23 May 21, 2020	DDRQA (single model) Georgia Institute of Technology & Peking University (Yuyu, Ping et al. 2020)		62.53	75.91	51.01	78.86	36.04	63.88
24 Jul 6, 2020	Anonymous (single model) Anonymous		64.29	77.23	51.12	78.57	36.29	63.75
25 Mar 6, 2020	DR model large (single model) Anonymous		62.01	75.32	49.88	77.77	35.44	62.95
26 Nov 24, 2021	()		61.71	74.57	50.04	77.16	36.77	62.92
Nov 24, 2021	HopAns (single model) ptf		61.71	74.57	50.04	77.16	36.77	62.92
28 Nov 21, 2020	Anonymous (single model) Anonymous		60.44	73.22	52.01	77.05	37.98	62.86
29 Nov 15, 2021	Multi-dimensional-AFSGraph (single model) Anonymous		61.53	74.61	50.33	77.24	36.21	62.44
30 Feb 11, 2020	HGN-albert + SemanticRetrievalMRS IR (single model) Anonymous		59.74	71.41	51.03	77.37	37.92	62.26
31 Aug 19, 2021	Tree-shaped-cluster (single model) Anonymous		60.31	73.14	49.87	76.83	35.85	61.73
32 Feb 6, 2021	AFSgraph (single model) Anonymous		60.08	72.97	49.96	76.85	35.89	61.66
33 Nov 6, 2019	Robustly Fine-tuned Graph-based Recurrent Retriever (single model) Salesforce Research & University of Washington (Asai et al., ICLR 2020)		60.04	72.96	49.08	76.41	35.35	61.18
34 Oct 4, 2020	AFSgraph model (single model) Anonymous		60.06	72.97	48.49	75.94	35.03	60.90
35 Dec 1, 2019	HGN-large + SemanticRetrievalMRS IR (single model) Anonymous		57.85	69.93	51.01	76.82	37.17	60.74
36 Jan 24, 2021	DPR-recurrent (single model) Anonymous		59.79	72.65	47.95	74.89	34.54	60.23
37 Jan 19, 2021	RoBERTa-DenseRetriever (single model) Anonymous		59.60	72.43	47.87	74.79	34.53	60.05
38 Oct 7, 2019	HGN + SemanticRetrievalMRS IR (single model) Microsoft Dynamics 365 AI Research Fang et al., 2019		56.71	69.16	49.97	76.39	35.63	59.86
39 Jul 27, 2020	()		58.89	71.60	48.03	75.69	34.46	59.84
40 Jan 21, 2021	GraphRR-Fast (single model) Anonymous		58.21	70.86	42.91	71.30	30.95	56.85
41 Feb 13, 2020	DR model (single model) Anonymous		58.82	71.68	41.55	72.54	29.34	56.82
42 Dec 8, 2019	Quark + SemanticRetrievalMRS IR (single model) Allen Institute for AI and Indian Institute of Technology A Simple Yet Strong Pipeline for HotpotQA		55.50	67.51	45.64	72.95	32.89	56.23
43 May 6, 2021	GAR-BERT (single model) York University		52.28	64.84	49.00	74.73	33.00	56.10
44 Sep 20, 2019	Graph-based Recurrent Retriever (single model) Anonymous		56.04	68.87	44.14	73.03	29.18	55.31
45 Sep 28, 2019	MIR+EPS+BERT (single model) Anonymous		52.86	64.79	42.75	72.00	31.19	54.75
46 Mar 14, 2021	GAR (single model) York University		48.22	61.33	48.34	73.89	30.61	52.95
47 Feb 4, 2020	Transformer-XH-final(BERT-base) (single model) University of Maryland, Microsoft AI & Research (Zhao et al. ICLR 2020)		51.60	64.07	40.91	71.42	26.14	51.29
48 Sep 21, 2019	Transformer-XH (single model) Anonymous		48.95	60.75	41.66	70.01	27.13	49.57
49 May 15, 2019	SemanticRetrievalMRS (single model) UNC-NLP (Nie et al., EMNLP'2019)		45.32	57.34	38.67	70.83	25.14	47.60
50 Nov 28, 2020	()		43.22	54.35	38.62	63.61	25.37	44.88
51 Feb 21, 2020	DrKIT (single model) Carnegie Mellon University, Google Research (Dhingra et al, ICLR 2020)		42.13	51.72	37.05	59.84	24.69	42.88
52 Nov 28, 2020	()		38.94	50.72	38.29	62.19	23.33	41.77
53 Jul 31, 2019	Entity-centric BERT Pipeline (single model) Anonymous		41.82	53.09	26.26	57.29	17.01	39.18
54 May 21, 2019	GoldEn Retriever (single model) Stanford University (Qi et al., EMNLP-IJCNLP 2019)		37.92	48.58	30.69	64.24	18.04	39.13
55 Aug 14, 2019	PR-Bert (single model) KingSoft AI Lab		43.33	53.79	21.90	59.63	14.50	39.11
56 Dec 4, 2019	SAFSr-Bert (single model) Anonymous		39.35	51.40	24.21	58.54	13.34	37.00
57 Feb 21, 2019	Cognitive Graph QA (single model) Tsinghua KEG & Alibaba DAMO Academy (Ding et al., ACL'19)		37.12	48.87	22.82	57.69	12.42	34.92
58 Mar 14, 2021	GAR-NOSF (single model) York University		47.50	60.62	7.62	44.79	4.88	33.36
59 Apr 12, 2021	IKFGraph (single model) anonymous		35.82	45.33	15.97	51.20	11.46	30.38
60 Jul 8, 2022	AnonymousQ (single model) Anonymous		36.85	45.95	15.25	46.76	11.54	29.07
Feb 12, 2024	()		36.85	45.95	15.25	46.76	11.54	29.07
62 May 15, 2023	HGN Model-reproduce (single model) Peking University		33.51	42.69	15.59	49.32	10.95	28.40
63 Mar 5, 2019	MUPPET (single model) Technion (Feldman and El-Yaniv, ACL'19)		30.61	40.26	16.65	47.33	10.85	27.01
64 Apr 7, 2019	GRN + BERT (single model) Anonymous		29.87	39.14	13.16	49.67	8.26	25.84
65 May 20, 2019	Entity-centric IR (single model) Anonymous		35.36	46.26	0.06	43.16	0.02	25.47
66 May 19, 2019	KGNN (single model) Tsinghua University (Ye et al., 2019)		27.65	37.19	12.65	47.19	7.03	24.66
67 Aug 16, 2019	SAQA (single model) Anonymous		28.44	38.62	14.69	47.17	8.62	24.49
68 Mar 4, 2019	GRN (single model) Anonymous		27.34	36.48	12.23	48.75	7.40	23.55
69 Nov 25, 2018	QFE (single model) NTT Media Intelligence Laboratories (Nishida et al., ACL'19)		28.66	38.06	14.20	44.35	8.69	23.10
70 Nov 29, 2019	SAFSr_model (single model) Anonymous		28.91	39.14	8.03	40.55	4.06	20.90
71 Oct 12, 2018	Baseline Model (single model) Carnegie Mellon University, Stanford University, & Universite de Montreal (Yang, Qi, Zhang, et al. 2018)		23.95	32.89	3.86	37.71	1.85	16.15
72 Nov 26, 2023	()		7.35	12.14	0.00	7.84	0.00	1.11
73 Jan 30, 2021	graph-recurrent-retriever+roberta-base w. S/R-pretraining (single model) Anonymous		58.13	70.96	0.00	0.00	0.00	0.00
74 Mar 1, 2019	()		30.00	40.65	0.00	0.00	0.00	0.00
75 Jun 25, 2024	Mistral multi hop with very large sources (single model) Anonymous		7.98	22.14	0.00	0.00	0.00	0.00
- Dec 13, 2022	()		58.05	71.08	N/A	N/A	N/A	N/A
- May 19, 2019	TPReasoner w/o BERT (single model) Anonymous		36.04	47.43	N/A	N/A	N/A	N/A
- Mar 3, 2019	MultiQA (single model) Anonymous		30.73	40.23	N/A	N/A	N/A	N/A