HotpotQA

A Dataset for Diverse, Explainable Multi-hop Question Answering

Preprocessed Wikipedia for HotpotQA

To build HotpotQA, we downloaded the English Wikipedia dump on October 1st, 2017 from Wikimedia, and preprocessed it with (our own fork of) WikiExtractor for extracting plain text with hyperlinks, followed by Stanford CoreNLP (version 3.8.0) for tokenization and sentence segmentation.

The processed Wikipedia can be downloaded here (BZip2 format, 7,413,895,794 bytes, MD51 62b8027b5803173d4383669d8d162509, distributed under a CC BY-SA 4.0 License). When decompressed2, this results in a directory named enwiki-20171001-pages-meta-current-withlinks-processed with numerous subdirectories, each containing a few .bz2 files (it is strongly suggested that you read these files programmatically, e.g. through the bz2 package in Python, instead of decompressing them, which will take up a lot of disk space). Each of these .bz2 file contains a few processed Wikipedia pages in JSON format. Each JSON object is on its own line, and has the following format

{
    "id": 12,
    "url": "https://en.wikipedia.org/wiki?curid=12",
    "title": "Anarchism",
    "text": [["Anarchism"], ["Anarchism is a <a href=\"political%20philosophy\">political philosophy</a> that advocates <a href=\"self-governance\">self-governed</a> societies based on voluntary institutions.", ...], ... ],
    "charoffset": [[[[0, 9]]], [[[0, 9], [10, 12], [13, 14], [15, 48], [48, 57], [58, 68], [68, 72], [73, 77], [78, 87], [88, 114], [114, 127], [127, 131], [132, 141], [142, 147], [148, 150], [151, 160], [161, 173], [173, 174]], ...], ...]
}

Here, id is the unique numerical ID of the article, url is a URL of the actual Wikipedia article, and title is the title of that article, which also serves as its textual identifier (case-insensitive). text is a list of lists of strings representing the plaintext from the original Wikipedia article, where each list of strings corresponds to a paragraph, and each string a sentence in that paragraph. Note that hyperlinks are preserved as HTML tags directly in these sentences, and the hyperlink points to the target page by its textual identifier (with the href property), which is escaped with standard URL encoding for special characters (e.g. a space will become %20, as can be seen in the example). All sentences in a paragraph can be joined without separators to recover the original paragraph (Python example: "".join(sentences)). charoffset has a parallel structure to text, where each sentence string corresponds to a list of tuples, each tuple specifying the 0-based start (inclusive) and end (exclusive) offsets of a token with respect to the beginning of the paragraph (Python example to recover a token: paragraph_str[st:en]). All XML tags (hyperlinks) are whole tokens, so that they can be removed to easily recover the plaintext.

NEW (Jan. 14, 2019) We also release the introductory paragraphs used in HotpotQA for the fullwiki setting, where we include both the plaintext version and the version of the paragraphs with the original Wikipedia hyperlinks. The archive can be downloaded here (BZip2 format, 1,553,565,403 bytes, MD5 01edf64cd120ecc03a2745352779514c, distributed under a CC BY-SA 4.0 License). The format of this BZip2 file is similar to that one above, but with two additional fields text_with_links and charoffset_with_links that contain the raw text and character offsets with the hyperlinks kept in the text. When indexing these paragraphs for information retrieval, we use the text field without the hyperlinks. These paragraphs are the first paragraph containing more than 50 characters after hyperlinks are stripped in their respective Wikipedia documents (note that this is different from what we described previously, sorry about the confusion!).

Related Links

  • The original Wikipedia dump used to collect HotpotQA (BZip2 format, 27,991,685,710 bytes, MD5 4daab7303ea57b6f40cd139b3a640d0d, distributed under the original CC BY-SA 3.0 Unported License from Wikipedia);
  • Our own fork of Attardi's WikiExtractor, where we fixed the issue in expanding convert templates to properly render text spans such as 123 km (instead of neglecting them altogether).

  1. To compute the MD5 checksum on a Unix system, run md5sum <bz2_file_you_downloaded>

  2. On a Unix system, this can be done by tar -xjvf <bz2_file_you_downloaded>