Wikipedia hoaxes
The master webpage for this project is hosted at the University of Maryland:
http://cs.umd.edu/~srijan/hoax/
Dataset information
Wikipedia has over 35,000,000 articles in over 290 languages. However, not all
the articles are genuine. Hoax articles are purely fabricated articles that were
created to mislead people.
In the paper cited below we study all actual and wrongly suspected hoaxes ever
identified in the English version of Wikipedia. Most of them have been permanently
deleted from Wikipedia's version history, so we had access to them only under a
non-disclosure agreement. Therefore we are unable to publish the full dataset we
work with in the paper. Instead, we publish a smaller dataset of hoaxes that are
also publicly available (on websites such as Speedy Deletion Wiki or
Deletionpedia),
alongside an equally-sized set of non-hoaxes.
This dataset contains a set of 64 hoax articles that are publicly available, and have
the following properties:
- the article was patrolled,
- the article was not deleted for at least one month, and
- the article has at least 100 page views per day.
It also contains a set of 64 non-hoax articles that have the above three properties
as the hoax articles. In addition, these non-hoax articles are selected such that
(i) for each hoax, there is a non-hoax article that was created on the same day, and
(ii) the two sets have similar appearance features (see Section 6 of the paper).
Sources (citations)
Files
The dataset contains four folders:
- hoax_markup_cleaned:
This folder has 64 files, one for each hoax article.
Each file is the raw content of the article, along with all its markup
(template, hyperlinks, image, table)
- nonhoax_markup_cleaned:
This is the corresponding set for the non-hoax articles.
- hoax_pretty_parsed:
This folder has 64 files, one for each hoax article.
Each file is processed from its version in the hoax_markup_cleaned file
to make the article look like a Wikipedia article when viewed in a browser.
- nonhoax_pretty_parsed:
This is the corresponding set for the non-hoax articles.
| File |
Description |
Size |
| wiki-hoaxes.zip |
Content of hoax and non-hoax articles | 1.0 MB |