How to extract information from a Wikipedia infobox?

The wrong way: trying to parse HTML

Use (cURL/jQuery/file_get_contents/requests/wget/more jQuery) to fetch the HTML article code of the article, then use a DOM parser to extract table.infobox tr[3] td / use a regex.

This is actually a really bad idea most of the time. Wikipedia’s HTML code is not particularly parsing-friendly (especially infoboxes which are a system of hand-written templates), the exact structure changes from infobox to infobox, and the structure of an infobox might change over time. You might also miss out on some features that would be otherwise available, such as internationalization.

The other wrong way: trying to parse wikitext

At a glance, the wikitext of some articles looks like it’s a pretty straightforward representation of the infobox:

{{ Infobox Foo
| param1 = bar
| param2 = 123
...

In reality, that’s not the case. Templates are “recursive” so you might run into stuff like param1 = {{convert|10|km|mi}}; template parameters might contain complex wikitext or HTML markup; some parameters might be missing from the article wikitext and fetched by the template from a subpage or other data repository. Just finding out where a parameter starts and ends might not be a simple business if it contains other templates which have their own parameters.

The ideal way: using a structured data source

There are various projects to provide the information contained in Wikipedia infoboxes in a structured form; the two large ones are Wikidata and DBpedia.

Wikidata is a project to build a knowledge base containing structured data; it is maintained by the same global movement that built Wikipedia, so information is in the process of being moved over. This is a manual process, so not all information in Wikipedia is available via Wikidata, on the other hand there is a lot of information that’s in Wikidata but not in Wikipedia. You can find the Wikidata page of an article and see what information it contains by following the Wikidata item link in the left-hand toolbar on the article page; programmatically, you can access Wikidata information using the wbgetentities API module (sandbox, explanation of concepts), e.g. wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&titles=Albert_Einstein. There is also a SPARQL endpoint, database dumps, and clients in PHP, Java and Python.

DBPedia is a project to harvest Wikipedia infobox information by automated means and publish it in a structured form. You can find the DBPedia page for a Wikipedia article by going to http://dbpedia.org/page/<Wikipedia article name>, e.g. http://dbpedia.org/page/Albert_Einstein. It has many data formats, dumps, a SPARQL endpoint and various other things.

The wrong ways done right

If the information you need is not available via Wikidata or DBpedia, there are still semi-structured ways of extracting data from infoboxes. For HTML-based extraction you can use Wikipedia’s REST content API (e.g. https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein) which returns a richer, more semantic HTML than the one used on normal article pages, and preserves in it some information about template structure.

Alternatively, you might start from wikitext and parse it into a syntax tree using the simpler, client-side mwparserfromhell Python module (docs) or the more powerful parsoid-jsapi which interacts with the Wikipedia REST content service.

A higher-level Python library which tries to extract infobox contents from wikitext is wptools.

Leave a Comment