Querying Open Data Communities Data with SPARQL

Question

There are two main options for retrieving the data you want. In some cases, it is possible to query the data using a publicly available SPARQL endpoint. This is probably the most convenient approach, and the one to take unless there’s some definite reason that you need the data locally. There are limitations to this approach, however, and in those cases, it makes sense to download the dataset and query against it locally. I’ll describe the remote endpoint solution first, and then the solution using local queries. The limitations on the SPARQL endpoint (e.g., hard timeouts) mean that the first approach isn’t sufficient for this particular task, so the specific answer to this question is the second approach.

I wasn’t familiar with these particular datasets and ontologies before this question, so the first approach also walks though the “getting familiar with the data” process.

Using the SPARQL endpoint

There is a Open Data Communities SPARQL endpoint against which you can run queries and get some data out. I haven’t looked at this data before, so rather than just posting the final answer, I’ll walk through the process that I used to figure out what sort of query to run.

One of the pages you linked to, B72 1NB, mentions that the resource has type PostcodeUnit, which has the URI

http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit

Based on this, the first thing I tried was a SPARQL query to try to retrieve some postcode units, so I used the following query in the endpoint above. (If you copy and paste it in there, you’ll need to remove any leading space before SELECT. I had to do that, anyhow.)

SELECT * WHERE { 
  ?postcodeUnit a <http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit>
}
LIMIT 10

SPARQL results

in the endpoint linked above. (The LIMIT helps ensure that the results come back in a timely manner, and that we’re not asking the server to do too much.) This produces results like

--------------------------------------------------------------
| postcodeUnit                                               |
==============================================================
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA219HB> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF109DS> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY256SA> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY147HR> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF107BZ> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY134LH> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA202HF> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY44QZ>  |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA116SS> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY209DR> |
--------------------------------------------------------------

The B72 1NB page shows its lsoa as Birmingham 006C. The IRI for the lsoa property is (and you can see this in the data you downloaded)

http://opendatacommunities.org/def/geography#lsoa

so we extend the SPARQL query to

SELECT * WHERE { 
  ?postcodeUnit
    a <http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit> ;
    <http://opendatacommunities.org/def/geography#lsoa> ?lsoa .
}
LIMIT 10

SPARQL results

The results are like this:

-----------------------------------------------------------------------------------------------------------------------------
| postcodeUnit                                               | lsoa                                                         |
=============================================================================================================================
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA219HB> | <http://opendatacommunities.org/id/geography/lsoa/E01029309> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF109DS> | <http://opendatacommunities.org/id/geography/lsoa/E01029706> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY147HR> | <http://opendatacommunities.org/id/geography/lsoa/E01018373> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF107BZ> | <http://opendatacommunities.org/id/geography/lsoa/E01014172> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY134LH> | <http://opendatacommunities.org/id/geography/lsoa/E01018514> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA202HF> | <http://opendatacommunities.org/id/geography/lsoa/E01029175> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY44QZ>  | <http://opendatacommunities.org/id/geography/lsoa/E01014204> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA116SS> | <http://opendatacommunities.org/id/geography/lsoa/E01029225> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SW65TP>  | <http://opendatacommunities.org/id/geography/lsoa/E01001950> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF15AX>  | <http://opendatacommunities.org/id/geography/lsoa/E01014155> |
-----------------------------------------------------------------------------------------------------------------------------

You can use prefixes in your query if you want to make it a bit more readable and concise:

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
SELECT * WHERE { 
  ?postcodeUnit
    a pc:PostcodeUnit ;
    geo:lsoa ?lsoa .
}
LIMIT 10

SPARQL results

The results will be the same, of course. At the bottom of each of those results pages, you can download the results in a number of other formats. One of the formats is CSV, and you might have luck importing that directly into a spreadsheet (you said you wanted to use the data in Excel).

Discussion in the comments pointed out that the sheer number of PostcodeUnits makes the result set very large. The UK Postcodes dataset contains four types of resources, in order of increasing size: Postcode Units, Postcode Sectors, Postcode Districts, and Postcode Areas. There are 1686911, 10833, 2087, and 120 resources of these types, respectively. As I understand the clarification in the comments, the idea is to associate these with Lower Layer Super Output Areas (LSOAs), e.g., Birmingham 006C. Individual Postcode Units are associated with LSOAs, but the higher level postcode regions are not. Each Postcode Unit is within its sector, district, and area. For instance, TA21 9HB is within TA, TA21 9, and TA21. Using this information, we can ask for postcode units and their corresponding district (or sector, or area), as well as their LSOA, and report just the district and the LSOA, ignoring the unit itself. For instance:

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
PREFIX sr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
SELECT DISTINCT ?district ?lsoa 
WHERE { 
  ?postcodeunit a pc:PostcodeUnit ;
                geo:lsoa ?lsoa ;
                sr:within ?district .
  ?district a pc:PostcodeDistrict .
}
LIMIT 10

SPARQL results

Now, there are 34378 LSOAs in the dataset, so there’s still lots of data to be selected, and trying to pull down the text results for all distinct losa/district mappings still results in a timeout. In fact, since every LSOA is associated (I expect) with some district, there are probably as many results in the output as there are LSOAs.

It looks like this is the point where we start to hit response size limits and timeouts for the SPARQL endpoint, and need to start accessing the data locally. The postcode data alone is 5.6 GB though, so this isn’t a wonderful solution.

But, if you’re willing to take a representative LSOA for each district, we can use SPARQL subqueries to pull these out, as in the following query which first retrieves all the postcode districts, and then for each one, finds a single LSOA that some postcode unit in the district has. I don’t know whether this is an acceptable result, but you end up with an LSOA for each district, and the results are small enough (there are 2087 rows, the same as the number of districts) that they can be pulled down in any of the results formats (including CSV).

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
PREFIX sr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
SELECT ?region ?lsoa 
WHERE { 
  {
    SELECT ?region WHERE { 
      ?region a pc:PostcodeDistrict .
    }
  }

  {
    SELECT ?lsoa WHERE { 
      ?postcodeunit a pc:PostcodeUnit ;
                    geo:lsoa ?lsoa ;
                    sr:within ?region .
    }
    LIMIT 1 
  }
}

SPARQL results

Using TDB locally

There are limitations to using the SPARQL endpoint such as the timeouts encountered above. In these situations, it’s not too hard to download the data and get it into a Jena TDB store and to query using tdbquery. The UK postcodes page has the download link for zipped n-triples. After downloading this data, (and having Apache Jena 2.10 installed), I ran (on a Unix system):

$ tdbloader2 --loc tdb dataset_data_postcodes_20130506183000.nt

where tdb is a local directory I make to contain TDB’s indexes. Loading the data takes a while (1125 seconds here), as does indexing. Once everything is loaded up, I stored the following query in a file named postcodes.sparql, and ran the query with

$ tdbquery --loc tdb --results CSV --query postcodes.sparql > unit_lsoa.csv

to generate results in CSV format, stored in the file unit_lsoa.csv. Here are the first few lines:

$ head -5 unit_lsoa.csv 
postcodeUnit,lsoa
http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AE,http://opendatacommunities.org/id/geography/lsoa/E01023667
http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AG,http://opendatacommunities.org/id/geography/lsoa/E01023741
http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AJ,http://opendatacommunities.org/id/geography/lsoa/E01023741
http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AR,http://opendatacommunities.org/id/geography/lsoa/E01023684

Now, there were 1686911 defined postcode units, so I initially expected that there would be the same number of lines in unit_lsoa.csv. However, there are about 200,000 fewer. (wc -l prints the number of lines in a file.)

$ wc -l unit_lsoa.csv 
1440143 unit_lsoa.csv

As it turns out, some of the postcode units do not have associated LSOAs. I checked this by running the query

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
SELECT * WHERE { 
  ?postcodeUnit
    a pc:PostcodeUnit .
    FILTER NOT EXISTS { ?postcodeUnit geo:lsoa ?lsoa }
}

stored in the file postcodes_without_lsoa.sparql:

$ tdbquery --loc tdb \
    --results CSV \
    --query postcodes_without_lsoa.sparql > unit_without_lsoa.csv

Sure enough, there are about 200,000 lines in unit_without_lsoa.csv:

$ wc -l unit_without_lsoa.csv
246770 unit_without_lsoa.csv

The sum of 1440143 and 246770 is 1686913 which is exactly the number of postcodes (plus 2 lines for the headers in each CSV file). Mission accomplished!

Using the SPARQL endpoint

Using TDB locally

More Related Contents:

Leave a Comment Cancel reply