countculture

Open data and all that

Posts Tagged ‘linked data

Introducing OpenCharities: Opening up the Charities Register

with 75 comments

A couple of weeks ago I needed a list of all the charities in the UK and their registration numbers so that I could try to match them up to the local council spending data OpenlyLocal is aggregating and trying to make sense of. A fairly simple request, you’d think, especially in this new world of transparency and open data, and for a dataset that’s uncontentious.

Well, you’d be wrong. There’s nothing at data.gov.uk, nothing at CKAN and nothing on the Charity Commission website, and in fact you can’t even see the whole register on the website, just the first 500 results of any search/category. Here’s what the Charities Commission says on their website (NB: extract below is truncated):

The Commission can provide an electronic copy in discharge of its duty to provide a legible copy of publicly available information if the person requesting the copy is happy to receive it in that form. There is no obligation on the Commission to provide a copy in this form…

The Commission will not provide an electronic copy of any material subject to Crown copyright or to Crown database right unless it is satisfied… that the Requestor intends to re-use the information in an appropriate manner.

Hmmm. Time for Twitter to come to the rescue to check that some other independently minded person hasn’t already solved the problem. Nothing, but I did get pointed to this request for the data to be unlocked, with the very recent response by the Charity Commission, essentially saying, “Nope, we ain’t going to release it”:

For resource reasons we are not able to display the entire Register of Charities. Searches are therefore limited to 500 results… We cannot allow full access to all the data, held on the register, as there are limitations on the use of data extracted from the Register… However, we are happy to consider granting access to our records on receipt of a written request to the Departmental Record Officer

OK, so it seems as though they have no intention of making this data available anytime soon (I actually don’t buy that there are Intellectual Property or Data Privacy issues with making basic information about charities available, and if there really are this needs to be changed, pronto), so time for some screen-scraping. Turns out it’s a pretty difficult website to scrape, because it requires both cookies and javascript to work properly.

Try turning off both in your browser, and see how far you get, and then you’ll also get an idea of how difficult it is to use if you have accessibility issues – and check out their poor excuse for accessibility statement, i.e. tough luck.

Still, there’s usually a way, even if it does mean some pretty tortuous routes, and like the similarly inaccessible Birmingham City Council website, this is just the sort of challenge that stubborn so-and-so’s like me won’t give up on.

And the way to get the info seems to be through the geographical search (other routes relied upon Javascript), and although it was still problematic, it was doable. So, now we have an open data register of charities, incorporated into OpenlyLocal, and tied in to the spending data being published by councils.

Charity supplier to Local authority

And because this sort of thing is so easy, once you’ve got it in a database (Charity Commission take note), there are a couple of bonuses.

First, it was relatively easy to knock up a quick and very simple Sinatra application, OpenCharities:

Open Charities :: Opening up the UK Charities Register

If there’s any interest, I’ll add more features to it, but for now, it’s just a the simplest of things, a web application with a unique URL for every charity based on its charity number, and with the  basic information for each charity is available as data (XML, JSON and RDF). It’s also searchable, and sortable by most recent income and spending, and for linked data people there are dereferenceable Resource URIs.

This is very much an alpha application: the design is very basic and it’s possible that there are a few charities missing – for two reasons. One: the Charity Commission kept timing out (think I managed to pick up all of those, and they should get picked up when I periodically run the scraper); and two: there appears to be a bug in the Charity Commission website, so that when there’s between 10 and 13 entries, only 10 are shown, but there is no way of seeing the additional ones. As a benchmark, there are currently 150,422 charities in the OpenCharities database.

It’s also worth mentioning that due to inconsistencies with the page structure, the income/spending data for some of the biggest charities is not yet in the system. I’ve worked out a fix, and the entries will be gradually updated, but only as they are re-scraped.

The second bonus is that the entire database is available to download and reuse (under an open, share-alike attribution licence). It’s a compressed CSV file, weighing in at just under 20MB for the compressed version, and should probably only attempted by those familiar with manipulating large datasets (don’t try opening it up in your spreadsheet, for example). I’m also in the process of importing it into Google Fusion Tables (it’s still churning away in the background) and will post a link when it’s done.

Now, back to that spending data.

Written by countculture

September 6, 2010 at 1:15 pm

A Local Spending Data wish… granted

with 25 comments

The very wonderful Stuart Harrison (aka pezholio), webmaster at Lichfield District Council, blogged yesterday with some thoughts about the publication of spending data following a local spending data workshop in Birmingham. Sadly I wasn’t able to attend this, but Stuart gives a very comprehensive account, and like all his posts it’s well worth reading.

In it he made an important observation about those at the workshop who were pushing for linked data from the beginning, and wished there was a solution. First the observation:

There did seem to be a bit of resistance to the linked data approach, mainly because agreeing standards seems to be a long, drawn out process, which is counter to the JFDI approach of publishing local data… I also recognise that there are difficulties in both publishing the data and also working with it… As we learned from the local elections project, often local authorities don’t even have people who are competent in HTML, let alone RDF, SPARQL etc.

He’s not wrong there. As someone who’s been publishing linked data for some time, and who conceived and ran the Open Election Data project Stuart refers to, working with numerous councils to help them publish linked data I’m probably as aware of the issues as anyone (ironically and I think significantly none of the councils involved in the local government e-standards body, and now pushing so hard for the linked data, has actually published any linked data themselves).

That’s not to knock linked data – just to be realistic about the issues and hurdles that need to be overcome (see the report for a full breakdown), and that to expect all the councils to solve all these problems at the same time as extracting the data from their systems, removing data relating to non-suppliers (e.g. foster parents), and including information from other systems (e.g. supplier data, which may be on procurement systems), and all by January, is  unrealistic at best, and could undermine the whole process.

So what’s to be done? I think the sensible thing, particularly in these straitened times, is to concentrate on getting the raw data out, and as much of it as possible, and come down hard on those councils who publish it badly (e.g. by locking it up in PDFs or giving it a closed licence), or who willfully ignore the guidance (it’s worrying how few councils publishing data at the moment don’t even include the transaction ID or date of the transaction, never mind supplier details).

Beyond that we should take the approach the web has always done, and which is the reason for its success: a decentralised, messy variety of implementations and solutions that allows a rich eco-system to develop, with government helping solve bottlenecks and structural problems rather than trying to impose highly centralised solutions that are already being solved elsewhere.

Yes, I’d love it if the councils were able to publish the data fully marked up, in a variety of forms (not just linked data, but also XML and JSON), but the ugly truth is that not a single council has so far even published their list of categories, never mind matched it up to a recognised standard (CIPFA BVACOP, COFOG or that used in their submissions to the CLG), still less done anything like linked data. So there’s a long way to go, and in the meantime we’re going to need some tools and cheap commodity services to bridge the gap.

[In a perfect world, maybe councils would develop some open-source tools to help them publish the data, perhaps using something like Adrian Short's Armchair Auditor code as the basis (this is a project that took a single council, WIndsor & Maidenhead, and added a web interface to the figures). However, when many councils don't even have competent HTML skills (having outsourced much of it), this is only going to happen at a handful of councils at best, unless considerable investment is made.]

Stuart had been thinking along similar lines, and made a suggestion, almost a wish in fact:

I think the way forward is a centralised approach, with authorities publishing CSVs in a standard format on their website and some kind of system picking up these CSVs (say, on a monthly basis) and converting this data to a linked data format (as well as publishing in vanilla XML, JSON and CSV format).

He then expanded on the idea, talking about a single URL for each transaction, standard identifiers, “a human-readable summary of the data, together with links to the actual data in RDF, XML, CSV and JSON”. I’m a bit iffy about that ‘centralised approach’ phrase (the web is all about decentralisation), but I do think there’s an opportunity to help both the community and councils by solving some of these problems.

And  that’s exactly what we’ve done at OpenlyLocal, adding the data from all the councils who’ve published their spending data, acting as a central repository, generating the URLs, and connecting the data together to other datasets and identifiers (councils with Snac IDs, companies with Companies House numbers). We’ve even extracted data from those councils who unhelpfully try to lock up their data as PDFs.

There are at time of writing 52,443 financial transactions from 9 councils in the OpenlyLocal database. And that’s not all, there’s also the following features:

  • Each transaction is tied to a supplier record for the council, and increasingly these are linked to company info (including their company number), or other councils (there’s a lot of money being transferred between councils), and users can add information about the supplier if we haven’t matched it up.
  • Every transaction, supplier and company has a permanent unique URL and is available as XML and JSON
  • We’ve sorted out some of the date issues (adding a date fuzziness field for those councils who don’t specify when in the month or quarter a transaction relates to).
  • Transactions are linked to the URL from which the file was downloaded (and usually the line number too, though obviously this is not possible if we’ve had to extract it from a PDF), meaning anyone else can recreate the dataset should they want to.
  • There’s an increasing amount of analysis, showing ordinary users spending by month, biggest suppliers and transactions, for example.
  • The whole spending dataset is available as a single, zipped CSV file to download for anyone else to use.
  • It’s all open data.

There are a couple of features Stuart mentions that we haven’t yet implemented, for good reason.

First, we’re not yet publishing it as linked data, for the simple reason that the vocabulary hasn’t yet been defined, nor even the standards on which it will be based. When this is done, we’ll add this as a representation.

And although we use standard identifiers such as SNAC ids for councils (and wards) on OpenlyLocal, the URL structure Stuart mentions is not yet practical, in part because SNAC ids doesn’t cover all authorities (doesn’t include the GLA, or other public bodies, for example), and only a tiny fraction of councils are publishing their internal transaction ids.

Also we haven’t yet implemented comments on the transactions for the simple reason that distributed comment systems such as Disqus are javascript-based and thus are problematic for those with accessibility issues, and site-specific ones don’t allow the conversation to be carried on elsewhere (we think we might have a solution to this, but it’s at an early stage, and we’d be interested to hear other idea).

But all in all, we reckon we’re pretty much there with Stuart’s wish list, and would hope that councils can get on with extracting the raw data, publishing it in an open, machine-readable format (such as CSV), and then move to linked data as their resources allow.

Written by countculture

August 3, 2010 at 7:45 am

The Audit Commission, open data and the quest for relevance

leave a comment »

[Note: I'm writing this post in a personal capacity, not on behalf of the Local Public Data Panel, on which I sit]

Alternative heading: Audit Commission under threat tries to divert government open data agenda. Fails to get it.

A week or so ago saw the release of a report from the Audit Commission, The Truth Is Out There, which “looks at how the public sector can improve information made available to the public”.

The timing of this is interesting, coming shortly before the Government’s landmark announcement about opening up data and using it to restructure government, and after a series of Conservative-leaning events or announcements making a similar case, albeit framed slightly differently.

Given all this, I’m guessing the Audit Commission is a tough places to be right now. Local Authorities have long complained about the burden it puts on them, the Conservatives have made it plain they see it as a problem rather than a solution so far as efficiency goes, and even the government is scaling back its desire to have targets for everything.

So, given this, perhaps this paper would see a realisation by the commission that if it doesn’t change its perspective it will become at best irrelevant and at worst a roadblock to open data, increased transparency, efficiency and genuine change.

First it’s worth pointing out some background:

In short, it’s a typical government body — all focused on process rather than delivery. And its response to the changing landscape of open data, the move from a web of documents to a web of data, and the potential to engage with data directly rather than through the medium of dry official reports?

Actually it’s what you’d expect: there’s a fair bit of social-media blah-blah-blah — Facebook, US open data initiatives, MySociety/FixMyStreet, etc; there’s a bit about transparency that doesn’t actually say much; and then there’s a lot of justification for why there needs to be an Audit-Commission type body which manages to both include jargon (RQP) and avoid talking about the real problems preventing this.

What are these?

  • Structural problems — although the net financial benefit to government as a whole will be significant, this will be achieving by stripping out existing wasteful processes, duplication, and intermediary organizations. The idea that a local authority should supply the same dataset to three different bodies in three different formats and three different ways is ludicrous. Particularly when those bodies then spend even more time reworking the data to allow a matchup to other datasets.

    This is just an unnecessary gunk that’s gumming up the work, and the truth is the Audit Commission is one of those problem bodies.

  • Technical/contractual problems — it’s not always easy for legacy systems to expose data, and even where it is, the nature of public-sector IT procurement means that it’s going to cost. Ultimately we need to change how government does IT, but in the meantime we need to make sure the money comes from the vast savings to be made be removing the gunk. This means overcoming silos, which is no easy task.
  • Identifier problems – being able to uniquely identify bodies, areas, categories, etc. Anyone who’s ever done any playing around with government data knows this is one of the central frustrations, and blockers when combing data. Is this local authority/ward/police authority/company the same as that one. What do we mean by ‘primary school’ spending and can we match it against this figure from central government. Some of these questions are hard to answer, but made much harder when organisations don’t use common, public identifiers.

Astonishingly the Audit Commission paper doesn’t really cover these issues (and doesn’t even mention the issue of identifiers, perhaps because it’s so bad at them). Is this because they haven’t really understood the issues, or is it because the paper is more about trying to make it seem relevant in a changing world? Either way, it’s got problems, and given the current attitude it doesn’t seem in a position to address them itself.

Written by countculture

March 9, 2010 at 3:59 pm

Introducing the Open Election Project: tiny steps towards opening local elections

with 18 comments

Update: The Open Election Data project is now live at http://openelectiondata.org.

Here’s a fact that will surprise many: There’s no central or open record of the results of local elections in the UK.

Want to look back at how people voted in your local council elections over the past 10 years? Tough. Want to compare turnout between different areas, and different periods? Sorry, no can do. Want an easy way to see how close the election was last time, and how much your vote might make a difference? Forget it.

It surprised and faintly horrified me (perhaps I’m easily shocked). Go to the Electoral Commission’s website and you’ll see they quickly pass the buck… to the BBC, who just show records of seat numbers, not voting records.


In fact, there is an unofficial database of the election results — held by Plymouth University, and this is what they do (remember we’re in the year 2010 now):

We collect them and then enter them manually into our database. This process is begun in February where we assess which local authority wards are due up for election, continues during March and April when we collect electorates and candidate names and then following the elections in May (or June in some years) we contact the local authorities for their official results”

Not surprisingly, the database is commercial (I guess they have to pay for all that manual work somehow), though they do receive some support from the Electoral Commission, which means as far as democracy, open analysis, and public record goes, it might as well not exist.

There are, of course, records of local election results on local authority websites, but accessible/comparable/reusable they ain’t, nor are they easy to find, and they are in so many different formats that it makes getting the information out of them near impossible, except manually.

So in the spirit of scratching your own itch (I’d like to put the information on OpenlyLocal.com, and I’m sure lots of other bodies would too, from the BBC to national press), I came up with a grandiose title and a simple plan: The Open Election Data project, an umbrella project to help local authorities to publish their election results in an open, reusable common format.

I had the idea at the end of the first meeting of the Local Public Data Panel, of which I’m a member and which is tasked with finding ways of opening up local public data. I then did an impromptu session at the UK Gov Barcamp on January 23, and got a great response. Finally I had meetings and discussions with all sorts of people, from local govt types, local authority CMS suppliers, council webmasters, returning officers and standards organisations. Finally, it was discussed at the 2nd Local Public Data Panel meeting this week, and endorsed there.

So how does it work? Well, the basic idea is that instead of councils writing their election results web pages using some arbitrary HTML (or worse, using PDFs), they use HTML with machine-readable information added into it using something called RDFa, which is already used by many organisations for the this purpose (e.g. for government’s consultations).

This means that pretty much any competent Content Management System should be able to use the technique to expose the data, while still allowing them to style it as they wish. Here, for example, is what one of Lichfield District Council’s election results pages currently looks like:

And this is what it looks like after it’s had RDFa added (and a few more bits of information):

As you can see (apart from the extra info), there appears to be no change to the user. The difference is however, that if you point a machine capable of understanding RDFa at it, you can slurp up the results, and bingo, suddenly you’ve got an election results database for free, putting local elections on a par with national ones for the first time.

So where do things go from here?

I’m also presenting this at the localgovcamp tomorrow(March 4), and we hope to have some draft local authority election results pages in the weeks shortly afterwards (although the focus is on getting as many councils to implement this by the local elections on May 6, there’s nothing to stop them using it on existing pages, and indeed we’d encourage them to, so they can get a feel for and indeed expose those earlier results). I’m also discussing setting up a Community of Practice to help enable council webmasters discuss implementation.

Finally, many thanks to those who have helped me draw up the various RDFa stuff and/or helped with the underlying idea: especially Jeni Tennison, Paul Davidson from LeGSB, Stuart Harrison of Lichfield District Council, Tim Allen of the LGA, and many more.

Written by countculture

March 3, 2010 at 11:24 pm

David Eaves’ Three Laws of Open Government Data

with 2 comments

Mentioned David Eaves’ Three Laws of Open Government Data at yesterday’s excellent Talk About Local unconference, and had a few people asking me what they were and where to find them. So here they are (from http://eaves.ca/2009/09/30/three-law-of-open-government-data/):

The Three Laws of Open Government Data:

  1. If it can’t be spidered or indexed, it doesn’t exist
  2. If it isn’t available in open and machine readable format, it can’t engage
  3. If a legal framework doesn’t allow it to be repurposed, it doesn’t empower

There are also a few other useful links in the comments.

p.s. I don’t know David, but I really like the conciseness of these

Written by countculture

October 4, 2009 at 10:05 am

Opening Up Local Government Information: APPSI Presentation

with one comment

Just got back from doing a presentation to the Advisory Panel on Public Sector Information

Though there’s a bit about OpenlyLocal.com (the site I run that opens up and makes accessible local government data), most of it is about the present and future of local government data, and the obstacles that need to be overcome.

The presentation seemed to go down reasonably well, despite Powerpoint messing up the formatting (it was created on Keynote), and I’m hoping we can start to get some traction on opening up local government data.

I’m embedding it and making it available under Creative Commons non-commercial share-alike licence. Comments welcome:

Written by countculture

September 17, 2009 at 4:39 pm

Follow

Get every new post delivered to your Inbox.

Join 79 other followers