Archive for the ‘semantic web’ Category
I had a fantastic response to the launch of OpenCharities — my little side project to open up the Charity Commission’s Register of Charities — from individuals, from organisations representing the third sector, and from charities themselves.
There were also a few questions:
- Could we pull out and expose via the api more info about the charities, especially the financial history?
- How often would OpenCharities be updated and what about new charities added after we’d scraped the register?
- Was there any possibility that we could add additional information from sources other than the Charity Register?
So, over the past week or so, we’ve been busy trying to answer those questions the best we could, mainly by just trying to get on and solve them.
First, additional info. After a terrifically illuminating meeting with Karl and David from NCVO, I had a much better idea of how the charity sector is structured, and what sort of information would be useful to people.
So the first thing I did was to rewrite the scraper and parser to pull in a lot more information, particularly the past 5 years income and spending and, for bigger charities the breakdown of that income and spending. (I also pulled in the remaining charities that had been missed the first time around, including removed charities.) Here’s what the NSPCC’s entry, for example, looks like now:
We are also now getting the list of trustees, and links to the accounts and Summary Information Returns, as there’s all sorts of goodness locked up in those PDFs.
However, while we running through the all these charities, we wondered if any of them had social networking info easily available (i.e. on their front page). It turns out some of the bigger ones did, and so we visited their sites and pulled out that info (it’s fairly easy to look for links for twitter/facebook/youtube etc on a home page). Here’s an example social networking info, again for the NSPCC.
[Incidentally, doing this threw up some errors in the Charity Register, most commonly websites that are listed as http://http://some.charity.org.uk, which in itself shows the benefit of opening up the data. All we need now is a way of communicating that to the Charity Commission.]
We also (after way too many hours wasted messing around with cookies and hidden form fields) figured out how to get the list of charities recently added, with the result that we can check every night for new charities added in the past 24 hours, and add those to the database.
This means not only can we keep OpenCharities up to date, it also means we can offer an RSS feed of the latest charities. And if that’s updated a bit too frequently for you (some days there are over 20 charities added), you can always restrict to a given search term, e.g http://OpenCharities/charities.rss?term=children for those charities with children in the title.
Finally, we’ve been looking at what other datasets we could link with the register, and I thought a good one might be the list of grants given out by the various National Lottery funding bodies (which fortunately had already been scraped by the very talented Julian Todd using ScraperWiki).
Then it was a fairly simple matter of tying together the recipients with the register, and voila, you have something like this:
Note, at the time of writing, the import and match of the data is still going on, but should be finished by the end of today.
We’ll also add some simple functionality to show payments from local councils that’s being published in the local council spending data. The information’s already in the database (and is actually shown on the OpenlyLocal page for the charity); I just haven’t got around to displaying it on OpenCharities yet. Expect that to appear in the next day or so.
A couple of weeks ago Will Perrin and I, along with some feedback from the Local Public Data Panel on which we sit, came up with some guidelines for publishing local spending data. They were a first draft, based on a request by Camden council for some guidance, in light of the announcement that councils will have to start publishing details of spending over £500.
Now I’ve got strong opinions about standards: they should be developed from real world problems, by the people using them and should make life easier, not more difficult. It slightly concerned me that in this case I wasn’t actually using any of the spending data – mainly because I hadn’t got around to adding it in to OpenlyLocal yet.
This week, I remedied this, and pulled in the data from those authorities that had published their local spending data – Windsor & Maidenhead, the GLA and the London Borough of Richmond upon Thames. Now there’s a couple of sites (including Adrian Short’s Armchair Auditor, which focuses on spending categories) already pulling the Windsor & Maidenhead data but as far as I’m aware they don’t include the other two authorities, and this adds a different dimension to things, as you want to be able to compare the suppliers across authorities.
First, a few pages from OpenlyLocal showing how I’ve approached it (bear in mind they’re a very rough first draft, and I’m concentrating on the data rather than the presentation). You can see the biggest suppliers to a council right there on the council’s main page (e.g. Windsor & Maidenhead, GLA, Richmond):
Click through to more info gets you a pagination view of all suppliers (in Windsor & Maidenhead’s case there are over 2800 so far):
Clicking any of these will give you the details for that supplier, including all the payments to them:
And clicking on the amount will give you a page just with the transaction details, so it can be emailed to others
But we’re getting ahead of ourselves. The first job is to import the data from the CSV files into a database and this was where the first problems occurred. Not in the CSV format – which is not a problem, but in the consistency of data.
Take Windsor & Maidenhead (you should just be able to open these files an any spreadsheet program). Looking at each data set in turn and you find that there’s very little consistency – the earliest sets don’t have any dates and aggregate across a whole quarter (but do helpfully have the internal Supplier ID as well as the supplier name). Later sets have the transaction date (although in one the US date format is used, which could catch out those not looking at them manually), but omit supplier ID and cost centre.
On the GLA figures, there’ a similar story, with the type of data and the names used to describe changing seemingly randomly between data sets. Some of the 2009 ones do have transaction dates, but the 2010 one generally don’t, and the supplier field has different names, from Supplier to Supplier Name to Vendor.
This is not to criticise those bodies – it’s difficult to produce consistent data if you’re making the rules up as you go along (and given there weren’t any established precedents that’s what they were doing), and doing much of it by hand. Also, they are doing it first and helping us understand where the problems lie (and where they don’t). In short they are failing forward –getting on with it so they can make mistakes from which they (and crucially others) can learn.
But who are these suppliers?
The bigger problem, as I’ve said before, is being able to identify the suppliers, and this becomes particularly acute when you want to compare across bodies (who may name the same company or body slightly differently). Ideally (as we put in the first draft of the proposals), we would have the company number (when we’re talking about a company, at any rate), but we recognised that many accounts systems simply won’t have this information, and so we do need some information that helps use identify them.
Why do we want to know this information? For the same reason we want any ID (you might as well ask why Companies House issues Company Numbers and requires all companies to put that number on their correspondence) – to positively identify something without messing around with how someone has decided to write the name.
With the help of the excellent Companies Open House I’ve had a go at matching the names to company numbers, but it’s only been partially successful. When it is, you can do things like this (showing spend with other councils on a suppliers’ page):
It’s also going to allow me to pull in other information about the company, from Companies House and elsewhere. For other bodies (i.e. those without a company number), we’re going to have to find another way of identifying them, and that’s next on the list to tackle.
Thoughts on those spending data guidelines
In general I still think they’re fairly good, and most of the shortcomings have been identified in the comments, or emailed to us (we didn’t explicitly state that the data should be available under an open licence such as the one at data.gov.uk, and be definitely should have done). However, adding this data to OpenlyLocal (as well as providing a useful database for the community) has crystalised some thoughts:
- Identification of the bodies is essential, and it think we were right to make this a key point, but it’s likely we will need to have the government provide a lookup table between VAT numbers and Company Numbers.
- Speaking of Government datasets, there’s no way of finding out the ancestry of a company – what its parent company is, what its subsidiaries are, and that’s essential if we’re to properly make use of this information, and similar information released by the government. Companies House bizarrely doesn’t hold this information, but the Office For National Statistics does, and it’s called the Inter Departmental Business Register. Although this contains a lot of information provided in confidence for statistical reasons, the relationships between companies isn’t confidential (it just isn’t stored in one place), so it would be perfectly feasible to release this information.
- We should probably be explicit whether the figures should include VAT (I think the Windsor & Maidenhead ones don’t include it, but the GLA imply that theirs might).
- Categorisation is going to be a tricky one to solve, as can be seen from the raw data for Windsor & Maidenhead – for example the Children’s Services Directorate is written as both Childrens Services & Children’s Services, and it’s not clear how this, or the subcateogries, ties into standard classifications for government spending, making comparison across authorities tricky.
- I wonder what would be the downside to publishing the description details, even, potentially, the invoice itself. It’s probably FOI-able, after all.
As ever, comments welcome, and of course all the data is available through the API under an open licence.
Earlier today I gave a presentation at the Open Knowledge Conference on open local data, OpenlyLocal and the Open Election Data project. It was a slight update of the talk I gave to the Manchester Social Media Cafe earlier in the month, and one of the key additions was a simple idea I added on the final page, which was about where we should go from here.
I’d been using the idea in conversation for the past months ago (and I’m sure I didn’t invent it), but it seemed to resonate with the audience, and so I thought it’s worth repeating as a short blog post, and it’s this:
When dealing with government, with organizations, with public officials, with outsourcing companies we need to develop the meme:
Are you an enabler or a blocker?
It’s a blunt and somewhat unsophisticated weapon, but in the past few months of doing the Open Election Data project, it seems to have been far more effective that any other I’ve tried — better than appealing to the public good, better than engaging on an intellectual level, better than asking for it nicely, better even than talking about potential savings.
Maybe it’s because, as someone suggested to me after the first meeting of the UK government’s Local Public Data Panel on which I sit, civil servants and other public officials only do things because there’s a benefit to them (or a downside if they don’t). [I'm not sure they're any different than most people working in the private sector in this respect, by the way.] I don’t know, and I don’t really care. What I do care about is getting things done, and this seems to be working for me.
So, I offer it out there, not as an original idea (I’m sure it isn’t), but as a suggestion of both engaging with public bodies, and as a method of dealing with problems.
When you come across people or organisations given them the option: do you want to be an enabler or a blocker. If you’re an enabler, great, let’s see how we can make this work; if you’re a blocker, fine also — now we know we’ll just go around you and get on with it anyway.