countculture

Open data and all that

A Local Spending Data wish… granted

with 25 comments

The very wonderful Stuart Harrison (aka pezholio), webmaster at Lichfield District Council, blogged yesterday with some thoughts about the publication of spending data following a local spending data workshop in Birmingham. Sadly I wasn’t able to attend this, but Stuart gives a very comprehensive account, and like all his posts it’s well worth reading.

In it he made an important observation about those at the workshop who were pushing for linked data from the beginning, and wished there was a solution. First the observation:

There did seem to be a bit of resistance to the linked data approach, mainly because agreeing standards seems to be a long, drawn out process, which is counter to the JFDI approach of publishing local data… I also recognise that there are difficulties in both publishing the data and also working with it… As we learned from the local elections project, often local authorities don’t even have people who are competent in HTML, let alone RDF, SPARQL etc.

He’s not wrong there. As someone who’s been publishing linked data for some time, and who conceived and ran the Open Election Data project Stuart refers to, working with numerous councils to help them publish linked data I’m probably as aware of the issues as anyone (ironically and I think significantly none of the councils involved in the local government e-standards body, and now pushing so hard for the linked data, has actually published any linked data themselves).

That’s not to knock linked data – just to be realistic about the issues and hurdles that need to be overcome (see the report for a full breakdown), and that to expect all the councils to solve all these problems at the same time as extracting the data from their systems, removing data relating to non-suppliers (e.g. foster parents), and including information from other systems (e.g. supplier data, which may be on procurement systems), and all by January, is  unrealistic at best, and could undermine the whole process.

So what’s to be done? I think the sensible thing, particularly in these straitened times, is to concentrate on getting the raw data out, and as much of it as possible, and come down hard on those councils who publish it badly (e.g. by locking it up in PDFs or giving it a closed licence), or who willfully ignore the guidance (it’s worrying how few councils publishing data at the moment don’t even include the transaction ID or date of the transaction, never mind supplier details).

Beyond that we should take the approach the web has always done, and which is the reason for its success: a decentralised, messy variety of implementations and solutions that allows a rich eco-system to develop, with government helping solve bottlenecks and structural problems rather than trying to impose highly centralised solutions that are already being solved elsewhere.

Yes, I’d love it if the councils were able to publish the data fully marked up, in a variety of forms (not just linked data, but also XML and JSON), but the ugly truth is that not a single council has so far even published their list of categories, never mind matched it up to a recognised standard (CIPFA BVACOP, COFOG or that used in their submissions to the CLG), still less done anything like linked data. So there’s a long way to go, and in the meantime we’re going to need some tools and cheap commodity services to bridge the gap.

[In a perfect world, maybe councils would develop some open-source tools to help them publish the data, perhaps using something like Adrian Short’s Armchair Auditor code as the basis (this is a project that took a single council, WIndsor & Maidenhead, and added a web interface to the figures). However, when many councils don’t even have competent HTML skills (having outsourced much of it), this is only going to happen at a handful of councils at best, unless considerable investment is made.]

Stuart had been thinking along similar lines, and made a suggestion, almost a wish in fact:

I think the way forward is a centralised approach, with authorities publishing CSVs in a standard format on their website and some kind of system picking up these CSVs (say, on a monthly basis) and converting this data to a linked data format (as well as publishing in vanilla XML, JSON and CSV format).

He then expanded on the idea, talking about a single URL for each transaction, standard identifiers, “a human-readable summary of the data, together with links to the actual data in RDF, XML, CSV and JSON”. I’m a bit iffy about that ‘centralised approach’ phrase (the web is all about decentralisation), but I do think there’s an opportunity to help both the community and councils by solving some of these problems.

And  that’s exactly what we’ve done at OpenlyLocal, adding the data from all the councils who’ve published their spending data, acting as a central repository, generating the URLs, and connecting the data together to other datasets and identifiers (councils with Snac IDs, companies with Companies House numbers). We’ve even extracted data from those councils who unhelpfully try to lock up their data as PDFs.

There are at time of writing 52,443 financial transactions from 9 councils in the OpenlyLocal database. And that’s not all, there’s also the following features:

  • Each transaction is tied to a supplier record for the council, and increasingly these are linked to company info (including their company number), or other councils (there’s a lot of money being transferred between councils), and users can add information about the supplier if we haven’t matched it up.
  • Every transaction, supplier and company has a permanent unique URL and is available as XML and JSON
  • We’ve sorted out some of the date issues (adding a date fuzziness field for those councils who don’t specify when in the month or quarter a transaction relates to).
  • Transactions are linked to the URL from which the file was downloaded (and usually the line number too, though obviously this is not possible if we’ve had to extract it from a PDF), meaning anyone else can recreate the dataset should they want to.
  • There’s an increasing amount of analysis, showing ordinary users spending by month, biggest suppliers and transactions, for example.
  • The whole spending dataset is available as a single, zipped CSV file to download for anyone else to use.
  • It’s all open data.

There are a couple of features Stuart mentions that we haven’t yet implemented, for good reason.

First, we’re not yet publishing it as linked data, for the simple reason that the vocabulary hasn’t yet been defined, nor even the standards on which it will be based. When this is done, we’ll add this as a representation.

And although we use standard identifiers such as SNAC ids for councils (and wards) on OpenlyLocal, the URL structure Stuart mentions is not yet practical, in part because SNAC ids doesn’t cover all authorities (doesn’t include the GLA, or other public bodies, for example), and only a tiny fraction of councils are publishing their internal transaction ids.

Also we haven’t yet implemented comments on the transactions for the simple reason that distributed comment systems such as Disqus are javascript-based and thus are problematic for those with accessibility issues, and site-specific ones don’t allow the conversation to be carried on elsewhere (we think we might have a solution to this, but it’s at an early stage, and we’d be interested to hear other idea).

But all in all, we reckon we’re pretty much there with Stuart’s wish list, and would hope that councils can get on with extracting the raw data, publishing it in an open, machine-readable format (such as CSV), and then move to linked data as their resources allow.

Written by countculture

August 3, 2010 at 7:45 am

25 Responses

Subscribe to comments with RSS.

  1. Hi,

    I’m a local councillor and cabinet member. Came across your post by accident when looking for info about how other councils are doing this. Not a techy at all; really don’t understand what the problem with pdfs is. That’s how my council are going to put our data up. You seem very opposed to this – can you let me know why? Your post seems to advocate CSV information on websites – so that’s just Excel, right? So what’s the difference in turning an Excel page into a PDF and putting it up?

    For info – we’re a smallish council facing 40% cuts this year. We have one web author in the comms department, not sure they have any technical expertise, and run our site on a CMS. Anything we needed to do beyond the ordinary would be the subject of a budget bid – and, to be honest, new spend isn’t really getting approved right now. Have no idea at all how to judge what the right thing to do is.

    Thanks for any thoughts.

    Cllr AB

    August 3, 2010 at 8:07 am

    • Great post – thanks for helping answer a lot of the questions I have about linked data and how difficult it will be for councils, with small web teams, to implement. The JFDI approach is right and council’s are likely to procrastinate if you give them too many things to thing about!

      Cllr AB – PDF’s are great to read and print, but can’t be read by machines. When developers make an application it needs to be able to automatically interrogate the data. PDF’s make this almost impossible. CSV is a format that works and excel can output, but it’s not “excel”(.xls). This is important because some people don’t use microsoft and/or the microsoft (.xls) format causes problems for developers writing applciations.

      It does not take any specialist skill to upload a csv file. There is no harm in publishing in PDF as well, but dont make it the only format you use.

      B

      August 3, 2010 at 10:46 am

  2. […] This post was mentioned on Twitter by Dave Briggs, Chris Taggart. Chris Taggart said: A Local Spending Data wish… granted: The very wonderful Stuart Harrison (aka pezholio), webmaster at Lichfield Dis… http://bit.ly/a7XYC4 […]

  3. Cllr AB

    Thanks for the comment. The problem is that PDFs are designed to be viewed on a computer screen or printed out, and even then are very problematic for people with accessibility issues.

    Getting the ‘data’ out of them is a slow, painful problematic manual process that often results in errors. So when you say you’re intending to ‘put our data up’, you’re actually saying we’ll put it in a format that makes the data very difficult to get.

    A CSV file on the other hand, is designed as a format for storing data, which is why it’s so easily read by and imported into Excel and other spreadsheet programs, Google docs, database programs and websites like OpenlyLocal and visualisation sites such as Many Eyes.

    As you say, it’s easy to get CSV out of Excel, so why not do that? I can’t see it would take any more time or expertise to do this than make a PDF out of the same data, nor should it take any more time to put it on your website.

    Hope this helps,
    Chris

    countculture

    August 3, 2010 at 10:37 am

  4. Hello Chris you have raised some good points.

    “That’s not to knock linked data – just to be realistic about the issues and hurdles that need to be overcome (see the report for a full breakdown), and that to expect all the councils to solve all these problems at the same time as extracting the data from their systems, removing data relating to non-suppliers (e.g. foster parents), and including information from other systems (e.g. supplier data, which may be on procurement systems), and all by January, is unrealistic at best, and could undermine the whole process.”

    Over 90 Local Government Councils use Agresso Business World (ABW) from Unit 4 Business Software and we have provided them with the ability to extract and store as open linked data. The data extract from RB Windsor & Maidenhead was created using Agresso Business World. In fact all users of Agresso Business World have the ability to extract and store as open linked data with no technical knowledge required. The extracted data is stored in a SPARQL compliant endpoint and available for querying. From a Unit 4 perspective at least the January deadline is easily achievable if you are using Agresso Business World…

    {Link to Unit 4 press release}

    I agree that there will be differences in how Councils store their data. Currently we build the ontology on the fly as the data is extracted. The key here is for Councils to have a common approach to storing supplier information for example ProClass, http://websites.uk-plc.net/Coding_International_ltd/Proclass-31509.htm. Either way it is relatively simple to mash data and even update at a later date. Part of the open nature of the exercise is the feed back loop which will coerce data exporters into providing better data. For better read common structure easily compared!

    Whilst extracting as CSV will allow the Councils to tick the box it is not really in the spirit of the exercise. Open linked data is relatively simple and the tools are being provided.

    Hope this is of interest & use.

    Anwen Robinson

    August 4, 2010 at 8:09 am

    • Anwen
      Not sure the best way of advancing the debate is pasting in the text of a press release or sales brochure. Also the Windsor and Maidenhead data is not (as far as I’m aware) is being published as linked data, only as CSV files which are not only missing much of the key data but are inconsistent, with both the content and field names changing from file to file. So not a great ad for your system, if this indeed is being used to produce these files.
      C

      countculture

      August 4, 2010 at 10:07 am

  5. Cllr AB – as others have pointed out – once you’ve got a publishable set of data you have to click a button to make a .pdf and you have to click another button to make a CSV. Not many councils are already publishing their expenditure data yet, but of those that do many are publishing a CSV and a PDF at the same time – so the machines can read the CSV and if humans want to – they can read the PDF.

    Here’s the GLA’s as an example.

    Ingrid Koehler

    August 4, 2010 at 8:34 am

    • Thanks B, countculture and Ingrid. So I just need to tell our people to put it up as a CSV file, I get that, though I don’t really (sorry) know why – surely the main use of it is for people to read it? Can’t really understand why it needs to be machine-readable…

      Next question: is there a recommended standard for the headings to use etc? I see you’ve mentioned that Windsor council use different field headings and so on: they are the council that are always used to beat us with when a DCLG official writes a letter telling us to do this thing, so if even they are not getting it right, who is?

      Cllr AB

      August 5, 2010 at 11:57 am

      • Cllr AB
        Why data rather than something you can only read? Well, think about when your finance department send their reports to DCLG as Excel spreadsheets – why do they do it as a spreadsheet rather than a PDF (or even on paper). When you get residents to fill in a form on your website that integrates with you back office systems, why do that rather than just sending an email which needs to be retyped. It’s because when the information is available as data rather than just human readable text we can do a lot more with it, and remove extra human work which does nothing but add cost and time.
        Re recommended headings, I expect these to be sorted in the next month or so, but in the meantime by all means get your dept to contact me. I’m actually on holiday at the moment so won’t respond immediately but happy to help them out when I get back.
        C

        countculture

        August 5, 2010 at 1:05 pm

      • Hi Cllr AB – I hope you’re checking responses! Although there wasn’t a suggested format for what to publish – there is now.

        Here’s a consultation on what to publish This is to help us achieve consistency and greater transparency on council expenditure data.

        Ingrid Koehler

        August 27, 2010 at 8:12 am

  6. Correct RBWM is not as yet in linked format but we have just made a utility available to enable our customers to publish in linked RDF format – free of charge. The purpose of linking the press release was to prove we are very serious about this – it good good coverage in Public Technology last week.

    The value of linked data (and indeed of any open data) is limited by the lack of standards strictly defining it’s meaning. BUT I totally agree that if people wait for standards nothing will happen. Even when some sort of standard evolves by which people will describe their data it becomes too difficult and expensive to implement it. The whole beauty of Open Linked Data is that it does facilitate the “Just do it” (I’ll drop the F) approach to publishing data.

    Publishing data as CSV is much better than pdf files as people have said. We have made it possible for our customers to convert CSV to RDF/XML as they can with xls output from Agresso browser screens. The RDF/XML format will have made the data more accessible but, of course, will not have addressed the standards / naming issue any more than it was addressed in the CSV file.

    So … I accept your comments … we are loking to make a difference and to be part of the collaborative engagement and we are always prepared for constructive feedback.

    Anwen Robinson

    August 4, 2010 at 11:19 am

    • Anwen
      It’s not just that it’s not in linked format, but the CSV is all over the place. I’d start with getting that corrected first

      countculture

      August 4, 2010 at 11:40 am

  7. Please expect a comprehensive response from Richard Murray – an expert in these areas and much more able to have a detailed technical debate than myself 🙂

    Anwen Robinson

    August 4, 2010 at 2:58 pm

  8. I am aware Richard has posted but it is yet to appear ………

    Anwen Robinson

    August 4, 2010 at 4:54 pm

    • I know Chris is on holiday at the moment, so he won’t have got round to moderating it. (I’m not even sure this will appear!)

      Stuart Harrison

      August 5, 2010 at 9:36 am

    • There are no pending comments in the moderation queue, so perhaps he’d like to have another go. Would be good if he kept it short 😉
      C

      countculture

      August 5, 2010 at 10:27 am

  9. I think the best advancment of debate is to debate using accurate facts and the fact is Unit 4 have provided a way for all users of Agresso Business World (ABW) to produce open linked data. How long is it before users take advantage of this functionality? That is a different debate…

    You mentioned in your post that the January deadline was “unrealistic at best” and I disagree. The issue is the will too extract rather than the technology used to extract. How simple is it to generate a table using Excel!? However there are issues with extracting responsibly as not all suppliers are suppliers in the accepted sense of a commercial entity.

    The data provided by RB Windsor & Maidenhead is extracted from ABW using standard Browser functionality. That does not mean that it is coherent. However the functionality is available to users to extract this information.

    There is already Local Goverment owned procurement classification standard known as ProClass;

    http://websites.uk-plc.net/ProClass/ProClass-FAQs-33242.htm

    “Q. Why was it introduced?

    A. In simple terms, to provide a more meaningful breakdown on council expenditure using a sensible hierarchy of headings. This would also support the comparison of third party expenditure information on a like for like basis between UK councils, subregions and regions using one common standard. This now means that initiatives designed to reduce expenditure can be compared and monitored between council.”

    If two councils adhered to this for their procurement would we not be a lot closer to our goal of comparing open linked data?

    I wrote the open linked data extract routines for ABW and have consistently drawn a blank on what to extract because the community at large get hung up on ontologies and vocabularies. That isn’t a dig but an observation. Sometimes the first pass needs to be “dirty” because it is the catalyst for further development. I understand that they are important but the data is still meaningful without a ratified ontology which has taken ages to pass through a forum. Get the data out there and let the people using it debate the pros and cons! Then we can repeatedly refine what we extract. If it isn’t good enough they will shout until it is.

    I have taken the liberty of downloading the spending.csv file from OpenlyLocal and put it though our convertor with default settings. The resultant RDF of 897822 statements has been uploaded to one of our development stores and is available for SPARQL querying at;

    http://ec2-75-101-194-147.compute-1.amazonaws.com:8081/sparql/

    I have SPARQL queries available if required?

    Now OpenlyLocal is available as open linked data and can be queried. Lets see what people can do with it…

    Dick Murray

    August 5, 2010 at 10:59 am

  10. Just what I was going to say Dick 🙂

    Anwen Robinson

    August 6, 2010 at 1:52 am

  11. For authorities & wards without SNACs, why not assign a temporary, unique, value (beginning with “Z”, say, or “99”?

  12. […] A Local Spending Data wish… granted « countculture (tags: blog data local opendata uk) […]

  13. […] A Local Spending Data wish… granted « countculture And that’s exactly what we’ve done at OpenlyLocal, adding the data from all the councils who’ve published their spending data, acting as a central repository, generating the URLs, and connecting the data together to other datasets and identifiers (councils with Snac IDs, companies with Companies House numbers). We’ve even extracted data from those councils who unhelpfully try to lock up their data as PDFs. (tags: opendata blog local data comment money) […]

  14. […] the charities in the UK and their registration numbers so that I could try to match them up to the local council spending data OpenlyLocal is aggregating and trying to make sense of. A fairly simple request, you’d think, especially in this new world of transparency and open […]

  15. […] really had any concrete situations or data to deal with, but from speaking to councils, and actually using the data it became clear the something much firmer was […]

  16. […] OpenlyLocal started pulling in council spending data, it’s niggled at me that it’s only half the story. Yes, as more and more data is […]


Leave a reply to Anwen Robinson Cancel reply