countculture

Open data and all that

How to help build the UK’s open planning database: writing scrapers

with 21 comments

This post is by Andrew Speakman, who’s coordinating OpenlyLocal’s planning application work.

As Chris wrote in his last post announcing OpenlyLocal’s progress in building an open database of planning applications, while we can do the importing from the main planning systems, if we’re really going to cover the whole country, we’re going to need the community’s help. I’m going to be coordinating this effort and so I thought it would be useful to explain how we’re going to do this (you can contact me at planning@openlylocal.com).

First, we’re going to use the excellent ScraperWiki as the main platform for writing external scrapers. It supports Python, Ruby and PHP, and has worked well for similar schemes. It also means the scraper is openly available and we can see it in action. We will then use the Scraperwiki API to upload the data regularly into OpenlyLocal.

Second, we’re going to break the job into manageable chunks by focus on target groups of councils, and just to sweeten things – as if building a national open database of planning applications wasn’t enough 😉 – we’re going to offer small bounties (£75) for successful scrapers for these councils.

We have some particular requirements designed to make the system maintainable, and do things the right way, but not many are fixed in stone, so feel free to respond with suggestions if you want to do it in a different way.

For example, the scraper should keep itself current (running on a daily basis), but also behave nicely (not putting an excessive load on Scraperwiki or the target website by trying to get too much data in one go). In addition we propose that the scrapers should operate by updating current applications on a daily basis and also make inroads into the backlog by gathering a batch of previous applications.

We have set up three example scrapers that operate in the way we expect: Brent (Ruby), Nuneaton and Bedworth (Python) and East Sussex (Python). These scrapers perform 4 operations, as follows:

  1. Create new database records for any new applications that have appeared on the site since the last run and store the identifiers (uid and url).
  2. Create new database records of a batch of missing older applications and store the identifiers (uid and url). Currently the scrapers are set up to work backwards from the earliest stored application towards a target date in the past
  3. Update the most current applications by collecting and saving the full application details. At the moment the scrapers update the details of all applications from the past 60 days.
  4. Update the full application details of a batch of older applications where the uid and url has been collected (as above) but the application details are missing. At the moment the scrapers work backwards from the earliest “empty” application towards a target date in the past

The data fields to be gathered for each planning application are defined in this shared Google spreadsheet. Not all the fields will be available on every site, but we want all those that are there.

Note the following:

  • The minimal valid set of fields for an application is: ‘uid’, ‘description’, ‘address’, ‘start_date’ and ‘date_scraped’
  • The ‘uid’ is the database primary key field
  • All dates (except date_scraped) should be stored in ISO8601 format
  • The ‘start_date’ field is set to the earliest of the ‘date_received’ or ‘date_validated’ fields, depending on which is available
  • The ‘date_scraped’ field is a date/time (RFC3339) set to the current time when the full application details are updated. It should be indexed.

So how do you get started? Here’s a list of 10 non-standard authorities that you can choose from. Aberdeen, Aberdeenshire, Ashfield, Bath, Calderdale, Carmarthenshire, Consett, Crawley, Elmbridge, Flintshire. Have a look at the sites and then let me know if you want to reserve one and how long you think it will take to write your scraper.

Happy scraping.

21 Responses

Subscribe to comments with RSS.

  1. I’ll have a look at converting my existing scraper for the (distinctly non-standard and annoying) site of Broxbourne Council to this standard. The current scraper is here:

    https://scraperwiki.com/scrapers/broxbourne_planning_applications/

    Tom Hughes

    March 29, 2012 at 9:10 am

    • Tom

      I would be very interested to see how you get on as there are about 8 other sites like this (built by Civica?) which I have already had a go at and found equally annoyingly and impenetrable. I did manage to crack two of them (Harrow and Wrexham) so let me know when you are ready and we can swap notes.

      Andrew

      Andrew

      March 29, 2012 at 7:46 pm

      • Think I’ve got it all running now, and repopulated with all the data from the start of 2007 using the new schema.

        The biggest pain of course is the lack of any direct URLs for applications…

        Tom Hughes

        March 30, 2012 at 4:12 pm

  2. Hi All

    The missing link to the spreadsheet in the main article is:

    https://docs.google.com/spreadsheet/ccc?key=0AhOqra7su40fdGdVbDRWYkxGbnhsTkFMTjBBSi1oTHc

    Look for the “Scraper field names” tab at the bottom

    Andrew

    March 29, 2012 at 7:50 pm

  3. Quick question – if a Council were to release the data in a machine readable format, as a starter what would the minimum data set you would require?

    adamjennison111

    May 8, 2012 at 2:55 pm

    • The mandatory data fields are tagged as such in the “Scraper field names” section of the shared spreadsheet. However the list is fairly minimal and we’d hope for more detail if possible.

      Andrew

      May 11, 2012 at 7:52 am

      • Thanks Andrew, it was after posting that I RTFM and spotted the spreadsheet..!
        🙂
        I ask because I work in Hull City Council and am looking at Open Data.

        I find it incredulous that the planning systems that are open, such as our’s and our neighbors – the East Riding of Yorkshire, are so poor when it comes to providing data that is machine readable. Not even an RSS feed!??

        I am looking to wrap some our systems in APIs that can provide data in better format, and in this case I would suggest even a little, with a link to the systems content, will be better than nothing.

        I started to write a scraper for the IDOx system (open scraper wiki) as I want to create a hyperlocal blog but found the Openlocaly site and paused to see what was happening.

        I would rather that the data, which is already in the public sphere, be open for all – not just eyes!

        I am looking to use persistent URIs and a datastore to hold the information, but we are in the early stages of planning our OpenData thoughts.. so any comments are greedily accepted, it would be good to be led by developer needs rather than some internal ‘direction’..
        🙂

        adamjennison111

        May 11, 2012 at 10:15 am

  4. I really want Canterbury to be supported, and might have a bit of time to look at writing a scraper. According to your spreadsheet this is “AcolNet” style.

    So – is anyone else working on this, and if not, would it be useful to try and get a single scraper working against multiple AcolNet sites?

    James Berry

    May 8, 2012 at 3:59 pm

  5. …and is it worth using the planningalerts googlecode repository as a starting point?

    http://code.google.com/p/planningalerts/source/browse/#svn%2Ftrunk%2Fpython_scrapers%253Fstate%253Dclosed

    James Berry

    May 8, 2012 at 4:11 pm

  6. I would really like to help out with Stoke-on-Trent City Council’s town planning alerts, so at least they’d be useable but everything is running off of sessions. It’s so badly done, it’s a joke.
    http://www.planning.stoke.gov.uk/dataonlineplanning/

  7. […] reading the article they, Openlylocal, might just have solved it. They are also asking for help, writing Scrapers, rather than read here what it’s all about, go to the site and read for your self, but above […]

  8. Are there any plans to include, for example, the New Forest National Park Authority? The ‘New Forest’ site now being scraped is the district council’s.

    Chunter

    September 17, 2012 at 4:26 am


Leave a reply to James Berry Cancel reply