How to help build the UK’s open planning database: writing scrapers

This post is by Andrew Speakman, who’s coordinating OpenlyLocal’s planning application work.

As Chris wrote in his last post announcing OpenlyLocal’s progress in building an open database of planning applications, while we can do the importing from the main planning systems, if we’re really going to cover the whole country, we’re going to need the community’s help. I’m going to be coordinating this effort and so I thought it would be useful to explain how we’re going to do this (you can contact me at planning@openlylocal.com).

First, we’re going to use the excellent ScraperWiki as the main platform for writing external scrapers. It supports Python, Ruby and PHP, and has worked well for similar schemes. It also means the scraper is openly available and we can see it in action. We will then use the Scraperwiki API to upload the data regularly into OpenlyLocal.

Second, we’re going to break the job into manageable chunks by focus on target groups of councils, and just to sweeten things – as if building a national open database of planning applications wasn’t enough 😉 – we’re going to offer small bounties (£75) for successful scrapers for these councils.

We have some particular requirements designed to make the system maintainable, and do things the right way, but not many are fixed in stone, so feel free to respond with suggestions if you want to do it in a different way.

For example, the scraper should keep itself current (running on a daily basis), but also behave nicely (not putting an excessive load on Scraperwiki or the target website by trying to get too much data in one go). In addition we propose that the scrapers should operate by updating current applications on a daily basis and also make inroads into the backlog by gathering a batch of previous applications.

We have set up three example scrapers that operate in the way we expect: Brent (Ruby), Nuneaton and Bedworth (Python) and East Sussex (Python). These scrapers perform 4 operations, as follows:

Create new database records for any new applications that have appeared on the site since the last run and store the identifiers (uid and url).
Create new database records of a batch of missing older applications and store the identifiers (uid and url). Currently the scrapers are set up to work backwards from the earliest stored application towards a target date in the past
Update the most current applications by collecting and saving the full application details. At the moment the scrapers update the details of all applications from the past 60 days.
Update the full application details of a batch of older applications where the uid and url has been collected (as above) but the application details are missing. At the moment the scrapers work backwards from the earliest “empty” application towards a target date in the past

The data fields to be gathered for each planning application are defined in this shared Google spreadsheet. Not all the fields will be available on every site, but we want all those that are there.

Note the following:

The minimal valid set of fields for an application is: ‘uid’, ‘description’, ‘address’, ‘start_date’ and ‘date_scraped’
The ‘uid’ is the database primary key field
All dates (except date_scraped) should be stored in ISO8601 format
The ‘start_date’ field is set to the earliest of the ‘date_received’ or ‘date_validated’ fields, depending on which is available
The ‘date_scraped’ field is a date/time (RFC3339) set to the current time when the full application details are updated. It should be indexed.

So how do you get started? Here’s a list of 10 non-standard authorities that you can choose from. Aberdeen, Aberdeenshire, Ashfield, Bath, Calderdale, Carmarthenshire, Consett, Crawley, Elmbridge, Flintshire. Have a look at the sites and then let me know if you want to reserve one and how long you think it will take to write your scraper.

Happy scraping.

Written by countculture

March 29, 2012 at 8:48 am

Posted in hyperlocal, local government, open data, openlylocal, planning

Tagged with Councils, gov2.0, hyperlocal, local data, local government, opendata, Planning Applications, PlanningAlerts

21 Responses

Subscribe to comments with RSS.

I’ll have a look at converting my existing scraper for the (distinctly non-standard and annoying) site of Broxbourne Council to this standard. The current scraper is here:

https://scraperwiki.com/scrapers/broxbourne_planning_applications/

Tom Hughes

March 29, 2012 at 9:10 am

Reply
- Tom
  
  I would be very interested to see how you get on as there are about 8 other sites like this (built by Civica?) which I have already had a go at and found equally annoyingly and impenetrable. I did manage to crack two of them (Harrow and Wrexham) so let me know when you are ready and we can swap notes.
  
  Andrew
  
  Andrew
  
  March 29, 2012 at 7:46 pm
  
  Reply
  - Think I’ve got it all running now, and repopulated with all the data from the start of 2007 using the new schema.
    
    The biggest pain of course is the lack of any direct URLs for applications…
    
    Tom Hughes
    
    March 30, 2012 at 4:12 pm
Hi All

The missing link to the spreadsheet in the main article is:

https://docs.google.com/spreadsheet/ccc?key=0AhOqra7su40fdGdVbDRWYkxGbnhsTkFMTjBBSi1oTHc

Look for the “Scraper field names” tab at the bottom

Andrew

March 29, 2012 at 7:50 pm

Reply
Quick question – if a Council were to release the data in a machine readable format, as a starter what would the minimum data set you would require?

adamjennison111

May 8, 2012 at 2:55 pm

Reply
- The mandatory data fields are tagged as such in the “Scraper field names” section of the shared spreadsheet. However the list is fairly minimal and we’d hope for more detail if possible.
  
  Andrew
  
  May 11, 2012 at 7:52 am
  
  Reply
  - Thanks Andrew, it was after posting that I RTFM and spotted the spreadsheet..!
    🙂
    I ask because I work in Hull City Council and am looking at Open Data.
    
    I find it incredulous that the planning systems that are open, such as our’s and our neighbors – the East Riding of Yorkshire, are so poor when it comes to providing data that is machine readable. Not even an RSS feed!??
    
    I am looking to wrap some our systems in APIs that can provide data in better format, and in this case I would suggest even a little, with a link to the systems content, will be better than nothing.
    
    I started to write a scraper for the IDOx system (open scraper wiki) as I want to create a hyperlocal blog but found the Openlocaly site and paused to see what was happening.
    
    I would rather that the data, which is already in the public sphere, be open for all – not just eyes!
    
    I am looking to use persistent URIs and a datastore to hold the information, but we are in the early stages of planning our OpenData thoughts.. so any comments are greedily accepted, it would be good to be led by developer needs rather than some internal ‘direction’..
    🙂
    
    adamjennison111
    
    May 11, 2012 at 10:15 am
I really want Canterbury to be supported, and might have a bit of time to look at writing a scraper. According to your spreadsheet this is “AcolNet” style.

So – is anyone else working on this, and if not, would it be useful to try and get a single scraper working against multiple AcolNet sites?

James Berry

May 8, 2012 at 3:59 pm

Reply
- There are more than 30 AcolNet systems in the UK so we think it’s going to be more efficient to implement this as part of the main OpenlyLocal systems. But we know people are impatient for this to happen quickly, so please nag (on the new OpenlyLocal blog at http://blog.openlylocal.com/2012/03/29/how-to-help-build-the-uks-open-planning-database-writing-scrapers/) if it doesn’t happen soon,
  
  Andrew
  
  May 11, 2012 at 8:04 am
  
  Reply
  - I’ll leave it to you guys to do it internally then – any progress?
    
    James Berry
    
    May 29, 2012 at 9:36 am
  - Sorry James we are still doing preparatory work on the AcolNet implementation
    
    In the meanwhile if you want basic information on the most recent Canterbury applications then they are available via this scraperwiki interface:
    
    https://views.scraperwiki.com/run/planning_map/?db=South+East&auth=Canterbury&max=200
    
    Similarly Stoke on Trent applications are here:
    
    https://views.scraperwiki.com/run/planning_map/?db=West+Midlands&auth=Stoke+on+Trent&max=200
    
    Andrew
    
    May 31, 2012 at 6:39 am
  - Sorry for the delay – but at last Canterbury planning applications are now active on Openly Local
    
    Here’s the link:
    
    http://openlylocal.com/councils/48/planning_applications
    
    Andrew
    
    August 10, 2012 at 9:39 am
  - Spectacular!
    
    James Berry
    
    August 10, 2012 at 11:17 am
…and is it worth using the planningalerts googlecode repository as a starting point?

http://code.google.com/p/planningalerts/source/browse/#svn%2Ftrunk%2Fpython_scrapers%253Fstate%253Dclosed

James Berry

May 8, 2012 at 4:11 pm

Reply
I would really like to help out with Stoke-on-Trent City Council’s town planning alerts, so at least they’d be useable but everything is running off of sessions. It’s so badly done, it’s a joke.
http://www.planning.stoke.gov.uk/dataonlineplanning/

Matt - My Tunstall (@MyTunstall)

May 10, 2012 at 1:40 pm

Reply
- Stoke on Trent is also an AcolNet system so it’s one of more than 30 similar UK systems we have to crack. and the comments above about Canterbury apply here too.
  
  Andrew
  
  May 11, 2012 at 8:11 am
  
  Reply
- Hi – Stoke on Trent planning applications are (at last) being scraped onto Openly Local
  
  Here’s the link
  
  http://openlylocal.com/councils/96/planning_applications
  
  Andrew
  
  August 10, 2012 at 5:21 pm
  
  Reply
[…] reading the article they, Openlylocal, might just have solved it. They are also asking for help, writing Scrapers, rather than read here what it’s all about, go to the site and read for your self, but above […]

Architectural Technologist – OpenlyLocal

May 11, 2012 at 7:45 am

Reply
Are there any plans to include, for example, the New Forest National Park Authority? The ‘New Forest’ site now being scraped is the district council’s.

Chunter

September 17, 2012 at 4:26 am

Reply
- Not yet. We haven’t ruled it out, but want to get the councils tackled first.
  
  countculture
  
  September 17, 2012 at 3:01 pm
  
  Reply
  - In the meantime you can get access to New Forest District and National Park applications via my map interface here https://views.scraperwiki.com/run/planning_map/?db=South+East.
    
    Andrew
    
    September 17, 2012 at 6:43 pm