Â鶹Éç

« Previous | Main | Next »

When is a dataset not a dataset? The hackday project that crowdsourced data.gov.uk

Post categories: ,Ìý,Ìý,Ìý

Dr Ian McDonald Dr Ian McDonald | 12:31 UK time, Thursday, 22 April 2010

Tom Morris and other participants at the end of the hackday

When is a dataset not a dataset? How many of the now 3241 datasets listed as part of are easy to open up and play with? How many are tables for computers to analyse, instead of PDF reports for people to read?

Ìý

The Ìýfilled a Channel 4 office with journalists and developers on the final Friday in January. Our aim was to tell new stories with open data. Attendees already hadÌýform - the Â鶹Éç's Open Secrets blogger Martin Rosenbaum, and data journalism teams from the Times, the Guardian, and the FT. judged our attempts in his role as head of hosts , alongsideÌýMy Society boss Tom Steinberg. They to my team's analysis of Tory candidates. But another project promised to shed light on public data in the UK.

Ìý

was part of a team that looked into the quality of data.gov.uk. Although data.gov.uk advertises itself as a database of open datasets, many of the entries are . He built a prototype format checker that invites people to go through datasets and record the file format.ÌýYou can listen to him explaining the checker to me and to the hackday, or reuse under the .

Ìý

In order to see this content you need to have both Javascript enabled and Flash installed. Visit Â鶹Éç Webwise for full instructions. If you're reading via RSS, you'll need to visit the blog to access this content.

Ìý

On Wednesday February 3rd, he put a completed quality checker online. On that Thursday, the crowd had gone through data.gov.uk and marked up all of the datasets.

Ìý

Tom posted his initial breakdown to the data.gov.uk community on March 20th:

HTML -252
XML -5
Word - 4
RTF - 1
OpenOffice -1
Something odd - 85
JSON - 9
Nothing there! - 190
CSV - 12
Multiple formats - 1211
PDF - 468
RDF - 10
Excel - 408
TOTAL - 2656
Sadly, this is over-optimistic. I've manually checked some of the data that has been categorised as JSON and RDF. Most of it is not actually correctly categorised - either people clicked, say, 'RDF' when they meant to click 'PDF', or they have seen an RSS or Atom feed and categorised it as RDF. What this admittedly imperfect dataset is basically saying is that the vast majority of the 'data' on data.gov.uk is not actually machine-readable data but human-readable documents.

He will be at the this weekend, where he will speak about and might do the analysis, which he told me was the most important part. When done, it will be very interesting indeed to read it.

Comments

More from this blog...

Topical posts on this blog

Categories

These are some of the popular topics this blog covers.

Â鶹Éç iD

Â鶹Éç navigation

Â鶹Éç © 2014 The Â鶹Éç is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.