Â鶹Éç

« Previous | Main | Next »

The Simple Joys of Web-Scale Identifiers

Post categories:

Michael Smethurst Michael Smethurst | 13:26 UK time, Wednesday, 25 June 2008

<aside>Second post of the day is quite a record for me but this one isn't about microformats so you can probably look away now...<aside>

Bob Dylan with his MusicBrainz identifier

This post is partly a response to and partly the result of conversations with Matthew Wood, Chris Sizemore and John O'Donovan on our recent jaunt to . Now I think that most of our department would agree with Tom. After all we've been having these conversations for a few years now and when it comes to URL design .

When you're building anything it's always good to admit that cleverer people than you or I (or even ) came before. In the case of the web gave us and HTTP is stateless. It's the whole beauty of the web: everyone, everywhere gets the same thing from the same place. The moment you pick a fight with this design you're probably gonna get beat.

Which is not to say that people haven't picked this fight. Many websites (including the Â鶹Éç) use to preserve state across requests. So but when you make that choice you need to be aware that all your user activity will remain uncaptured by the web - no browsability, no google goodness, no benefit to your organisation (beyond the obvious) and no caching.

So, like I say I agree with the but I'd like to try to add a fifth: if possible don't reinvent other people's web identifiers. By web identifiers I mean those fragments of URLs that uniquely identify a resource within a domain. So in the case of the entry for The Fall () that'll be d5da1841-9bc8-4813-9f89-11098090148e.

The last time we updated the /music site we made this mistake (kind of unavoidable at the time). Even though we linked our data to MusicBrainz we minted new identifiers for artists. So The Fall became /music/artist/jb9x/ where jb9x was the identifier. But jb9x doesn't exist anywhere outside of /music. We'll (hopefully) .

When we first the big attraction was 2 fold:

  • stable web-scale identifiers
  • - no separate deals to reuse data in APIs etc

So when the next version of /music goes live you'll see: /music/artists/d5da1841-9bc8-4813-9f89-11098090148e and the world will hopefully be a slightly better place.

Now I can already hear my old mentor saying:

Michael noooo! URIs are just identifiers for resources. They shouldn't reflect the taxonomy of the site. The resource should define it's relationships to other resources not the URI. Call them anything you like but just keep them stable.

With which I also mostly agree but - if bbc.co.uk/programmes tagged content with the same vocabulary as we'd be able to cross promote news stories from programmes and programmes from news stories by sharing APIs not databases. Tie this into personalisation and the power goes logarithmic. Read six articles on reconstruction in Iraq? Then you might like this Panaroma programme.

But if the vocabulary used to tag programmes and news was web-scale then , , etc (or someone in between) could start to aggregate stories around a shared sense of topic. This is what Chris' recent post on using wikipedia / dbpedia as a controlled vocabulary begins to hint at. It's like or except the terms returned are web native or web-scale identifiers if you will.

So what's the practical benefit: well because the new /music URLs will be based on MusicBrainz identifiers and because /music will be interlinked with /programmes and because the speaks in MusicBrainz identifiers can spend a weekend at making something that takes your Last.fm user name, extracts your favourite artists, ties them to /music and recommends Â鶹Éç programmes. Which is a .

Taking another example for those who wish to stalk Tom Scott. His blog is at which is also his OpenID, you'll find his delicious account at , his tweets at and if you want to hire him he's at on LinkedIn. So derivadow is a web-scale identifier for Tom. It's not as strong or as powerful as a set of RDF linked URIs but if you wanna aggregate Tom-ness it's a pretty good starting point. Sadly I can't find him anywhere on Last.fm but that's possibly a godsend.

The obvious question is if web-scale identifiers are so good why did the Â鶹Éç mint it's own for programmes? After all the the b00c4wxm used in /programmes and iPlayer is a Â鶹Éç invention. And the answer is there were no suitable identifiers out there. I'd like to think that if Program(me)Brainz existed with stable identifiers we'd have put in the work to use those instead. But it didn't so we couldn't... But now we have stable identifiers out there on the web free to use for anyone. It would be good for example to see these identifiers adopted by . Time will tell.

One argument against all this is that web-scale identifiers are often kinda ugly. After all if Last.fm gets away with why do we need d5da1841-9bc8-4813-9f89-11098090148e. The answer is ambiguity. MusicBrainz has . Which one(s) does the Â鶹Éç play? Probably none actually but you get the point. If we want to be exact in what we point to we need to handle ambiguity. In general we follow 3 commandments:

  1. URLs should be human readable
  2. URLs should be hackable
  3. URLs should persistently point to one concept

And the greatest of these is persistence. If you can't maintain stable URLs per concept don't even bother with 1 and 2. There are others that argue that . If resolving ambiguity is not important to your business then I'd agree but if you need to differentiate stuff with the same label you need unique identifiers - better yet web-scale identifiers.

Now I guess the people would say do this properly in with etc and we will do. But for hackers without PhDs the possibility of instant interoperability and quick mesh-ups is irresistible. Obviously you'll still need to establish equivalency between and but luckily that's where the people have done some of our work for us. And they're damn nice people to boot.

So I guess what I'm saying echoes Tom. Cleverer people than us have come up with ways to attach web-scale identifiers to content so why waste time reinventing. Whilst the Â鶹Éç or *insert your organisation here* should own their data (whilst hopefully making it free - as in beer; as in speech) we don't have to own our identifiers. If we choose to use the power of web-scale identifiers we free our content to fly and . It's not exactly profound but it does feel like a small breakthrough to an .

Comments

  • Comment number 1.

    But of course, your programme identifiers (which humorously have recently all started with b00b!) aren't web-scale. Nobody else (e.g. ITV, Channel 4, HBO etc) can use your identifiers without risk of overlap.

    Perhaps the Â鶹Éç should start a Program(me)Brainz system itself, or sponsor MusicBrainz to do so. You need someone who can be responsible for the identifiers and ensuring that two organisations don't use the same identifiers for different program(me)s...

  • Comment number 2.

    I actually believe that the 3 Asimov rules, I mean semantic standards, are of decreasing importance:

    1. URLs should be human readable
    2. URLs should be hackable
    3. URLs should persistently point to one concept

    If 1 and 3 is not possible, then 1 takes priority.

    Wikipedia does this quite well. If we use a similarly musical example, 'Queen' by itself will bring up a disambiguation list. 'Queen (Band)' will bring up the artist and 'Queen (Chess)' will bring up the playing piece. I know where I'm going to end up. If someone told me article 'sdkjh221' was about the band, there's no way to know until I visit it.

    And let's say there were 2 bands called 'Queen' one would be 'Queen (1980s)' and the other would be 'Queen (2000s)'.

    My lecturer in my database normalisation classes always made sure we understood that you rarely need to assign a random identity to an object. Unfortunately real life doesn't work out that way, as you have found. But experience has taught me that the manipulation of the original unique identifier is better than making one up your own.

    Whilst inter-site compatibility is very important and a very useful tool, without some sort of consensus, there will be anarchy. Eventually the strongest will survive, and I'm willing to bet the ones that will survive are the ones that people can read.

    I hope this makes some sense. It's a very abstract world and my explanation isn't very conclusive, certainly anecdotal. But I hope you can see that there are solutions out there that work, you just need a bit of imagination.

  • Comment number 3.

    hey ed, i think you (and maybe michael originally?) have misinterpreted what the Beeb's programme episode identifiers actually are?

    the unique identifier isn't: b00c53g4...

    it's: /programmes/b00c53g4

    thus it's web-scale and guaranteed unique.

    i think we're getting confused because the MusicBrainz IDs, both internal and external, are GUIDs...

    but URIs are just as unique, and arguably more web-scale, than GUIDs...

    "Nobody else (e.g. ITV, Channel 4, HBO etc) can use your identifiers without risk of overlap." -- yes they can, just keep the Â鶹Éç namespace at the beginning. the whole URI is the identifier.

    mattcopp, i agree with you that Wikipedia has URLs to die for. luckily and fortuitously for them, their "business" process ensures this: consider the workflow of how wiki pages are created and named...

    but disagree with: "you just need a bit of imagination" -- imagination is at a surplus, believe me... what you actually need is a wiki-style URL naming scheme enforced inside an organisation, like Wikipedia lucked into, and so far the Â鶹Éç hasn't been able to enforce a wiki-style unique naming scheme to it's programme titles, nor has it been able to curate disambiguation pages (or afford the server load to automate such beasts -- and automated naming conventions lead to abstract IDs anyway... "let's say there were 2 bands called 'Queen' one would be 'Queen (1980s)' and the other would be 'Queen (2000s)'." what if there were 2 Queens in the 1980s? what would we call the 2nd? whatever we called it, a human would need to be involved, or else an automated process would start to name things in less-than-human-readable ways: Queen1, Queen2, Queen3, etc?)...

    nor has MusicBrainz, for that matter, thus the GUIDs...

    "let's say there were 2 bands called 'Queen' one would be 'Queen (1980s)' and the other would be 'Queen (2000s)'." -- fine, but some human being actually needs to name these thing in that way by hand, or else some pretty clever software must do so -- but then that would depend on human-entered metadata in the system somewhere in the chain, no?

    plus many "episodes" simply don't have titles at all, much less unique ones.

    anyway, human-readable is a delight. persistent is a drop-dead requirement. Wikipedia lucked into both, because it has lots of free labor that literally has to give each page a unique name, by hand.

  • Comment number 4.

    @mattcopp + @onpause Wikipedia's URLs are lovely to look at, nice-ish to type and swallowed whole by google but they're not persistent. They move as concepts are disambiguated. Somewhere in my head lurks a figure of 5% but since I don't know if that's per month, per year or forever it's a pretty useless figure. But even if that's the total movement over all time given the size of wikipedia that's a fair bit of shift. And obviously cool URIs don't change:

  • Comment number 5.

    @ed + @onpause I suspect I've made myself misunderstood - I was certainly not claiming that b00c4wxm is globally unique.

    The only true web-scale identifiers are URLs. They're how we uniquely identify and locate resources. I think we're all agreed there? But that's not quite what I'm talking about here.

    In retrospect maybe this post should have been called 'Approximations of web-scale identifiers' but that's a bit of a mouthful. The bit I'm interested in is the fragment of the URL that's *almost* good enough to carry identity. So in the Panorama episode example a Google search for b00c4wxm returns 14 results and they're all about that episode on /programmes and iplayer. Now we know that that ID isn't globally unique and it's certainly not as powerful as a URL but it clearly carries some degree of unambiguous meaning

    The same with derivadow. Again not globally unique. We could all register on a social network that Tom's not joined yet (cough) as derivadow. Now a google search for derivadow gives a fair few results but all the ones I'm seeing are our Tom. A search for 'Tom Scott' on the other hand brings back judges, musicians and mystery buyers. So while neither b00c4wxm (which doesn't start with b00b) nor derivadow are true web-scale identifiers they are pretty damn good approximations.

  • Comment number 6.

    I should be able to type:

    www.bbc.co.uk/programmes/rubbadubbers

    and not have to remember:

    www.bbc.co.uk/programmes/b0072hw8

    If there is another programme called rubbadubbers then there should be disambiguation info at the readable url.

    The Â鶹Éç partly encourages hackable urls. Why not extend that power to programmes and customers?

  • Comment number 7.

    @ritchielee there's a task on our board to do just that. Should be there soon. In the meantime:

    /programmes/a-z/rubbadubbers

    does the job

  • Comment number 8.

    yes, you are right, academic research confirms 5% drift over Wikipedia's lifetime.

    you know me, though, always looking on the bright side: i say hooray, 95% remained persistent! ... ;-)

    i'm sure MusicBrainz fairs even better than 95%, but surely not 100% persistence?

    plus, Wikipedia *URLs* actually remain persistent, as in dereferenceable -- almost no 404s... it's the *resource* "behind them" that changed, which is actually what you are getting at, no? i agree that this is unfortunate, but they sometimes surmount this via redirects and implied redirects via disambiguation pages. but it's inconsistent and non-machine readable, alas.

    Wikipedia has changed the *meaning* of 5% of it's persistent URLs. so they still point to something, but what they point to has changed.

    much thanks, as that's an important point to make, no pun intended.

  • Comment number 9.

    I find the Wikipedia example interesting for two reasons that have nothing to do with each other:

    1. The disambiguation aspect has been part of HTTP for a long time now via the 300 status code. Unfortunately because the main entry point to the web has been the browser for years now those kind of features have never really been implemented. They'd work well for automated processing.

    2. English speaking speaker tend to forget that the rest of the World has to endure ugly URLs no matter what due to encoding issue. So the "readable URLs", although large in audience, is limited in its scope.

  • Comment number 10.

    A follow-up, the 300 status code doesn't actually offer you a choice between multiple resources but between multiple representations of the same resource. That, of course, may not be what one wants.

  • Comment number 11.

    Ok, I find this an interesting topic. Much could be read and written about it, and is. Thinking of the internet in database form (or RESTful) is quite a handy analogy.

    I had to re-read your post to see what you were trying to acheive Michael, and now I see the purpose of using unique identifiers.

    Musicians being creative types, database structure does not spring in to their minds when they chose their names. That is why there are 16 very similar Auroras. Wikipedia and Last.fm do get very conviluted about that subject, especially Wikipedia as it deals with much more than just artists.

    Random and unique identifiers are quite a handy way of differentiating between them, and is only the really sensible way. But the problems with the length and that it's not human readable, but this is not the Â鶹Éç's fault, it's MusicBrainz who are trying to logicalize a very large and overlapping field.

    I understand why you use MusicBrainz and not Wikipedia, because MusicBrainz is a consistant source.

    I would ask though that when you are redirected to the artist page from the MusicBrainz style URL, you redirect the browser to the Â鶹Éç's identifier for that artist. Rather than just dumping us at /artist as the ting tings link above does.

  • Comment number 12.

    > URLs should be human readable

    I really don't understand how you consider the following to be human readable:
    /music/artists/d5da1841-9bc8-4813-9f89-11098090148e

    URLs should certainly contain the name of the artist / song that its describing.

    Otherwise the URL of this blog post should be:
    /blogs/radiolabs/2008/06/6fa26370-491f-11dd-ae16-0800200c9a66.shtml
    Right? Otherwise you could run into some instance where there are TWO posts with the same name of "the_simple_joys_of_webscale_id" !!!!

    Edge cases aren't worth trashing friendliness, hackability and usability.

    99% of the artists won't have collision problems. So why would you don't design a system for 1% of your data set?

  • Comment number 13.

    This comment was removed because the moderators found it broke the house rules. Explain.

  • Comment number 14.

    This comment was removed because the moderators found it broke the house rules. Explain.

Ìý

Â鶹Éç iD

Â鶹Éç navigation

Copyright © 2015 Â鶹Éç. The Â鶹Éç is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.