Last week ReadWriteWeb asked: “Is Linked Data Gaining Acceptance?” Our answer: definitely yes. Projects like DBPedia, a community effort to structure the information from Wikipedia and provide it as Linked Open Data, have come a long way and work really well. For example, you can search for all scientists born in Zürich, Switzerland.
But you don’t have to stop there! Because the data is linked, you can open up the definition of “Zürich” and find much more related information. This is the beauty of Linked Open Data, and something we feel you should know about for the power it gives to visualization creators and data journalists. That’s why we have decided to start a series on Linked Open Data. With this first post I’ll introduce you to some core concepts in a gentle, non-technical way…
The Semantic Web, Linked Data and Open Data
Back in 2001 Tim Berners-Lee and his collaborators published a seminal article called “The Semantic Web” in which they presented their idea of “a new form of Web content that is meaningful to computers [and] will unleash a revolution of new possibilities”. In the last few years, the idea has gained traction and technologies have become available to build parts of this vision. Unfortunately, getting started is not so easy, because there are many concepts with slightly varying names and minute differences in their meaning and several technologies with cryptic names, so let’s start with some definitions.
First up is the term Semantic Web. The Semantic Web describes the vision that machines will some day be able to understand the meaning (“semantics”) of information on the Internet, and be able to “perform tasks automatically and locate related information on behalf of the user” (Wikipedia). What is important to understand, is that this term describes an amalgam of concepts and technologies (similar to the “Web 2.0”) and not a single technology.
One technological concept that is part of the Semantic Web vision is Linked Data, which describes “a method of publishing structured data, so that it can be interlinked and become more useful” (Wikipedia). The above-mentioned example shows the power of this: instead of giving our software a meaningless (at least to a machine) string as an input, we give it an object with an URI (Zürich) and define this object as being of type, amongst others, populated place.
The meaning of “a populated place” in this case is clearly defined, so that others can look up what it means exactly and also use this definition themselves. This way, if someone uses “a populated place”, everyone talks about the same thing. Also, if we take a look at the definition of label, it says that it is “a human-readable name for the subject”.
The description of “a populated place” is part of a vocabulary that has been defined in an ontology. What’s interesting is that this ontology can be defined by anyone. This allows for the creation of ontologies for special areas of interest, such as the “friend of a friend (FOAF)” or the “hCard” vocabularies, which were created by individuals or small groups and have proven useful to their community. Because of the distributedness of these ontologies, they can be formed bottom-up and save us from creating The One Global Ontology, which would be a gargantuan task.
Linked Data by itself doesn’t have to be publicly available data, it can just as well be used in private, so we need one more definition: Open Data. It describes “a philosophy and practice requiring that certain data be freely available to everyone, without restrictions from copyright, patents or other mechanisms of control” (Wikipedia). This is similar in spirit to other movements like Open Source Software, and there is work being done to create licenses that clarify the usage terms of the data (e.g. Open Definition and the Open Data Commons).
At last, to describe data that is open and linked, there’s the combination of the two, Linked Open Data. This is the data we, as visualization creators, want, because it has clear license terms and is easily linkable with other data sets. To put these terms in relation to each other, I created the following graphic; in the world of all data, only the blue areas are open to the public, with the dark blue being open and linked.
Democratic governments have always had to make the data they produce transparent to their citizens, however, many do so using proprietary file formats like Excel, machine-unfriendly documents like PDFs, or “hide” the data by distributing it over many government sites and thus making it (unintentionally) hard to find. This is all Open Data, because people can look at and use it.
Luckily, there is this new trend to make data really open, not just legally and as a matter of form. Sites like data.gov have started to provide Open Data as a central, searchable catalog, often with the option of accessing the data through APIs, which makes it a lot easier to consume the data, as it doesn’t have to be transformed, combined and prepared for a program to use. With this central catalog in place, they have now been able to go a step further and start transforming this data into a huge Linked Open Data set, that is accessible to everyone. The graphic below shows the size of the Linked Open Data web at the end of 2010: each bubble is a website that you can access through Linked Open Data technologies in similar ways that you would normally access a database.
To get some perspective on these different ways of publishing data, Berners-Lee suggested a 5-star system to describe the accessibility quality of data sets to emphasize that “the Semantic Web isn’t just about putting data on the web”, but doing so in ways that allow machines to understand the meaning of the data. The LiDRC Lab has taken Berners-Lee’s proposal and prepared it using examples and annotations. Go and have a look at the Linked Open Data star scheme by example, it’s a good read.
What the system does not take into account, however, is the quality of the data itself. As with everything on the Internet, remember that even if you get your hands on a well published Linked Open Data set, it may be incomplete, taken out of context or badly curated. Bad content in, bad content out does still apply. This problem is especially acute for Linked Open Data at the moment, because everyone is just starting out with creating the ontologies and links and there is no way to do this overnight, so incompleteness will probably prevail for a while.
The Value for Visualization Creators
The basis of all visualizations is content, and the availability of Open Data certainly helps visualization creators and data journalists to find data that lets them support and discover the stories they want to tell. In his TED talks in 2009 and 2010, which we covered before, Berners-Lee gave several exciting examples of what can be accomplished if data is open to everyone, be it people uncovering discrimination based on race like in Zanesville or communities getting together to improve maps of Haiti to make crisis help possible.
These examples show what people are able to achieve if they have access to the data. Here are some more reasons why I think (Linked) Open Data has a lot of value for our community:
- Because the data is freely accessible, it is verifyable by others, they can access and judge the source.
- Others can access the same data set and create alternative works. Mashable magazine proposed the wonderful idea of “forking” a visualization, lowering the bar to create alternatives and open up the discourse. Platforms like Many Eyes, and even more the New York Times Census Explorer already go in this direction by showing the data and the visualizations others create, letting readers experiment and publish their own versions.
- By linking up the data, we gain access to many more “databases” and discover connections we wouldn’t otherwise have thought of.
- Because the knowledge about its relationships is already in the data, more people can work with it without requiring knowledge of tools to combine different data sets.
- The data becomes embeddable: not everyone has to keep a copy of a data set, they can link to it. This ensures that the data is always up-to-date.
- The data becomes better documented. If, for example, you use a national coordinate system like northing that is used in the UK, you can link to it’s definition and by that ease the difficulty for others to convert it to other systems.
I’m sure there are many more benefits of (Linked) Open Data and hope that I could start your imagination. These are exciting developments and I’m looking forward to seeing all the interesting (Linked) Open Data visualizations the visualization community comes up with!