« academic publishing as "gift culture" | Main | the social life of books »

if not rdf, then what? Post date  03.28.2006, 11:35 AM

posted by jdwilbur

I posted about RDF and the difficulty the web development community has had fully adopting RDF and ontologies as a method of metadata organization. I said that one of the reasons was the relative complexity of RDF and the cost of generating useful metadata (as opposed to just enough information to solve the current problem). Simon St. Laurent has a nice redux of the matter. I won't try to duplicate that, but I do want to explain some of the details about RDF. Though I made a case for how complex RDF is when used to create fully relational data sets, I didn't do a very good job of explaining how simple RDF is in principle. RDF proponents believe they are building the future. I'm not entirely convinced, but I want to take a close look at RDF before I consider other solutions.

RDF seems overwhelming, but in the inimitable words of Squire Patsy, "It's only a model!" A model, in this case, that can representat digital and real things and their relationships. The promise of RDF is that it can describe everything using a combination of unique identifiers, properties and property values.

Unique Identifiers
The heart of RDF is the unique identifier. Your name is a unique identifier, but only as long as there is no one else in the room who answers to [your-name-here]. This, clearly, is not a good way to create a universal identification system. Your social security number is a unique identifier in this country, but it doesn't signify much in China, and the system is not extensible (we'd run out of numbers if we tried to SSN the Chinese). Your email address is a unique identifier on the Internet—it works pretty well as a unique identifier. A Universal Resource Indicator (URI) is a little more extensible, and, since it's longer than an email, can provide more information. You can use a URI to identify something, even if it can't be retrieved through the web. A product at Amazon.com, for example, could have a unique URI, even though you still need a truck to bring it to you.

Properties
If we look at objects in the real world, they have physical properties, like size, color, and hardness. An example: my kitchen table. It's a three dimensional object, so it has height, width, length. It's made of wood, it has been stained. It also has informational properties: the date I purchased it, the person who sold it to me, the area of the country it came from, the level of personal attachment I have for the thing. Each of these properties can be put into RDF, by linking it to a schema that defines the property in a normative fashion. It'll make a little more sense when I give an example. But for that to happen I need to describe...

Property Values
Property values are the names, numbers, and dates that make properties make sense. My kitchen table is 78" long x 28" wide x 34" tall, dark-walnut stained, and soft (as wood goes). I bought it in February, 2002 from Joe Komenda, and I'm never going to part with it (even though it isn't really NYC apartment sized). Property values are the easy part of the metadata. Associating property values to properties, and properties to normative schemas, that's when things get tricky.

Here's the example I promised (bound in an XML format):

<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:kt="http://www.jdwilbur.fake/furniture#"
xmlns:geom2d="http://nurl.org/0/geom2d/1.0/"
xmlns:map="http://nurl.org/0/geography/map/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/">

<rdf:Description rdf:about="http://www.jdwilbur.fake/furniture/kitchen-table">
  <kt:height>34</kt:height>
 <kt:width>28</kt:width>
 <kt:length>78</kt:length>
 <kt:price>150</kt:price>
 <kt:month>February</kt:month>
 <kt:year>2002</kt:year>
 <dc:coverage>
    <geom2d:Point>
      <map:srs resource="http://nurl.org/0/geography/SRSCatalog/wgs84">
      <geom2d:x>-123.817</geom2d:x>
      <geom2d:y>46.183</geom2d:y>
    </geom2d:Point>
  </dc:coverage>
  <kt:seller rdf:resource="http://www.komenda.fake/Joseph%20Komenda#" />
  <kt:sellit>Never ever ever</kt:sellit>
</rdf:Description>
</rdf:RDF>

http://www.jdwilbur.fake/furniture/kitchen-table: The URI of my kitchen table
kt:height: The property height from my schema defined here: http://www.jdwilbur.fake/furniture#
34: The property value that tells me how tall my table is. I would infer from the schema that the value is in inches, not millimeters or light years

For the purposes of this example, I've made up my own fake schema (which would be a bunch of lines of xml similar to the example above) and included three real ones: Dublin Core dc, Geomap 2d geom2d for mapping coordinates, and map to relate the coordinates to physical locations. My schema, kt (which is a stand for the words kitchen table) includes some special properties like seller and sellit. The seller, Joe Komenda, has his own URI (it appears after rdf:resource). The others are fairly standard, but have a specific meaning in my personal context. The only other tricky part is the geographic coordinates, because I'm using three different schemas to define a geographic point. (It's just an example taken from mapbureau. It could resolve to the middle of the Pacific Ocean for all I know)

The obvious point here is that writing RDF is hard. We need automated tools to help us compose in this syntax, which is convoluted but requires perfection to work. Humans are not perfect; RDF is not our language. RDF also requires front-loading: developing schemas and choosing terms, URI's, finding prior art so that terms can be reused. We need tools to help us manage that aspect. And we need applications that demand RDF. Currently, the demand for RDF is low because it is mostly for the sake of maintaing the richness of a data set for some future application—not the ones I work with every day.

So if RDF, syntactically difficult, but conceptually easy, cannot get adopted, what is the alternative? The web API. A wide variety of new web applications and services are accompanied by an API. It seems like you can hardly be part of Web 2.0 without one. What does the API have that RDF doesn't? Simplicity. Famililarity. You cannot interact with an API unless you follow the rules. Fine. Same with RDF. But the rules of an API fall into the familiar realm of setting parameters, grabbing previously named functions, and following the documentation. This is like a caffeinated beverage for developers: they instinctively know how to consume it. More than that, API's mean that people can innovate on an interface level, even if they don't have serious coding chops. I've seen the Google API implemented in twenty minutes. This is a more fluid way to develop; one that feels more comfortable even if it sacrifices information richness. We'll get to RDF one day, maybe in Web 3.5, but until then we will take small steps towards data sharing and interoperability with API's.

Posted by jdwilbur on March 28, 2006 11:35 AM
tags: RDF, api, data, dublin_core, interoperability, property, schema, syntax, uri, value, web_2.0, xml

comments (7):



fournierarrow2.jpgandrew s. on March 28, 2006 12:57 PM:

after this last paragraph is probably a good place to segue into mentioning microformats, from which i'm still trying to separate out the interesting aspects from the trivial.



fournierarrow2.jpgJesse Wilbur on March 28, 2006 04:19 PM:

Andrew,

I don't know much about microformats, but thanks for the link to the microformats blog. My impression is that microformats are an organic pathway blooming just under our digital feet. I like that, and I like the small pieces loosely joined aspect of it. I wonder how it will scale up in enterprise situations?



fournierarrow2.jpgRoger Sperberg on March 29, 2006 01:12 PM:

I think that automated tools are the solution, in much the same way that Adobe Illustrator was the solution to creating art in Postscript.

On the other hand, topic maps are the flip side of RDF -- easy to model, comparable capabilities, less widely deployed, fewer choices, not as web-oriented.

In one large publishing project I work on, we flipflopped from TM to RDF and back to TM over the course of two years. In the end, the difficulties you raise about complexity and modeling ontologies drove us away from RDF, with the knowledge that we could likely convert anything we needed back again -- if needed.

What mattered most now was being able to look at the data and make sense of it. And we could do that with Topic Maps.



fournierarrow2.jpgvirginia kuhn on March 29, 2006 10:05 PM:

I am so glad to see this discussion here. I am only starting to get my feet wet in this area as I look for a way to catalogue and retrieve large video data sets. I've been looking into Dublin Core and kibitzing with a friend who is a metadata specialist at Cornell. It seems to me that if the Institute is going to be a leader in both electronic (academic) publishing and if Sophie projects are going to be persistent and exert influence, then the front loading of meta data is crucial.



fournierarrow2.jpgbowerbird on March 30, 2006 02:26 PM:

meta-data is useful to catalogers, sure,
and even to people who know what they want.

but for users who want "the right book" to
_materialize_ in front of their very eyes,
meta-data ain't gonna do 'em much good...

for that, _collaborative_filtering_ is key.

and making it happen will be 100 times cheaper
than taking some meta-data r.d.f. approach...

-bowerbird



fournierarrow2.jpgK.G. Schneider on March 31, 2006 10:50 AM:

I have a lot of experience with locally-generated metadata, for the portal I maintain in my day job. On the one hand, I stringently question the assumption that metadata is only useful for known-item searches. Good metadata can enhance the findability of items through several means, and collaborative filtering ain't there yet, just like we don't have robots doing our laundry. But I do not question the cost issues associated with good metadata. It's a balancing act and one that I take very seriously.



fournierarrow2.jpgbowerbird on March 31, 2006 01:40 PM:

k.g. said:
> I stringently question the assumption that
> metadata is only useful for known-item searches.

i didn't say "only".

and it wasn't an "assumption", it was an assertion.

now, if you want to "stringently" question it, fine,
but how about giving us just a hint of reasoning?

and since one of the first things you might mention is
the "serendipity" of finding books "on the same shelf",
let me nip that one in the bud right away by saying that
in the cyberlibrary, there might be _thousands_ of books
"on the same shelf". what a user _really_ wants to know
is which _three_ of those books they'll be glad they read.
i don't see how _any_ system that fails to take into account
their individual preferences will ever be able to deliver that.

as to whether collaborative filtering "ain't here yet",
some people will argue with that. i agree with you,
that it's not here yet, but it's also the case that r.d.f.
"ain't here yet" either. so issues of cost _are_ relevant.

the variable cost of r.d.f. goes up as you scale, while
the variable cost of collaborative filtering goes down.
when we're talking about the hundreds of millions
of items in the cyberlibrary, that's a vital difference.

-bowerbird

(Because of spam troubles, first-time comments from unfamiliar addresses or containing multiple links might be held for moderation. If your comment isn't spam, we'll publish it very soon. Thanks in advance for your patience.)




Remember Me?

(you may use HTML tags for style)