why google and yahoo love wikipedia 12.29.2005, 3:16 PM
posted by lisa lynch
From Dan Cohen's excellent Digital Humanities Blog comes a discussion of the Wikipedia story that Cohen claims no one seems to be writing about -- namely, the question of why Google and Yahoo give so much free server space and bandwith to Wikipedia. Cohen points out that there's more going on here than just the open source ethos of these tech companies: in fact, the two companies are becoming increasingly dependent on Wikipedia as a resource, both as something to repackage for commercial use (in sites such as Answers.com), and as a major component in the programming of search algorithms. Cohen writes:
Let me provide a brief example that I hope will show the value of having such a free resource when you are trying to scan, sort, and mine enormous corpora of text. Let's say you have a billion unstructured, untagged, unsorted documents related to the American presidency in the last twenty years. How would you differentiate between documents that were about George H. W. Bush (Sr.) and George W. Bush (Jr.)? This is a tough information retrieval problem because both presidents are often referred to as just "George Bush" or "Bush." Using data-mining algorithms such as Yahoo's remarkable Term Extraction service, you could pull out of the Wikipedia entries for the two Bushes the most common words and phrases that were likely to show up in documents about each (e.g., "Berlin Wall" and "Barbara" vs. "September 11" and "Laura"). You would still run into some disambiguation problems ("Saddam Hussein," "Iraq," "Dick Cheney" would show up a lot for both), but this method is actually quite a powerful start to document categorization.
Cohen's observation is a valuable reminder that all of the discussion of Wikipedia's accuracy and usefulness as an academic tool is really only skimming the surface of how and why the open-souce encyclopedia is reshaping the way knowledge is made and accessed. Ultimately, the question of whether or not Wikipedia should be used in the classroom might be less important than whether -- or how -- it is used in the boardroom, by companies whose function is to repackage, reorganize and return "the people's knowledge" back to the people at a tidy profit.