microsoft steps up book digitization 10.17.2006, 11:20 AM
posted by ben vershbow
Back in June, Microsoft struck deals with the University of California and the University of Toronto to scan titles from their nearly 50 million (combined) books into its Windows Live Book Search service. Today, the Guardian reports that they've forged a new alliance with Cornell and are going to step up their scanning efforts toward a launch of the search portal sometime toward the beginning of next year. Microsoft will focus on public domain works, but is also courting publishers to submit in-copyright books.
Making these books searchable online is a great thing, but I'm worried by the implications of big coprorations building proprietary databases of public domain works. At the very least, we'll need some sort of federated book search engine that can leap the walls of these competing services, matching text queries to texts in Google, Microsoft and the Open Content Alliance (which to my understanding is mostly Microsoft anyway).
But more important, we should get to work with OCR scanners and start extracting the texts to build our own databases. Even when they make the files available, as Google is starting to do, they're giving them to us not as fully functioning digital texts (searchable, remixable), but as strings of snapshots of the scanned pages. That's because they're trying to keep control of the cultural DNA scanned from these books -- that's the value added to their search service.
But the public domain ought to be a public trust, a cultural infrastructure that is free to all. In the absence of some competing not-for-profit effort, we should at least start thinking about how we as stakeholders can demand better access to these public domain works. Microsoft and Google are free to scan them, and it's good that someone has finally kickstarted a serious digitization campaign. It's our job to hold them accountable, and to make sure that the public domain doesn't get redefined as the semi-public domain.
K.G. Schneider on October 17, 2006 5:02 PM:
Imagine my Shock and Awe when I was blocked from viewing an edition of a 19th-century book in Google Books because the introduction, published a couple of decades ago, was still in copyright.
I'm with you, Ben, but the big schools seem to be tumbling one by one, and they aren't paying much attention to the few objections from the peanut gallery. I thought OCA was going to gallop in and save the day; that's apparently not the case.
Gary Frost on October 17, 2006 5:28 PM:
Maybe this is common knowledge...but I am not aware that OCR is capable of dealing with the variety of fonts in older books. Are there developments letter recognition software that would figure onto such massive image processing?
In all these developments, I only hope that the libraries hold onto the books themselves. We went through whole generations of imaging and discard with microfilm. Its an issue in a culture like our own with a preference for new copy over old original.
JulietS on October 18, 2006 1:23 PM:
The many scanning and "digitization" projects are a necessary first step towards "fully functioning digital texts (searchable, remixable)" just as the Human Genome Project was a necessary first step towards comprehensive understanding of human genetics. But, producing scans and raw OCR is long way from that "searchable, remixable" goal. While modern OCR is almost perfectly accurate on modern material, it simply doesn't do as well with the vagueries of older books: strange fonts, uneven inking, foxing, blotches, etc. It seems to do well enough for searching on a large scale, but it is very painful for a human to read.
Production of those "searchable, remixable" texts, that will allow the grand uses envisioned by so many luminaries, will require extensive human intervention. Perhaps someday natural language processing will become good enough to, for example, distinguish between "he" and "be" (a very common OCR misread that can be incorrect in either direction) in all cases. Until then, people will have to carefully examine each page. And this is expensive.
Distributed Proofreaders (DP), at www.pgdp.net, is one approach to this problem. DP uses the time of many volunteers, in a peer production model, to proofread public domain texts. Thus far DP has produced over 9200 of those searchable, remixable texts. A drop in the bucket compared to the quantity that has been scanned and remains to be scanned. But, nonetheless, a start. And one that is essentially free because the entire site is completely volunteer.
Also, it should be remembered that under current US law, scans of public domain material are themselves also public domain. No creative effort goes into them. So all of the scanning initiatives that allow access to their scans are, in fact, doing the public domain a big favor. Others, like DP, can then work from the base that they provide.
Finally, in our experience at DP, missing/bad pages are a significant problem at some of the scanning projects. I would hate to be a library relying on scans given the quality issues.
bowerbird on October 18, 2006 4:48 PM:
ok, ben, this is a _much_ better take.
i never got around to saying "wrong" to
you and jesse the last time this came up.
and, for a while there, i really thought
i'd be able to knock you out, because
the library at the university of michigan
announced last month that it's gonna be
making the o.c.r. results publicly available
from its scanning collaboration with google.
(and i saw no mention of _that_ on this blog.
didn't fit in your world-view, so you ignored it?)
indeed, i wanted to blow trumpets for umichigan,
maybe bestow upon them a brand new award --
entitled "the first annual 'michael hart' award",
signified by a big-ass gaudy trophy, of course --
given yearly to the best hero of the public domain.
one glitch is that umichigan text is only available
on a page-by-page basis. so if you want the text
for a whole book, you've gotta scrape each page...
but hey, you can write a program to do that, so
i did, and i was jazzed by this new development.
i even wrote a series of posts to "bookpeople"
-- (search for "feedback to umichigan" for it) --
walking people through the steps of scraping
one book's o.c.r. from umichigan in order to
turn it into a high-powered electronic-book,
which entails fixing scannos and formatting.
but alas, in scraping the umichigan o.c.r. text,
i was sad to find that it's _significantly_flawed_.
and i do mean _flawed_.
and i do mean _significantly_...
and you can see this for yourself, on an example page.
first the image:
and then the text:
first, notice the _paragraphing_ has been lost;
there are no blank lines between the paragraphs,
and no indentation of each paragraph's first line,
so the text on each page is one big paragraph...
you'll also see the end-of-line _hypenates_ have
lost their hyphens, so they need to be replaced...
furthermore, _em-dashes_ have been lost as well.
and -- even worse -- the _quote-marks_ were lost.
all told, this adds up to serious fundamental breakage.
it's easier to re-do the o.c.r. -- right -- than do the work
of fixing all the problems with this umichigan o.c.r. text.
finally, i have already found some books (just out of
the half-dozen i looked at) that have _lousy_ o.c.r.
look at the awful o.c.r. on a book from jules verne:
all in all, this text is so pathetic that i myself would be
_badly_embarrassed_ to release this o.c.r. to _anyone_
-- _anywhere_ -- let alone a _university_ community
and the public at large! it gives "shoddy" a bad name...
and, to add insult to injury, the only librarian from
umichigan who has been willing to go on the record
seems unconcerned about this atrocious quality --
plus their badly-flawed infrastructural architecture,
which is downright _hostile_ to user-remixability --
and even tried to blame google for the poor o.c.r.,
a dodge that's unsupported by the fact that a search
of the text at google does not reveal the same flaws.
as i said, i hoped umichigan would prove you wrong;
but i am sad to report that -- at this time, and until
something changes -- i'm the one who's wrong, and
our public-domain text is still being held hostage...
of course, we _do_ have access to the scans now
-- google has even given us one-button download
for each book, which is downright gracious of 'em --
so we _can_ start on the o.c.r. ourselves, right now.
but any hope umichigan stirred up was badly dashed.
and i won't be buying any big-ass gaudy trophy soon.
bowerbird on October 18, 2006 5:13 PM:
well, as i just demonstrated, i have no problem
admitting when i've been wrong on something...
but i also have no problem pointing out
when someone else is wrong on something,
and juliet just uttered a lot of wrong stuff.
as i noted just now, i took umichigan o.c.r.
for a whole book and turned it into a finished,
high-quality electronic-book. it took me 1 hour.
i wrote up the experiment timing extensively for
that series of posts on the bookpeople listserve.
i was testing the quality of my work by doing
the experiment on a book that's already been
digitized several times now, to full accuracy...
i ended up with 14 errors across the 280 pages,
most of those being minor errors in punctuation.
this degree of accuracy is more than good enough
to turn the book out to the public, who we will
then request to serve as the "final" proofreaders,
and hope they'll notice those punctuation errors.
(and if they don't, what does that say about the
seriousness of the errors? says "macht nicht".)
and no, i didn't proof every word on every page.
given today's o.c.r. software, which gives text
that is highly accurate if the scans are good,
it's a terrible waste of human time and energy
to take that specific approach to digitization.
and i've told juliet this fact over and over,
yet she still persists in giving this shtick.
(look at the scans at distributed proofreaders,
and you'll see immediately why they get such
crappy o.c.r. from them. it's totally obvious.
google's scans, on the other hand, are better,
so are much more likely to give good o.c.r.)
it's a good thing that the volunteers over at
juliet's distributed proofreaders don't know
how badly she's wasting their time and energy,
or they wouldn't stick around for very long...