Google's Book Search: A Disaster for Scholars?

Your humble Northwest History blogger is sometimes accused of being a Google fanboy. A fair cop. But you know who is not a Google fanboy? Geoffrey Nunberg, that is who. Over at the Chronicle of Higher Education Nunberg has a witty jerimiad, Google's Book Search: A Disaster for Scholars.

Nunberg's beef is with Google's sloppy and commercially driven metadata schemes. He demonstrates that even with such a basic item as date of publication, Google Books very frequently gets it wrong. This in turn often corrupts search results: "A search on 'Internet' in books published before 1950 produces 527 results; 'Medicare' for the same period gets almost 1,600." By comparing Google's data to that found in the catalogues of the contributing libraries Nunberg shows that these errors do in fact belong to Google, not to their partners.

Nunberg also whacks Google for the classification errors where books are placed in the wrong categories: " H.L. Mencken's The American Language is classified as Family & Relationships. A French edition of Hamlet and a Japanese edition of Madame Bovary are both classified as Antiques and Collectibles . . . An edition of Moby Dick is labeled Computers; The Cat Lover's Book of Fascinating Facts falls under Technology & Engineering."

Worst of all to Nunberg is Google's adoption of the Book Industry Standards and Communications categories for Google Books, which he describes as a modern commercial invention used to sell books, rather than a scholarly system of classification like the Library of Congress subject headings: "For example the BISAC Juvenile Nonfiction subject heading has almost 300 subheadings, like New Baby, Skateboarding, and Deer, Moose, and Caribou. By contrast the Poetry subject heading has just 20 subheadings. That means that Bambi and Bullwinkle get a full shelf to themselves, while Leopardi, Schiller, and Verlaine have to scrunch together in the single subheading reserved for Poetry/Continental European. In short, Google has taken a group of the world's great research collections and returned them in the form of a suburban-mall bookstore."

I think that Nunberg has a number of good points--point he gathers together to form a molehill, from which he conjures up a mountain. Google's metadata may be everything he says (and I think he is probably right) but how great a problem is that really? This scholar at least uses Google Books either 1) to locate a digital copy of a book I already know about, or 2) via a string of search terms. In the first case, it is not relevant to me that Google has classified Adventures of Huckleberry Finn under "wild plants" or whatever. I know perfectly well what it is, and just wanted to find a quote I remember.

In the second case, I might search for mentions of the Columbia River in books published before 1860. And suppose a faulty date in Google's database brings me to something written after 1860. So what? Surely when I click on the link and find myself reading Sherman Alexie instead of Lewis and Clark, I will notice the fact. (Actually I just did the search and on the first 10 pages of results I don't see any errors at all. Take that, Nunberg.)

So for which scholars exactly is Google Book Search a "disaster?" Nunberg cites "linguists and assorted wordinistas" who are "adrenalized" at the thought of data mining to "track the way happiness replaced felicity in the 17th century, quantify the rise and fall of propaganda or industrial democracy over the course of the 20th century, or pluck out all the Victorian novels that contain the phrase "gentle reader." But who does this? OK, I know that people do it, but most data mining of this type has always struck me as more of a parlour trick than actual scholarship.

The other thing Nunberg ignores is that metadata is not that hard to fix. Google already provides a "feedback" button on every virtual page so readers can report unreadable or missing pages. If we howl loud enough we could easily see similar feedback mechanisms on the "More book information" page so we could correct names and dates and categories.

Nunberg is absolutely correct to recognize the monumental importance to scholars of the Google Book Search project. It is vital that scholars take a critical stance that will push Google to improve the project and make it even more useful. His article is a valuable push in that direction.

UPDATE 9/3/09: Reader Ed points out that Geoff Nunberg also posted a nicely illustrated version of his article on the blog Language Log, and got a brief response in the comments from
John Orwant, who manages the metadata at Google Books.