Why Is Google Giving Us the Finger?

Here’s what a Joseph Conrad novel might look like when you find it in the library…

Like just about every professional writer and reader, I have been curious about Google’s much-debated library of scanned books — for personal reasons. After critics of the Google Books project charged the company with copyright infringement, a tentative agreement was reached last year that promises to pay authors $60 for the rights to copy each of their publications, with other fees to come. But I’m less interested, frankly, in any future royalties than in the benefits of instant access to a library that is estimated to eventually top 20 million books.

So when a mobile version of Google Book Search showed up among the apps offered on my relatively new iPhone, I tried it out. I was delighted to find that I could browse every issue of Life magazine, from 1936 on, much as I had as a child (though I no longer retreated to the dark closet under the stairs in the decrepit ancient house of my great-aunt Olive). And I learned some surprising things.

One of the sample texts on Book Search was a Joseph Conrad novella from 1917, The Shadow Line. I read it while stranded in an airport waiting room, happy for the emergency material. The novella put me in mind of Conrad’s Under Western Eyes, a 1911 novel about terrorism that struck me as having renewed relevance for our time. I searched the free-books list on Book Search, and there it was, in the public domain.

The type was clear, and I found it easy to drag the text down the screen with my thumb — easier than with Kindle, or the Sony Reader, with its irritating page-refresh flicker. But I noticed here that the scanning process occasionally stuttered. Bits of grit or loose paper appeared to throw off the character-recognition software. French phrases so confused the device that it threw in asterisks, tildes and carats. Now and then, it would give up completely and erupt in a string of dingbats like comic-book cursing. A couple of underlined sentences were suddenly reproduced photographically in the original book type rather than the screen type. Then the device seemed to hit the virtual carriage return a few times, producing a three-quarter-inch blank space.

I was surprised that Google didn’t make use of a higher class of scanner. And I was really surprised by what happened next: like a dirty photo falling from between the pages of a book, a photograph popped up.

…And here’s what it might look like when you download it from Google Books

It showed the hand of whoever fed pages into the scanner — a hand with a latex sheath on its index finger, like a condom. The person’s nails were nothing to brag about. The condom and the nails, combined with the sudden, unexpected appearance, made the picture seem obscene and unhealthy. I thought with horror of the guy who found a finger in his bowl of fast-food chili.

Was this the literal hand of Google? The fickle finger of the company that holds my copyrights? The sticky fingers that, to hear some tell it, threaten to grab our literary heritage?

I wondered what such sloppiness said about the book-scanning project — about how much we can trust Google and how much we should fear it. Even as people involved with publishing have debated the issue of Google’s right to digital content, most of us, impressed with the company’s search engine and maps, have assumed it would at least get the technical part right.

Rereading press coverage of Google Books, I learned that others had found finger photos, and some had posted them online. But these technical concerns were crowded out by the lovefest for the project engaged in by important writers. Take Jeffrey Toobin’s sloppy kiss to the deal in the February 5, 2007, New Yorker.

In Toobin’s account, details about the scanning process are not so easy to pin down. He depicts Google’s chief scanner, Dan Clancy, a NASA veteran, as a lovable geek with granola-bar crumbs clinging to his clothes.

Clancy tells Toobin that the project’s enormous scope required the development of special scanning tools and leaves it at that. Says Toobin, “Google will not discuss its proprietary scanning technology, but, rather than investing in page-turning equipment, the company employs people to operate the machines, I was told by someone familiar with the process. ‘Automatic page-turners are optimized for a normal book, but there is no such thing as a normal book,’ Clancy said. ‘There is a great deal of variability over books in a library, in terms of size or dust or brittle pages.’”

According to a Wikipedia contributor, Google currently uses Elphel cameras for book scanning. These were apparently adapted from models used to capture street imagery for Google Maps. (Elphel is a little-known company based in Utah that, ironically, given Google’s secrecy, uses open-source software to operate its equipment.)

Some critics, of course, have highlighted concerns about the technical side of Google Books. In August, the linguist Geoffrey Nunberg, writing in The Chronicle of Higher Education, attacked the project for errors in the data used to file the books: author, title, subject and year of publication, to begin with the most basic classification elements.

Nunberg wrote that the “book search’s metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a mess…”

To take Google’s word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler’s Killer in the Rain, The Portable Dorothy Parker, André Malraux’s La Condition Humaine, Stephen King’s Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams’s Culture and Society 1780–1950, and Robert Shelton’s biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born; a book of Virginia Woolf’s letters is dated 1900, when she would have been 8 years old. Tom Wolfe’s Bonfire of the Vanities is dated 1888, and an edition of Henry James’s What Maisie Knew is dated 1848.

Part of the problem is the stupidity in software, or grayware. But scanning technology is also at fault, Nunberg believes. For instance simple misreading of the copyright page seems to lie behind many incorrect datings.

Nowhere in Google’s FAQs or anywhere else is there a clear answer to the question of how books are physically scanned. Whether the books are disassembled in the process of scanning. What measures are taken to avert damage to scanned books, especially to older, more fragile ones with dry bindings and acidic paper. What sort of action readers or authors can take if they encounter errors in the scanning, dating or classification.

Nor has Google’s press department answered my email asking these questions.

So it is likely that the company will also ignore this question: If the process of creating Google Books is open and its motives good, why is there so much secrecy about the nuts and bolts? Many experts feel there is room only for a single digital super library and Google is it. Geoffrey Nunberg writes, “No competitor will be able to come after it on the same scale. Nor is technology going to lower the cost of entry. Scanning will always be an expensive, labor-intensive project.” So why then does Google seem to fear competition from the disclosure of information about the process?

In an October 9 New York Times op-ed piece, Sergey Brin promised to improve on the bibliographic information in Google Books. But he said nothing about scanning errors and seems to dispute the prediction that the service is likely to emerge as a de facto monopoly. Writing about the millions of out-of-print books threatened with extinction, the books he aims to preserve, he said, “I wish there were a hundred services with which I could easily look at such a book; it would have saved me a lot of time, and it would have spared Google a tremendous amount of effort. But despite a number of important digitization efforts to date (Google has even helped fund others, including some by the Library of Congress), none have been at a comparable scale, simply because no one else has chosen to invest the requisite resources. At least one such service will have to exist if there are ever to be one hundred. If Google Books is successful, others will follow.”

If there are to be many libraries, it is all the more important to get the quality of the original scans right. The same files might serve as material not only for other libraries but also for other formats, including Kindle or open-source-based readers.

Concentrating power and responsibility for any purpose in the hands of a single entity is rarely positive. You don’t have to read millions of scanned books to glean that lesson. Just try Suetonius, The Federalist Papers, Barbarians at the Gate or All the King’s Men. You can find them for free — at your public library.

Observed