Saturday, September 17, 2016

How in the world did they scan all those books in Google

Mr Penumbra's 24-Hour Bookstore
by Robin Sloan

I read Mr Penumbra's 24-Hour Bookstore by Robin Sloan and loved it tremendously. It presents a story in a good mesh of the conventional and modern setting in reading. The protagonist is Clay who is a newly graduated designer who is also quite adept at programming. He lost his job and subsequently settled for the next one he could get - nightshift clerking at a 24-hour bookstore.

A 24-hour bookstore! 

I've never found one in my life, but this one here is a mysterious one. It concerns the books in the Waybacklist, what Clay referred to as the tall vertical shelves at the back of the store and a fellowship of readers called the Unbroken Spine. It's a clash between two worlds of old leatherbound books and electronically rendered ones, between codes in the pages of a book and codes on the screens of computers, between handwritten log books and a 3-D model of the bookstore, between a small dim musty decrepit bookstore and the massive polished sleek Google. 

I actually learned so much from reading this book, particularly about Hadoop and Mechanical Turks. And as the story progressed on, Clay went to Google to get a book scanned. It had crossed my mind before to check on how Google actually got all those books scanned but I never did. Now that I did, I am impressed. 

According to Wikipedia, "as of October 2015, the number of scanned book titles was over 25 million...Google estimated in 2010 that there were about 130 million distinct titles in the world, and stated that it intended to scan all of them." That is only about 19% scanned. 

And how do they scanned all those massive amounts of pages.

"Many of the books are scanned using a customized Elphel 323 camera at a rate of 1,000 pages per hour. A patent awarded to Google in 2009 revealed that Google had come up with an innovative system for scanning books that uses two cameras and infrared light to automatically correct for the curvature of pages in a book. By constructing a 3D model of each page and then "de-warping" it, Google is able to present flat-looking pages without having to really make the pages flat, which requires the use of destructive methods such as unbinding or glass plates to individually flatten each page, which is inefficient for large scale scanning."

I could not find out how the Google scanning machine or contraption look like but I found this awesome high speed book scanner I would love to get my hands on. It scans 250 pages a minute, 15 times better than the Google one. And if I had one, I'd scan all my 400 over books in my library in just 20 hours! So I would just need to rent it for a day or two. I could ask the Ishikawa Oku Library in the University of Tokyo, or not. 



  1. 250 pages a minute? That is crazy fast.

    If you could scan all of your books would you donate or sell the paper versions?