Mining Legacy Content with OCR

As a publisher, you invariably have stacks of legacy documents—those yellowing pages packed away in dusty cabinets and boxes that are bursting at the seams, mostly ignored or even forgotten. Well, you are sitting on a gold mine, literally. With OCR (optical character recognition) technology, you can now convert (read: repurpose) those sneeze-inducing pages—and even obsolete archival formats—into digital assets and channel them into different platforms to generate new revenue streams. The question that springs up in your mind must be: What is possible with OCR?

Take one recent project handled by KnowledgeWorks Global Limited (KGL), a Cadmus Communications company: a Phi Beta Kappa directory that was completed within a month of receipt. Some 122,080 membership records, spanning 1780 to 1941, were scanned and then converted into searchable image-enhanced 300-dpi PDFs. “The extremely small type in three columns posed the biggest challenge, requiring rekeying to ensure high data accuracy,” said COO Atul Goel. “And since the institution did not have a DTD [document type definition], our service was extended to include the development of a semantic and tailor-made DTD that would accommodate its various record formats.”

Mumbai-based KGL offers numerous value-added services to complement its basic OCR workflow, such as enhancing scanned PDFs with bookmarks, table of contents, and internal and external links. “More complex services include e-book conversion, section 508—compliant PDF conversion and optimized print-on-demand file production. There is also an XML header service that captures article metadata, titles, authors, abstracts and reference lists with sufficient structural detail to enable citation linking,” said Goel, whose team also handled a series of 18th-century engineering index books containing 1.7 million abstracts spanning 1884 to 1969.

Non-English content and special characters, said Goel, represent the greatest OCR challenges. “In such projects, a custom dictionary has to be built into the scanning process. Additionally, mathematical equations and colored tables have to be converted to images. Bad hard copies—wrinkled, misaligned and blotchy pages—and those with colored backgrounds or light-colored text create 'noises’ during scanning and significantly reduce character recognition.”

To date, KGL has tackled origination materials ranging from hard copies (which sometimes require nondestructive scanning and careful handling, especially when it is the only copy remaining) to microfilms and microfiches, disks, large maps and radio survey forms. This last item is handwritten, and KGL has developed a sophisticated system for it with built-in validation to accurately capture the data. Maps, according to Goel, are the most challenging format in the OCR universe, a sentiment shared by Ashok Hinduja, head of the data unit at Chennai-based Newgen Imaging.

For Newgen’s 260-strong data unit, the largest archival project this year totaled 3.5 million pages, involving nondestructive scanning of maps and archives dating back to the mid-18th century. The project was completed within a year, six months ahead of schedule. Process-wise, the ball started rolling even before the first page got to the OCR machine, with a host of logistical issues to resolve, including humidity-controlled storage, shipping coordination, customs clearance, bar-coding of pages and creation of a database of materials.

“OCR project requirements vary in complexity,” said Hinduja. “We may be called on to create a Web-based storage and retrieval system, set up protocols to process the materials, distribute the content to third-party hosts and aggregators, or update and release new editions. In some cases, we assign a full-time staff exclusively to a project to load and manage data, even after the content management system has been built. Presently, we are managing a technical publication requiring periodic updates in several languages.”

As to what publishers need to understand about OCR/scanning, Hinduja had this advice: know broadly what you want as a final product, but don’t foreclose any options for getting there. “If you don’t know what all the possible future requirements are, then we will lead you toward a flexible archiving format. Take e-book conversion as an example: one approach is to convert directly to the required format; another is to create a generic, richly tagged format that can be automatically reconverted to several other standards or proprietary formats. The latter, of course, will cost slightly more, but any future conversions will cost way less.”

To sum up, with OCR/scanning technology having improved by leaps and bounds in recent years, almost anything is now convertible.

Note: This is the fourth column in a regular series highlighting content/publishing services provided by India-based companies.

Mining Legacy Content with OCR

A basic digitization process