Every now and then, I like to pass along tips that I've learned from my work about preparing manuscripts for e-formatting. (To read my other blogs on this subject, click these links: A Learning Process, Paragraph Indents, Styles and Formatting Palette.)
Today I want to discuss my learning experience from formatting OCR scanned copy. The acronym OCR stands for Optical Character Recognition, and it's the most common form of text scanning. In other words, transferring copy from a printed page into the computer. Authors and publishers will do this when they want to re-publish older printed books in e-format, but they don't have (or don't have access to) a computer file of the book.
Tip No. 1—If you use this process to e-publish a book, proofread, proofread and proofread again. REASON: OCR scanning is notorious for generating typos—especially if scanning from serif fonts.
For example, I recently formatted a scanned book, and every word that began with "fi" had been changed to "h". So, the word first ended up hrst in every instance where it was used in the book. Likewise, words beginning in "fl" were changed to "H".
Even though those squiggly red lines in MS Word can drive us crazy, pay close attention to them if the copy has been scanned.
Tip No. 2—Copy your document into a program such as NotePad to clear all formatting before saving it into MS Word. Yes, it means extra work, because you'll have to go through the book and replace all italics and other font formatting, but it's worth it because… REASON: OCR scanning can play havoc with formatting.
For instance, if during the printing process, the press (or printer) laid down the black ink too heavily, the scanner can interpret that as bold type. Formatting errors like these aren't always visible to the naked eye.
For example, a book I formatted had about 95 bold html tags scattered throughout the document—some in very odd places such as a period at the end of a sentence. These all had to be weeded out to give the readers a clean, readable copy of the book.
In the end, don't we all want to the give the readers the best possible e-book that we can produce, even if it means a little extra work.