Need to extract text from a scanned document? Chances are that doing this is a lot easier than you think, and you may even have the software to do it already. In this post we explore how to use Microsoft® Office Document Imaging (This comes with Office 2003 and 2007).
Microsoft Office Document Imaging (MODI)
Microsoft Office Document Imaging performs text recognition using optical character recognition (OCR) and comes with Office 2003 and 2007.
What makes a scanned document different from other documents? A scanned document does not contain actual text, but rather a “snapshot” of text much in the same manner as a digital camera takes a photo.
OCR recognizes characters from images of text and converts them into actual text characters. This process makes it possible to edit that text by sending it to Microsoft Word or to find the file later using a keyword search. The results generally are not perfect, but this process can save considerable time and labor over having to re-create an entire document.
(Image created using Solid Capture Screen Capture)The quality of the text created by created by MODI depends in large part on the quality of the scanned document. If you have a poor quality image to work with, you may get poor results in your DOC file.
In many cases it will help to get a better scanner (or purchase better OCR software), but it never hurts to try the tools you already have before spending additional money.