Monday, January 13, 2014

PDF Files - Converting to Word

Question: When I convert a PDF file to a Word document not all of the characters are correct.  Why does this happen?

Answer: Not all PDFs are made the same way.

PDFs - you can say there are basically two types of PDF files - scanned from paper and created on a computer.

Scanned PDFs - pages are images. OCR (optical character recognition) has to look at all the dots of ink and figure out what is what.

Based on patterns our software tries to determine characters, images, font, etc. Some dots can be noise or can be diacritics. We try to figure this out as best we can. Quality of scan and font type certainly affect analysis and success of detection. 

Computer generated PDFs - usually these types of PDFs include font information (PDFs are built with mapping/encoding details, they do not include characters, words, sentences, paragraphs, headers/footers, details). We figure all this stuff out and reconstruct the document with structure. Some PDF creators do not use standard encoding so we have to again do some detective work to try and determine what is in the PDF. If the font has been computer generated and we have details included in the PDF and the computer you convert to Microsoft Word has the same font installed, then you should get these other fonts in the Word file. For example, we convert CJK (Chinese, Japanese, Korean) PDFs that have been computer generated into correct Word documents.

Fonts - if we have the information from the PDF or we determine the language/font and it includes diacritics we use the correct characters for the font/language. We have done a lot of work to support foreign languages with our OCR and again, with our standard PDF conversion software.

Forcing OCR on computer generated PDFs - turning on text recovery for all characters and pages forces all PDFs through our OCR processing which is not the best solution. With non-standard encoded documents it can provide a better conversion, but if your PDF is a standard computer generated one then our default document reconstruction processing is best.