Wednesday, June 23, 2010

Best Practices for Converting PDF Files

Question: I have content I need to repurpose that is held captive in a PDF file. What do I do?
Answer: Solid Converter® PDF and Solid PDF Tools offer different options for PDF to Word conversion because one size does not fit all. There are different options that can be used to convert. Here’s how to get the best conversion for what you need.

Conversion options available in Solid PDF Tools and Solid Converter PDF include:

  1. Flowing - recovers page layout, columns, formatting, graphics and preserve text flow
  2. Exact - recovers exact page presentation using text boxes in Microsoft Word
  3. Continuous - detects layout and columns but only recovers formatting, graphics and text.

When should you use each? We get asked this question a lot and we usually followup with a question:

"What are you intending to do with the Word document (recovered PDF content) once you have converted it? Read on to see how to choose the best conversion method for your repurposing task at hand.

Flowing Mode:

Why are there page breaks in Flowing mode?

Word documents do not have pages in the same sense as PDF files. They are re-paginated by Word whenever the document is loaded and the page breaks can shift depending on all sorts of things (like local fonts, paper size, print margin changes, etc). Software has to balance customer first impression (they expect WYSIWYG) with edit-ability. It is exceptionally hard to get layout in Word that matches that in PDF perfectly. Word is limited in many ways: font sizes only to the nearest half point and less than ideal kerning/spacing. We use all sorts of techniques to make the layout match. Without "page breaks" a single minor layout error early in a document will cascade from page to page and mess up all subsequent pages in the document. This comes back to the question "what are you intending to do with the Word document?" Read on ...

Exact Mode:

Exact mode is more than just "converting the PDF into text boxes". We do great paragraph-level reconstruction so that we combine PDF text chunks into larger logical text units that are editable at a paragraph level, in text boxes. This mode is the layout engine behind our new PDF to PowerPoint converter. Here is a perfect case where hard page breaks (between slides) make a lot of sense and exact mode is more desirable than flowing or continuous. If you take a PDF presentation and convert to PowerPoint, we use exact mode behind the scenes and the results are fantastic. However, try taking a 200 page legal contract that's a PDF and convert it to PowerPoint: what would you expect the PowerPoint slideshow to look like? Exactly! A mess. "What are you intending to do with the Word document?"

Continuous Mode:

What about HTML? When converting from PDF to HTML we automatically use our continuous reconstruction mode and set the Header and Footer option to detect and remove headers and footers. This will take multi-column complex PDF documents and re-flow them into a single continuous HTML document. There are no page breaks in HTML and the format is intended to re-flow correctly when the width of the viewer varies. Breaking the content up with the headers and footers from the print (PDF) version would be messy and annoying. As far as we're concerned, this is the purest form of reconstruction for when the user is trying to re-purpose a large document with tons of text. In this use case, layout is very rarely important and will be re-applied in with a new style template at some point in the future once the text has been edited or re-purposed. SDL's translation software (Trados) uses Solid Framework in continuous reconstruction mode (for PDF to Word) when they are translating PDF files: there is no point in WYSIWYG layout since the German version will not be the same length sentences as the English anyway. Re-styling and layout comes after re-purposing.

Having said that, we need to remember that continuous is not plain text either: inline formatting, tables, images, etc. all need to be preserved. The software has two goals: close to perfect reading order of the text and preserve as much inline formatting and content as possible while removing page artifacts (like headers and footers).

This brings us back to flowing mode. We see this mode used most effectively for medium size documents where the re-purposing is light. In other words the user wants to do a lot more than simple text touch-up but not as much as translating the entire document. Editing will cause flow "ripples" when the new text is shorter or longer than the old. These layout issues can be mitigated manually in small documents (say, less than 20 pages) while retaining most of the layout as-is. Delete the odd page break for overflow. Shrink an image to make some space. However, for larger documents, there is no magic that allows you to heavily edit while retaining more complex layout. There usually is just not enough information in the PDF to get DTP-level structure in Word (which is not great for DTP anyway).