Question: I have content I need to repurpose that is held captive in a PDF file. What do I do?
Answer: Solid Converter® PDF and Solid PDF Tools offer different options for PDF to Word conversion because one size does not fit all. There are different options that can be used to convert. Here’s how to get the best conversion for what you need.
Conversion options available in Solid PDF Tools and Solid Converter PDF include:
- Flowing - recovers page layout, columns, formatting, graphics and preserve text flow
- Exact - recovers exact page presentation using text boxes in Microsoft Word
- Continuous - detects layout and columns but only recovers formatting, graphics and text.
When should you use each? We get asked this question a lot and we usually followup with a question:
"What are you intending to do with the Word document (recovered PDF content) once you have converted it? Read on to see how to choose the best conversion method for your repurposing task at hand.
Flowing Mode:
Why are there page breaks in Flowing mode?Exact Mode:
Exact mode is more than just "converting the PDF into text boxes". We do great paragraph-level reconstruction so that we combine PDF text chunks into larger logical text units that are editable at a paragraph level, in text boxes. This mode is the layout engine behind our new PDF to PowerPoint converter. Here is a perfect case where hard page breaks (between slides) make a lot of sense and exact mode is more desirable than flowing or continuous. If you take a PDF presentation and convert to PowerPoint, we use exact mode behind the scenes and the results are fantastic. However, try taking a 200 page legal contract that's a PDF and convert it to PowerPoint: what would you expect the PowerPoint slideshow to look like? Exactly! A mess. "What are you intending to do with the Word document?"
Continuous Mode:
What about HTML? When converting from PDF to HTML we automatically use our continuous reconstruction mode and set the Header and Footer option to detect and remove headers and footers. This will take multi-column complex PDF documents and re-flow them into a single continuous HTML document. There are no page breaks in HTML and the format is intended to re-flow correctly when the width of the viewer varies. Breaking the content up with the headers and footers from the print (PDF) version would be messy and annoying. As far as we're concerned, this is the purest form of reconstruction for when the user is trying to re-purpose a large document with tons of text. In this use case, layout is very rarely important and will be re-applied in with a new style template at some point in the future once the text has been edited or re-purposed. SDL's translation software (Trados) uses Solid Framework in continuous reconstruction mode (for PDF to Word) when they are translating PDF files: there is no point in WYSIWYG layout since the German version will not be the same length sentences as the English anyway. Re-styling and layout comes after re-purposing.
Having said that, we need to remember that continuous is not plain text either: inline formatting, tables, images, etc. all need to be preserved. The software has two goals: close to perfect reading order of the text and preserve as much inline formatting and content as possible while removing page artifacts (like headers and footers).
This brings us back to flowing mode. We see this mode used most effectively for medium size documents where the re-purposing is light. In other words the user wants to do a lot more than simple text touch-up but not as much as translating the entire document. Editing will cause flow "ripples" when the new text is shorter or longer than the old. These layout issues can be mitigated manually in small documents (say, less than 20 pages) while retaining most of the layout as-is. Delete the odd page break for overflow. Shrink an image to make some space. However, for larger documents, there is no magic that allows you to heavily edit while retaining more complex layout. There usually is just not enough information in the PDF to get DTP-level structure in Word (which is not great for DTP anyway).