Thursday, December 19, 2013

What is Non-Standard Encoding?

Let's start by looking at what standard encoding is and how it works.

Each English alphabetical and numerical character A, B, C or 1, 2, 3 and so on has a corresponding ASCII (American Standard Code for Information Interchange) Decimal or Hexadecimal code associated with it. Microsoft Word, Notepad and plain text editors use ASCII to display the desired characters.

ASCII code is the numerical representation of a character such as 'a' 'A' or '@' and can also be an action as shown in the chart below. ASCII-formatted text contains no font formatting information or font decoration such as bold or italic.

Example:

The string 'Solid Documents' in ASCII is equivalent to:

Decimal Value - 83 111 108 105 100 NULL 68 111 99 117 109 101 110 116 115

Hexadecimal value - 53 6F 6C 69 64 1 44 6F 63 75 6D 65 6E 74 73

Using the chart below, you can find the referenced values in decimal and hexadecimal for the example strings above. This table maps the value to the character that will display when these values are used.



Does this mean that a PDF document uses the same codes? Actually a PDF document contains neither decoration denoting fonts, boldface, italics nor plain text. Rather a PDF document contains 'Glyphs' or a collection of glyphs that display as the text you see. Commonly, each Glyph will also contain its own custom type of encoding for the letter 'C' or 'b' for example different from an ASCII or Unicode value.

In most cases, when standard encoding is used, each Glyph or collection will also contain the associated values or ASCII code needed to map and display the correct character during a conversion to Microsoft Word as shown in an example below.

Glyph name;Unicode scalar value

Aring;00C5
Aringacute;01FA
Aringbelow;1E00
Aringsmall;F7E5
Asmall;F761
Atilde;00C3

Non-Standard Encoding

Unfortunately there are no rules for PDF creation utilities to require the use of standard encoding, like ASCII, standard glyph names or mapping of Glyph names to ASCII codes.

While the PDF may appear fine when viewed in Adobe Reader or Acrobat, the document actually lacks necessary encoding needed to convert successfully to Microsoft Word. For a better understanding please note the example below:


Notice the PDF looks great. How you can tell if a PDF has standard encoding or not, is to copy and paste the text from the PDF to Word - if Word can display the correct character then it is encoded, if not, it is non-standard encoding.

This is known as Non Standard Encoding (NSE).

While the created PDF document may render and look OK, the standard encoded associations to the Hexadecimal or Decimal values do not exist in the document making it impossible for Microsoft Word to map the values, associate and display the correct characters. You may have seen these results while working with a document created with low quality PDF creation tools or in a poorly scanned document.

So what happens when non standard encoding exists in a PDF document? It is then up to software such as Solid Converter PDF or Solid PDF Tools to determine and accurately recreate each character in Word. When NSE is detected the conversion engine of Solid Converter PDF and Solid PDF Tools work to study each character individually in a process of rebuilding the character or characters in order to supply Microsoft Word with the necessary encoding to display each character correctly. This process requires a high degree of recognition and reconstruction accuracy.

To continually study and improve our conversion engine, Solid Documents tests thousands of PDF files in a process using Solid Framework SDK to automate a rich set of conversion tests across a wide spectrum of PDF files in order to insure high quality and accurate conversion software.

With nearly decade of success in delivering best-in-class document reconstruction and archiving software, Solid Documents provides a technically sound and innovative document reconstruction software solution for standard as well as non-standard encoded PDF files.