Character encoding: Indexer

Managed by | Updated .

The indexer will tell you which encoding it used to interpret the content. That can be found in the index.log file, where there's one line for each indexed document:

index.log
{3 Windows-1252 HTML 84014 0 "acdehistuvADISU" 2014-09-22  263 [ T=14.000 W=695.000 Z=1.456 H=5.669 F=0.253 C=0.112 L=3.973 D=60.000 M=1.151 ] html}
  URL: http://forums.squizsuite.net/index.php?showtopic=12515

In our case PADRE used the Windows-1252 encoding, which can be considered equivalent to ISO-8859-1 (PADRE actually uses Windows-1252 instead of ISO-8859-1 for legacy reasons). It's a bit misleading, but PADRE is doing the right thing here by correctly detecting the encoding. If it has been UTF-8 or something else, it would have indicated a problem.

When to use the -force_iso, -isoinput, -utf8input indexer options

Those options need to be used carefully, as they will completely change how PADRE interprets the content:

  • -force_iso is to be used when you know that the actual content encoding is ISO-8859-1 (after having inspected it with an hex editor), and you know that the META tags or HTTP headers are lying. In our case it wouldn't have helped, because the content is correctly encoded in ISO-8859-1, except for those 2 non breaking space bytes that are UTF-8. PADRE will correctly detect the encoding as ISO-8859-1 so there's no need to force it

If a document has no META tags or HTTP headers to give a clue about it's encoding, PADRE will default to Windows-1252. You can use:

  • -isoinput to force it to ISO-8859-1. Note that most people mean Windows-1252 when they talk about ISO-8859-1, so this option is very rarely used.
  • -utf8input to force it to UTF-8

In practice, those options should only be used as workaround if you can't fix the root cause of the problem, rather than being attempted in the hope of "magically" fixing the problem.

Was this artcle helpful?

Comments