Character encoding: validate content source

Managed by | Updated .

Review content source

The first step is to look at the original content, as the crawler will see it, by making a simple HTTP request to the original web server. With the old Squiz forums as use case.

By looking at the page, we can observe that it "looks" correct, we don't see any corrupted characters in the output:

Correct characters example for old Squiz Forum.

However, that doesn't mean that the content is not problematic, because browser are usually smart (smarter than our crawler) and can automatically fix a number of encoding-related errors. Additionally, collecting more info on those corrupted characters will help us later.

The first thing to look at is the HTML source of the page, to see how those characters are represented:

Source code of squiz forum example.

In this case those characters are represented as HTML entities. That's perfectly valid and the entities used there are valid too, looking at the list of HTML entities:

  • ' is a regular quote
  •   is a non breaking space

This content seems ok. 

Declared encoding of the document

Our use case seems to be related to processing entities, and so it's probably not related to actual encoding problems on the original document. When that's the case, other information is useful to lookup.

What is the encoding declared in the HTML source?

In our case it's ISO-8859-1 as seen in the <meta charset="iso-8859-1"> tag.

What is the encoding returned by the webserver?

It can differ from what's in the page, and that's usually a source of problem.

To find out the encoding, use your browser developer tools and search for the Content-Type response header returned on the main HTTP request:

Response header of squiz forum page example.

Alternatively, use wget (observe the Content-Type header at the bottom.

Terminal - Get HTTP header of page.
$ wget -S 'http://forums.squizsuite.net/index.php?showtopic=12515'
--2014-10-01 17:33:48--  http://forums.squizsuite.net/index.php?showtopic=12515
Resolving forums.squizsuite.net (forums.squizsuite.net)... 50.57.66.196
Connecting to forums.squizsuite.net (forums.squizsuite.net)|50.57.66.196|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Wed, 01 Oct 2014 15:33:48 GMT
  Server: Apache
  Set-Cookie: session_id=df2a0d03ae9572592471d1109e0942e0; path=/; httponly
  Set-Cookie: modpids=deleted; expires=Tue, 01-Oct-2013 15:33:47 GMT; path=/
  Cache-Control: no-cache, must-revalidate, max-age=0
  Expires: Tue, 30 Sep 2014 15:33:48 GMT
  Pragma: no-cache
  Connection: close
  Transfer-Encoding: chunked
  Content-Type: text/html;charset=ISO-8859-1
Length: unspecified [text/html]

In our case, the Content-Type returned by the server and the Content-Type declared in the page match. Sometimes, they don't, because the content author usually don't control the webserver. When they don't match, browser will use various heuristics to try to find the "right" one. The Funnelback crawler will follow an order of precedence, and you can ask RnD what encoding is considered first (either Response header, or page content).

What is the actual encoding of the page?

Despite encoding values being returned by the web server, and/or the page META tags, it's possible that both of them are lying and that the content is actually encoded differently. The only way to find that out is to use an hexadecimal editor, and to inspect the content at the byte level to see how it's actually encoded. In our specific case it doesn't matter because special characters are encoded as entities, but that's something you always need to confirm.

Consider the following HTML document: Character encoding example.

If you look at it's byte representation, you'll see that the "ç" is represented with 2 bytes 0xC3 0xA7. That's actually UTF-8, not ISO-8859-1.

If the page was encoded in ISO-8859-1 as it claims, it should have been a single 0xE7 byte.

Character encoding example.
<!doctype html>
<html lang=en-us>
<head><meta charset="iso-8859-1"></head>
<body>Parlons Français</body>
</html>
Incorect Output: Byte representation of HTML Character encoding example.
$ hexdump -C wrong.html
00000000  3c 21 64 6f 63 74 79 70  65 20 68 74 6d 6c 3e 0a  |<!doctype html>.|
00000010  3c 68 74 6d 6c 20 6c 61  6e 67 3d 65 6e 2d 75 73  |<html lang=en-us|
00000020  3e 0a 3c 68 65 61 64 3e  3c 6d 65 74 61 20 63 68  |>.<head><meta ch|
00000030  61 72 73 65 74 3d 22 69  73 6f 2d 38 38 35 39 2d  |arset="iso-8859-|
00000040  31 22 3e 3c 2f 68 65 61  64 3e 0a 3c 62 6f 64 79  |1"></head>.<body|
00000050  3e 50 61 72 6c 6f 6e 73  20 46 72 61 6e c3 a7 61  |>Parlons Fran..a|
00000060  69 73 3c 2f 62 6f 64 79  3e 0a 3c 2f 68 74 6d 6c  |is</body>.</html|
00000070  3e 0a                                             |>.|
00000072
Correct Output: Byte representation of HTML Character encoding example with
$ hexdump.exe -C good.html
00000000  3c 21 64 6f 63 74 79 70  65 20 68 74 6d 6c 3e 0a  |<!doctype html>.|
00000010  3c 68 74 6d 6c 20 6c 61  6e 67 3d 65 6e 2d 75 73  |<html lang=en-us|
00000020  3e 0a 3c 68 65 61 64 3e  3c 6d 65 74 61 20 63 68  |>.<head><meta ch|
00000030  61 72 73 65 74 3d 22 69  73 6f 2d 38 38 35 39 2d  |arset="iso-8859-|
00000040  31 22 3e 3c 2f 68 65 61  64 3e 0a 3c 62 6f 64 79  |1"></head>.<body|
00000050  3e 50 61 72 6c 6f 6e 73  20 46 72 61 6e e7 61 69  |>Parlons Fran.ai|
00000060  73 3c 2f 62 6f 64 79 3e  0a 3c 2f 68 74 6d 6c 3e  |s</body>.</html>|
00000070  0a                                                |.|
00000071

Was this artcle helpful?

Tags
Type: Keywords:
Features: