Character encoding: validate content source
Managed by | Updated .
Review content source
The first step is to look at the original content, as the crawler will see it, by making a simple HTTP request to the original web server. With the old Squiz forums as use case.
By looking at the page, we can observe that it "looks" correct, we don't see any corrupted characters in the output:
However, that doesn't mean that the content is not problematic, because browser are usually smart (smarter than our crawler) and can automatically fix a number of encoding-related errors. Additionally, collecting more info on those corrupted characters will help us later.
The first thing to look at is the HTML source of the page, to see how those characters are represented:
In this case those characters are represented as HTML entities. That's perfectly valid and the entities used there are valid too, looking at the list of HTML entities:
- ' is a regular quote
-   is a non breaking space
This content seems ok.
Declared encoding of the document
Our use case seems to be related to processing entities, and so it's probably not related to actual encoding problems on the original document. When that's the case, other information is useful to lookup.
What is the encoding declared in the HTML source?
In our case it's ISO-8859-1 as seen in the <meta charset="iso-8859-1">
tag.
What is the encoding returned by the webserver?
It can differ from what's in the page, and that's usually a source of problem.
To find out the encoding, use your browser developer tools and search for the Content-Type
response header returned on the main HTTP request:
Alternatively, use wget
(observe the Content-Type
header at the bottom.
$ wget -S 'http://forums.squizsuite.net/index.php?showtopic=12515'
--2014-10-01 17:33:48-- http://forums.squizsuite.net/index.php?showtopic=12515
Resolving forums.squizsuite.net (forums.squizsuite.net)... 50.57.66.196
Connecting to forums.squizsuite.net (forums.squizsuite.net)|50.57.66.196|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Wed, 01 Oct 2014 15:33:48 GMT
Server: Apache
Set-Cookie: session_id=df2a0d03ae9572592471d1109e0942e0; path=/; httponly
Set-Cookie: modpids=deleted; expires=Tue, 01-Oct-2013 15:33:47 GMT; path=/
Cache-Control: no-cache, must-revalidate, max-age=0
Expires: Tue, 30 Sep 2014 15:33:48 GMT
Pragma: no-cache
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html;charset=ISO-8859-1
Length: unspecified [text/html]
In our case, the Content-Type returned by the server and the Content-Type declared in the page match. Sometimes, they don't, because the content author usually don't control the webserver. When they don't match, browser will use various heuristics to try to find the "right" one. The Funnelback crawler will follow an order of precedence, and you can ask RnD what encoding is considered first (either Response header, or page content).
What is the actual encoding of the page?
Despite encoding values being returned by the web server, and/or the page META tags, it's possible that both of them are lying and that the content is actually encoded differently. The only way to find that out is to use an hexadecimal editor, and to inspect the content at the byte level to see how it's actually encoded. In our specific case it doesn't matter because special characters are encoded as entities, but that's something you always need to confirm.
Consider the following HTML document: Character encoding example.
If you look at it's byte representation, you'll see that the "ç" is represented with 2 bytes 0xC3 0xA7. That's actually UTF-8, not ISO-8859-1.
If the page was encoded in ISO-8859-1 as it claims, it should have been a single 0xE7 byte.
<!doctype html>
<html lang=en-us>
<head><meta charset="iso-8859-1"></head>
<body>Parlons Français</body>
</html>
$ hexdump -C wrong.html
00000000 3c 21 64 6f 63 74 79 70 65 20 68 74 6d 6c 3e 0a |<!doctype html>.|
00000010 3c 68 74 6d 6c 20 6c 61 6e 67 3d 65 6e 2d 75 73 |<html lang=en-us|
00000020 3e 0a 3c 68 65 61 64 3e 3c 6d 65 74 61 20 63 68 |>.<head><meta ch|
00000030 61 72 73 65 74 3d 22 69 73 6f 2d 38 38 35 39 2d |arset="iso-8859-|
00000040 31 22 3e 3c 2f 68 65 61 64 3e 0a 3c 62 6f 64 79 |1"></head>.<body|
00000050 3e 50 61 72 6c 6f 6e 73 20 46 72 61 6e c3 a7 61 |>Parlons Fran..a|
00000060 69 73 3c 2f 62 6f 64 79 3e 0a 3c 2f 68 74 6d 6c |is</body>.</html|
00000070 3e 0a |>.|
00000072
$ hexdump.exe -C good.html
00000000 3c 21 64 6f 63 74 79 70 65 20 68 74 6d 6c 3e 0a |<!doctype html>.|
00000010 3c 68 74 6d 6c 20 6c 61 6e 67 3d 65 6e 2d 75 73 |<html lang=en-us|
00000020 3e 0a 3c 68 65 61 64 3e 3c 6d 65 74 61 20 63 68 |>.<head><meta ch|
00000030 61 72 73 65 74 3d 22 69 73 6f 2d 38 38 35 39 2d |arset="iso-8859-|
00000040 31 22 3e 3c 2f 68 65 61 64 3e 0a 3c 62 6f 64 79 |1"></head>.<body|
00000050 3e 50 61 72 6c 6f 6e 73 20 46 72 61 6e e7 61 69 |>Parlons Fran.ai|
00000060 73 3c 2f 62 6f 64 79 3e 0a 3c 2f 68 74 6d 6c 3e |s</body>.</html>|
00000070 0a |.|
00000071