Character encoding: web crawler, WARC file
Managed by | Updated .
Before checking for character encoding issues within the crawler and warc files please check the content is "properly served by the customer website. See Character encoding
Assuming that we know the content is "good" and properly served by the customer website, we need to look at the crawler. The crawler will download the content, process it in some way, pass it to the the filtering layer, and store it in the WARC file.
- There's not much to look at on the crawler itself, in the logs you'll just get the HTTP response headers, similar to what was achieved with
- There's not much to look at on the filters either, because they don't log much information by default. Moreover, debugging filters is a bit difficult as you need to add relevant log statements to your custom Groovy code.
- That's why usually one looks at the WARC file first. If the content is correct in the WARC file, we'll know that the problem lies elsewhere, but if it's not we'll know that we'll have to dig into the crawler and filter.
Extracting content from the WARC file
Depending on your Funnelback version, you'll either have to use
warc.pl (up to v13) or WarcCat (v14 and above).
Looking at the extracted content, we can see something interesting: The HTML numeric entity in the title has been converted to a named entity:
linbin/java/bin/java -classpath "lib/java/all/*" com.funnelback.warc.util.WarcCat -stem data/squiz-forum-funnelback/live/data/funnelback-web-crawl -matcher MatchURI -MF "uri=http://forums.squizsuite.net/index.php?s=6258edbbc08a5347636117c80372a804&showtopic=12515" -printer All > /tmp/content.txt
... <title>Funnel back search doesn't allow empty search query - Funnelback - Squiz Suite Support Forum</title> ...
While surprising, that's not necessarily unexpected, as the filters can do that especially when parsing HTML with JSoup. In any case, that's still valid HTML and valid entities.
Regarding the line breaks after "Hi", the numeric entities seems to have disappeared from the HTML source.
However, if we fire up our trusty hex editor, we can actually see what happened behind the scenes.
... <p>Hi</p> <p> </p> <p>We have 3 funnelback searches on our site: Business Finder; Course Finder; Job Finder</p> <p> </p> ...
00006ca0 20 20 20 20 20 20 20 3c 70 3e 48 69 3c 2f 70 3e | <p>Hi</p>| 00006cb0 20 0a 20 20 20 20 20 20 20 20 20 20 20 3c 70 3e | . <p>| 00006cc0 c2 a0 3c 2f 70 3e 20 0a 20 20 20 20 20 20 20 20 |..</p> . | 00006cd0 20 20 20 3c 70 3e 57 65 20 68 61 76 65 20 33 20 | <p>We have 3 |
Observe that what is between the opening and closing P tags is "0xC2 0xA0". This is the UTF-8 representation for the non breaking space, which was previously represented as the   named entity.
So here, we've actually found one problem: The page is still declared as ISO-8859-1, but non breaking spaces are represented as a UTF-8 sequences. Because the declared encoding is ISO-8859-1, it's likely PADRE will index the document with this encoding. Instead of interpreting 0xC2 0xA0 as a single UTF-8 character (non breaking space), it will interpret it as 2 separate ISO-8559-1 characters, with:
- 0xC2 = 194 = Â
- 0xA0 = 160 = non breaking space
That explains why a Â is showing up in the results!
It's likely to be caused by a filter, because the crawler by itself doesn't modify the content. The only way to fix that is to pinpoint which filter is causing the problem.