Character encoding: web crawler, WARC file

Managed by | Updated .


Before checking for character encoding issues within the crawler and warc files please check the content is "properly served by the customer website. See Character encoding

Assuming that we know the content is "good" and properly served by the customer website, we need to look at the crawler. The crawler will download the content, process it in some way, pass it to the the filtering layer, and store it in the WARC file.

  • There's not much to look at on the crawler itself, in the logs you'll just get the HTTP response headers, similar to what was achieved with wget earlier.
  • There's not much to look at on the filters either, because they don't log much information by default. Moreover, debugging filters is a bit difficult as you need to add relevant log statements to your custom Groovy code.
  • That's why usually one looks at the WARC file first. If the content is correct in the WARC file, we'll know that the problem lies elsewhere, but if it's not we'll know that we'll have to dig into the crawler and filter.

Extracting content from the WARC file

Depending on your Funnelback version, you'll either have to use (up to v13) or WarcCat (v14 and above).

Looking at the extracted content, we can see something interesting: The HTML numeric entity in the title has been converted to a named entity:

Terminal - Extract content from warc.
linbin/java/bin/java -classpath "lib/java/all/*" com.funnelback.warc.util.WarcCat -stem data/squiz-forum-funnelback/live/data/funnelback-web-crawl -matcher MatchURI -MF "uri=" -printer All > /tmp/content.txt
Example extracted content as html.
<title>Funnel back search doesn&apos;t allow empty search query - Funnelback - Squiz Suite Support Forum</title>

While surprising, that's not necessarily unexpected, as the filters can do that especially when parsing HTML with JSoup. In any case, that's still valid HTML and valid entities.

Regarding the line breaks after "Hi", the numeric entities seems to have disappeared from the HTML source.

However, if we fire up our trusty hex editor, we can actually see what happened behind the scenes.

HTML source
           <p> </p>
           <p>We have 3 funnelback searches on our site: Business Finder; Course Finder; Job Finder</p>
           <p> </p>
Hex of HTML source
00006ca0  20 20 20 20 20 20 20 3c  70 3e 48 69 3c 2f 70 3e  |       <p>Hi</p>|
00006cb0  20 0a 20 20 20 20 20 20  20 20 20 20 20 3c 70 3e  | .           <p>|
00006cc0  c2 a0 3c 2f 70 3e 20 0a  20 20 20 20 20 20 20 20  |..</p> .        |
00006cd0  20 20 20 3c 70 3e 57 65  20 68 61 76 65 20 33 20  |   <p>We have 3 |

Observe  that what is between the opening and closing P tags is "0xC2 0xA0". This is the UTF-8 representation for the non breaking space, which was previously represented as the &#160; named entity.

So here, we've actually found one problem: The page is still declared as ISO-8859-1, but non breaking spaces are represented as a UTF-8 sequences. Because the declared encoding is ISO-8859-1, it's likely PADRE will index the document with this encoding. Instead of interpreting 0xC2 0xA0 as a single      UTF-8 character (non breaking space), it will interpret it as 2 separate ISO-8559-1 characters, with:    

  • 0xC2 = 194 =       
  • 0xA0 = 160 = non breaking space      

That explains why a  is showing up in the results!    

It's likely to be caused by a filter, because the crawler by itself doesn't modify the content. The only way to fix that is to pinpoint which filter is causing the problem.

Was this artcle helpful?