Character encoding: filters

Managed by | Updated .

When the content stored in the WARC file is incorrect, it's likely to be because of a misbehaving filter. Filters needs to preserve encoding when reading or writing content.

Troubleshooting filters is hard: It might be single filter at the origin of the problem, or multiple of them. What a filter is doing can be undone by another filter in the chain, etc. Some tips to help diagnosing the problem:

  • It's often easier to try to reproduce the problem on a separate collection, with only a single document causing the problem. This way the turnaround time is faster, and the logs are easier to read.
  • Do not necessarily trust log files, or your terminal. Logs are written with a specific encoding. Similarly, your terminal display content in a specific encoding as well. Depending on the fonts installed on your systems, some characters might not show up even if they're present in the content
  • If possible, try to add code to write the actual bytes that are being processed to a temporary file. You can then inspect this temporary file with an hex editor to remove any other factor (log file encoding, terminal fonts, etc.).
  • Be careful to write with bytes and not strings, because when manipulating strings you need to know the encoding of it to correctly interpret the bytes.

Pinpoint the filter

The first step is to try pinpoint which filter(s) are causing the corruption. This is done by editing the filter.classes parameter in collection.cfg, removing all the filters, and then adding them back one by one:

# Original
# In our case we know our document is HTML so it won't be processed by Tika, nor the ExternalFilterProvider so we can rule those out

You need to run an update between each change, to cause the content to be re-crawled and re-filtered.

In my case I was about to pinpoint that MetaDataScraperFilter is causing the problem. It's a Groovy filter living in lib/java/groovy.

Content reading

Browsing the source code, we can see that this filter converts the HTML content into a JSoup object, to be able to manipulate the HTML. While doing so, it tries to detect the charset of the document:

Let's print this charset by using a logger, and inspect the crawler.inline-filtering.log

jsoup filter example
//Converts the String into InputStream
InputStream is = new ByteArrayInputStream(input.getBytes());
BufferedInputStream bis = new BufferedInputStream(is);
//Get the character set
String c = TextUtils.getCharSet(bis);
//Create the JSOUP document object with the calculated character set
Document doc = Jsoup.parse(bis, c, address);
Detected charset for Windows-1252

Charset is detected as Windows-1252, which is equivalent to ISO-8859-1 for our purposes (see the Indexer section below for further explanation), so that looks correct.

However, despite the charset detection being correct, the content is still read wrong. That's because of :

InputStream is = new ByteArrayInputStream(input.getBytes());

The call to  input.getBytes() used to convert the content string to an array of bytes doesn't specify a charset, so it will use the default one as said in the Javadoc. The default encoding in Funnelback is UTF-8. It means that the detected charset of the content will be Windows-1252, but the byte stream will be read as  UTF-8, resulting in corruption.

This corruptions is only visible when the string is written back, making the problem harder to diagnose.

One should always be careful when converting Strings to byte arrays ( String.getBytes() ) and vice-versa (  new String(byte[] data) ). If the charset is not specified, it will use a default which might not necessary be what you want (Usually UTF-8, but that's platform dependent unless specified with a command line argument when starting the JVM). It's better to always specify the charset to avoid any problems:    

  • String.getBytes("Windows-1251")
  • new String(data, "Windows-1251")

This specific code is not easy to fix, because an   InputStream is needed to detect the charset, but you need the charset to create the   InputStream from the String! A better way to do it is to build the JSoup object from the   input string itself. This way, you need not to worry about providing an encoding with an InputStream .

Document doc = Jsoup.parse(input, address);

Content writing

The content reading is wrong here, but for this guide sake let's inspect how the content is written back:


It's simply using the JSoup


method to do so. We need to dig into the JSoup documentation to understand what charset will be used by this method. By doing so, we find theDocument.OutputSettings class.

Let's add some code to inspect the charset from the output settings just before writing the document:"Output settings charset:" + doc.outputSettings().charset())
return doc.html();
2014-10-03 21:55:04,235 [com.funnelback.crawler.NetCrawler 0]  INFO  filter.MetaDataScraperFilter  - Detected charset for Windows-1252
2014-10-03 21:55:04,294 [com.funnelback.crawler.NetCrawler 0]  INFO  filter.MetaDataScraperFilter  - Output settings charset:windows-1252

That's the correct charset, but we can still confirm that something is wrong in the filter by outputting the content before and after filtering, and compare both:"Output settings charset:" + doc.outputSettings().charset())"Raw content for "+address+": \n\n"+ input +"\n\n")"Content for "+address+": \n\n" + doc.html() + "\n\n")
           <p> </p>
           <p>We have 3 funnelback searches on our site: Business Finder; Course Finder; Job Finder</p>
           <p> </p>
           <p>Â </p>
           <p>We have 3 funnelback searches on our site: Business Finder; Course Finder; Job Finder</p>
           <p>Â </p>
Was this artcle helpful?