Character encoding: filters
Managed by | Updated .
When the content stored in the WARC file is incorrect, it's likely to be because of a misbehaving filter. Filters needs to preserve encoding when reading or writing content.
Troubleshooting filters is hard: It might be single filter at the origin of the problem, or multiple of them. What a filter is doing can be undone by another filter in the chain, etc. Some tips to help diagnosing the problem:
- It's often easier to try to reproduce the problem on a separate collection, with only a single document causing the problem. This way the turnaround time is faster, and the logs are easier to read.
- Do not necessarily trust log files, or your terminal. Logs are written with a specific encoding. Similarly, your terminal display content in a specific encoding as well. Depending on the fonts installed on your systems, some characters might not show up even if they're present in the content
- If possible, try to add code to write the actual bytes that are being processed to a temporary file. You can then inspect this temporary file with an hex editor to remove any other factor (log file encoding, terminal fonts, etc.).
- Be careful to write with bytes and not strings, because when manipulating strings you need to know the encoding of it to correctly interpret the bytes.
Pinpoint the filter
The first step is to try pinpoint which filter(s) are causing the corruption. This is done by editing the filter.classes
parameter in collection.cfg
, removing all the filters, and then adding them back one by one:
# Original
filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider:InjectNoIndexFilterProvider:com.funnelback.services.filter.MetaDataScraperFilter
# In our case we know our document is HTML so it won't be processed by Tika, nor the ExternalFilterProvider so we can rule those out
filter.classes=CombinerFilterProvider:DocumentFixerFilterProvider
filter.classes=CombinerFilterProvider:DocumentFixerFilterProvider:InjectNoIndexFilterProvider
filter.classes=CombinerFilterProvider:DocumentFixerFilterProvider:InjectNoIndexFilterProvider:com.funnelback.services.filter.MetaDataScraperFilter
You need to run an update between each change, to cause the content to be re-crawled and re-filtered.
In my case I was about to pinpoint that MetaDataScraperFilter
is causing the problem. It's a Groovy filter living in lib/java/groovy
.
Content reading
Browsing the source code, we can see that this filter converts the HTML content into a JSoup object, to be able to manipulate the HTML. While doing so, it tries to detect the charset of the document:
Let's print this charset by using a logger, and inspect the crawler.inline-filtering.log
//Converts the String into InputStream
InputStream is = new ByteArrayInputStream(input.getBytes());
BufferedInputStream bis = new BufferedInputStream(is);
bis.mark(Integer.MAX_VALUE);
//Get the character set
String c = TextUtils.getCharSet(bis);
bis.reset();
//Create the JSOUP document object with the calculated character set
Document doc = Jsoup.parse(bis, c, address);
doc.outputSettings().escapeMode(EscapeMode.xhtml);
Detected charset for http://forums.squizsuite.net/index.php?s=fb30266e34222a354db54c6e63c51aea&showtopic=12515: Windows-1252
Charset is detected as Windows-1252, which is equivalent to ISO-8859-1 for our purposes (see the Indexer section below for further explanation), so that looks correct.
However, despite the charset detection being correct, the content is still read wrong. That's because of :
InputStream is = new ByteArrayInputStream(input.getBytes());
The call to input.getBytes()
used to convert the content string to an array of bytes doesn't specify a charset, so it will use the default one as said in the Javadoc. The default encoding in Funnelback is UTF-8. It means that the detected charset of the content will be Windows-1252, but the byte stream will be read as UTF-8, resulting in corruption.
This corruptions is only visible when the string is written back, making the problem harder to diagnose.
One should always be careful when converting Strings to byte arrays ( String.getBytes()
) and vice-versa ( new String(byte[] data)
). If the charset is not specified, it will use a default which might not necessary be what you want (Usually UTF-8, but that's platform dependent unless specified with a command line argument when starting the JVM). It's better to always specify the charset to avoid any problems:
String.getBytes("Windows-1251")
new String(data, "Windows-1251")
This specific code is not easy to fix, because an InputStream
is needed to detect the charset, but you need the charset to create the InputStream
from the String! A better way to do it is to build the JSoup object from the input
string itself. This way, you need not to worry about providing an encoding with an InputStream
.
Document doc = Jsoup.parse(input, address);
doc.outputSettings().escapeMode(EscapeMode.xhtml);
Content writing
The content reading is wrong here, but for this guide sake let's inspect how the content is written back:
doc.html();
It's simply using the JSoup
Document.html()
method to do so. We need to dig into the JSoup documentation to understand what charset will be used by this method. By doing so, we find theDocument.OutputSettings
class.
Let's add some code to inspect the charset from the output settings just before writing the document:
logger.info("Output settings charset:" + doc.outputSettings().charset())
return doc.html();
2014-10-03 21:55:04,235 [com.funnelback.crawler.NetCrawler 0] INFO filter.MetaDataScraperFilter - Detected charset for http://forums.squizsuite.net/index.php?s=44a3483b0fde4fdef17c34532a5a9724&showtopic=12515: Windows-1252
2014-10-03 21:55:04,294 [com.funnelback.crawler.NetCrawler 0] INFO filter.MetaDataScraperFilter - Output settings charset:windows-1252
That's the correct charset, but we can still confirm that something is wrong in the filter by outputting the content before and after filtering, and compare both:
logger.info("Output settings charset:" + doc.outputSettings().charset())
logger.info("Raw content for "+address+": \n\n"+ input +"\n\n")
logger.info("Content for "+address+": \n\n" + doc.html() + "\n\n")
Before:
<p>Hi</p>
<p> </p>
<p>We have 3 funnelback searches on our site: Business Finder; Course Finder; Job Finder</p>
<p> </p>
After:
<p>Hi</p>
<p>Â </p>
<p>We have 3 funnelback searches on our site: Business Finder; Course Finder; Job Finder</p>
<p>Â </p>