Managed by | Updated .
When the content stored in the WARC file is incorrect, it's likely to be because of a misbehaving filter. Filters needs to preserve encoding when reading or writing content.
Troubleshooting filters is hard: It might be single filter at the origin of the problem, or multiple of them. What a filter is doing can be undone by another filter in the chain, etc. Some tips to help diagnosing the problem:
- It's often easier to try to reproduce the problem on a separate collection, with only a single document causing the problem. This way the turnaround time is faster, and the logs are easier to read.
- Do not necessarily trust log files, or your terminal. Logs are written with a specific encoding. Similarly, your terminal display content in a specific encoding as well. Depending on the fonts installed on your systems, some characters might not show up even if they're present in the content
- If possible, try to add code to write the actual bytes that are being processed to a temporary file. You can then inspect this temporary file with an hex editor to remove any other factor (log file encoding, terminal fonts, etc.).
- Be careful to write with bytes and not strings, because when manipulating strings you need to know the encoding of it to correctly interpret the bytes.
Pinpoint the filter
The first step is to try pinpoint which filter(s) are causing the corruption. This is done by editing the
filter.classes parameter in
collection.cfg, removing all the filters, and then adding them back one by one:
# Original filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider:InjectNoIndexFilterProvider:com.funnelback.services.filter.MetaDataScraperFilter # In our case we know our document is HTML so it won't be processed by Tika, nor the ExternalFilterProvider so we can rule those out filter.classes=CombinerFilterProvider:DocumentFixerFilterProvider filter.classes=CombinerFilterProvider:DocumentFixerFilterProvider:InjectNoIndexFilterProvider filter.classes=CombinerFilterProvider:DocumentFixerFilterProvider:InjectNoIndexFilterProvider:com.funnelback.services.filter.MetaDataScraperFilter
You need to run an update between each change, to cause the content to be re-crawled and re-filtered.
In my case I was about to pinpoint that
MetaDataScraperFilter is causing the problem. It's a Groovy filter living in
Browsing the source code, we can see that this filter converts the HTML content into a JSoup object, to be able to manipulate the HTML. While doing so, it tries to detect the charset of the document:
Let's print this charset by using a logger, and inspect the
//Converts the String into InputStream InputStream is = new ByteArrayInputStream(input.getBytes()); BufferedInputStream bis = new BufferedInputStream(is); bis.mark(Integer.MAX_VALUE); //Get the character set String c = TextUtils.getCharSet(bis); bis.reset(); //Create the JSOUP document object with the calculated character set Document doc = Jsoup.parse(bis, c, address); doc.outputSettings().escapeMode(EscapeMode.xhtml);
Detected charset for http://forums.squizsuite.net/index.php?s=fb30266e34222a354db54c6e63c51aea&showtopic=12515: Windows-1252
Charset is detected as Windows-1252, which is equivalent to ISO-8859-1 for our purposes (see the Indexer section below for further explanation), so that looks correct.
However, despite the charset detection being correct, the content is still read wrong. That's because of :
InputStream is = new ByteArrayInputStream(input.getBytes());
The call to
input.getBytes() used to convert the content string to an array of bytes doesn't specify a charset, so it will use the default one as said in the Javadoc. The default encoding in Funnelback is UTF-8. It means that the detected charset of the content will be Windows-1252, but the byte stream will be read as UTF-8, resulting in corruption.
This corruptions is only visible when the string is written back, making the problem harder to diagnose.
One should always be careful when converting Strings to byte arrays (
String.getBytes() ) and vice-versa (
new String(byte data) ). If the charset is not specified, it will use a default which might not necessary be what you want (Usually UTF-8, but that's platform dependent unless specified with a command line argument when starting the JVM). It's better to always specify the charset to avoid any problems:
new String(data, "Windows-1251")
This specific code is not easy to fix, because an
InputStream is needed to detect the charset, but you need the charset to create the
InputStream from the String! A better way to do it is to build the JSoup object from the
input string itself. This way, you need not to worry about providing an encoding with an
Document doc = Jsoup.parse(input, address); doc.outputSettings().escapeMode(EscapeMode.xhtml);
The content reading is wrong here, but for this guide sake let's inspect how the content is written back:
It's simply using the JSoup
method to do so. We need to dig into the JSoup documentation to understand what charset will be used by this method. By doing so, we find the
Let's add some code to inspect the charset from the output settings just before writing the document:
logger.info("Output settings charset:" + doc.outputSettings().charset()) return doc.html();
2014-10-03 21:55:04,235 [com.funnelback.crawler.NetCrawler 0] INFO filter.MetaDataScraperFilter - Detected charset for http://forums.squizsuite.net/index.php?s=44a3483b0fde4fdef17c34532a5a9724&showtopic=12515: Windows-1252 2014-10-03 21:55:04,294 [com.funnelback.crawler.NetCrawler 0] INFO filter.MetaDataScraperFilter - Output settings charset:windows-1252
That's the correct charset, but we can still confirm that something is wrong in the filter by outputting the content before and after filtering, and compare both:
logger.info("Output settings charset:" + doc.outputSettings().charset()) logger.info("Raw content for "+address+": \n\n"+ input +"\n\n") logger.info("Content for "+address+": \n\n" + doc.html() + "\n\n")
Before: <p>Hi</p> <p> </p> <p>We have 3 funnelback searches on our site: Business Finder; Course Finder; Job Finder</p> <p> </p> After: <p>Hi</p> <p>Â </p> <p>We have 3 funnelback searches on our site: Business Finder; Course Finder; Job Finder</p> <p>Â </p>