Character encoding

Managed by | Updated .

Overview

This page is a guide about troubleshooting encoding / charsets problems with Funnelback, where the symptom is usually "odd" or "corrupted" characters showing up in search results (titles, summaries or metadata).

There are no recipes for this kind of problems, so this page is more about discussing approaches and troubleshooting examples rather than step by step instructions. It should give you enough background to understand the problem and diagnose it by yourself.

What not to do

A quick fix that is often made is to write either a custom filter or a Groovy hook script to "clean" the output. That shouldn't be done for multiple reasons:

  • It doesn't address the root cause of the problem, and by doing so might hide other issues
  • Those filters are using a "blacklist" approach when only well-known "odd" characters are cleaned up. It's bound to fail as new characters that weren't thought of will appear in the content
  • In most cases there is no good replacement for those corrupted characters. Yes, you can replace a middle-dash with a simple dash, but replace the French "é with a non-accented "e" and you'll find that words might now having different meanings.

Understanding the problem

This kind of corruption happens when different components of Funnelback are not talking with the same character encoding. One component can output ISO-8859-1, but Funnelback will read it as UTF-8.

Encoding and decoding is usually done at the boundary of each component. When a component write a file on disk (e.g, crawler), it uses a specific encoding. When another component reads this file back in (e.g. indexer), it assumes an encoding. If those components are not using the same encoding, corruption occurs.

So, to be able to pinpoint the problem, one must look at the boundaries of each component of Funnelback, and at the original source document:

Funnelback Content Lifecyle v4

We can see that the content is following a complex lifecycle, from where it was downloaded from to being returned as a summary in a search results page. During this lifecycle, it crosses multiple boundaries that can each have an encoding problem.

The right way to diagnose an encoding issue it to look at each boundary in turn, and confirm that each components are doing the right thing.

Different ways of representing non-ASCII characters

The only "safe" characters that can be represented in all encodings are ASCII characters, i.e. A to Z in lower and upper case, and some punctuation signs.

Representing non ASCII characters can be achieved in different ways. Understanding those different ways is crucial to being able to diagnose encoding problems.

Using HTML character entities references

In an HTML document, "special" characters can be represented by entities, either:

  • Numeric entities: &#nnn;. For example "ç" can be represented as ç , 231 being the unicode code point of the c with cedilla
  • Named entities: &name;. For example, the same "ç" can be represented as &#ccedil; 

Those entities will need to be decoded to be displayed, or to process the content. For example the indexer doesn't know what "Français" means, but it knows what "Français" does.

Note that the encoding (see below) of the document doesn't matter in this case, because HTML entities are always expressed with ASCII characters (either numbers, or ASCII letters to compose entity names) so it doesn't matter which encoding the document is in.

Using a different encoding in the content

Understanding encodings is the key to diagnosing that kind of problems.

The main idea is that different languages use different characters. ASCII can only express up to 128 characters, which is not enough for non-English languages. In order to express more characters, different encoding were invented. For example European languages might use ISO-8859, whereas Chinese could use Big5 and Japanese ShiftJIS. The only universal way to represent all characters of all languages is to use Unicode, and to represent Unicode characters in a document one must use one of its encodings/implementations such as UTF-8.

Some languages didn't need a lot of additional characters (e.g French). In this case, they simply extended the ASCII set by adding 128 characters. It means that those character sets will have at most 256 (0 to 255) possible characters, which fits in a single byte. That's for example ISO-8859-1.

Some other languages (e.g. Chinese, Japanese ...) did need a whole lot more of characters. Adding 128 would not work, so they added a lot more but then needed more than one byte to represent greater values (> 255). That's why non ASCII characters, for example in UTF-8, are represented as 2 or more bytes. You see one character in the screen, but when you look at the bytes on disk, there are more than one byte.

To give an example, if the document is encoded using ISO-8859-1, the "ç" will be represented by a byte of value 231 (0xE7 in hexadecimal), whereas if encoded in UTF-8, it will be represented as 2 bytes, 0xC3 0xA7

Terminal
Content is "Test: ç" (followed by a new line)
 
$ hexdump -C iso.txt
00000000  54 65 73 74 3a 20 e7 0a                           |Test: ..|
00000008
 
 
$ hexdump -C utf8.txt
00000000  54 65 73 74 3a 20 c3 a7  0a                       |Test: ...|
00000009

Note that the byte representation is the only thing that matters. An HTML document can claim to be encoded in ISO-8859-1 in its META tags, but it's completely possible that the content is actually written as UTF-8 bytes. That's usually the cause of most encoding problems.

For more information, see:

Diagnosing

When diagnosing each of the following articles should be checked in this order.

  1. Character encoding: validate content source
    The first step is too look at the original content, as the crawler will see it, by making a simple HTTP request to the original web server.
  2. Character encoding: web crawler, WARC file
    The crawler will download the content, process it in some way, pass it to the the filtering layer, and store it in the WARC file.
  3. Character encoding: filters
    When the content stored in the WARC file is incorrect, it's likely to be because of a misbehaving filter. Filters needs to preserve encoding when reading or writing content.
  4. Character encoding: custom workflow scripts
    This article discusses how to avoid character encoding issues when working with custom workflow scripts.
  5. Character encoding: Indexer
    The indexer will tell you which encoding it used to interpret the content. That can be found in the index.log file, where there's one line for each indexed document.
  6. Character encoding: Index file
    It's interesting to look at the index file, because PADRE will have processed the content and possibly transformed it before storing it on disk. For example, it's likely that HTML entities will get decoded, so it's worth checking that they were correctly decoded by PADRE.
  7. Character encoding: query processor
    The following details how to check that the query processor is returning the content in correct form.
  8. Character encoding: user interface layer
    Whent this content is then read by the Modern UI, possibly transformed, and then rendered with FreeMarker to be presented. We need to inspect all of these steps to narrow down the problem.
  9. Character Encoding: CMS integration / browser display
    Some CMS don't support rendering UTF-8. There is no possible workaround for that on the Funnelback side. The CMS (or something sitting between the CMS and Funnelback) needs to read the content as UTF-8, and then do its own charsets conversio
Was this artcle helpful?

Comments