Character encoding: custom workflow scripts
Managed by | Updated .
Background
On non web collections, it's common to have custom workflow scripts to process the content which is usually stored in individual files (often XML).
- If the content was gathered using Funnelback components, they all produce UTF-8. The custom workflow scripts need to read them back as UTF-8, and to write them as UTF-8 as well.
- If the content was gathered by a custom mean, you need to know what encoding was used when the files were stored on disk
It's crucial to preserve the correct encoding when processing content with workflow script. If you read it wrong, or write it wrong the data will get corrupted, and the rest of the chain (indexing, query processing, etc...) will produce corrupted output.
Make sure you check the result of your workflow script in isolation to confirm that the processed files were saved with the right encoding. If it's XML files, try to open them in your browser to confirm they're valid and display correctly.
Finding which encoding was used to store the content
There's no easy way to automagically find the encoding of a file, so it's better to know what encoding was used by the custom gather component. However, you can get some clues by looking at the content itself.
- Try to pinpont a "special" character in the content like an accented letter or a Chinese / Japanese symbol. Open it with a hex editor, and observe how it is represented at the bytes level.
- You can then try to locate the character in the ISO-8859-1 codepage layout, or on websites like http://www.fileformat.info/ which will give you the UTF-8 representation.
- As an example, if "é", is represented as a single byte of value 233 (0xE9 in hexa), it's ISO-8859-1. If it's represented as two bytes 0xC3 0xA9, it's UTF-8.
Ensuring your workflow script reads and writes it right
This is highly dependent of the language used to implement workflow scripts. Usually, API calls to read or write files will allow you to specify an encoding.
When using Groovy, there are different ways to configure the encoding:
- Use the
-Dfile.encoding=...
flag when starting Groovy. With this flag you can specify the encoding to use in the script without having to specify it for each read / write call - Alternatively, specify the encoding when you reading a file using
File.getText()
orFile.withReader()
, and similarly when writing files:
Configure character encoding in groovy:
new File("/tmp/test.txt").getText("ISO-8859-1")
new File("/tmp/test.txt").withReader("ISO-8859-1") { reader ->
...
}
new File("/tmp/test.txt").append("hello", "ISO-8859-1")