Data cleansing

Managed by | Updated .

Clean data as close to the source as possible. Implementations should aim to avoid custom data cleansing workflow steps.

Should a filter or a hook script be used?

Data cleansing efforts should be applied as close to the source as possible. The order of priority for cleaning should be:

  • Source: Can you arrange for the data to be as close as possible to the expected format? Can you gather only what is needed (include / exclude patterns, noindex tags)?
  • Custom filter (Groovy)
  • Hook scripts (Groovy)
  • Server-side template (Freemarker)
  • Client-side scripting (Javascript)

The rationale is that the farthest you get from the data, the hardest it is to understand the cleansing code. For example having Javascript code correct something in the data for display would require an implementer to inspect the Javascript, then FreeMarker, then the hook scripts, the filters and finally the data to be able to understand what the Javascript is doing.

Additionally content cleaned close to the source benefits other systems. For example cleaning code in the Freemarker template does not affect the JSON and XML output. Cleaning done in a hook script will not affect the cached copy of the document, etc.

Was this artcle helpful?