Character encoding: user interface layer
Managed by | Updated .
Whent this content is then read by the Modern UI, possibly transformed, and then rendered with FreeMarker to be presented. We need to inspect all of these steps to narrow down the problem.
Note that when inspecting XML and JSON endpoints, your browser will interpret the content to render it in a nice way. It's preferable to use command line tools like wget
or curl
to get the raw content, to prevent misleading transformations by the browser.
padre-sw.cgi
First step is to use padre-sw.cgi. The output should be strictly identical as running the query processor on the command line see Character encoding: query processor, but that's a good way to confirm that nothing unexpected happens when PADRE is run by Jetty.
Use the URL /s/padre-sw.cgi
:
curl 'http://search-internal.cbr.au.funnelback.com/s/padre-sw.cgi?collection=squiz-forum-funnelback&query=!nullquery' | less
...
<result>
<rank>3</rank>
<score>553</score>
<title>Funnel back search doesn't allow empty search query - Funnelback - Squiz Suite Support Forum</title>
...
<summary><![CDATA[Funnel back search doesn't allow empty search query - posted in Funnelback: Hi  We have 3 funnelback searches on our site: Business Finder; Course Finder; Job Finder  The first 2 will display all the available results when the page is initiall]]></summary>
...
Here we're getting the exact same output as the command line, so we're good.
search.xml / search.json
Using the XML or JSON endpoint allow bypassing the FreeMarker rendering to control that the Modern UI is building the right data model.
It's recommended to use JSON, because XML has its own quirks (such as requiring XML entities for special characters) which can be misleading.
# Use python to make the XML more readable
curl 'http://search-internal.cbr.au.funnelback.com/s/search.json?collection=squiz-forum-funnelback&query=!nullquery' | python -m json.tool |less
...
"score": 553,
"summary": "Funnel back search doesn't allow empty search queryHi \u00c2\u00a0 We have 3 funnelback searches on our site: Business Finder; Course Finder; Job Finder \u00c2\u00a0 The first 2 will display all the available results when the page is initiall",
"tags": [],
"tier": 1,
"title": "Funnel back search doesn't allow empty search query"
...
- (Aside: We can confirm that our "incorrect" non breaking space is still preserved as C2 A0 in the Modern UI.)
- We can see that the title apostrophe is preserved alright
- But we can see that the summary apostrophe has been encoded to '
That's normal! The technical details are a bit complex, but the Modern UI will re-encode some special characters from the PADRE result packet: \, ", ', <, >, & .
That's however where our problem lies. Depending how the value is interpreted in FreeMarker, the encoding will or will not show.
If you encounter a problem at this point, it can be either a Modern UI bug (check with RnD), or a custom hook script that's doing something wrong. Try disabling all the hook scripts first.
FreeMarker
Now that we've confirmed that the data model is right, we need to look at FreeMarker. FreeMarker will take the data model as is, and render it to HTML. Different parameters can affect the rendering (The FreeMarker documentation about charset might be worth reading)
<#ftl encoding> directive
This directive is set at the top of the template: <#ftl encoding="utf-8">
. It's used to define in which encoding the .ftl
file is stored on disk.
This should not usually be changed because Funnelback writes templates in UTF-8 by default. You need to change that only if you converted the .ftl
file to a different encoding for some reason (but there should be no reason to do that).
A global <escape> directive
Since v14, a global escaping directive has been declared at the top of the template: <#escape x as x?html>
. It means that all the values from the data model will be automatically escaped to HTML. For example, characters like "<" will be converted into <.
This is for security reasons, to prevent people from injecting Javascript or HTML. If someone enters a query term like "<script>
", it will be escaped to "<script>" and will be correctly displayed as "<script>" by the browser rather than being interpreted as a script tag.
You need to be aware of this. Since version 14 all variables are HTML encoded by default
Ad-hoc ?html or ?url statements
Variables can be HTML encoded with ?html, or URL encoded with ?url.
- HTML encoding will convert special characters like <, >, ... into their HTMl entities < >. That's used when you want to display content to the user that contains special characters that shouldn't be interpreted by the browser. For example, if you want to display the litteral word "<script>", you need to write <script> otherwise the browser will interpret it as a <script> HTML tag.
- URL encoding convert special characters in URL to their percent-encoding form. For example a "$" will be converted to %24. That's used when you want to embed special characters in a URL (link, or image source, etc.)
A ad-hoc <#noescape> directive
Sometimes, you don't want the automatic escaping to take place. In that case, you need to enclose the variable with a <#noescape>
directive.
That's for example the case with Curator or Best Bets HTML messages. We know that those messages can contain HTML tags like <b>, <i>, ... If they were to be automatically escaped, they would end up being displayed a <b> in the results, rather than being interpreted as bold.
You'll note that by default the results summaries are also not escaped:
<@s.boldicize><#noescape>${s.result.summary}</#noescape></@s.boldicize>
That's because the Modern UI is re encoding some special characters as described in the previous section. Let's take an example:
- Result summary contains the sentence "5 is > 4"
- Modern UI re-encodes it to "5 is > 4"
- If the variable was escaped, or displayed with
summary?html
, it would result in another encoding pass: "5 is &gt; 4". The & would be interpreted as "&" by the browser, and the page would display "5 is > 4". If the variable was not escaped, the data is inserted as is in the HTML source, "5 is > 4". The browser will then interpret > as ">", and the page will display: "5 is > 4".
In our specific form example, that's where the problem lies. Looking at the form, we can see the summary being displayed as:
<p>
<small class="text-muted">
<#if s.result.date??>${s.result.date?date?string("d MMM yyyy")}: </#if>
<#if s.result.metaData["a"]??>${s.result.metaData["a"]} - </#if>
</small>
<@s.boldicize>
${s.result.summary?html}
<!-- ${s.result.metaData["c"]!} -->
</@s.boldicize>
</p>
Notice that the summary has a ?html
statement in it. Additionally, the global escaping is still in place at the top of the file, resuting in a double encoding!
- Modern UI data model contains "Funnelback doesn't"
- The global escape directive at the top of the form encodes it to "Funnelback doesn&#39;t"
- The additional ?html statement encodes it to "Funnelback doesn&amp;#39;t"
- The browser will decode the &, resulting in ""Funnelback doesn&#39;t" being displayed to the user
The correct way to display it should be to bypass all possible encodings:
<#noescape>${s.result.summary}</noescape>
HTML output declared encoding
Your FreeMarker template will probably declare an encoding in the HEAD section, something like <meta charset="utf-8">
This cannot be changed to something else. The Modern UI will always output UTF-8, and there's no way to change that. If you declare a different encoding in your HTML page, the browser will try to interpret the UTF-8 data as a something else, and it will result in corruption.