Squiz Matrix integration
Managed by | Updated .
The following items should be considered as part of any Funnelback-Matrix integration, regardless of how the search results are integrated with the CMS
- External metadata for all binary documents that are indexed by Funnelback (via Matrix asset listings)
- Noindex tagging of all Matrix pages to hide non-content regions.
- Increase the crawler.request_timeout (eg. to 30s) as Matrix sites commonly timeout.
- The new Matrix Funnelback Search Page asset can be used to display Funnelback output using matrix tags to format the search results display. However if you need facets you can't use this asset. Note: when using the Funnelback Search Page asset you need to specify the web path to xml.cgi (not search.cgi) - note: use of the is not recommended.
- The REST asset is another way to integrate Funnelback output into a Matrix page.
- Squiz CMS/MySource Mini doesn't support remote content so search results pages must be formatted using Funnelback's templating system. (Note: possibility of including headers and footers via IncludeUrl)
- Take care when using instant updates (see below)
- If DLS requiring impersonation (eg. TRIM) is involved in any of the searches don't integrate in this way otherwise you'll need a solution to ensure that impersonation is passed through.
- Make sure you account for the remote user's IP address being replaced with that of the Matrix server (see below)
REST asset (use this in preference to a Remote Content Asset)
Things to remember:
- Snippets can be used to insert chunks of matrix code into a funnelback form (eg. if you need to include some matrix stuff inside the block of code returned by funnelback
Advantages / disadvantages
- You lose IP based stats in Funnelback's analytics (as all requests come from the Matrix server's IP). Note: this can be mitigated (see instructions below for IP address handling)
- You get the increased overhead of Matrix constructing the page so you will have longer response times compared to using Funnelback native forms. This is very important and should not be underestimated; it is a fairly common problem with Matrix.
- If you go the other way (i.e. use Funnelback IncludUrl calls to pull in Matrix templates) search results will be instantly updated.
- Squiz sometimes cache even the REST asset output, so Funnelback results will not update with the index.
- the ui.modern_*_link values need to be set so that the form links work (ui_search_link can usually be left blank, but other values need to be set to the full path to the relevant Funnelback endpoint eg. http://FUNNELBACK_SERVER/s/redirect of http://FUNNELBACK_SERVER/s/cache)
- Don't forget to add a robots.txt exclusion (or robots meta tags) for the page that hosts the rest asset into the matrix site's robots.txt, or in the page's metadata.
Modifying the Funnelback returned html code at the Matrix front end
Note that it is preferable to fix the Freemarker template rather than modify the return from Funnelback within the REST asset.
The code is input as part of the REST asset configuration from within Matrix.
eg. the following js code was used on the Digital UK collection to do some front end modifications.
var response = _REST.response.body; // Fix markup response = response.replace(/(<div class="facet">[\n\s]*<div class="facetLabel">[\n\s]*All results) : (<a href="\?.*>)all(<\/a>)(\s*→\s.*)(<\/div>)/g, "$1</div> <div class=\"category\"><span class=\"categoryName\">$2All$3</span></div> <div class=\"category\"><span class=\"categoryName\">$4</div>"); // Make 'pdf' uppercase in the sidebar on the left response = response.replace(/(<span class="categoryName"><a href=".*">)pdf(<\/a>)/g, "$1PDF$2"); response = response.replace(/(→ )pdf/g, "$1PDF"); // Make 'map' proper case in the sidebar on the left response = response.replace(/(<span class="categoryName"><a href=".*">)map(<\/a>)/g, "$1Map$2"); response = response.replace(/(→ )map/g, "$1Map"); // Make 'video' proper case in the sidebar on the left response = response.replace(/(<span class="categoryName"><a href=".*">)video(<\/a>)/g, "$1Video$2"); response = response.replace(/(→ )video/g, "$1Video"); print(response);
- Speed: This seems to be an issue as the search results generally don't seem to be cached, so every search is like specifying _nocache
- Statistics: All requests to Funnelback come from the Matrix server, so any stats that want to break down by IP won't work.
- Character encoding problems: Can occur if the Matrix site isn't UTF8. See below for a workaround.
Character encoding issues
Funnelback returns text as UTF8. Problem was encountered with DUK which was setup to return code as ISO8859-1, resulting in garbage characters in the embedded funnelback code.
var response = _REST.response.body; response = response.replace(/£/g, "£"); response = response.replace(/©/g, "©"); response = response.replace(/\u2014/g, "—"); response = response.replace(/\u2013/g, "–"); response = response.replace(/\u00a0/g, " "); response = response.replace(/\u2018/g, "‘"); response = response.replace(/\u2019/g, "’"); response = response.replace(/\u201c/g, "“"); response = response.replace(/\u201d/g, "”"); response = response.replace(/\u2026/g, "…"); print(response);
Funnelback search page asset (Squiz Matrix asset)
These are brief instructions on how to connect Funnelback Enterprise to Squiz Matrix using the Funnelback search page asset in Matrix to handle the front-end formatting of the search results.
Advantage: we only need to set up the configuration for the crawler - front end formatting is handled by Squiz Matrix.
- reqiures at least MySource Matrix 3.26.
- requires at least Funnelback 9.0
- Funnelback collection must not use Faceting or events search (search asset can only render the things currently supported by Funnelback OEM)
- As xml.cgi is used you can't use anything that requires search.cgi
- Set up a Funnelback search collection as normal and crawl the matrix site. Exclude things like the search page, urls containing SQ_DESIGN_NAME=print and so on. Make sure you exclude SQ_ACTION as well otherwise if you are running an authenticated crawl the crawler might log itself off.
- Create a Funnelback search page asset (see: http://manuals.matrix.squiz.net/funnelback-search/chapters/funnelback-search-page)
- From the details screen of the asset choose: Funnelback server search.
- Enter the path to xml.cgi (search.cgi doesn't work and will return a Matrix FNB002 error message)
- Enter the funnelback collection name.
This should handle the connection details.
Squiz should now be able to format the search results using the Matrix keywords that are available.
- Check that proxying isn't turned on inside Matrix (the search asset doesn't support it, though there is a workaround.
- If you get an FNB002 message check to make sure that the requests are reaching the Funnelback server by watching the apache request log.
IP address handling
When Funnelback is accessed via Matrix the originating IP address seen by Funnelback is that of the Matrix server. This means that every request logged will have the Matrix server's IP against it. This is bad as all the analytics reports which rely on IP address (eg. location reports) will be incorrect.
There is a working hook script solution in place for NSW Gov which involves the remote user's IP being passed in as a CGI parameter.
However the preferred approach is to supply the the X-Forwarded-For HTTP header supplying the remote user's IP as a parameter.
In the context of a Matrix REST asset this can be done by supplying the following HTTP request header:
One major problem that comes up time and time again is the ability to associate a binary document (eg. pdf/word doc) with the metadata recorded for it in MySource Matrix. This is a problem mainly for the titles and descriptions as users (rightly so) believe that they have input the title, but don't see it reflected in the search results. This is because the metadata is stored in the Matrix database and Funnelback has no idea it exists (and no way to get at it).
The best solution is to apply this metadata using the external metadata mechanism.
- Find out what metadata is available for PDFs and other binary files on the site in question. (reqs contacting Squiz)
- Ask Squiz to create an asset listing that spits out text in our external metadata format. Once you know what metadata is available then you can define how the external metadata file should look. It is a file that needs to look like this, one entry per line.
<URL> <md_class>:"<md_value>" <md_class2>:"<md_value2>" etc
www.somewhere.com/path/to/document.pdf t:"This is the title" c:"This is the description" s:"These|Are|The Keywords"
www.somewhere.com/path/to/anotherdocument.pdf t:"Another title" c:"the description" s:"The Keywords"
Standard funnelback md_classes that should be populated are
t (document title)
c (document description)
These are also other standard ones that should be populated if there is relevant metadata available:
a (document author)
s (document subject)
Other metadata can be specified too, but we need to ensure that there is an appropriate metadata class defined in Funnelback to map the data to.
- Once the asset listing is set up and available, you can set a pre-gather step in Funnelback to download the page and save it as the external metadata file (eg. using wget)
Authenticated http requests
The easiest way to index Matrix sites that have authentication is to get http authentication enabled on the matrix server. This will allow the standard http_user and http_passwd collection.cfg settings to be specified to crawl the site.
Using this method:
- HTTP authentication needs to be enabled in an admin interface setting for Matrix - the normal login screen for matrix doesn't use http authentication.
If this isn't possible, or you need to authenticate using Matrix's standard login screen (which doesn't use http_auth):
Set up a form_interaction.cfg with an entry something like:
[path to login screen] 1 SQ_LOGIN_USERNAME=[USERNAME]&SQ_LOGIN_PASSWORD=[PASSWD]
Ensure you use pre-crawl authentication (ie. don't set crawler.form_interaction_in_crawl=true in collection.cfg)
Problems with authentication
An issue has also been discovered relating to authenticated requests - you need to make sure any code that parses requests from matrix that are authenticated take in to account a licence expiry check:
Non-authenticated HTTP requests won't trigger Matrix check the warranty key validity. So a lesson to be learnt from this for Matrix integration is that any authenticated HTTP requests to Matrix should check for the 'Invalid Warranty Key' screen, or access Live assets with Public Read permission so authentication isn't required.
TODO: flesh out properly
- Avoid setting Matrix triggers to fire off an update directly. This approach works for the odd change, but if you move part of a matrix site you can end up triggering thousands of simultaneous instant updates which will crash the Funnelback server.
- Better solution is to get Matrix to batch up the updates - eg. Trigger an event every x (eg. 5) mins to put together a funnelback feed that is then fed to the feed interface of funnelback.
- Alt solution (not as good) is to modify the feed interface Funnelback side to batch up the updates.
Example sites where this has been implemented with varying degrees of success: NSW Treasury (PD01), Federal Court (installed client), UK Electoral Commission (UK SaaS), Digital UK (UK SaaS).
Sourcing metadata from Matrix
Several setups supplement document information with metadata that has been assigned in Matrix. The basic process is:
- Squiz create an asset list that lists documents and their assigned metadata (e.g. URL, title, author, expiry, etc.)
- Funnelback builds a start urls list based on the asset list.
- Funnelback crawls the URLs.
- During filtering, Funnelback matches the metadata in the asset list against downloaded documents and inserts them into the file.
- Update proceeds as normal.
- Caching: Try to use '/_nocache' with the asset list URL. Squiz have to cache Matrix output as it is so server intensive to generate pages. Occasionally, they may cache these asset lists as well.
- Timeouts: If they asset list is timing out, increase the 'crawler.max_timeout_retries' and 'crawler.request_timeout'.
- Pagination: Squiz could also try paginating the asset list (e.g. Parks Victoria do this).
- Modifying crawled file vs. using external metadata: It's recommended that you modify the file rather than using external metadata to avoid getting appended metadata fields (e.g. title result outputting as '<title>Default title|Matrix title</title>'.
- Redirects: Sometimes, at go-live, Squiz might decide to implement 302 redirects to preserve external search engine rankings, without updating the document metadata list. This means that Funnelback will access the files, get redirected to the new URL and subsequently not be able to link it back to the asset list. A workaround is to read the redirects.txt into a hash/map and reference this when applying the external metadata. See the attached 'ProcessExternalMetadata.groovy' script for an example.
- Replace diacritic (accented) characters
- Replace string across multiple analytics log files
- Modifying the highlight regular expression used by <@s.Boldicize>
- Accessing data model maps that use a non-string key (eg. gScopeCounts)
- Monitoring Funnelback
- Accessing data model variables that start with a single lower case letter