Best practices - Gathering

Managed by | Updated .

This section details gathering best practices organised by data source type.

API generated content

API generated content is content returned by accessing and API. The content is usually returned in a structured format such as JSON or XML via a REST style web call.

If the content is fully returned via HTTP/HTTPS with no authentication, or basic HTTP/NTLM/form-based authentication then a web collection will probably be the most appropriate collection type to use.

If you require custom logic to access the content such as multi-layered API requests or multiple API requests that need to be aggregated, or the repository has authentication requirements that are beyond what a web collection supports then use a custom collection to gather the content.

Recommended collection type: custom or web

CSV

CSV data, if accessible via a HTTP/HTTPS url, should be gathered using a web collection and the built-in CSVtoXML filter to convert the CSV to XML appropriate for indexing by Funnelback.

If unavailable via HTTP/HTTPS and there is another appropriate collection type for accessing the CSV then use this to gather the CSV data and use the built-in CSVtoXML filter.

If using a custom collection to gather the CSV data (e.g. from an API) then use the built-in CSVtoXML filter to convert the CSV to XML appropriate for indexing by Funnelback.

Further modifications can then be made using the filter framework, operating on the XML output from the CSVtoXML filter.

Recommended collection type: web or custom

JSON

JSON data, if accessible via a HTTP/HTTPS url, should be gathered using a web collection and the built-in JSONtoXML filter to convert the JSON to XML appropriate for indexing by Funnelback.

If unavailable via HTTP/HTTPS and there is another appropriate collection type for accessing the JSON then use this to gather the JSON data and use the built-in JSONtoXML filter. Further modifications can then be made using the filter framework.

If using a custom collection to gather the JSON data (e.g. from an API) then use the built-in JSONtoXML filter to convert the JSON to XML appropriate for indexing by Funnelback.

Further modifications can then be made using the filter framework, operating on the XML output from the JSONtoXML filter.

Recommended collection type: web or custom

SQL database

The rows of a table of results returned by an SQL database query can be indexed as individual result items.

If the SQL database is accessible from the Funnelback server (and there is an appropriate JDBC driver) then a database collection should be used.

If the SQL database is not accessible (e.g. because of security requirements) then a web collection can be used if there is a web accessible location from which the results can be fetched. This could be via an export process which runs the SQL query and converts the results into an XML file (ideally) or by writing a simple web script that Funnelback can access remotely that queries the database and returns the results as dynamically to Funnelback as XML. XML is the preferred delivery format as it doesn't require any further conversion by Funnelback. JSON and CSV would also work but would require extra configuration to convert to XML.

Recommended collection type: database or web

Websites

A web collection should be used:

  • For any website that is not authenticated. This includes intranet sites that are delivered via a content management system.
  • For any website requiring basic HTTP authentication, NTLM authentication or form-based authentication if the page content is not personalised for the user and no document level security is required.

For websites (mostly intranets) that require document level security it is likely a custom connector will be required to index the site.

Recommended collection type: web

XML

XML data, if accessible via a HTTP/HTTPS url, should be gathered using a web collection.

If the XML needs to be split into individual records:

  • If no transformation of the XML is required and the XML fields to extract are contained in simple XPaths then use the built-in XML splitting that can be configured via the XML processing options.
  • If the XML needs to be transformed then use the filter framework to split the XML (use the SplitXml filter available from Github: See: https://github.com/funnelback/groovy-filters/tree/master/crawl%20filters) chained with a custom filter that makes the required modifications to the XML. See the section on filtering best practices below.

If unavailable via HTTP/HTTPS and there is another appropriate collection type for accessing the XML then use this to gather the XML and the filter framework to make the required modifications to the XML.

If using a custom collection to gather the XML data (e.g. from an API) then use the filter framework to make the required modifications to the XML.

Recommended collection type: web or custom

Was this artcle helpful?

Tags