Splitting XML files

Managed by | Updated .


This article discusses two different techniques for splitting XML files into separate items within the search index.

Splitting XML using the indexer

The Funnelback indexer includes built in support for the splitting of XML files using a specified X-Path.

Using the indexer to split the XML document is ideal if the XML source does not need to be transformed in any way.

After splitting, each record matched by the element path will be indexed as a separate document within Funnelback.

Method: XML processing options (Funnelback 15.14 and newer)

The XML processing options screen allows configuration of an X-Path used to split the XML. This value can be selected as the XML document splitting field on the XML processing screen, available on the administer tab in the administration interface.

See: Funnelback documentation: XML documents - XML document splitting

Method: xml.cfg (Funnelback 15.12 and earlier)

XML configuration in Funnelback 15.12 and earlier used xml.cfg to configure both the XML field mappings as well as other XML options.

The docurl field in xml.cfg is used to set an X-Path to use to split the XML document into individual files.

See: Funnelback documentation: xml.cfg.

Splitting XML using the filter framework (Funnelback 15.8 and newer)

The filter framework in Funnelback 15.8 and newer can be used to split an XML document into multiple documents that can then be processed further in subsequent filters within the collection's filter chain.

A string document filter can be implemented that parses the input document text into an XML object, and then splits it into separate documents with unique URLs.

See: Funnelback documentation: Document filtering

A sample filter is available on the Funnelback GitHub site that can be used for XML document splitting (Note that it has only had some basic testing in Funnelback 15.18 but should work in earlier versions with some modification for some earlier versions that don't support the Grapes/Grab syntax which is used in the filter). The code can also be adapted to use Groovy's XML parser (though element paths must be specified using GPaths instead of X-Paths when using this parser).

See: SplitXML filter documentation

Download: SplitXML filter

Was this artcle helpful?

Type: Keywords: