Configure Funnelback to index additional file types

Managed by | Updated .

Background

Funnelback supports the indexing of HTML, Microsoft Office (Word/Excel/Powerpoint), RTF and text documents out of the box.

The binary formats are converted to text using Apache Tika - which supports a large number of document formats.

These formats can easily be added to the filetypes indexed by Funnelback.

The first approach discussed below does not convert the documents to text but uses metadata to describe the binary documents. The other approaches use Tika and external filtering in order to exrtact text from the binary documents.

Indexing non-textual files

For Funnelback to successfully index a document it needs to have a textual representation of the document. For text-based documents such as PDFs or Microsoft Word documents filtering is used to extract the text contained within the document and this is what Funnelback indexes.

For other files types such as multimedia types (e.g. images, movies, sound files) the filtering will only extract any metadata that has been embedded within the files and this will normally not be anything useful as the embedded metadata is usually attributes about the file such as the bit rate, duration, camera used to take a photo etc.

The best approach for indexing non-textual files is to index text that has been written to describe the files and associating it with the file's URL. For a sound file or movie index a transcipt or write descriptive metadata such as a title, description and keywords which can then be used as the text that describes the file.

Tip: When you use this approach to index non-textual documents the actual files themselves do not need to be downloaded by Funnelback (e.g. if you're doing a web crawl these can be in the exclude list). This is because the files themselves are not indexed - Funnelback is using the XML record to index the file and then attaching the file's URL to the search result.

Tutorial: Index non-textual files using an XML file listing

This tutorial shows how to use a simple XML file to describe a number of non-text files to add them to the search index.

Consider a site that has 3 non-text files:

  • An image (shakespeare.jpg) containing a picture of William Shakespeare
  • A sound file (hamlet.mp3) containing a radio performance of Hamlet
  • A video file (lear.mov) of a performance of King Lear
  1. Produce an XML file containing all the useful fielded information describing your files. This could be produced manually, or automatically generated from metadata/database information. e.g.
    <?xml version="1.0" encoding="UTF-8" ?>
    <files>
    	<file>
    		<title><![CDATA[Chandos portrait]]></title>
    		<uri>http://shakespeare.example.com/images/shakespeare.jpg</uri>
    		<description><![CDATA[The Chandos portrait is the most famous of the portraits that may depict William Shakespeare. Painted between 1600 and 1610, it may have served as the basis for the engraved portrait of Shakespeare used in the First Folio in 1623. It is named after the Dukes of Chandos, who formerly owned the painting.]]></description>
    		<author><![CDATA[John Taylor]]></author>
    		<date>1610</date>
    		<location><![CDATA[National Portrait Gallery, London]]></location>
    		<keywords>
    			<keyword><![CDATA[John Taylor]></keyword>
    			<keyword><![CDATA[William Shakespeare]></keyword>
    			<keyword><![CDATA[painting]></keyword>
    		</keywords>
    		<filetype>Image</filetype>
    		<filesize>820kB</filesize>
    		<format>jpg</format>
    	</file>
    	<file>
    		<title><![CDATA[Hamlet]]></title>
    		<uri>http://shakespeare.example.com/radio/hamlet.mp3</uri>
    		<description><![CDATA[A full-text radio production of the play, co-produced by the BBC and the Renaissance Theatre Company. Features Kenneth Branagh as Hamlet, Derek Jacobi and Claudius, Judi Dench as Gertrude, and John Gielgud as the Ghost.]]></description>
    		<author><![CDATA[William Shakespeare]]></author>
    		<author><![CDATA[Kenneth Branagh]]></author>
    		<author><![CDATA[Derek Jacobi]]></author>
    		<author><![CDATA[Judi Dench]]></author>
    		<author><![CDATA[Renaissance Theatre Company]]></author>
    		<author><![CDATA[British Broadcasting Corporation]]></author>
    		<date>1992</date>
    		<keywords>
    			<keyword><![CDATA[William Shakespeare]]></keyword>
    			<keyword><![CDATA[audio]]></keyword>
    			<keyword><![CDATA[radio]]></keyword>
    			<keyword><![CDATA[BBC Radio 3]]></keyword>
    		</keywords>
    		<duration>235</duration>
    		<duration_units>min</duration_units>
    		<filetype>Sound recording</filetype>
    		<filesize>399.7MB</filesize>
    		<format>mp3</format>
    	</file>
    	<file>
    		<title><![CDATA[King Lear]]></title>
    		<uri>http://shakespeare.example.com/video/lear.mov</uri>
    		<description><![CDATA[King Lear is a 2018 British-American television film directed by Richard Eyre. An adaptation of the play of the same name by William Shakespeare, cut to just 115 minutes, was broadcast on BBC Two on 28 May 2018.]]></description>
    		<author><![CDATA[William Shakespeare]]></author>
    		<author><![CDATA[Richard Eyre]]></author>
    		<author><![CDATA[Jim Broadbent]]></author>
    		<author><![CDATA[Jim Carter]]></author>
    		<author><![CDATA[Tobias Menzies]]></author>
    		<author><![CDATA[Emily Watson]]></author>
    		<author><![CDATA[John Macmillan]]></author>
    		<author><![CDATA[Florence Pugh]]></author>
    		<author><![CDATA[Emma Thompson]]></author>
    		<author><![CDATA[Anthony Calf]]></author>
    		<author><![CDATA[Anthony Hopkins]]></author>
    		<author><![CDATA[Simon Manyonda]]></author>
    		<author><![CDATA[Chukwudi Iwuji]]></author>
    		<author><![CDATA[Karl Johnson]]></author>
    		<author><![CDATA[Samuel Valentine]]></author>
    		<author><![CDATA[Andrew Scott]]></author>
    		<author><![CDATA[Christopher Eccleston]]></author>
    		<date>2018</date>
    		<keywords>
    			<keyword><![CDATA[William Shakespeare]]></keyword>
    		</keywords>
    		<duration>115</duration>
    		<duration_units>min</duration_units>
    		<filetype>Video recording</filetype>
    		<filesize>4.8GB</filesize>
    		<format>mov</format>
    	</file>
    </files>
    
  2. Make the XML available at a web accessible address. (e.g. http://shakespeare.example.com/files.xml)
  3. Ensure that the XML file is included in your search. e.g. for a web collection you could add the XML's URL to your start URLs.
  4. Update your search collection.
  5. Set the following XML processing options (Note the paths here are specific to the example XML above). This will split the XML document into multiple records, and assign the URL and filetype based on the contents of specified fields in the XML.
    • XML document splitting: /files/file
    • Document URL: /files/file/uri
    • Document filetype: /files/file/format
  6. Create metadata mappings for all of the fields that you wish to include in the index. e.g.
    • t: //title
    • author: //author
    • etc.
  7. Re-index the live view to incorporate the metadata.
  8. At this point you should see the additional results appearing in your search results. You will need to modify your template to display the result appropriately.

Add an additional filetype that is supported by Tika

Before going any further check the list of supported document types. When checking ensure you look at the correct version of Tika - you can find out the version by finding the Tika jar files that sit within the $SEARCH_HOME/lib/java/all folder.

For formats supported by Tika see: Funnelback - Tika versions.

Add additional types to web collections

Step 1. crawler reject files list

Ensure the filtype extension is not present in the crawler.reject_files list.

The default value in collection.cfg is:

crawler.reject_files=Z,asc,asf,asx,avi,bat,bib,bin,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mp4,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,vob,wav,wmv,wrl,xpm,zip

Step 2. parser mime types

If you wish links to be extracted (for crawl purposes) from the document then ensure the mime type is listed in the crawler.parser.mimeTypes list. note: only text documents should be listed here.

The default value in collection.cfg is:

crawler.parser.mimeTypes=text/html,text/plain,text/xml,application/xhtml+xml,application/rss+xml,application/atom+xml,application/json,application/rdf+xml,application/xml

Step 3. non html files list

Add the file extension of the new filetype to the crawler.non_html list

The default value in collection.cfg is:

crawler.non_html=pdf,doc,ps,ppt,xls,rtf

Step 4. Tika processed file types

Check that the file extension is listed in the filter.tika.types list.

The default value in collection.cfg is:

filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm

Step 5. collection update

Update the collection by running a full crawl.

Add additional types to filecopy collections

Step 1. filecopier file types list

Add the file extension of the new filetype to the filecopy.filetypes list

The default value in collection.cfg is:

filecopy.filetypes=doc,docx,rtf,pdf,html,xls,xlsx,txt,htm,ppt,pptx

Step 2. Tika processed file types

Check that the file extension is listed in the filter.tika.types list.

The default value in collection.cfg is:

filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm

Step 3. collection update

Update the collection.

Add additional types to trimpush collections

Step 1. trim extracted file types list

Add the file extension of the new filetype to the trim.extracted_file_types list

The default value in collection.cfg is:

trim.extracted_file_types=*,doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,txt,htm,html,jpg,gif,tif,vmbx

Step 2. Tika processed file types

Check that the file extension is listed in the filter.tika.types list.

The default value in collection.cfg is:

filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm

Step 3. collection update

Update the collection.

Add additional types to other collection types

Note: this applies to other collection types except for local collections which do not filter binary documents.

Step 1. Tika processed file types

Check that the file extension is listed in the filter.tika.types list.

The default value in collection.cfg is:

filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm

Step 2. Update the collection

For a push collection call the push API by issuing a PUT request to /v1/collections/{collection}/documents and specify the optional filters parameter to include tika in the supplied filter chain.

For other collection types update the collection.

Add an additional filetype using an external filter

Warning

The use of external filters is generally discouraged as there is a large hit to performance as a separate system process is run for each document that is being filtered.

Step 1. Install binaries

Ensure any extra binaries are installed onto the Funnelback server and made executable by the search user (or relevant Windows user account used to run updates).

Step 2. Executables configuration

Add any new binaries to executables.cfg and create a textify.cfg containing extension to command mappings.

Step 3. Filter chain

Ensure that ExternalFilterProvider is included in the filter chain for the collection.

The default value in collection.cfg is:

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider

Step 4. Parser and non html file types

Ensure that the filetype is added to the acceptable files for the collection using the Tika instructions above (web: crawler.non_html and optionally crawler.parser.mimeTypes; filecopy: filecopy.filetypes; TRIM/HP RM: trim.extracted_file_types)

Step 5. Tika processed file types

If the external filter is overriding Tika then ensure that the file extension is removed from filter.tika.types.

Step 6. collection update

Update the collection.

Was this artcle helpful?