Configure Funnelback to index additional file types

Managed by | Updated .

Background

Funnelback supports the indexing of HTML, Microsoft Office (Word/Excel/Powerpoint), RTF and text documents out of the box.

The binary formats are converted to text using Apache Tika - which supports a large number of document formats.

These formats can easily be added to the filetypes indexed by Funnelback.

Add an additional filetype that is supported by Tika

Before going any further check the list of supported document types. When checking ensure you look at the correct version of Tika - you can find out the version by finding the Tika jar files that sit within the $SEARCH_HOME/lib/java/all folder.

For formats supported by Tika see: Funnelback - Tika versions.

Add additional types to web collections

Step 1. crawler reject files list

Ensure the filtype extension is not present in the crawler.reject_files list.

The default value in collection.cfg is:

crawler.reject_files=Z,asc,asf,asx,avi,bat,bib,bin,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mp4,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,vob,wav,wmv,wrl,xpm,zip

Step 2. parser mime types

If you wish links to be extracted (for crawl purposes) from the document then ensure the mime type is listed in the crawler.parser.mimeTypes list. note: only text documents should be listed here.

The default value in collection.cfg is:

crawler.parser.mimeTypes=text/html,text/plain,text/xml,application/xhtml+xml,application/rss+xml,application/atom+xml,application/json,application/rdf+xml,application/xml

Step 3. non html files list

Add the file extension of the new filetype to the crawler.non_html list

The default value in collection.cfg is:

crawler.non_html=pdf,doc,ps,ppt,xls,rtf

Step 4. Tika processed file types

Check that the file extension is listed in the filter.tika.types list.

The default value in collection.cfg is:

filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm

Step 5. collection update

Update the collection by running a full crawl.

Add additional types to filecopy collections

Step 1. filecopier file types list

Add the file extension of the new filetype to the filecopy.filetypes list

The default value in collection.cfg is:

filecopy.filetypes=doc,docx,rtf,pdf,html,xls,xlsx,txt,htm,ppt,pptx

Step 2. Tika processed file types

Check that the file extension is listed in the filter.tika.types list.

The default value in collection.cfg is:

filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm

Step 3. collection update

Update the collection.

Add additional types to trimpush collections

Step 1. trim extracted file types list

Add the file extension of the new filetype to the trim.extracted_file_types list

The default value in collection.cfg is:

trim.extracted_file_types=*,doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,txt,htm,html,jpg,gif,tif,vmbx

Step 2. Tika processed file types

Check that the file extension is listed in the filter.tika.types list.

The default value in collection.cfg is:

filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm

Step 3. collection update

Update the collection.

Add additional types to push collections

Step 1. Tika processed file types

Check that the file extension is listed in the filter.tika.types list.

The default value in collection.cfg is:

filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm

Step 2. Push files with filter classes

Call the push API by issuing a PUT request to /v1/collections/{collection}/documents and specify the optional filters parameter to include tika in the supplied filter chain.

Add an additional filetype using an external filter

Warning

The use of external filters is generally discouraged as there is a large hit to performance as a separate system process is run for each document that is being filtered.

Step 1. Install binaries

Ensure any extra binaries are installed onto the Funnelback server and made executable by the search user (or relevant Windows user account used to run updates).

Step 2. Executables configuration

Add any new binaries to executables.cfg and create a textify.cfg containing extension to command mappings.

Step 3. Filter chain

Ensure that ExternalFilterProvider is included in the filter chain for the collection.

The default value in collection.cfg is:

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider

Step 4. Parser and non html file types

Ensure that the filetype is added to the acceptable files for the collection using the Tika instructions above (web: crawler.nonhtml and optionally crawler.parser.mimeTypes; filecopy: filecopy.filetypes; TRIM/HP RM: trim.extractedfile_types)

Step 5. Tika processed file types

If the external filter is overriding Tika then ensure that the file extension is removed from filter.tika.types.

Step 6. collection update

Update the collection.

Was this artcle helpful?

Comments