Include binary documents in the search index

Managed by | Updated .

Funnelback will remove binary documents that it is unable to filter from the index by default.  This is sometimes undesirable as you may wish the document's URL to be displayed regardless, even if no useful text can be extracted.

It is a little complicated, but possible to achieve this outcome.

The first step is to ensure that the documents are stored by the crawler.  Once in the crawler some changes are required to the indexing process.

  1. For a web collection you'll need to ensure that the extensions are listed in the crawler.non_html collection.cfg config option
    eg. 
    crawler.non_html=doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,xlsm,zip
  2. remove the type from the crawler.reject_files collection.cfg option
    eg. 
    crawler.reject_files=Z,asc,asf,asx,avi,bat,bib,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,wav,wmv,wrl,xpm
  3. ensure that we don't try to filter these documents by adding the mime type to the filter.ignore.mimeTypes collection.cfg option:
    eg. 
    filter.ignore.mimeTypes=application/zip

You should then be able to crawl the files (you should see them listed in the stored.log for the collection.

Add the -ibd indexer option - this tells the indexer to include binary documents in the index.  However when the index is built the indexer sets a flag in the index for each of these documents that prevents them from displaying - this flag needs to be removed.

To do this you first need to generate a list of URLs to apply this removal to. If you want all binary documents in the index to be visible then you can run something like

$SEARCH_HOME/bin/padre-fl $SEARCH_HOME/data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index -show > $SEARCH_HOME/conf/$COLLECTION_NAME/binaryurls.txt

The following command then removes the binary document flag from the index

$SEARCH_HOME/bin/padre-fl $SEARCH_HOME/data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index $SEARCH_HOME/conf/$COLLECTION_NAME/binaryurls.txt -bits 17f AND

This can be done automatically by adding a post_index_command to your collection.cfg:

 post_index_command=$SEARCH_HOME/bin/padre-fl $SEARCH_HOME/data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index -show > $SEARCH_HOME/conf/$COLLECTION_NAME/binaryurls.txt && $SEARCH_HOME/bin/padre-fl $SEARCH_HOME/data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index $SEARCH_HOME/conf/$COLLECTION_NAME/binaryurls.txt  -bits 17f AND
Was this artcle helpful?

Comments