Customise undesirable text

Managed by | Updated .

Background

The undesirable words report in content auditor identifies URLs that contain words that are seen as undesirable. This includes common misspellings, but can be augmented with organisation-specific words - such as avoid words in company style policy, or industry specific terms.

Funnelback uses Wikipedia’s common misspellings list to identify undesirable words. This list can be replaced or augmented with custom lists of terms.

Customisation of undesirable text requires a full update of the collection to apply the changes.

Process

  1. From the administration interface switch to the desired collection and open the file manager.
  2. Create a new file called undesirable-text.additional.cfg. Select undesirable-text.*.cfg from the create menu that appears at the bottom of the config section.
  3. Set the filename to undesirable-text.additional.cfg by editing the text field above the main content editor, then edit the file. Add the list of undesirable terms, one per line, then save the file. This configures content auditor to identify pages that contain these words.
  4. Edit the collection.cfg and add the following line then save the file.

    filter.jsoup.undesirable_text-source.additional=$SEARCH_HOME/conf/$COLLECTION_NAME/undesirable-text.additional.cfg
    
  5. Run a full update of the collection. Note: a full update is required – an incremental update is not sufficient because filter changes won’t be applied to content that is not downloaded.

  6. After the update completes return to the content auditor report for the collection and observe that occurrences of the words added to the custom undesirable text file are now included in the words listed as undesirable text. Clicking on one of the terms will filter the report to only pages containing the selected word.
Was this artcle helpful?

Tags
Type:
Features: