Adding limited wildcard support to DAAT mode

Managed by | Updated .

Background

Wildcards are not generally permitted in search queries by Funnelback as wildcards result in a large performance hit.

Funnelback has two modes of processing a query:

  • Document at a time (DAAT): processes matches by examining each document and deciding if the document is relevant for search results. This is the default mode of processing queries.
  • Term at a time (TAAT): processed matches by examining each term then checking each document for relevance. This was the default mode of processing queries until Funnelback 10.

Document at a time is much more efficient in processing queries, especially over large datasets - however it does not support any form of wildcards. Term at a time mode supported wildcards (via a truncation operator). This slowed down the search dramatically by did allow wildcard matching.

Sometimes it is desirable to provide wildcard support - especially if a database-style query is being performed. This article shows how to configure a collection to provide limited wildcard support when using Funnelback's document at a time mode.

Process

The following process can be used to add limited truncation support - allowing you to use the asterix operator on the right hand side (only) of words in a query. The code uses the auto-completion service to get 5 terms for each starred item. There is a configuration option to adjust the number of auto-completions to request from the service for each starred term.

This method provides an efficient way of supporting wildcards - however the expanded query won't return all the matching results (compared to if a full expansion of the wildcard was performed). There is a fine balancing act between performance and completeness of the search result set and the goal of a web search engine such as Funnelback has always been to return a set of relevant results. This set isn't necessarily complete as shortcuts are taken to determine relevant results quickly.

Step 1. Add wildcard support and expand the query

The following script adds support for query terms to use trailing wildcard operators implemented as an asterix.

Note: this will only work with standard query terms. Use on a term containing metadata and other query language operators is not supported.

Create a hook_pre_process.groovy script for the collection with the following contents:

hook_pre_process.groovy
def q = transaction.question
if (q.collection.configuration.value(["partial_query_enabled"])) {
    // Convert a partial query into a set of query terms
    // Maximum number of query terms to expand partial query to - read from collection.cfg partial_query_expansion_index parameter.
    // eg. partial_query=com might expand to query=[commerce commercial common computing]
    def partial_query_expansion_index = 5
    if ((q.collection.configuration.value(["partial_query_expansion_index"]) != null) && (q.collection.configuration.value(["partial_query_expansion_index"]).isInteger())) {
      partial_query_expansion_index = q.collection.configuration.value(["partial_query_expansion_index"])
    }
    if (q.query != null) {
        // explode the query and expand each item that ends with a *
        def terms = q.query.tokenize(" ");
        terms.each {
            def term = it
            if (term ==~ /\w+\*$/) {
                //remove term from q.query
                terms -= term
                def termclean = term.replaceAll(~/\*$/,"")
                // Read $SEARCH_HOME
                def sH = Environment.getValidSearchHome().getCanonicalPath();
                File searchHome = new File(sH)
//              File searchHome = new File("/opt/funnelback")
                File indexStem = new File(q.collection.configuration.value(["collection_root"]) + File.separator + "live" + File.separator + "idx","index")
                // NOTE: CONSTRUCTOR HAS CHANGED post v14.2 and requires searchHome as the first param
                List<Suggestion> suggestions = new PadreConnector(searchHome,indexStem)
                  .suggest(termclean)
                  .suggestionCount(partial_query_expansion_index)
                  .fetch();
                // build the expanded query from the list of suggestions
                def expanded_query = ''
                suggestions.each {
                    expanded_query += '"'+it.key+'" '
                }
                // set the query to the expanded set of query terms ORed together
                if (expanded_query != "") {
                    if (q.rawInputParameters["s"] == null) {
                    q.rawInputParameters["s"] = ["["+expanded_query+"]"]
                    }
                    else {
                    q.rawInputParameters["s"][0] += " ["+expanded_query+"]"
                    }
                }
            }
        }
        // reconstruct query.
        q.query = terms.join(" ");
    }
}

Step 2. Enable wildcard support and configure the level of expansion

Add the following to the collection's collection.cfg:

partial_query_enabled=true
# optionally add the following line to indicate a maximum number of terms to expand a starred term to.  The default is 5
# e.g. expand each starred term with 3 expansions
partial_query_expansion_index=3

Queries such as Dan* Smith should now be accepted - the expanded queries can be seen by viewing the JSON or XML output and looking at the query/queryAsProcessed/queryRaw/querySystemRaw/queryCleaned values from the response packet. Expansions are injected into the querySystemRaw element.

Was this artcle helpful?

Comments