Debug a crawl for missing documents

Managed by | Updated .

Background

This article outlines the rough process to follow to determine why a document might be missing from a web search.

This is tailored for web collections but the process is similar regardless of collection type. For non-web collections check the collection type's gather logs (in place of the check the crawl logs section below).

Process

Check the search results

  1. Search for the URL using the metadata field. e.g to look for http://site.com/path/to/file.pdf run a search for v:path/to/file.pdf (The v metadata field holds the path component of the document's URL).

Check the crawl logs

If the document can't be located in the search results:

  1. Check stored.log to see if it's listed as being stored. If it's stored then jump to the step about checking the indexer logs.

  2. Check url_errors.log to see if it's listed as having an error recorded against it when accessed. If it's listed here then the error will need to be addressed. If it was a timeout then the crawler request timeout can be increased, otherwise it could just be a transient issue that self-corrects on the next crawl.

  3. Grep crawl.log.X for the URL and investigate any messages recorded against the URL (could be rejected to an include/exclude rule, or because the file is too large).

  4. Check to see if the site has a robots.txt, and if there are any rules preventing the crawling from accessing areas of the site that may include the URL of interest.

  5. Check the crawler.central.log (or crawler.inline-filter.log for older versions) to see if an error was raised when attempting to filter the document. Binary documents in particular can fail to convert to text for many reasons, and if this occurs then the URL is likely to be missing from the index.

  6. If you still can't find the URL in any of these logs then it's likely that the crawler did not see the URL during the crawl. This could be caused by:

    • a parent page that links to the URL resulting in an error (in which case the URL to the document of interest wasn't extracted)
    • another issue with the parent page
    • it's possible that a domain level alias is causing Funnelback to requst the URL on a different domain which can sometimes result in an error.
    • the crawl timed out before it was able to request the URL
    • The SimpleRevisitPolicy is set which means that infrequently changing URLs are not checked on each crawl.

Check the index logs

  1. If you've determined that the URL has been stored then check the Step-Index.log to see if an error is recorded at index time. Grep the Step-Index.log for the URL.

    • If it's marked as BINARY then the document failed to filter correctly and a binary file remained after filtering. Binary files are flagged in the index and suppressed at search time
    • If it's excluded due to pattern check that the URL doesn't contain the install path in it (usually /opt/funnelback) - if the install path is in the URL then it's skipped by the indexer unless you set the -check_url_exclusion=false indexer option.
Was this artcle helpful?

Comments