Debug a crawl for missing documents

Managed by | Updated .

Background

This article outlines the rough process to follow to determine why a document might be missing from a web search.

This is tailored for web collections but the process is similar regardless of collection type. For non-web collections check the collection type's gather logs (in place of the check the crawl logs section below).

Process

Check the search results

  1. Search for the URL using the metadata field. e.g to look for http://site.com/path/to/file.pdf run a search for v:path/to/file.pdf (The v metadata field holds the path component of the document's URL).

Check the site's robots.txt

Visit the website where the file is hosted and check robots.txt at the root of the site. e.g. for http://site.com/path/to/file.pdf check the robots.txt at for http://site.com/robots.txt.

If you get a not found message skip to the next section. If you get a robots.txt returned then inspect the file and check to see if there are any rules that instruct Funnelback to reject the file.

Check the crawl logs

If the document can't be located in the search results:

  1. Check stored.log to see if it's listed as being stored. If it's stored then jump to the step about checking the indexer logs.

  2. Check url_errors.log to see if it's listed as having an error recorded against it when accessed. If it's listed here then the error will need to be addressed. If it was a timeout then the crawler request timeout can be increased, otherwise it could just be a transient issue that self-corrects on the next crawl. If the file 'exceeded the maximum download size' then the crawler.max_download_size can be increased.

  3. Grep crawl.log.X for the URL and investigate any messages recorded against the URL (could be rejected to an include/exclude rule, or because the file is too large).

  4. Check the crawler.central.log (or crawler.inline-filter.log for older versions) to see if an error was raised when attempting to filter the document. Binary documents in particular can fail to convert to text for many reasons, and if this occurs then the URL is likely to be missing from the index.

  5. If you still can't find the URL in any of these logs then it's likely that the crawler did not see the URL during the crawl. This could be caused by:

    • the URL matches an exclude pattern in collection.cfg (remember that standard patterns are substring matched against the URL so a pattern like /m will exclude http://example.com/media)
    • the file is unlinked on the site where the file is hosted.
    • the file is only linked from pages that are excluded either via robots.txt rules, or by crawler exclude patterns. Remember that the crawler must be able to find a path of links to follow from one of the seed URLs to the final page and robots.txt and exclude patterns can prevent the crawler from reaching the final page.
    • a parent page that links to the URL resulting in an error (in which case the URL to the document of interest wasn't extracted)
    • a parent page that links to the URL contains robots meta tags or link properties.
    • a domain level alias causing Funnelback to requst the page on a different domain which can sometimes result in an error.
    • the crawl timed out before it was able to request the URL
    • the SimpleRevisitPolicy, if set, which means that infrequently changing URLs are not checked on each crawl.

Check the index logs

  1. If you've determined that the URL has been stored then check the Step-Index.log to see if an error is recorded at index time. Grep the Step-Index.log for the URL.

    • If it's marked as BINARY then the document failed to filter correctly and a binary file remained after filtering. Binary files are flagged in the index and suppressed at search time
    • If it's marked as a DUPLICATE then it's possible that a canonical URL is being assigned to the page that has already been assigned to another page in the index, or that the page content once extracted is identical to other pages in the index (this can happen if an error page is returned with a 200 status code).
    • If it's excluded due to pattern check that the URL doesn't contain the install path in it (usually /opt/funnelback) - if the install path is in the URL then it's skipped by the indexer unless you set the -check_url_exclusion=false indexer option.
  2. If you can't find it in the index log but it was definitely stored then it's possible that a canonical URL is set resulting in the document being indexed under a different URL. Check the source page to see if a canonical URL is defined.

Was this artcle helpful?

Comments