Debug a crawl for missing documents
Managed by | Updated .
Background
This article outlines the rough process to follow to determine why a document might be missing from a web search.
This is tailored for web collections but the process is similar regardless of collection type. For non-web collections check the collection type's gather logs (in place of the check the crawl logs section below).
Process
Check the search results
- Search for the URL using the metadata field. e.g to look for
http://site.com/path/to/file.pdf
run a search forv:path/to/file.pdf
(The v metadata field holds the path component of the document's URL).
Check the site's robots.txt
Visit the website where the file is hosted and check robots.txt at the root of the site. e.g. for http://site.com/path/to/file.pdf
check the robots.txt at for http://site.com/robots.txt
.
If you get a not found message skip to the next section. If you get a robots.txt returned then inspect the file and check to see if there are any rules that instruct Funnelback to reject the file.
Check the crawl logs
If the document can't be located in the search results:
Check
stored.log
to see if it's listed as being stored. If it's stored then jump to the step about checking the indexer logs.Check
url_errors.log
to see if it's listed as having an error recorded against it when accessed. If it's listed here then the error will need to be addressed. If it was a timeout then the crawler request timeout can be increased, otherwise it could just be a transient issue that self-corrects on the next crawl. If the file 'exceeded the maximum download size' then thecrawler.max_download_size
can be increased.Grep
crawl.log.X
for the URL and investigate any messages recorded against the URL (could be rejected to an include/exclude rule, or because the file is too large).Check the
crawler.central.log
(orcrawler.inline-filter.log
for older versions) to see if an error was raised when attempting to filter the document. Binary documents in particular can fail to convert to text for many reasons, and if this occurs then the URL is likely to be missing from the index.If you still can't find the URL in any of these logs then it's likely that the crawler did not see the URL during the crawl. This could be caused by:
- the URL matches an exclude pattern in
collection.cfg
(remember that standard patterns are substring matched against the URL so a pattern like/m
will excludehttp://example.com/media
) - the file is unlinked on the site where the file is hosted.
- the file is only linked from pages that are excluded either via robots.txt rules, or by crawler exclude patterns. Remember that the crawler must be able to find a path of links to follow from one of the seed URLs to the final page and robots.txt and exclude patterns can prevent the crawler from reaching the final page.
- a parent page that links to the URL resulting in an error (in which case the URL to the document of interest wasn't extracted)
- a parent page that links to the URL contains robots meta tags or link properties.
- a domain level alias causing Funnelback to requst the page on a different domain which can sometimes result in an error.
- the crawl timed out before it was able to request the URL
- the
SimpleRevisitPolicy
, if set, which means that infrequently changing URLs are not checked on each crawl.
- the URL matches an exclude pattern in
Check the index logs
If you've determined that the URL has been stored then check the
Step-Index.log
to see if an error is recorded at index time. Grep theStep-Index.log
for the URL.- If it's marked as BINARY then the document failed to filter correctly and a binary file remained after filtering. Binary files are flagged in the index and suppressed at search time
- If it's marked as a DUPLICATE then it's possible that a canonical URL is being assigned to the page that has already been assigned to another page in the index, or that the page content once extracted is identical to other pages in the index (this can happen if an error page is returned with a 200 status code).
- If it's excluded due to pattern check that the URL doesn't contain the install path in it (usually
/opt/funnelback
) - if the install path is in the URL then it's skipped by the indexer unless you set the-check_url_exclusion=false
indexer option.
If you can't find it in the index log but it was definitely stored then it's possible that a canonical URL is set resulting in the document being indexed under a different URL. Check the source page to see if a canonical URL is defined.