Crawl Wordpress sites

Managed by | Updated .

Background

This article shows how to crawl and index a Wordpress site.

Crawling and indexing is much easier if the Wordpress site in question is using the Yoast Wordpress Plugin. However it is possible to crawl and index a Wordpress site that does not use this plugin.

Crawling with the Yoast plugin

Yoast produces:

  • high-quality sitemap.xml files (readable and parse-able as HTML via XSLT)
  • Twitter and OpenGraph metadata
  • Canonical URLs
  • Breadcrumbs

Assuming a reasonable robots.txt configuration:

http://www.example.org/robots.txt
...
Sitemap: http://www.example.org/sitemap.xml

Recommend starting crawl at sitemap.xml.

collection.cfg.start.urls
http://example.org/sitemap.xml

Focusing the crawler solely on URLs listed in the sitemap (ignoring categories and tags):

collection.cfg
crawler.max_link_distance=2
crawler.use_sitemap_xml=true
exclude_patterns=category-sitemap,post_tag-sitemap,xsl

Crawling without the Yoast plugin

Straightforward sitemaps or full page listings may not be easily accessible.  Most WordPress pages will create additional URLs for the page's RSS feed and the page's comments RSS feed.  URLs and feeds are also created on a per-tag and per-category basis.

collection.cfg
exclude_patterns=regexp:/feed/|xmlrpc|/archive/|.*/page/\d$/|

These category and tag pages will still need to be stored, filtered and parsed to gather all content, and killed post-index:

kill_partial.cfg
http://example.org/category/
http://example.org/tag/
Was this artcle helpful?