Crawl Wordpress sites
Managed by | Updated .
Background
This article shows how to crawl and index a Wordpress site.
Crawling and indexing is much easier if the Wordpress site in question is using the Yoast Wordpress Plugin. However it is possible to crawl and index a Wordpress site that does not use this plugin.
Crawling with the Yoast plugin
Yoast produces:
- high-quality sitemap.xml files (readable and parse-able as HTML via XSLT)
- Twitter and OpenGraph metadata
- Canonical URLs
- Breadcrumbs
Assuming a reasonable robots.txt configuration:
...
Sitemap: http://www.example.org/sitemap.xml
Recommend starting crawl at sitemap.xml.
http://example.org/sitemap.xml
Focusing the crawler solely on URLs listed in the sitemap (ignoring categories and tags):
crawler.max_link_distance=2
crawler.use_sitemap_xml=true
exclude_patterns=category-sitemap,post_tag-sitemap,xsl
Crawling without the Yoast plugin
Straightforward sitemaps or full page listings may not be easily accessible. Most WordPress pages will create additional URLs for the page's RSS feed and the page's comments RSS feed. URLs and feeds are also created on a per-tag and per-category basis.
exclude_patterns=regexp:/feed/|xmlrpc|/archive/|.*/page/\d$/|
These category and tag pages will still need to be stored, filtered and parsed to gather all content, and killed post-index:
http://example.org/category/
http://example.org/tag/