Parsing & transforming HTML in custom filters

Managed by | Updated .

Parsing & transformation of HTML should really only be done as a last resort, as it is brittle. Always try to get to make the required change in the source content whenever possible because any transformation will break if the source HTML code is updated in any way.

If you need to do it, it's strongly recommended to use a proper HTML parser when manipulating HTML content such as Jsoup. Using regular expressions to parse HTML is not recommended because:

  • It's not reliable as it will strongly depend on the actual HTML syntax.
    • For example using simple quotes or double quotes around HTML attributes: <img src='...'> vs. <img src="...">. Writing a regex that accounts for both syntaxes is difficult
    • Similarly, accounting for attribute order is difficult, e.g. <img src=... alt=...> vs. <img alt=... src=...>
    • The resulting regular expressions end up very complex and hard to maintain
  • Complex regular expressions on large HTML pages might time out or take a very long time to process, slowing down (sometimes blocking) the crawl

JSoup is a better approach as it doesn't rely on the specific HTML syntax. Instead it builds a tree representation of the HTML and nodes can be selected using selectors that apply to the structure of the HTML, not its syntax.

Was this artcle helpful?