Create a jsoup filter

Managed by | Updated .

The following provides a basic jsoup filter example that adds a custom filter to the Jsoup filter chain.  The filter scrapes the value of the first image's alt text and writes this to a custom metadata field.

  1. Create a @groovy/filter/jsoup folder within your collection. Add your jsoup filters into this folder.
  2. Create your filter. eg. let's create a filter that will be added to the jsoup filter chain as filter.jsoup.ScrapeMetadata:
    Create a file inside the @groovy/filter/jsoup folder called scrapeMetadata.groovy containing the following
ScrapeMetadata.groovy (Example)
package filter.jsoup;
// Imports for logging
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
/**
 * scrapes and inserts alt text from first image and inserts as custom metadata
 */
public class ScrapeMetadata implements IJSoupFilter {
    private static final Logger logger = LogManager.getLogger(ScrapeMetadata.class)
    /** Configure some metadata field names - this will be used to add a <meta name="fb.custom" /> field */
    public static final String CUSTOM_META = "fb.custom";
    public void processDocument(FilterContext context) {
        // get the document object
        def doc = context.getDocument();
        // run some Jsoup selects. eg select the first img element in the document
        def image = doc.select("img").first()
  
        if (image != null) {
            if (docMap[image.attr('alt')] != null) {
                def meta = image.attr('alt')
                // print out some logging to crawler.inline_filter.log
                logger.error("Extracted metadata: "+meta)
                context.getAdditionalMetadata().put(CUSTOM_META, meta)
            }
        }
    }
}
  1. Add the filter to the jsoup filter chain:    
    collection.cfg
collection.cfg
filter.jsoup.classes=ContentGeneratorUrlDetection,FleschKincaidGradeLevel,UndesirableText,filter.jsoup.ScrapeMetadata

Additional resources

Was this artcle helpful?

Comments