Create a jsoup filter
Managed by | Updated .
The following provides a basic jsoup filter example that adds a custom filter to the Jsoup filter chain. The filter scrapes the value of the first image's alt text and writes this to a custom metadata field.
- Create a @groovy/filter/jsoup folder within your collection. Add your jsoup filters into this folder.
- Create your filter. eg. let's create a filter that will be added to the jsoup filter chain as filter.jsoup.ScrapeMetadata:
Create a file inside the @groovy/filter/jsoup folder called scrapeMetadata.groovy containing the following
ScrapeMetadata.groovy (Example)
package filter.jsoup;
// Imports for logging
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
/**
* scrapes and inserts alt text from first image and inserts as custom metadata
*/
public class ScrapeMetadata implements IJSoupFilter {
private static final Logger logger = LogManager.getLogger(ScrapeMetadata.class)
/** Configure some metadata field names - this will be used to add a <meta name="fb.custom" /> field */
public static final String CUSTOM_META = "fb.custom";
public void processDocument(FilterContext context) {
// get the document object
def doc = context.getDocument();
// run some Jsoup selects. eg select the first img element in the document
def image = doc.select("img").first()
if (image != null) {
if (docMap[image.attr('alt')] != null) {
def meta = image.attr('alt')
// print out some logging to crawler.inline_filter.log
logger.error("Extracted metadata: "+meta)
context.getAdditionalMetadata().put(CUSTOM_META, meta)
}
}
}
}
- Add the filter to the jsoup filter chain: collection.cfg
collection.cfg
filter.jsoup.classes=ContentGeneratorUrlDetection,FleschKincaidGradeLevel,UndesirableText,filter.jsoup.ScrapeMetadata
Additional resources
Was this artcle helpful?