Skip to main content

Extracting URLs with Chained Extractors

Many websites, provide a listings of products or a list of search results with links to each product page that has more details. To retrieve the details for all the products you can build two extractors, to create what we call chained extractors. The parent extractor captures a list of URLs (links) to the product pages. The child extractor uses the output from the first extractor to collect data for the individual products. This method is known as chaining extractors.

 A good example would be http://owlkingdom.com/ where there's a list of owls on the homepage.

Each listing on this page includes URLs to the pages for Pointy, Needy, Smart, Tall, Shy, and Snowy, which have more details about each owl.

To capture all of the details for each owl, you can create a listings extractor that extracts data from http://owlkingdom.com/. During the training, Import.io should automatically detect the list and capture all of the links in one of the columns.

Once the listings extractor is created, you can create a details extractor against one of the owl pages, like http://owlkingdom.com/snowy.html. After training and saving the extractor, we can then set the details extractor to use the URLs we extracted in our listings extractor. On the Settings tab of your details extractor, change the Extract from dropdown to URLs from another Extractor, set the Parent Extractor to your listings extractor, and select the URL column that has the extracted URLs.

One option you can also enable is Always run the parent first - run this when parent finishes. This feature automatically runs the parent extractor before running the child extractor, regardless which one has a run triggered. When you enable this option, the child extractor's schedule will be set by the parent extractor, with the extractor being triggered every time after the parent extractor runs.

When an extractor has the Always run the parent first option enabled, an arrow will appear next to it in the extractor list to indicate that it is chained.

Chained extractors can be multiple levels deep, such as a products extractor that is chained to a listings extractor that is chained to a categories extractor. If the Always run the parent first -run this when parent finishes option is enabled for the listings extractor and products extractor and any of the extractors are triggered, it will run the categories extractor, then listings extractor, and then the products extractor.

Elements of the Chaining View

  1. Extract from: Dropdown to set whether the extractor uses URLs from an explicit list of URLs provided or URLs extracted by another extractor.
  2. Parent extractor: Extractor that extracts the list of URLs to use.
  3. URL column: Specific column that extractor the URLs to use from Parent extractor
  4. Always run the parent first: On/off toggle to automatically trigger the current child extractor after its parent extractor completes a crawl run.
  5. Save Input Mapping: Save chained extractors.
  6. Run Inputs: This will trigger the parent extractor to run first before running the current child extractor.