Extractor Studio

What is Extractor Studio?

Extractor Studio is a toolchain for import.io users and managed service providers to build out scalable extractor definitions by creating a modular Extractor Library.

Concepts

An import.io Extractor Library is a git repository, that contains a library of modules and extractors for one or more organizations.

There are multiple types of modules:

Robot

  • An extractor template that is inherited from when creating an org extractor

Schema

  • A definition of what columns are expected to be returned

  • Primarily used to scaffold extraction yaml files when creating extractors from a robot template

Action

  • A browser control and logic building block

  • Uses a browser context to control the browser - see IContext

  • Action may be used as an interface

    • A default definition may be provided

    • Named parameters (e.g. domain, country)

Extraction

  • A definition of what to extract on the page

  • Configuration includes fields to be extracted and their corresponding selectors, XPaths, regular expressions, typings, etc

Each instance of a module maps to a file within the git repository within the top-level src/library folder, and has a URI composed of the type and path, e.g. "action:product/details".