Authenticated Extractors

What is an Authenticated Extractor?

An Authenticated Extractor is an extractor whos data sits behind a login. Meaning you must be logged in as a user on the target website in order to extract the data you need.

Extractor Studio allows any robot to behave as an Authenticated Extractor, but requires some additional configuration in order to do so.

How do Authenticated Extractors work?

Before building an Authenticated Extractor it’s important to understand how they will work at runtime.

Browser Session

For every Extractor that runs on import.io (Authenticated or not) browser sessions are used in order to navigate to the target website and perform the needed actions.

For regular Extractors (not Authenticated) the browser session and state is not important. Think of it as opening an incognito tab for each input you wish to extract. Browser state is not persisted between extractions and the name of the game is to extract many inputs in parallel, with out a care for for session cookies and the like.

As for Authenticated Extractors the browser state and session are important, we want to make sure the "user" remains logged in otherwise our data may be invalid or not available. For this reason Authenticated Extractors must first log in (once) to the website before attempting to extract any inputs, and be aware of this session becoming invalidated so that they can attempt to log in again.

Auth Interactions

Obviously for Authenticated Extractors we must first log in before attempting to extract the data. "Auth Interactions" serve this purpose.

"Auth Interactions" map to authInteractions on the extractor runtime configuration and consist of an interaction sequence to be performed in order to log the user in.

authInteractions execute once before any extraction inputs are attempted, and will only execute again if the checkAuthInteractions throw an error.

"Auth Interactions" are defined in the authentication section of the robot template, more on that in the "Configuration" section below.

Check Authentication

Throughout your data extraction the target website may log you out or the browser session may be invalidated. This of course will cause the data you’re seeking to either not be present or incorrect. "Check Authentications" serve as a means to validate your session prior to attempting to extract data.

You can "check" that your auth session is still valid by configuring "Check Authentication" actions. These map to "checkAuthInteractions" on the extractor runtime configuration and are configured in the checkAuthentication section of the robot template. More on this in the "Configuration" section below.

If present, "Check Authentication" runs before each input, if this function throws an error it will prompt the browser to re-execute the "Auth Interactions" before performing the data extraction.

Configuration

Robot

Any robot can support Authentication. To allow a robot to support Authentication simply:

  • Add an authentication entry point to your robot.yaml.

    • Behaves the same as entryPoint

    • Can have a dynamic entrypoint by resolving parameters. For example: shared/auth/${domain}

  • (Optional) Add a checkAuthentication entryPoint

    • Serves to validate login

    • Runs before each extraction

    • Supports Dynamic entry points

Example:

Below is an example robot.yaml file that supports authentication

proxy:
  zone: USA
  type: DATA_CENTER
honorRobots: false
schema: product/details
parameters:
  - store
  - country
  - domain
entryPoint: product/search
pathTemplate: product/${store[0:1]}/${store}/${country}/details
authentication: shared/auth/action
checkAuthentication: shared/checkAuth/action

Below is an example authentication entry point

---
async function implementation (
  inputs,
  parameters,
  context,
  dependencies
) {
  const { _credentials } = inputs;
  const credentials = _credentials || {};
  await dependencies.gotoLogin({});
  await dependencies.preLogin(credentials);
  await dependencies.doLogin(credentials);
  await dependencies.postLogin(credentials);
  console.log('Logged in!');
}
module.exports = {
  parameters: [
    {
      name: 'domain',
      description: '',
      optional: false
    }
  ],
  inputs: [
    {
      name: '_credentials',
      description: '',
      type: 'string',
      optional: false
    }
  ],
  dependencies: {
    gotoLogin: 'action:shared/auth/gotoLogin',
    preLogin: 'action:shared/auth/preLogin',
    postLogin: 'action:shared/auth/postLogin',
    doLogin: 'action:shared/auth/doLogin'
  },
  path: './domains/${domain[0:2]}/${domain}/authenticate',
  implementation
};
---

Below is an example checkAuthentication entryPoint:

---
async function implementation (
  inputs,
  parameters,
  context,
  dependencies
) {
  const { url } = inputs;
  const { loggedInSelector } = parameters;
  await dependencies.goto({ url });
  await context.waitForSelector(loggedInSelector);
  console.log('Logged in!');
}
module.exports = {
  parameters: [
    {
      name: 'domain',
      description: '',
      optional: false
    },
    {
      name: 'loggedInSelector'
    }
  ],
  inputs: [
    {
      name: 'url',
      description: '',
      type: 'string',
      optional: false
    }
  ],
  dependencies: {
    goto: 'action:shared/goto'
  },
  path: '../auth/domains/${domain[0:2]}/${domain}/checkAuth',
  implementation
};
---

Extractor (extractor.yaml)

Just because a robot supports Authentication does not mean that every Extractor that implements that robot needs it. For this reason Authentication is opt-in.

To turn an existing extractor into an authenticated one simply:

  • Add authenticated: true to the extractor.yaml

  • Re-run the import-io extractor:new scaffold command with the --auth flag to generate the needed dependencies

  • Fill out the parameters and train the generated files as usual

  • Fill out the credentials.yaml file in the same directory containing the extractor.yaml, and fill out as necessary (more on Credentials files below)

Example:

Below is an example extractor.yaml file that uses authentication

---
robot: examples/simple
parameters:
  domain: doom.import.io
authenticated: true
---

Credentials

Credentials used to log in to a target website can be stored in credentials.yaml file.

When deploying, the credentials object is safely encrypted and stored by import.io. These credentials are passed into the input on an action as the key _credentials at runtime.

For security reasons it is recommended that credentials.yaml files be gitignored

Example:

Below is an example credentials.yaml file.

In the example the default credentials are username: example@example.com and password: meep.

Branch specific credentials can be stored in the branches section. In the example, the dev branch references different credentials than the default ones. If no branch specific credentials are specified, default will be used.

---
default:
  username: example@example.com
  password: meep
branches:
  dev:
    username: different@example.com
    password: otherPassword123

Scaffolding

When creating a new robot using the import-io robot:new command, --authentication and --checkAuthentication flags can be provided to point to the respective entry points.

When creating a new extractor using the import-io extractor:new command --auth flag can be provided which will scaffold out the needed dependencies and add a credentials.yaml file to the extractor directory.

Testing

For testing Authenticated Extractors locally the import-io extractor:run:[local or remote] commands are recommended.

The extractor:run commands will sequentially execute the following entrypoints for :

  • checkAuthentication

  • authentication (if checkAuthentication fails)

  • entryPoint

By default browser sessions are cached for 15 minutes. You can clear your browser state by running import-io cache:clear

Documenation for these commands can be found by clicking here

Deploying

Authenticated Extractors can be deployed to SaaS or Workbench, though the restrictions of doing so slightly differ

Workbench

Deploying an Authenticated Extractor as a Source to workbench using the import-io source:deploy command requires:

  • Valid User Token configured in developer environment

  • User Token must belong to the Organization you are attempting to deploy to

  • default credentials will only be used. Credentials management in Workbench is coming soon

SaaS

For security purposes, when deploying an authenticated extractor to app.import.io the API key used in your environment and the account you are deploying to must be one in the same. Otherwise you will get an error when attempting to deploy.

Running on Workbench

It is important to note that in order to run an Authenticated Extractor on Workbench you must first have the "Legacy Platform Id" saved on the Organization. This ID is used for security purposes to validate that the extractor is authorized to log in as the saved user on the extractor and perform the web automation it requires for data extraction.