Authenticated Extractors
What is an Authenticated Extractor?
An Authenticated Extractor is an extractor whos data sits behind a login. Meaning you must be logged in as a user on the target website in order to extract the data you need.
Extractor Studio allows any robot to behave as an Authenticated Extractor, but requires some additional configuration in order to do so.
How do Authenticated Extractors work?
Before building an Authenticated Extractor it’s important to understand how they will work at runtime.
Browser Session
For every Extractor that runs on import.io (Authenticated or not) browser sessions are used in order to navigate to the target website and perform the needed actions.
For regular Extractors (not Authenticated) the browser session and state is not important. Think of it as opening an incognito tab for each input you wish to extract. Browser state is not persisted between extractions and the name of the game is to extract many inputs in parallel, with out a care for for session cookies and the like.
As for Authenticated Extractors the browser state and session are important, we want to make sure the "user" remains logged in otherwise our data may be invalid or not available. For this reason Authenticated Extractors must first log in (once) to the website before attempting to extract any inputs, and be aware of this session becoming invalidated so that they can attempt to log in again.
Auth Interactions
Obviously for Authenticated Extractors we must first log in before attempting to extract the data. "Auth Interactions" serve this purpose.
"Auth Interactions" map to authInteractions
on the extractor runtime configuration and consist of an interaction sequence to be performed in order to log the user in.
authInteractions
execute once before any extraction inputs are attempted, and will only execute again if the checkAuthInteractions
throw an error.
"Auth Interactions" are defined in the authentication
section of the robot template, more on that in the "Configuration" section below.
Check Authentication
Throughout your data extraction the target website may log you out or the browser session may be invalidated. This of course will cause the data you’re seeking to either not be present or incorrect. "Check Authentications" serve as a means to validate your session prior to attempting to extract data.
You can "check" that your auth session is still valid by configuring "Check Authentication" actions. These map to "checkAuthInteractions" on the extractor runtime configuration and are configured in the checkAuthentication
section of the robot template. More on this in the "Configuration" section below.
If present, "Check Authentication" runs before each input, if this function throws an error it will prompt the browser to re-execute the "Auth Interactions" before performing the data extraction.
Configuration
Robot
Any robot
can support Authentication. To allow a robot to support Authentication simply:
-
Add an
authentication
entry point to yourrobot.yaml
.-
Behaves the same as
entryPoint
-
Can have a dynamic entrypoint by resolving parameters. For example:
shared/auth/${domain}
-
-
(Optional) Add a
checkAuthentication
entryPoint-
Serves to validate login
-
Runs before each extraction
-
Supports Dynamic entry points
-
Example:
Below is an example robot.yaml
file that supports authentication
proxy:
zone: USA
type: DATA_CENTER
honorRobots: false
schema: product/details
parameters:
- store
- country
- domain
entryPoint: product/search
pathTemplate: product/${store[0:1]}/${store}/${country}/details
authentication: shared/auth/action
checkAuthentication: shared/checkAuth/action
Below is an example authentication entry point
---
async function implementation (
inputs,
parameters,
context,
dependencies
) {
const { _credentials } = inputs;
const credentials = _credentials || {};
await dependencies.gotoLogin({});
await dependencies.preLogin(credentials);
await dependencies.doLogin(credentials);
await dependencies.postLogin(credentials);
console.log('Logged in!');
}
module.exports = {
parameters: [
{
name: 'domain',
description: '',
optional: false
}
],
inputs: [
{
name: '_credentials',
description: '',
type: 'string',
optional: false
}
],
dependencies: {
gotoLogin: 'action:shared/auth/gotoLogin',
preLogin: 'action:shared/auth/preLogin',
postLogin: 'action:shared/auth/postLogin',
doLogin: 'action:shared/auth/doLogin'
},
path: './domains/${domain[0:2]}/${domain}/authenticate',
implementation
};
---
Below is an example checkAuthentication
entryPoint:
---
async function implementation (
inputs,
parameters,
context,
dependencies
) {
const { url } = inputs;
const { loggedInSelector } = parameters;
await dependencies.goto({ url });
await context.waitForSelector(loggedInSelector);
console.log('Logged in!');
}
module.exports = {
parameters: [
{
name: 'domain',
description: '',
optional: false
},
{
name: 'loggedInSelector'
}
],
inputs: [
{
name: 'url',
description: '',
type: 'string',
optional: false
}
],
dependencies: {
goto: 'action:shared/goto'
},
path: '../auth/domains/${domain[0:2]}/${domain}/checkAuth',
implementation
};
---
Extractor (extractor.yaml)
Just because a robot supports Authentication does not mean that every Extractor that implements that robot needs it. For this reason Authentication is opt-in.
To turn an existing extractor into an authenticated one simply:
-
Add
authenticated: true
to theextractor.yaml
-
Re-run the
import-io extractor:new
scaffold command with the--auth
flag to generate the needed dependencies -
Fill out the parameters and train the generated files as usual
-
Fill out the
credentials.yaml
file in the same directory containing theextractor.yaml
, and fill out as necessary (more on Credentials files below)
Credentials
Credentials used to log in to a target website can be stored in credentials.yaml
file.
When deploying, the credentials object is safely encrypted and stored by import.io. These credentials are passed into the input
on an action as the key _credentials
at runtime.
For security reasons it is recommended that credentials.yaml
files be gitignored
Example:
Below is an example credentials.yaml
file.
In the example the default credentials are username: example@example.com
and password: meep
.
Branch specific credentials can be stored in the branches
section. In the example, the dev
branch references different credentials than the default ones. If no branch specific credentials are specified, default
will be used.
---
default:
username: example@example.com
password: meep
branches:
dev:
username: different@example.com
password: otherPassword123
Scaffolding
When creating a new robot using the import-io robot:new
command, --authentication
and --checkAuthentication
flags can be provided to point to the respective entry points.
When creating a new extractor using the import-io extractor:new
command --auth
flag can be provided which will scaffold out the needed dependencies and add a credentials.yaml
file to the extractor directory.
Testing
For testing Authenticated Extractors locally the import-io extractor:run:[local or remote]
commands are recommended.
The extractor:run
commands will sequentially execute the following entrypoints for :
-
checkAuthentication
-
authentication
(ifcheckAuthentication
fails) -
entryPoint
By default browser sessions are cached for 15 minutes. You can clear your browser state by running import-io cache:clear
Documenation for these commands can be found by clicking here
Deploying
Authenticated Extractors can be deployed to SaaS or Workbench, though the restrictions of doing so slightly differ
Workbench
Deploying an Authenticated Extractor as a Source to workbench using the import-io source:deploy
command requires:
-
Valid User Token configured in developer environment
-
User Token must belong to the Organization you are attempting to deploy to
-
default
credentials will only be used. Credentials management in Workbench is coming soon
Running on Workbench
It is important to note that in order to run an Authenticated Extractor on Workbench you must first have the "Legacy Platform Id" saved on the Organization. This ID is used for security purposes to validate that the extractor is authorized to log in as the saved user on the extractor and perform the web automation it requires for data extraction.