Quick Start Guide

Purpose

By following this guide you will have a Project in Import.io Workbench. The Project will execute and deliver data to your S3 bucket. In the "Further Steps" section will also implement some data quality Checks on the data and outline steps to make your Project production worthy.

This document presumes you have already have a Workbench Organization setup.

The High level steps for creating your Project are:

Refer to this diagram to understand the relationships of Workbench Entities:
Entities and their relationships

Create Shared Assets

1. Define a Schema

A Schema in import.io defines the output shape of the data, along with some validation rules for the schema.

  • Click Schemas from your org page

  • To add a new Schema click the + icon

add plus
  • Enter a Name and optionally update the slug/ID

  • Create fields for the data you will be collecting by naming them and clicking "Add Field"

  • As you add fields you can also set "Validation Rules".

  • When you’ve added fields clock Save As Draft

Here is an example schema with the "Output Preview"

qs example schema

Go Deep on Schemas

2. Create a Destination

Your Projects Collections publish to a Destination such as S3 or an SFTP site.

It’s a good practice to create a Staging or Test Destination while you are finalizing your Collection.
  • Click Destinations from your Org page

  • To add a new Destinations click the + icon

add plus
  • Enter a Name and Type (S3 or SFTP)

  • Select the data file format

  • Depending on the Type fill in the credentials for the destination

  • There are many filename / path template variables available.
    Try these for example:

    • Bucket: Your bucket name

    • Path Template:

          :org-:project-:start_YYYY-:start_MM-:start_DD
    • Filename:

           :collection-:source-:snapshot_id.:ext
  • Click Save

Go Deep on Destinations

New Project steps

3. Create a new Project

This project will house your collections and sources. Your organization may have many Projects. You need at least one.

You should make an effort to edit the Slug / ID to be useful as this is used throughout the project as a unique id and cannot be changed once created.
  • Optionally, You can fill in the README with a description of the project. This can be updated later.

  • Click Save

Go deep on Projects

4. Add a Collection

Now that you have a schema you can create a Collection

  • Click Projects > 'Your Project Name' > Collections

add plus
  • Enter a Name and update the Slug / ID

  • Select the Schema created in the previous step and Save

  • Add the locale and domain parameters (case sensitive)

qs parameters
Domain is used for better rate limiting

Go Deep on Collections

5. Add a Collection Source

A source is an extractor created on app.import.io or import-io-cli-public

You will need an extractor ID (guid) from one of these platforms to proceed. The output of the extractor must match the schema created in the previous steps.

  • To add a new Source go to the Collection and select Sources

  • click the + icon

add plus
  • Enter a Name and update the Slug / ID

  • Enter an extractor ID and Save

Test your source

You can now try executing the source with a small set of inputs.

  • Click the Run Source Icon

qs runsource
  • This will create a Snaphhot with State START_PENDING

  • Click refresh to see it so through the states

qs refresh
  • The STATE should be PENDING_IMPORT when it is complete

  • Verify the Snapshot Data, Click Drilldown in your Snapshot

  • Run the details query on "Data - Internal" to see the columns of data.

  • Once the snapshot successful you can set your Source STATE to ACTIVE, this will allow it to push to your Destination after you complete the following steps.

Go Deep on Sources

  • From your Collection page click Destinations

  • Click the "Link/Unlink" toggle

  • Select the "Linked" Check box on the Destination(Created in previous steps)

7. Add a Project Flow

  • From your Project Page click Flows

  • To add a new Flow click the + icon

add plus
  • Enter a Name and (optionally)update the Slug / ID

  • Select Type "SIMPLE" (This will allow you to control execution from only Workbench)

  • You can specify your Collection Information Hours, enter "1" for testing

  • You can specify an S3 location for inputs, otherwise it will run your extractor’s defaults.

  • Click Save

You are now ready to run your project end-to-end using your project’s flow.

  • Select Flows > <flow name> under your Project

  • Click the "Run Flow" icon

qs runflow

This will redirect you to the Delivery view of your flow

  • Click refresh to see it so through the states

qs refresh
  • When finished check your S3 bucket for the data.

Go Deep on Flows

Further Steps

Now that you have a working Flow in workbench you can kick it off via API or Add Quality Checks

Create a flow via API

  • POST up to /api/orgs/:slug/flows

{
  "slug": "mystagingflow",
  "name": "My staging flow",
  "pushHours": 2,
  "dataHours": 1,
  "closeHours": 3,
  "cron": "0 0,12 * * *",
  "active": false,
  "type": "SIMPLE",
  "definition": {
    "collectionId": "ec47317d-04da-4678-8be0-32d02a54e955",
    "inputUriTemplate": "s3://mybucket/testing/input.json", // optional, uses the access credentials below
    "chunks": 10, // optional, requires inputUriTemplate
    "chunkCollectHours": 1  // optional, requires inputUriTemplate; chunk collection window length
  },
  "encryptedConfig": {  // required for inputUriTemplate
    "accessKeyId": "xxx",
    "secretAccessKey": "xxx"
  }
}

Create your first delivery

  • Upload the test inputs to where you defined them in the flow

  • Run the flow manually, either UI or POST with an empty body to /api/orgs/:slug/flows/:id/_start

  • See that the input gets picked up OK, it is chunked, and delivered to the S3 URI as expected.

Production configuration

Duplicate the staging environment

  • Create the (production) input & output buckets and an IAM user that has read-write access to them, and add test input

  • Fork the staging schema into the production schema

  • Create a production collection

    • Collection settings - custom output, blank row

  • Create a prod extractor by duplicating the staging extractor

    • This should not be changed in future, you just need to PATCH the latestConfigId of the prod one to "deploy" with the good staging latestConfigId

  • Duplicate the collection checks

  • Create a destination with the same path but to a different test bucket and link to collection - DO NOT ACTIVATE THIS

Set up the production collection quality checks

You can set up a number of automated data quality checks on a collection. These use statistics per source.

If you want to trigger alerts to different groups, you can decide to escalate some of the checks and snapshots will go into a ESCALATED state rather than PENDING_QA if any such check fails.

We suggest that you set up:

Description

Type

Metric

Test

Escalate on fail

Blocking

HEALTH_NUMBER

blockPct

< 1%

System errors

HEALTH_NUMBER

errorPct

< 1%

Missing HTML snapshots

HEALTH_NUMBER

noHtmlPct

< 1%

% 200 responses generating no data

HEALTH_NUMBER

noData200Pct

< ±10%

% 404/410 responses

HEALTH_NUMBER

noData200Pct

< ±10%

Rows per input/page (when input generates data)

HEALTH_NUMBER

rowsPerPage

< ±10%

Total pages/inputs

HEALTH_NUMBER

pages

< ±10%

Total rows

HEALTH_NUMBER

rows

< ±10%

% Duplicates

HEALTH_NUMBER

dupePct

< ±10%

% Filtered Rows

HEALTH_NUMBER

filteredPct

< ±10%

Quality score

HEALTH_NUMBER

qualityScore

> 0.9

Validation errors

VALIDATION_ERRORS

rule name

< ±10%

The validation checks will automatically appear once data is run through, and are initialized to fail if any validation errors are reported.

If you have these checks in place, the data will pass automatically.

Enable data store

Turn on a data store table for the production collection.

Production flow configuration

  • Create an inactive production flow configuration with the correct timings, but still with the test inputs, e.g.

{
  ...
  "pushHours": 24,
  "dataHours": 12,
  "closeHours": 48,
  "definition": {
    "collectionId": "ec47317d-04da-4678-8be0-32d02a54e955",
    "inputUriTemplate": "s3://mybucket/testing/input.json", // optional, uses the access credentials below
    "chunks": 48, // optional, requires inputUriTemplate
    "chunkCollectHours": 1  // optional, requires inputUriTemplate; chunk collection window length
  },
  ...
}

Create alarms and subscriptions

It is suggested that you create a PagerDuty subscription, and linked it to an Alarm Group.

The Alarm Group status is ALARM if any of the Alarms in the group is in state ALARM.

When the Alarm Group transitions in or out of ALARM we create or resolve an event in pagerduty with severity error.

Alarms within a group can be, for example:

  • When snapshot status counts breach a threshold:

    • snapshots.status.ALL_FAILED = INTERNAL_FAILURE + FAILED + START_FAILED > 0

    • snapshots.status.PENDING_QA > 0

    • snapshots.status.FAILED_QA > 0

    • snapshots.status.ESCALATED > 0

  • When push status counts breach a threshold:

    • pushes.status.FAILED > 0

  • When the number of snapshots that should have finished, but have not (snapshots.afterEndByCount) > X (based on constant throughput)

  • When the number of snapshots that should have pushed, but have not (snapshots.afterDeliverByCount) > X (based on constant throughput)

  • When the % of inputs that have been processed within the specified collection windows is (health.collectedByPct) < X%

  • When the % of inputs that have been blocked is (health.blockedPct) > 1%

  • When the % of inputs that have had errors is (health.errorPct) > 1%

  • When the % of inputs that return a 200 but have no data is (health.noData200Pct) > 1%

  • When the % of inputs that return a 200 but have no HTML captured is (health.noHtmlPct) > 1%

If you have different groups to respond to different issues, you can set up multiple alarm groups.

Try running the production flow manually

At this point this will not push any data, but you should be able to check that it is working as expected.

Integration

Activate the flow to enable the schedule

Now the flow should be generating deliveries on the timescales, but with test inputs (of which there are less, e.g. 1 QPS).

Set up bucket replication

Now set up replication from your managed output bucket to your customer output bucket, and from their input bucket to your input bucket.

Use the real inputs

Now PATCH (or use the UI) to set up the flow definition to use the real input uris:

{
  "definition": {
    "collectionId": "ec47317d-04da-4678-8be0-32d02a54e955",
    "inputUriTemplate": "s3://{bucketname}/:YYYY/:ww/:source/inputs.json", // optional, uses the access credentials below
    "chunks": 48, // optional, requires inputUriTemplate
    "chunkCollectHours": 1  // optional, requires inputUriTemplate; chunk collection window length
  }
}

Activate the destination

Now when you are ready to publish data to the replicated output bucket, activate the destination.

Set up more sources

Go through the process of setting up sources in your staging collection, and when you are happy with them adding them into the production collection.

Operational issues

Snapshots failing to start

If a snapshot fails to start, you should setup PagerDuty to inform you.

You can then go to the snapshot page to see the error reason. If it was a transient error you can click the "retry start" action to try again.

QA Failures

When QA failures happen, it should trigger an alarm in pagerduty. Normally this will be snapshots either being transitioned to PENDING_QA or ESCALATED depending on whether there was a failing check that had the "escalate" option turned on. The pagerduty alert will contain the URL to the delivery page where the issues are.

Someone then needs to go and work out why QA has failed. That person should mark it assigned to them. In the case of PENDING_QA this will automatically transition it into QA.

There are multiple things that then can happen:

False alarm

You inspect the data, see on the checks page exactly why we think we failed the automated quality checks, and inspect the data. The data looks OK, so you move it to PASSED_QA.

Bad data

There was some bad data, or a % of blocked too high, etc.

There are some options:

  1. Go ahead and push the data because it’s better to push it even with missing data - you should put the snapshot into a PUSHED_IGNORE state so it doesn’t mess up the statistics for "good data". Alert someone to look at the extractor so it performs better next time.

  2. Edit the calculated columns, parameters or other settings and re-import the same data with the update configuration - you can do this by clicking the "re-import" action - this will move the old one into a SUPERSEDED state

  3. Edit the extractor and re-import the data with an updated extractor configuration - you can do this by clicking the "re-extract" action - this will move the old one into a SUPERSEDED state

  4. Entirely re-run the extractor by clicking the "re-run" action on the snapshot page - this will move the old one into a SUPERSEDED state

We have not yet implemented partial retries, where we retry some inputs based on a condition (certain error code, etc.)

Editing extractors

You should always edit the staging version of the extractor, and then update the latestConfigId in the production version when you have a new version.

Make sure you test the staging extractor against relevant data. Currently, this is done by changing the inputs in the legacy SAAS application, which can be done over API. You can download the inputs from the product over API.

Inputs will soon be editable in workbench.
There are currently some extractor level settings that you may need to manually copy over, e.g. proxy pool.