Quick Start Guide
Purpose
By following this guide you will have a Project in Import.io Workbench. The Project will execute and deliver data to your S3 bucket. In the "Further Steps" section will also implement some data quality Checks on the data and outline steps to make your Project production worthy.
This document presumes you have already have a Workbench Organization setup. |
The High level steps for creating your Project are:
-
Create Shared Assets - These assets can be shared between Projects
Refer to this diagram to understand the relationships of Workbench Entities: Entities and their relationships |
1. Define a Schema
A Schema in import.io defines the output shape of the data, along with some validation rules for the schema.
-
Click Schemas from your org page
-
To add a new Schema click the + icon

-
Enter a Name and optionally update the slug/ID
-
Create fields for the data you will be collecting by naming them and clicking "Add Field"
-
As you add fields you can also set "Validation Rules".
-
When you’ve added fields clock Save As Draft
Here is an example schema with the "Output Preview"

Go Deep on Schemas
2. Create a Destination
Your Projects Collections publish to a Destination such as S3 or an SFTP site.
It’s a good practice to create a Staging or Test Destination while you are finalizing your Collection. |
-
Click Destinations from your Org page
-
To add a new Destinations click the + icon

-
Enter a Name and Type (S3 or SFTP)
-
Select the data file format
-
Depending on the Type fill in the credentials for the destination
-
There are many filename / path template variables available.
Try these for example:-
Bucket: Your bucket name
-
Path Template:
:org-:project-:start_YYYY-:start_MM-:start_DD
-
Filename:
:collection-:source-:snapshot_id.:ext
-
-
Click Save
Go Deep on Destinations
3. Create a new Project
This project will house your collections and sources. Your organization may have many Projects. You need at least one.
-
Click into your Org name from https://workbench.import.io/orgs
-
Click Create New under Projects
-
Enter a Name for your project and a project Slug / ID
You should make an effort to edit the Slug / ID to be useful as this is used throughout the project as a unique id and cannot be changed once created. |
-
Optionally, You can fill in the README with a description of the project. This can be updated later.
-
Click Save
Go deep on Projects
4. Add a Collection
Now that you have a schema you can create a Collection
-
Click Projects > 'Your Project Name' > Collections

-
Enter a Name and update the Slug / ID
-
Select the Schema created in the previous step and Save
-
Add the locale and domain parameters (case sensitive)

Domain is used for better rate limiting |
Go Deep on Collections
5. Add a Collection Source
A source is an extractor created on app.import.io or import-io-cli-public
You will need an extractor ID (guid) from one of these platforms to proceed. The output of the extractor must match the schema created in the previous steps.
-
To add a new Source go to the Collection and select Sources
-
click the + icon

-
Enter a Name and update the Slug / ID
-
Enter an extractor ID and Save
Test your source
You can now try executing the source with a small set of inputs.
-
Click the Run Source Icon

-
This will create a Snaphhot with State START_PENDING
-
Click refresh to see it so through the states

-
The STATE should be PENDING_IMPORT when it is complete
-
Verify the Snapshot Data, Click Drilldown in your Snapshot
-
Run the details query on "Data - Internal" to see the columns of data.
-
Once the snapshot successful you can set your Source STATE to ACTIVE, this will allow it to push to your Destination after you complete the following steps.
Go Deep on Sources
6. Link Collection to Destination
-
From your Collection page click Destinations
-
Click the "Link/Unlink" toggle
-
Select the "Linked" Check box on the Destination(Created in previous steps)
Go Deep on Collection Destinations
7. Add a Project Flow
-
From your Project Page click Flows
-
To add a new Flow click the + icon

-
Enter a Name and (optionally)update the Slug / ID
-
Select Type "SIMPLE" (This will allow you to control execution from only Workbench)
-
You can specify your Collection Information Hours, enter "1" for testing
-
You can specify an S3 location for inputs, otherwise it will run your extractor’s defaults.
-
Click Save
You are now ready to run your project end-to-end using your project’s flow.
-
Select Flows > <flow name> under your Project
-
Click the "Run Flow" icon

This will redirect you to the Delivery view of your flow
-
Click refresh to see it so through the states

-
When finished check your S3 bucket for the data.
Go Deep on Flows
Further Steps
Now that you have a working Flow in workbench you can kick it off via API or Add Quality Checks
Create a flow via API
-
POST up to /api/orgs/:slug/flows
{
"slug": "mystagingflow",
"name": "My staging flow",
"pushHours": 2,
"dataHours": 1,
"closeHours": 3,
"cron": "0 0,12 * * *",
"active": false,
"type": "SIMPLE",
"definition": {
"collectionId": "ec47317d-04da-4678-8be0-32d02a54e955",
"inputUriTemplate": "s3://mybucket/testing/input.json", // optional, uses the access credentials below
"chunks": 10, // optional, requires inputUriTemplate
"chunkCollectHours": 1 // optional, requires inputUriTemplate; chunk collection window length
},
"encryptedConfig": { // required for inputUriTemplate
"accessKeyId": "xxx",
"secretAccessKey": "xxx"
}
}
Production configuration
Duplicate the staging environment
-
Create the (production) input & output buckets and an IAM user that has read-write access to them, and add test input
-
Fork the staging schema into the production schema
-
Create a production collection
-
Collection settings - custom output, blank row
-
-
Create a prod extractor by duplicating the staging extractor
-
This should not be changed in future, you just need to PATCH the
latestConfigId
of the prod one to "deploy" with the good staginglatestConfigId
-
-
Duplicate the collection checks
-
Create a destination with the same path but to a different test bucket and link to collection - DO NOT ACTIVATE THIS
Set up the production collection quality checks
You can set up a number of automated data quality checks on a collection. These use statistics per source.
If you want to trigger alerts to different groups, you can decide to escalate some of the checks and snapshots will go into a ESCALATED
state rather than PENDING_QA
if any such check fails.
We suggest that you set up:
Description |
Type |
Metric |
Test |
Escalate on fail |
Blocking |
HEALTH_NUMBER |
blockPct |
< 1% |
✅ |
System errors |
HEALTH_NUMBER |
errorPct |
< 1% |
✅ |
Missing HTML snapshots |
HEALTH_NUMBER |
noHtmlPct |
< 1% |
✅ |
% 200 responses generating no data |
HEALTH_NUMBER |
noData200Pct |
< ±10% |
|
% 404/410 responses |
HEALTH_NUMBER |
noData200Pct |
< ±10% |
|
Rows per input/page (when input generates data) |
HEALTH_NUMBER |
rowsPerPage |
< ±10% |
|
Total pages/inputs |
HEALTH_NUMBER |
pages |
< ±10% |
|
Total rows |
HEALTH_NUMBER |
rows |
< ±10% |
|
% Duplicates |
HEALTH_NUMBER |
dupePct |
< ±10% |
|
% Filtered Rows |
HEALTH_NUMBER |
filteredPct |
< ±10% |
|
Quality score |
HEALTH_NUMBER |
qualityScore |
> 0.9 |
|
Validation errors |
VALIDATION_ERRORS |
rule name |
< ±10% |
The validation checks will automatically appear once data is run through, and are initialized to fail if any validation errors are reported.
If you have these checks in place, the data will pass automatically.
Production flow configuration
-
Create an inactive production flow configuration with the correct timings, but still with the test inputs, e.g.
{
...
"pushHours": 24,
"dataHours": 12,
"closeHours": 48,
"definition": {
"collectionId": "ec47317d-04da-4678-8be0-32d02a54e955",
"inputUriTemplate": "s3://mybucket/testing/input.json", // optional, uses the access credentials below
"chunks": 48, // optional, requires inputUriTemplate
"chunkCollectHours": 1 // optional, requires inputUriTemplate; chunk collection window length
},
...
}
Create alarms and subscriptions
It is suggested that you create a PagerDuty subscription, and linked it to an Alarm Group.
The Alarm Group status is ALARM if any of the Alarms in the group is in state ALARM.
When the Alarm Group transitions in or out of ALARM we create or resolve an event in pagerduty with severity error
.
Alarms within a group can be, for example:
-
When snapshot status counts breach a threshold:
-
snapshots.status.ALL_FAILED = INTERNAL_FAILURE + FAILED + START_FAILED > 0
-
snapshots.status.PENDING_QA > 0
-
snapshots.status.FAILED_QA > 0
-
snapshots.status.ESCALATED > 0
-
-
When push status counts breach a threshold:
-
pushes.status.FAILED > 0
-
-
When the number of snapshots that should have finished, but have not (snapshots.afterEndByCount) > X (based on constant throughput)
-
When the number of snapshots that should have pushed, but have not (snapshots.afterDeliverByCount) > X (based on constant throughput)
-
When the % of inputs that have been processed within the specified collection windows is (health.collectedByPct) < X%
-
When the % of inputs that have been blocked is (health.blockedPct) > 1%
-
When the % of inputs that have had errors is (health.errorPct) > 1%
-
When the % of inputs that return a 200 but have no data is (health.noData200Pct) > 1%
-
When the % of inputs that return a 200 but have no HTML captured is (health.noHtmlPct) > 1%
If you have different groups to respond to different issues, you can set up multiple alarm groups.
Try running the production flow manually
At this point this will not push any data, but you should be able to check that it is working as expected.
Integration
Activate the flow to enable the schedule
Now the flow should be generating deliveries on the timescales, but with test inputs (of which there are less, e.g. 1 QPS).
Set up bucket replication
Now set up replication from your managed output bucket to your customer output bucket, and from their input bucket to your input bucket.
Use the real inputs
Now PATCH (or use the UI) to set up the flow definition to use the real input uris:
{
"definition": {
"collectionId": "ec47317d-04da-4678-8be0-32d02a54e955",
"inputUriTemplate": "s3://{bucketname}/:YYYY/:ww/:source/inputs.json", // optional, uses the access credentials below
"chunks": 48, // optional, requires inputUriTemplate
"chunkCollectHours": 1 // optional, requires inputUriTemplate; chunk collection window length
}
}
Operational issues
Snapshots failing to start
If a snapshot fails to start, you should setup PagerDuty to inform you.
You can then go to the snapshot page to see the error reason. If it was a transient error you can click the "retry start" action to try again.
QA Failures
When QA failures happen, it should trigger an alarm in pagerduty. Normally this will be snapshots either being transitioned to PENDING_QA
or ESCALATED
depending on whether there was a failing check that had the "escalate" option turned on. The pagerduty alert will contain the URL to the delivery page where the issues are.
Someone then needs to go and work out why QA has failed. That person should mark it assigned to them. In the case of PENDING_QA
this will automatically transition it into QA
.
There are multiple things that then can happen:
False alarm
You inspect the data, see on the checks page exactly why we think we failed the automated quality checks, and inspect the data. The data looks OK, so you move it to PASSED_QA
.
Bad data
There was some bad data, or a % of blocked too high, etc.
There are some options:
-
Go ahead and push the data because it’s better to push it even with missing data - you should put the snapshot into a PUSHED_IGNORE state so it doesn’t mess up the statistics for "good data". Alert someone to look at the extractor so it performs better next time.
-
Edit the calculated columns, parameters or other settings and re-import the same data with the update configuration - you can do this by clicking the "re-import" action - this will move the old one into a SUPERCEDED state
-
Edit the extractor and re-import the data with an updated extractor configuration - you can do this by clicking the "re-extract" action - this will move the old one into a SUPERCEDED state
-
Entirely re-run the extractor by clicking the "re-run" action on the snapshot page - this will move the old one into a SUPERCEDED state
We have not yet implemented partial retries, where we retry some inputs based on a condition (certain error code, etc.) |
Editing extractors
You should always edit the staging version of the extractor, and then update the latestConfigId
in the production version when you have a new version.
Make sure you test the staging extractor against relevant data. Currently, this is done by changing the inputs in the legacy SAAS application, which can be done over API. You can download the inputs from the product over API.
Inputs will soon be editable in workbench. |
There are currently some extractor level settings that you may need to manually copy over, e.g. proxy pool. |