Quick start

Presuming you already have an organization set up…​

Initial development

Create the organization

Create the org in the product.

Create a legacy SAAS account

  • Provision the correct amount of credits

Create a new project

This project will house the staging and production collections and sources.

Set up your staging environment

Create a draft staging schema for the data you are collecting

Create a staging collection

  • Add the locale, tz and domain parameters (case sensitive)

  • Domain is used for better rate limiting

  • Set up whatever "custom output" settings you need

Create a staging extractor that matches your schema

  • Build it in the account that is linked to the organization

  • Test it in the legacy UI by doing small crawl runs

Add a source to the staging collection

  • Set the parameters

  • Link it to the staging extractor

TODO: enforce the extractor being in the correct account

Add a test destination to the staging collection

  • Create a test destination with the same path variables but to a different test bucket and link to collection

    • Note that it could have /:org/:project/:collection/:source/ in the path if you want to reuse the test bucket across projects & customers

Run the source

Run the source from the UI or API. Check that the data is good. Alternatively, import a crawl run from the legacy platform.

TODO: Make it an option(?) that if running sources/importing snapshots they do not get pushed, or make destinations part of the flow config (rather than attached to the collection)

Publish the schema

Publish the schema now so that the data can go through QA.

Check published data

  • Re-import the snapshot, or re-run to generate data.

  • Manually QA it, and pass it.

  • Check that the data files are correctly written to the S3 bucket.

Create a flow

  • POST up to /api/orgs/:slug/flows

  "slug": "mystagingflow",
  "name": "My staging flow",
  "pushHours": 2,
  "dataHours": 1,
  "closeHours": 3,
  "cron": "0 0,12 * * *",
  "active": false,
  "type": "SIMPLE",
  "definition": {
    "collectionId": "ec47317d-04da-4678-8be0-32d02a54e955",
    "inputUriTemplate": "s3://mybucket/testing/input.json", // optional, uses the access credentials below
    "chunks": 10, // optional, requires inputUriTemplate
    "chunkCollectHours": 1  // optional, requires inputUriTemplate; chunk collection window length
  "encryptedConfig": {  // required for inputUriTemplate
    "accessKeyId": "xxx",
    "secretAccessKey": "xxx"

Create your first delivery

  • Upload the test inputs to where you defined them in the flow

  • Run the flow manually, either UI or POST with an empty body to /api/orgs/:slug/flows/:id/_start

  • See that the input gets picked up OK, it is chunked, and delivered to the S3 URI as expected.

Production configuration

Duplicate the staging environment

TODO: this could be autoamted

  • Create the (production) input & output buckets and an IAM user that has read-write access to them, and add test input

  • Fork the staging schema into the production schema

  • Create a production collection

    • Collection settings - custom output, blank row

  • Create a prod extractor by duplicating the staging extractor

    • This should not be changed in future, you just need to PATCH the latestConfigId of the prod one to "deploy" with the good staging latestConfigId

  • Duplicate the collection checks

    • TODO: this should be able to be automated

  • Create a destination with the same path but to a different test bucket and link to collection - DO NOT ACTIVATE THIS

Set up the production collection quality checks

You can set up a number of automated data quality checks on a collection. These use statistics per source.

If you want to trigger alerts to different groups, you can decide to escalate some of the checks and snapshots will go into a ESCALATED state rather than PENDING_QA if any such check fails.

We suggest that you set up:





Escalate on fail




< 1%

System errors



< 1%

Missing HTML snapshots



< 1%

% 200 responses generating no data



< ±10%

% 404/410 responses



< ±10%

Rows per input/page (when input generates data)



< ±10%

Total pages/inputs



< ±10%

Total rows



< ±10%

% Duplicates



< ±10%

% Filtered Rows



< ±10%

Quality score



> 0.9

Validation errors


rule name

< ±10%

The validation checks will automatically appear once data is run through, and are initialized to fail if any validation errors are reported.

If you have these checks in place, the data will pass automatically.

Enable data store

Turn on a data store table for the production collection.

Production flow configuration

  • Create an inactive production flow configuration with the correct timings, but still with the test inputs, e.g.

  "pushHours": 24,
  "dataHours": 12,
  "closeHours": 48,
  "definition": {
    "collectionId": "ec47317d-04da-4678-8be0-32d02a54e955",
    "inputUriTemplate": "s3://mybucket/testing/input.json", // optional, uses the access credentials below
    "chunks": 48, // optional, requires inputUriTemplate
    "chunkCollectHours": 1  // optional, requires inputUriTemplate; chunk collection window length

Create alarms and subscriptions

It is suggested that you create a PagerDuty subscription, and linked it to an Alarm Group.

The Alarm Group status is ALARM if any of the Alarms in the group is in state ALARM.

When the Alarm Group transitions in or out of ALARM we create or resolve an event in pagerduty with severity error.

Alarms within a group can be, for example:

  • When snapshot status counts breach a threshold:


    • snapshots.status.PENDING_QA > 0

    • snapshots.status.FAILED_QA > 0

    • snapshots.status.ESCALATED > 0

  • When push status counts breach a threshold:

    • pushes.status.FAILED > 0

  • When the number of snapshots that should have finished, but have not (snapshots.afterEndByCount) > X (based on constant throughput)

  • When the number of snapshots that should have pushed, but have not (snapshots.afterDeliverByCount) > X (based on constant throughput)

  • When the % of inputs that have been processed within the specified collection windows is (health.collectedByPct) < X%

  • When the % of inputs that have been blocked is (health.blockedPct) > 1%

  • When the % of inputs that have had errors is (health.errorPct) > 1%

  • When the % of inputs that return a 200 but have no data is (health.noData200Pct) > 1%

  • When the % of inputs that return a 200 but have no HTML captured is (health.noHtmlPct) > 1%

If you have different groups to respond to different issues, you can set up multiple alarm groups.

Try running the production flow manually

At this point this will not push any data, but you should be able to check that it is working as expected.


Activate the flow to enable the schedule

Now the flow should be generating deliveries on the timescales, but with test inputs (of which there are less, e.g. 1 QPS).

Set up bucket replication

Now set up replication from your managed output bucket to your customer output bucket, and from their input bucket to your input bucket.

Use the real inputs

Now PATCH (or use the UI) to set up the flow definition to use the real input uris:

  "definition": {
    "collectionId": "ec47317d-04da-4678-8be0-32d02a54e955",
    "inputUriTemplate": "s3://{bucketname}/:YYYY/:ww/:source/inputs.json", // optional, uses the access credentials below
    "chunks": 48, // optional, requires inputUriTemplate
    "chunkCollectHours": 1  // optional, requires inputUriTemplate; chunk collection window length

Activate the destination

Now when you are ready to publish data to the replicated output bucket, activate the destination.

Set up more sources

Go through the process of setting up sources in your staging collection, and when you are happy with them adding them into the production collection.

Operational issues

Snapshots failing to start

If a snapshot fails to start, you should setup PagerDuty to inform you.

You can then go to the snapshot page to see the error reason. If it was a transient error you can click to "retry start" action to try again.

QA Failures

When QA failures happen, it should trigger an alarm in pagerduty. Normally this will be snapshots either being transitioned to PENDING_QA or ESCALATED depending on whether there was a failing check that had the "escalate" option turned on. The pagerduty alert will contain the URL to the delivery page where the issues are.

Someone then needs to go and work out why QA has failed. That person should mark it assigned to them. In the case of PENDING_QA this will automatically transition it into QA.

There are multiple things that then can happen:

False alarm

You inspect the data, see on the checks page exactly why we think we failed the automated quality checks, and inspect the data. The data looks OK, so you move it to PASSED_QA.

Bad data

There was some bad data, or a % of blocked too high, etc.

There are some options:

  1. Go ahead and push the data because it’s better to push it even with missing data - you should put the snapshot into a PUSHED_IGNORE state so it doesn’t mess up the statistics for "good data". Alert someone to look at the extractor so it performs better next time.

  2. Edit the calculated columns, parameters or other settings and re-import the same data with the update configuration - you can do this by clicking the "re-import" action - this will move the old one into a SUPERCEDED state

  3. Edit the extractor and re-import the data with an updated extractor configuration - you can do this by clicking the "re-extract" action - this will move the old one into a SUPERCEDED state

  4. Entirely re-run the extractor by clicking the "re-run" action on the snapshot page - this will move the old one into a SUPERCEDED state

We have not yet implemented partial retries, where we retry some inputs based on a condition (certain error code, etc.)

Editing extractors

You should always edit the staging version of the extractor, and then update the latestConfigId in the production version when you have a new version.

Make sure you test the staging extractor against relevant data. Currently, this is done by changing the inputs in the legacy SAAS application, which can be done over API. You can download the inputs from the product over API.

Inputs will soon be editable in workbench.
There are currently some extractor level settings that you may need to manually copy over, e.g. proxy pool.