Quick start

Creating a project

Create a new project by clicking the + icon on your projects page.

The readme should provide context to team members who are involved in the project, and supports markdown syntax.

It is important that you pick good slugs, they cannot be changed later - here to example, they are used within the data lake and for the database schema name.
newproject

Creating a destination

A destination is where the data that passes QA gets published to.

To create a new destination click the + icon on the destinations page.

Data will only be published to a destination if the destinaton is marked as active at the time that the data passes QA.

S3 Destinations

You can select what type of files you wish to be published to an S3 bucket, and add a path and filename template. The default file name template is :snapshot_id.:ext.

destination
"Custom" files can be defined on the collection settings page.

Creating a schema

A schema defines the output format of the data.

To create a new schema click the + icon on the schemas page.

schemas

You can see a preview on the right of what the data would look like as JSON.

preview
Currently if you use a nested schema you must push data in via the avro import method.

Field settings

Setting Description

Single Value

Whether or not the value is an array

Filter

If the value is a falsy value (zero, false, blank) mark this row as filtered and do not include it in the data pushed to destinations

Internal

If this is true, the data is excluded from the data pushed to destinations

Primary key

The fields that are part of the composite primary key give the rows the _id metadata column - a generated UUID from the hash of the column values. The data pushed to destinations is deduped on this ID.

Type

The type gives the avro/parquet data types, and also controls how the system turns extracted text values into typed values. The locale parameter on a source is used when doing this conversion.

Default value

A textual default value for the column. Note that this should be in ISO format for date/time, and JSON format for numbers and booleans.

Validation settings

These settings contribute towards the validation error statistics for snapshots of data.

These settings are soft indicators, they do NOT filter data, the filter setting should be used in the field settings.

Creating a collection

A collection is a group of sources that share the same schema and collection window.

Create a new collection by clicking the + icon on your collections page within the project section.

collection

The readme should provide instructions beyond the schema on how the sources for this collection are built.

The parameters are important to distinguish sources. For example, I may have a domain parameter for a number of sites that are being built. The parameter values, along with the readme and schema give the people implementing the sources the required information to build the source. There are two special parameters, locale and tz which are also used in the data typing if set.

Linking a destination to the collection

To add or remove links to destinations for a collection to push data to, visit the "Destinations" section from the collection homepage.

Setting up the data quality checks for the collection

To set up the data quality checks for a collection, visit the "Checks" section from the collection homepage.

If you want to escalate the failed snapshot to your L2 support automatically on fail you can choose to do so.

checks
If there are no data quality checks, or every data quality check passes without human intervention when the data is imported, unless the source is in development or maintenance, QA will automatically pass, and the data will be pushed to the configured destinations.

Creating a source

To create a source in a collection, visit the "Sources" section from the collection homepage, and click the + icon.

When a source is created it is in the QUEUED state.

If a user takes ownership of a QUEUED source, it moves to IN PROGRESS. Once they are ready to have the source checked, they can move to QA. The source will automatically move to ACTIVE if the QA passes, and the data published.

If a source fails QA, the source moves into a MAINTENANCE state.

You can change the state of the source manually by editing the source.

Linking an extractor to a source

You can link an extractor by ID to the source in the edit source view.

Importing a crawl run

To import a crawl run for a source, visit the "Snapshots" section from the source homepage, and click the "Import" button.

Viewing the import status

To import a crawl run for a source, visit the "Import status" section from the snapshot homepage.

importstatus