Data Quality Tests

Data Quality Tests are set up on a per-Collection basis.

To set up the data quality tests for a Collection, visit the "Tests" section from the collection homepage.

Metric tests

Automated tests can be set up against numeric metrics that are machine generated when snapshots are imported into the platform.

Metric tests can pass or fail if the metric value is available. If the metric value is not available they are cannot be automatically determined, and are left for a user to decide. This should generally not happen, and may mean that there is a misconfiguration.

Target types

When settings up a metric check, we set a target value for a metric, and a comparator, such as >, and we also set the target type:

Type Description Interpretation of number

VALUE

Actual value we are looking for

Literal value, e.g. row count > 10

PCT

Looking for the value being within a % of the last value to pass QA

Size of the range in %

STDDEV

Looking for the value to be within a certain number of standard deviations of the mean in the last calendar month for snapshots that passed QA

Size of the range in σ

Snapshots that are moved into PUSHED_IGNORE state are not included in the snapshots to have passed QA, only the states PUSHED and PASSED_QA.

It is from these settings that we determine whether the measurement represents a pass or a fail.

Types of metric check

Validation errors

The first time a validation error is seen, a check is inserted into the collection tests - code VALIDATION_ERRORS - and by default we are looking for zero validation errors to occur. This can be adjusted in the tests configuration.

Statistical metrics

Statistical metrics can also be selected, code METRIC_NUMBER. There are a number of statistical metrics collected for every column, and every page of results from an input, and the overall snapshot.

We organize the metrics by namespace (group), target (column), name (metric), measure (statistic).

Table 1. Examples of statistical metrics for namespaces
Namespace Target Name Measure

pages:meta

rows

count

count

pages:field

url

stringUpperRatio

p99

data:meta

_screenCapture

exists

pct_not_null

data:field

salePriceHighCurrency

stringPunctRatio

stddev

SQL metrics

You can also use SQL to create metrics.

You can write a single query that returns a single value:

select count(*) from S3Object s where foo='bar'

Alternatively, you use an Excel-compatible function to combine them, for example:

(SQL statement 1) select count(*) as total_a from S3Object s where foo='bar'
(SQL statement 2) select count(*) as total_a_and_b from S3Object s where foo='bar' and bar='foo'
(Metric) total_a_and_b/total_a

Health metrics

A number of top-level aggregate "health" metrics are also created by the system that can be used as HEALTH_NUMBER checks.

Table 2. Health metrics
Code Description

pages

# of pages

rows

# of rows

dataPct

% Data

noDataPct

% No Data (excludes 404/410)

notFoundPct

% No Data Available (404/410)

blockedPct

% Blocked

errorPct

% Error

noScreenshotPct

% Missing screenshot

noHtmlPct

% Missing HTML

rowsPerPage

Rows per page with data

avgAttempts

Average attempts

verrorsP99

99% rows have less than X validation errors

verrorsP95

95% rows have less than X validation errors

verrorsP75

75% rows have less than X validation errors

verrorsP50

50% rows have less than X validation errors

dupePct

% Duplicates

filteredPct

% Filtered

Human sampling

You can elect to have human sampling also on the data, which is a RECORDS check.

sample size

We by default use a 95% confidence and a 5% margin of error, and an assumption that 5% of the population data is bad.

Then, for example a snapshot with 2,000 rows we would sample 71 rows.

When the data is sampled, we record which cell values are said to be either missing or incorrect, and then this can be viewed and drilled down into to better understand and fix the issues by a developer.

From this we determine an estimated % accuracy for each column, and this then forms the metric value that is used.

The default expectancy is that the average column accuracy should be above 99%.

Stratified sampling

We can also add a SQL condition to a RECORDS test, e.g. availability='InStock'.

Automated testing

By default, we add new checks automatically, but you can opt out of this in the collection settings page, and then manually add any missing baseline tests by clicking the "Add baseline tests" button on the tests page.

Table 3. Baseline tests
Description Type Metric Test Escalate on fail Added for

Manual row checking

RECORDS

(Estimated % of incorrect rows)

< 1%

Blocking

HEALTH_NUMBER

blockPct

< 1%

System errors

HEALTH_NUMBER

errorPct

< 1%

Missing HTML snapshots

HEALTH_NUMBER

noHtmlPct

< 1%

% Inputs generating no data

HEALTH_NUMBER

noDataPct

Within 2𝜎

Rows per input/page (when input generates data)

HEALTH_NUMBER

rowsPerPage

Within 2𝜎

Total rows

HEALTH_NUMBER

rows

Within 2𝜎

% Duplicates

HEALTH_NUMBER

dupePct

Within 2𝜎

% Filtered Rows

HEALTH_NUMBER

filteredPct

Within 2𝜎

Column fill rate

METRIC_NUMBER

data:field/column/exists/pct_true

Within 2𝜎

All data columns

Column composite metric score

METRIC_NUMBER

data:field/column/composite/value

> 0.9

All data columns

Column value outlier rate

METRIC_NUMBER

data:field/column/outliers/pct

Within 2𝜎

All data columns

Validation errors

VALIDATION_ERRORS

rule name

Within 2𝜎

All validation rules

The composite metric score will compare all the metrics for a column with the last 30 days of values for "Good data", and return a score based on how far outside the expectation the column metrics are. This includes fill rate, but it can improve the ability to be able to surface errors specifically around fill rate by having it as another explicit check.

You can then drill down into the specific column metrics:

column stats

Snapshot state transition after checks are run

When we are running the automatic QA, the snapshot is in the AUTO_QA state.

When all of the tests are completed, we transition the snapshot automatically out of the AUTO_QA state and into another state, depending on what the outcome was.

If we cannot determine the result of any test - or there were no tests - we place it into the manual QA queue - PENDING_QA.

If a metric check fails, the snapshot is by default transitioned into the FAILED_QA state, which will transition the linked source into MAINTENANCE to have a human look at what the issue is.

You can elect to have the snapshot escalated into the ESCALATED state, by ticking the "Escalate on fail" option for a snapshot. This puts the snapshot into the L2 support queue.

You can see these states on the Lifecycle reference.