Collection Checks

This page allows you to perform data quality checks at the Collections level. You perform data quality
checks on a per-Collection basis. Certain checks and validations can be performed before humans
evaluate this data. These checks can be performed once the Snapshot is imported. This data
verification occurs during the testing phase and before data is pushed to the customer.

collection checks existing

Add a Collection Check

collection checks add

To add a Collection Check:

  1. From the left navigation pane, click Checks.

  2. From the top right of the Checks page, click the Add a Collection Check icon or plus (+) symbol.

  3. From the Add Collection Check modal, enter text in the Check Name field.

  4. From the Add Collection Check modal, click the drop-down arrow and choose a Check Type
    from the list.

  5. To store content, click OK. To disregard, click Cancel.

collection checks saved

Collections may have Check rules applied that either automatically flag an issue with the data or signal
a QA user to perform certain Human Sampling validations.

You can configure checks as a validation directly on the current data using Check Type VALUE.

Check types PCT and STDDEV are evaluated against previous successful chunks.

Metric Tests

You can establish automated tests against numeric metrics that are machine-generated when Snapshots
are imported into the platform. Metric tests can Pass or Fail if the metric value is available.
If the metric value is not available, metric tests cannot be automatically determined and are left
for a user to decide. This is not common and may indicate that there is a misconfiguration.

Target Types

When establishing a metric check, the import.io team sets a target value for a metric
along with a comparator, such as >; in addition, the import.io team sets the target type:

Type Description Interpretation of Number

VALUE

Actual value sought

Literal value, such as row count > 10

PCT

Value within a % of the last value to pass QA

Size of the range in %

STDDEV

Value within a certain number of standard deviations of the mean
in the last calendar month
for Snapshots that passed QA

Size of the range in σ

Snapshots that are moved into a PUSHED_IGNORE state are not included in the Snapshots that have passed QA (only the states PUSHED and PASSED_QA).

It is from these settings that the import.io team determines whether the measurement represents a Pass or a Fail.

Types of Metric Checks

Validation Errors

The first time a validation error is seen, a check is inserted into the Collection tests – code VALIDATION_ERRORS – and, by default, the import.io team looks for zero validation errors to occur. This can be adjusted in the test configuration.

Statistical Metrics

Statistical metrics also can be selected, code METRIC_NUMBER. There are a number of statistical
metrics collected for each column, every page of results from an input, and the overall Snapshot.

The import.io team organizes the metrics by namespace (group), target (column), name (metric),
measure (statistic).

Table 1. Examples of Statistical Metrics for Namespaces
Namespace Target Name Measure

pages:meta

rows

count

count

pages:field

url

stringUpperRatio

p99

data:meta

_screenCapture

exists

pct_not_null

data:field

salePriceHighCurrency

stringPunctRatio

stddev

SQL Metrics

You also can use SQL to create metrics.

You can write a single query that returns a single value:

select count(*) from S3Object s where foo='bar'

Alternatively, you use an Excel-compatible function to combine them. For example:

(SQL statement 1) select count(*) as total_a from S3Object s where foo='bar'
(SQL statement 2) select count(*) as total_a_and_b from S3Object s where foo='bar' and bar='foo'
(Metric) total_a_and_b/total_a

Health Metrics

A number of top-level aggregate health metrics also are created by the system and can be used as HEALTH_NUMBER checks.

Table 2. Health Metrics
Code Description

pages

# of pages

rows

# of rows

dataPct

% Data

noDataPct

% No Data (excludes 404/410)

notFoundPct

% No Data Available (404/410)

blockedPct

% Blocked

errorPct

% Error

noScreenshotPct

% Missing screenshot

noHtmlPct

% Missing HTML

rowsPerPage

Rows per page with data

avgAttempts

Average attempts

verrorsP99

99% rows have less than X validation errors

verrorsP95

95% rows have less than X validation errors

verrorsP75

75% rows have less than X validation errors

verrorsP50

50% rows have less than X validation errors

dupePct

% Duplicates

filteredPct

% Filtered

Human Sampling

This section is not yet released. RECORDS currently is specific to a fixed number of records.

You can elect to have human sampling also on the data, which is a RECORDS check.

sample size

The import.io team by default uses a 95% confidence and a 5% margin of error along with an assumption that 5% of the population data is bad.

Then, for a Snapshot with 2,000 rows – for example – the import.io team would sample 71 rows.

When the data is sampled, the team records which cell values are either missing or incorrect.
A Developer can review and fix issues.

From this information, the import.io team determines an estimated % accuracy for each column (which forms the metric value that is used).

The default expectancy is that the average column accuracy should be above 99%.

Stratified Sampling

The import.io team also can add an SQL condition to a RECORDS test, such as availability='InStock'.

Automated Testing

By default, the team adds new checks automatically; however, you can opt out of this on the Collection Settings page. You can manually add any missing baseline tests by clicking the Add Baseline Tests
button on the Tests page.

Table 3. Baseline Tests
Description Type Metric Test Escalate on Fail Added for

Manual row checking

RECORDS

(Estimated % of incorrect rows)

< 1%

Blocking

HEALTH_NUMBER

blockPct

< 1%

System errors

HEALTH_NUMBER

errorPct

< 1%

Missing HTML snapshots

HEALTH_NUMBER

noHtmlPct

< 1%

% Inputs generating no data

HEALTH_NUMBER

noDataPct

Within 2𝜎

Rows per input/page (when input generates data)

HEALTH_NUMBER

rowsPerPage

Within 2𝜎

Total rows

HEALTH_NUMBER

rows

Within 2𝜎

% Duplicates

HEALTH_NUMBER

dupePct

Within 2𝜎

% Filtered Rows

HEALTH_NUMBER

filteredPct

Within 2𝜎

Column fill rate

METRIC_NUMBER

data:field/column/exists/pct_true

Within 2𝜎

All data columns

Column composite metric score

METRIC_NUMBER

data:field/column/composite/value

> 0.9

All data columns

Column value outlier rate

METRIC_NUMBER

data:field/column/outliers/pct

Within 2𝜎

All data columns

Validation errors

VALIDATION_ERRORS

rule name

Within 2𝜎

All validation rules

The composite metric score will compare all the metrics for a column with the last 30 days of values
for Good data, and return a score based on how far outside the column metrics are from the expectation.
This includes fill rate, but it can improve the ability to be able to surface errors specifically
around fill rate by having it as another explicit check.

You can then further examine (or drill down into) the specific column metrics:

column stats

Snapshot State Transitions after Checks Run

If the import.io team cannot determine the result of any tests – or there were no tests – the team places it into the manual QA queue, PENDING_QA.

If a metric check fails, the Snapshot is by default transitioned into the FAILED_QA state, which will
transition the linked Source into MAINTENANCE for user issue evaluation.

You can elect to have the Snapshot moved into the ESCALATED state by selecting the Escalate on Fail option for a Snapshot. This action puts the Snapshot into the L2 support queue.