Import Status

Importing the crawl run data (and creating the subsequent Snapshot) is achieved via an
import pipeline. Import Status tracks each stage of this data import process.

There are several screens in the DOC environment that you can access to view the import status
or check the data. If, for example, the import does not complete, there also are several
checks you can perform. One action is to review the Import Status content in hopes
of determining at what stage the import failed.

Import Status also is useful – as unforeseen issues might trigger a need to retry
certain stages of the Snapshot lifecycle:

  • The JQ transform for custom files changes.

  • The runtime configuration of a Snapshot is modified.

  • The import fails.

There are different pipeline sequences that may be invoked. The content below focuses
on a sequence with 11 stages, from UpdateStatus to Generate Assets.

First Set of the 11 Stages:

import status primary

Second Set of the 11 Stages:

import status secondary

The Import Status table includes the Stage, Started at, Finished at, Progress, Error
and Completed columns. A checkmark in the Completed column indicates that the corresponding
stage finished successfully.

The Import Status content is aligned with the Snapshot. As data is deployed into the DOC environment,
the progress of the crawl run and import is tracked.

import status snapshot

The image above depicts a crawl run with one input along with the import of 25 rows of data;
in addition, there are 11 stages, and each completed successfully.

Import Stages

This section defines the 11 stages of the import pipeline, as referenced in the images above.

Stage Description

UpdateStatus

This stage is specific to Snapshot status changes.
This status can include PENDING_QUEUE,
PENDING_QA, COMPLETED, or PUSHED.

Download Crawl Run

Here, the crawl run data is imported into the DOC
environment.

MergeExtractions

The crawl run data may appear in a number of forms,
including multi-part, depending on the circumstances
of the crawl run. This stage normalizes the data
into a single form that can be processed later
(and more efficiently).

PageSummarization

At this stage, the Web pages slated for access
are analyzed for statistical purposes. The resulting
stats become available when the Snapshot import
process is complete. The Snapshot Download menu,
depicted in the image below this table, includes
the related reference: Summarized Pages (JSON).

AvroConversion

The imported data is converted from its crawl run
(JSON format) to another data download format: Avro.

ParquetConversion

This stage represents another data format.

UpdateStatus

This stage is specific to Snapshot status changes.
This status can include PENDING_QUEUE,
PENDING_QA, COMPLETED, and PUSHED.

StatisticsGeneration

A multitude of Snapshot stats are gathered
not only specific to the Snapshot but also
its health and how it compares to previous
Snapshots. The data is available for Checks. This
information is summarized on the Compare page.
It also is available as a JSON download, though
not referenced in the Snapshot Download menu
in the image below. This information is useful
for managing data quality.

TestsGeneration

Here, checks are performed, and the results
recorded.

UpdateStatus

This stage is specific to Snapshot status changes.
This status can include PENDING_QUEUE,
PENDING_QA, COMPLETED, and PUSHED.

Generate Assets

This stage involves the creation of certain custom files.
The Custom Output Settings on the Collection Settings
page is used here to produce the intended custom output.

The PageSummarization, AvroConversion, and ParquetConversion stages referenced
in the table above are specific to data downloads, as evidenced in the following
Snapshot Download menu:

import status download snapshot

Additional Import Status stages associated with other data pipeline sequences include:

Stage Description

AggregateChunks

A Flow that has multiple chunks and is configured
to aggregate the resulting data will perform this
AggregateChunks operation. You can view this
operation on parent Snapshots.

Download Parquet

This stage is necessary to perform
StatisticsGeneration again.

Download Pages Parquet

This stage is necessary to perform
StatisticsGeneration again.

Reextract

This stage re-executes the crawl run and restarts
the import process.