Snapshots

A Snapshot, which maps to a crawl run, is the extracted data or output.

There are three ways to import extracted data into the DOC environment:

  • Import by Crawl Run ID / Import Latest Crawl Run

  • Modify Source Settings

  • Access the Run Source option (Recommended)

With each option, you can track the progress of the Crawl Run and the Import
each identified by an in-progress chart. While the crawl run is the process whereby a designated
website is accessed to retrieve data, the import occurs when this data is received and brought
into the DOC environment. After a successful crawl run and import, Health Metrics are available.
This information, located below the crawl run and import charts, reflects the data fitness.
Below Health Metrics resides an Activity section which may provide additional data import detail.

After you initiate the import, you can track the status by reviewing the progress charts. Health
Metrics
are only available upon import completion.

snapshot metrics

Import by Crawl Run ID

  1. From the Snapshots page, click the ellipses or More Options icon at the top right of the page.

  2. From the list that appears, select Import by Crawl Run ID.

  3. From the Import Crawl Run by ID modal, enter the Crawl Run ID.

    You can retrieve this ID from the Extractor application. Identify the Extractor whose data you want
    to import, access the Run History tab, click the Preview Data icon (right), and copy the now visible
    Crawl Run ID located near the top center of this tab.

  4. Return to the DOC environment and paste this ID into the modal before clicking the OK button.

Import Latest Crawl Run

  1. From the Snapshots page, click the ellipses or More Options icon at the top right of the page.

  2. From the list that appears, select Import Latest Crawl Run

    The system will examine your existing list of crawl runs and import the most recent.

Source Settings

  1. From the Source level, click Settings.

  2. From the Source Settings page, select the Automatically Import Data checkbox.

    A Source Saved message appears near the top right of the page, confirming your selection.

  3. Return to the Extractor and run.

    This action fires an event to the DOC environment which triggers the data extraction import.

This method is preferred, as it is the most efficient way to import extracted data
into the DOC environment.

  1. From the Snapshots page, click the Run Source icon located at the top right of the page.

    Running the Source establishes background windows, permits, and other processes which render
    this extraction method the most efficient.

As data is imported into the DOC environment, the status of the Snapshot commonly transitions
to a number of states from PENDING_QUEUE to DRAFT_SCHEMA. The DRAFT_SCHEMA state
indicates that the Schema has not been published. You cannot push data to its
Destination until a Schema is published. You must review the data carefully before
publishing the Schema. You cannot make any breaking changes after the Schema is published.
You can add columns without issue; however, changing a column type from Text to Currency,
for example, is a breaking change.

There are several pages you can access to view the import status or check the data. The Import
Status
page enables you to see the import pipeline stages. For each stage, you can view Start
and Finish times along with the Progress, Errors, and Completion. You also can access the
Checks page to determine the data validity. If an alert indicates that a certain number
of rows is expected for a particular run (but significantly less rows are output), you can
view this page to troubleshoot and determine any data issues.

After import, a Download icon appears at the top right of the Snapshots page. Upon selection,
you can choose from various file types. You can download this Snapshot file to further evaluate
the extracted data or perform troubleshooting.

Retry Snapshot

If a Snapshot import fails, the JQ transform for custom files changes or fails, or the runtime
configuration of a Snapshot changes, you might need to retry certain stages of the Snapshot lifecycle.

The Snapshots page includes ellipses or the More Options icon at the top right of the page.
Upon selection (after a data import), four options appear. The table below describes each.

Option Purpose

Re-extract Snapshot

Perform this action if you have updated the Extractor.
The original Snapshot will assume a SUPERCEDED state,
and the new Snapshot will include both a new crawl run
and import. In addition, the Re-extract stage will be
added to the import pipeline of the new Snapshot.

Re-import Snapshot

Perform this action if a Snapshot fails to import prior
to the Generate Assets stage of the import pipeline. You
can re-import data if there are inconsistencies between
the Extractor and DOC Schemas. The original import will
assume a SUPERCEDED state, and a new Snapshot will
include both a new crawl run and import.

Re-run Snapshot

Perform this action if an entirely new crawl run is
necessary without a revision to an Extractor’s runtime
configuration. The original Snapshot will assume a
SUPERCEDED state, and a new Snapshot will
be created along with a new crawl run.

Regenerate Custom Assets

Generate Assets is the final stage of the import pipeline
for a Snapshot. If the Snapshot’s Collection has any
linked Destinations with a custom file type selected
(custom text or custom parquet), they will be created
during this stage. If the JQ transform associated
with a Snapshot fails for a custom file, the Snapshot
will fail to import (for example, status will be FAILED)
with a relevant error message. If a Snapshot fails
because of the Generate Assets stage, this stage can
be retried with the Regenerate Custom Assets option.
This action will only retry the Generate Assets stage.
If the regeneration is successful, a Snapshot can
continue to QA to be pushed.