Re-Extract Snapshot

At a high level, the Extraction process is two-fold:

  • Fetch the website page content.

  • Parse the page content to retrieve specific data.

Re-extracting a Snapshot is a manual process that involves re-executing the second step and, therefore,
parsing the page content to retrieve data. Since you do not have to re-fetch the page, the Re-extract
Snapshot process promotes time efficiency – as all initial page content is saved such that you do not
have to make subsequent requests for this information. A single extraction, in some cases, may take
in excess of 12 or 24 hours, given the data retrieval process coupled with considerations such as
proxy pools and captcha-solving. Re-extraction becomes not only a time-efficient process but also a
cost-effective one.

You re-extract content from saved pages. In this circumstance, you already have access to the page
content. Re-extraction allows you to fetch other information, sometimes retrieving more
information and at other times retrieving less.

The Re-extract Snapshot process also allows you to run additional transforms to further format the data.
In addition, it allows you to retain the data history relative to a moment in time. Since the
initial data is saved, you can return to a previous date to retrieve other information. Historical
data is important, since many customers have an interest in analytics, forecasting, and trends.

Re-extraction is only possible when a Snapshot has been imported, and there is a different runtime
configuration for a subsequent Extractor than for the previous one. You can change the runtime
configuration via the SaaS application, CLI tool (extractor.yaml file, for example), or Hades. You
should change the runtime configuration in the environment where the original Extractor was created,
which is commonly the SaaS application or the CLI environment.

SaaS Application

Below is a basic example of performing re-extraction, first, using the SaaS application to build an
Extractor. Here, the MarketWatch website is accessed, and data from three columns is retrieved:
Symbol, Company, and Last Value.

Running this Extractor produces an Extractor ID and a Crawl Run ID:

snapshot reextract extractorA
snapshot reextract extractorA1

In the DOC environment, you can add this Extractor ID to the Source page then run the Source.

snapshot reextract runsource

Next, perform the crawl run and import the data.

snapshot reextract disabled

These actions produce a Snapshot.

  1. From the Snapshot page, click the ellipses or More Options icon at the top right of the page.

  2. From the list that appears, Re-extract Snapshot is disabled.

  3. Click snapshots from the path near the top center of the page to trigger display of the
    Demo Source Snapshots page. Here, you can view a list of Snapshots.

snapshot reextract demo source snapshots

To enable the Re-extract Snapshot option on the Snapshot page, you must change the runtime configuration.

First, return to the SaaS application and edit the Extractor by, for example, removing a column.
The Extractor now contains solely the Symbol and Company columns. You also may provide a new
Extractor name. When you merely Save these changes, the Extractor ID and Crawl Run ID remain the
same. When you Save the file and Run the inputs (a practice common when testing), the Extractor ID
remains the same; however, the Crawl Run ID changes.

snapshot reextract extractorB
snapshot reextract extractorB1
  1. Return to the DOC environment, accessing the original Snapshot.

  2. From the Snapshot page, click the snapshot reference from the path near the top center of the page.

    This action triggers display of the Demo Source Snapshots page and also serves as a page refresh.
    Ensure you are certain which Snapshot you want to re-extract.

  3. Click View located at the start of this line.

    snapshot reextract demo source snapshots2
  4. From the Snapshot page, click the ellipses or More Options icon at the top right of the page.

  5. From the list that appears, select the now enabled Re-extract Snapshot.

    A Confirmation message appears.

    snapshot reextract enabled
  6. Click OK to confirm this selection.

    snapshot reextract confirm
  7. Proceed with the Re-extract Snapshot process by completing the data import for the new runtime configuration.

snapshot reextract continue

If you Re-Extract or Re-Import a Snapshot, the state of that previous/original Snapshot becomes
SUPERCEDED; in addition, a new Snapshot is created, and you are automatically redirected to this new
Snapshot. SUPERCEDES appears on the right of the Snapshot page along with an
associated ID. This identifier corresponds to the numeric ID of the original Snapshot.

snapshot reextract demo source snapshots3

CLI

You can change the runtime configuration by modifying the extractor.yaml file or updating any Extractor
code (to include code specific to executing different actions).

Below is a sample extractor.yaml file, though from a different Extractor than previously referenced:

snapshot reextract cli

Here, you can modify the inputs, for example, to trigger a different configuration before deploying
this data to the DOC environment. Next, access the Snapshot, perform a page refresh which might
involve first accessing the Demo Source Snapshots page, identifying the Snapshot, and clicking View
to select. Finally, proceed with the Re-extract Snapshot process by selecting this option from the list
that appears after clicking the ellipses or More Options icon at the top right of the Snapshot page.

Hades

You can change the runtime configuration by modifying the Extractor via Hades. Here, you can modify
field information, for example, to trigger a different configuration before deploying data to the DOC environment.

snapshot reextract hades1
snapshot reextract hades2

For more detailed Extractors, you can make additional runtime configuration changes. Below is code
associated with a different Extractor:

snapshot reextract hades3

Re-extracting a Snapshot is a useful process that not only provides flexibility but also promotes
efficiency, as you are able to update, fine-tune, and reformat Extractors.