Data Operations Center

overview diagram

Data Operations Center (DOC), formerly Workbench, is a nested or layered platform that functions
as a data delivery pipeline. More specifically, you can import extracted data into this environment.
Here, data is downloaded, transformed/formatted per customer specifications, and tested.
The data also is analyzed to ensure its validity. After the data has been verified, it is delivered
to the customer at a preferred Destination.

The DOC environment provides customers with an overview of their data to include its quality and
completeness. In addition, DOC enables the Delivery team to perform its duties while pushing data
to the customer Destination in a standardized manner.

Internal staffers and some external customers have access to the DOC platform. There are several
primary personas:

System Role

  • System Admin

Organization User Role

  1. Admin

  2. Ops

  3. Schema Admin

  4. Member

  5. Source Engineer

Each persona will have access to certain DOC features, providing additional security.

The DOC platform has multiple layers:

Layer Purpose

Organization

Company or entity requesting the data extraction.
An Organization may be comprised of multiple users.

Project

A group of one or more Collections. These Collections
contain Extractors. As you establish Projects, you can
name them identically; however, in this case, each
Project must be associated with a different Environment
(DEV, STAGING, or PRODUCTION). Projects can be
locked, restricting write and edit operations under each
subsequent DOC layer (such as Collections, Sources,
and Snapshots.) You will not have Project access unless
you are an ORG_ADMIN, OPS, or assigned to work on a
specific layer of this Project. If your role is ORG MEMBER
and you are attempting to edit a Source, for example, you
cannot make changes unless this Source is assigned to you.

Collection

A group of Sources (or Extractors) that adheres to a
particular Schema. A Collection represents the method by which to group Sources.

Source

The Extractor or web crawling tool. A Source maps to an Extractor ID.

Snapshot

The extracted data or output. A Snapshot maps to a crawl run.

Each layer has a one-to-many relationship. One Organization may have many Projects.
One Project may have many Collections. One Collection may have many Sources.
One Source may have many Snapshots.

Although not part of the DOC layered structure, Schemas and Destinations are critical to this platform.
Schemas are inputs (column names) that are a subset of an Extractor. A Destination is the
Delivery location of a customer’s extracted data, such as an S3 bucket or Azure. A Destination is defined
at the Organization level.

As the Import.IO team continues to enhance the flagship applications, it is important
to note that the current Extractor will be replaced by a new CLI tool: Extractor Studio. More technical in use,
this tool will allow greater data extraction efficiency – as it will be code-driven and allow Extractors
to be built on a larger scale. Having observed customer Extractor use, the Import.IO team developed
the CLI tool (in part) to align with customer behavior. It cannot be overstated that the transition from the
current Extractor to the CLI tool will occur gradually and will not immediately impact customers who
use the existing SaaS product.