Data Operations Center

overview diagram

Data Operations Center (DOC), formerly Workbench, is a nested or layered platform that functions
as a data delivery pipeline. More specifically, you can import extracted data into this environment.
Here, data is downloaded, transformed/formatted per customer specifications, and tested.
The data also is analyzed to ensure its validity. After the data has been verified, it is delivered
to the customer at a preferred Destination.

The DOC environment provides customers with an overview of their data to include its quality and
completeness. In addition, DOC enables the Delivery team to perform its duties while pushing data
to the customer Destination in a standardized manner.

Internal staffers and some external customers have access to the DOC platform. There are several
primary personas:

System Role

  • System Admin

Organization User Role

  1. Admin

  2. Ops

  3. Schema Admin

  4. Member

  5. Source Engineer

Each persona will have access to certain DOC features, providing additional security.

The DOC platform has multiple layers:

Layer Purpose

Organization

Company or entity requesting the data extraction.
An Organization may be comprised of multiple users.

Project

A group of one or more Collections. These Collections
contain Extractors. As you establish Projects, you can
name them identically; however, in this case, each
Project must be associated with a different Environment
(DEV, STAGING, or PRODUCTION). For now, the
Environment references, generally, function as tags
or labels. Projects can be locked, restricting write
and edit operations under each subsequent DOC layer
(such as Collections, Sources, and Snapshots.) You will
not have Project access unless you are an ORG_ADMIN,
OPS, or assigned to work on a specific layer
of this Project. If your role is ORG MEMBER and you
are attempting to edit a Source, for example,
you cannot make changes unless this Source
is assigned to you.

Collection

A group of Sources (or Extractors) that adheres to a
particular Schema. A Collection represents the method by which to group Sources.

Source

The Extractor or web crawling tool. A Source maps to an Extractor ID.

Snapshot

The extracted data or output. A Snapshot maps to a crawl run.

Each layer has a one-to-many relationship. One Organization may have many Projects.
One Project may have many Collections. One Collection may have many Sources.
One Source may have many Snapshots.

Although not part of the DOC layered structure, Schemas and Destinations are critical to this platform.
Schemas are inputs (column names) that are a subset of an Extractor. A Destination is the
Delivery location of a customer’s extracted data, such as an S3 bucket or Azure. A Destination is defined
at the Organization level.

As the Import team continues to enhance the flagship applications, it is important
to note that the current Extractor will be replaced by a new CLI tool: Extractor Studio. More technical in use,
this tool will allow greater data extraction efficiency – as it will be code-driven and allow Extractors
to be built on a larger scale. Having observed customer Extractor use, the Import team developed
the CLI tool (in part) to align with customer behavior. It cannot be overstated that the transition from the
current Extractor to the CLI tool will occur gradually and will not immediately impact customers who
use the existing SaaS product.