Simple Flows

flows simple

Unlike Legacy Flows, this Flow type automatically starts each Source associated with the Collection.
You then must start or schedule the Deliveries (which triggers the crawl runs) – making this
Flow type more convenient than Legacy. While you can cancel Simple Flows, you cannot cancel a
Legacy. For Simple Flows, this action cancels the Snapshot crawl runs in progress and does not import
this data. Simple Flows provide more options for customization and, as such, prompt you to respond
to additional questions during configuration. Simple and Chained are the most common Flow types.

  1. From the left navigation pane, click Flows.

  2. From the top right of the Flows page, click the Add a Flow icon or plus (+) symbol.

  3. From the New Flow page, enter text in the Name field.

    The name you enter will autofill the Slug/ID field. You can associate Slug/IDs with many DOC
    platform objects, which serve as self-defined identifiers. Slug/IDs can be useful as you
    reference APIs or create variable names. You cannot change Slug/IDs. That noted, ensure
    they are meaningful.

  4. Use the drop-down arrow to select a Type.

    You may choose from the Legacy, Simple, and Chained Flow types.

  5. Ignore or clear the Active checkbox.

    This selected checkbox ensures Deliveries run as scheduled, per the Cron Expression
    visible after you adjust the toggle switch: Do you want to add scheduling? The Active checkbox is selected by default. If you clear this checkbox, no Deliveries will run even if specified in the
    Cron Expression. Clearing the Active checkbox provides a method by which to maintain the Flow
    while not running scheduled Deliveries. In this case, you could run the Flow manually (as needed)
    by clicking the Run Flow button located at the top right of the page specific to that configured Flow.

  6. Use the drop-down arrow to select a Collection.

    A Collection is a group of Sources (or Extractors) that adheres to a particular Schema. A Collection
    represents the method by which to group Sources. Your choice will align with the Collection
    that contains the group of Extractors whose data you want to send to the customer Destination.

Source Filters

Adjust, as needed, the toggle switch: Do you want to filter the sources included in this Flow?

Sliding the toggle switch to the right or ON position triggers display of the default Locale and
Domain Parameters along with any other Source Parameters you defined. This action allows you
to filter data based on the Source Parameters you established for the Collection. For example,
you may only want the Locale=en_us Extractors to run for the Delivery. A backend query selects
the Sources that meet the specified criteria for Delivery inclusion. Parameters are not required.

QA Checks

Adjust, as needed, the toggle switch: Do you want to skip QA checks for this Flow?

Sliding the toggle switch to the right or ON position disregards any QA checks. Occasionally,
there are crawl runs where you collect data but want to skip QA review. In this circumstance,
when Snapshots import, they automatically transition from Passed_QA to Push to customer.
Skipping QA checks may be preferred when internally testing Flows or Extractors, as there may be
a need to review this data to see how closely it aligns with specifications. In other cases,
customers simply may want to review raw data.

S3 Configuration / Input Transformation

  1. Adjust, as needed, the toggle switch: Do you want to add an S3 configuration (and chunks)?

    Sliding the toggle switch to the right or ON position triggers display of S3 directory
    information and additional questions. Adding an S3 configuration allows you to specify inputs
    used for crawl runs and does not require that you use chunking. However, inputs from S3 are
    required if you do want to chunk Snapshots.

  2. Enter the S3 Bucket, Path Template, Filename, and File Type.

    These responses designate the file location of customer inputs. As such, you must identify
    a Bucket, Path Template, Filename, and .json or .csv file. Available template variables and
    date/time formats allow you to further your customization.

    The Bucket Name represents the AWS S3 cloud storage location. The Path Template may represent
    a folder you created in the AWS S3 environment along with a concatenation of Source Parameter
    names and template variables. For example:

    workbench-dev-assets represents an S3 Bucket.
    :source.collection/:YYYY/:WW/:source represents a Path Template.
    inputs represents a Filename.

    You can make your S3 templates more readable by adding curly braces:

    s3://eds-scrape-pool/scrapers/import/production/outgoing/master_lists/srd/:
    source.country/MasterList_:source.input_name_:source.country_
    FullTextReviews.csv

    VS

    s3://eds-scrape-pool/scrapers/import/incoming/BAU/srd_01/:{source.frequency}/:
    {source.country_code}/:{YYYY}/:{MM}/:{DD}/:{source.dest_prefix}.:
    {source.output_name}.:{YYYY}-:{MM}-:{DD}T00_00_:{version}.csv
  3. Enter a response in the field: Do you want to specify an input transform?

    An input transform allows you to enter a JQ expression to adjust the input format. You can
    enter JQ transform expressions for a number of different purposes. If, for example,
    a customer’s inputs require reformatting, you can use JQ transform to write an expression
    to reformat the data such that a potential format issue does not cause the extraction to fail.
    In another scenario, if inputs are uploaded as RPCs but – instead – should be Product IDs,
    you can use JQ to write an expression to correct this input format. You also may use a
    JQ expression to filter data or specify product details.

  4. Enter an Access Key ID and a Secret Access Key.

    These references, similar to passwords, are your S3 credentials and were likely provided by a
    member of your IT department. The text you enter will be encrypted.

Chunk Information

  1. Enter a value in the Number of Chunks field.

    Chunking involves separating large amounts of data into smaller groups to facilitate processing
    and crawl runs. This value represents the number of smaller groups or chunks of data. If you have
    1 million inputs, for example, you might have 10 chunks (Snapshots) running in parallel – each
    containing 100,000 inputs. You must load these inputs into an S3 bucket or the customer’s
    preferred input Destination. It is important to note that these chunks or child Snapshots
    have a parent level.

  2. Enter a value in the Chunk Collect Hours field.

    This value represents the timeframe you want each chunk to run.

  3. Enter a value in the Chunk Push Hours field.

    This value represents the timeframe from crawl run start to Destination push.

  4. Enter a value in the Chunk Timeout Hours field.

    This value represents the maximum chunking timeframe before crawl run cancellation. Data
    collected within this duration will be imported. All other data is disregarded. If a crawl
    run does not complete within the timeframe you designate, the crawl run is canceled; only
    data collected during this timeframe is imported. The value you enter here also displays
    in the Must Finish By field on the Delivery Snapshot page. A value in this field,
    therefore, indicates Timeout Hours exist. Unlike similar values you enter on the Flows
    page which primarily have SLA significance, the value you enter here is enforced and
    has consequence.

  5. Adjust, as needed, the toggle switch: Do you want to aggregate the chunks to be pushed for this Flow?

    Sliding the toggle switch to the right or ON position allows the chunks of data to be combined
    at the parent level before pushed to the customer. Each chunk (Snapshot) has a parent. When the
    child Snapshot finishes, its state is Completed. Once the last Snapshot completes, they all
    aggregate at the parent level and are pushed to the Destination as a single file. Enabling this
    toggle switch reveals the Aggregate Chunks and Retain Original Chunk File Pushes checkboxes.

  6. Select or clear the Aggregate Chunks checkbox.

    You must select this checkbox if you want to combine the chunks (Snapshots) at the parent level
    and push to the Destination as a single file.

  7. Select or clear the Retain Original Chunk File Pushes checkbox.

    You must select this checkbox if you do not want to combine the chunks (Snapshots) but, instead,
    maintain the data chucks and push to the Destination as groups of data.

  8. Adjust, as needed, the toggle switch: Do you want to add scheduling?

    Sliding the toggle switch to the right or ON position triggers display of the Minutes, Hours,
    Day (Month), Month, and Day (Week) fields, which are visible in the Cron Expression section.
    Each has specific entry criteria, evident as you access each field.

flows cron expression

Collection Information

The content you enter in this section is used primarily for Service Level Agreement (SLA) purposes.
As such, if the actions associated with the values you enter in the fields below are not satisfied,
there is no significant impact. For example, if you enter “1” in the Hours to Collect Data field
(and the crawl run does not complete within this time period), this action may be noted on the
Delivery Snapshot page; however, there is no other consequence. In general, certain timelines
and metrics are tracked in the DOC environment. This information appears on Dashboards and is
evaluated by internal staffers.

  1. Enter a value in the Hours to Collect Data field.

    This value represents the time period (from start to finish) required to retrieve the extracted data.
    This value is the crawl run completion timeframe and serves as the entire collection window for all
    Snapshots in a Delivery.

  2. Enter a value in the Hours from Start to Destination Push field.

    This value represents the timeframe from crawl run start to push to Destination. When a Snapshot
    transitions from Passed_QA to Pushed or Completed, it triggers a Destination push which moves
    customer files to the specified Destination. On the Delivery Snapshot page, a selected
    Pushed in Window checkbox indicates that this timeframe was satisfied; in addition, the
    Collected in Window column indicates the percentage of data retrieved during this period.

  3. Enter a value in the Hours to Finish field.

    This value represents the total Delivery window, from extraction start to Destination push for all
    Snapshots in the Delivery. Not every Snapshot begins at the same time. They normally are staggered
    to eliminate potential strains on the system. After this timeframe lapses, Delivery is moved
    to a Closed state; no in-progress/in-flight action is canceled.

  4. To share helpful information about the Flow with team members, enter text in the README section.

    This section, which supports the markdown syntax, allows you to provide additional Flow context and insight.

  5. To store content, click Save. To disregard, click Cancel.

Programming Message
A Snapshot's time limit is calculated when a Snapshot has a collectBy value.
The collectBy value is calculated based on the difference between the current
time and the collectBy time on the Snapshot (in minutes). The time limit is
added to the crawl run object associated with the Snapshot. The msPerInput
(how long each input is expected to take to complete in the expected timeframe)
is only calculated for a crawl run on a Snapshot when a stopBy value is
present on the Snapshot. The msPerInput for a crawl run is calculated based
on the difference between the current time and the stopBy time on the Snapshot
then dividing that value by the number of inputs. If the current time is after
the stopBy (a delivery is running late), the value used to calculate +
msPerInput is based on the deliverBy value.
You cannot delete Flows.