Chained Flows

flows chained

This Flow type is the most complex, as it involves two or more chained Extractors – so named because
one is dependent on the other. Based on the amount of website data, two or more extractions might be
required for complete data retrieval. For example, Extractor 1 might perform a search for a list
of books and categories (performing an extraction for each page of search results), outputting
book names and categories. To run, Extractor 2 would require the output from Extractor 1.
Extractor 2 would take the search results, access each page, and retrieve the product details
such as name, product URL, rank, price, and availability. The Extractors are chained such that
when Extractor 1 finishes, Extractor 2 starts by using the output from Extractor 1.

Chained Flow types are commonly used when performing a search and providing related product details.
This action is more challenging using a single Extractor, hence Chained Flows. Chained Flows
include Segments, and each correlates to the set of first Extractors. Each Segment is attached
to a Collection, which may be comprised of several Sources.

  1. From the left navigation pane, click Flows.

  2. From the top right of the Flows page, click the Add a Flow icon or plus (+) symbol.

  3. From the New Flow page, enter text in the Name field.

    The name you enter will autofill the Slug/ID field. You can associate Slug/IDs with many DOC
    platform objects, which serve as self-defined identifiers. Slug/IDs can be useful as you
    reference APIs or create variable names. You cannot change Slug/IDs. That noted, ensure
    they are meaningful.

  4. Use the drop-down arrow to select a Type.

    You may choose from the Legacy, Simple, and Chained Flow types.

  5. Ignore or clear the Active checkbox.

    This selected checkbox ensures Deliveries run as scheduled, per the Cron Expression
    visible after you adjust the toggle switch: Do you want to add scheduling? The Active checkbox is selected by default. If you clear this checkbox, no Deliveries will run even if specified in the
    Cron Expression. Clearing the Active checkbox provides a method by which to maintain the Flow
    while not running scheduled Deliveries. In this case, you could run the Flow manually (as needed)
    by clicking the Run Flow button located at the top right of the page specific to that configured Flow.

Segment 1

  1. Use the drop-down arrow to select a Collection.

    A Collection is a group of Sources (or Extractors) that adheres to a particular Schema. A Collection
    represents the method by which to group Sources. Your choice will align with the Collection
    that contains the group of Extractors whose data you want to send to the customer Destination.

  2. Enter a value in the Hours to Collect Data field.

    This value represents the time period (from start to finish) required to retrieve the extracted data.
    This value is the crawl run completion timeframe and serves as the entire collection window for all
    Snapshots in a Delivery.

  3. Enter a value in the Chunk Collect Hours field.

    This value represents the timeframe you want each chunk to run.

Segment 2

  1. Enter text in the JQ Input Transform field.

    This field allows you to enter a JQ expression to adjust the input format of the extracted data
    or output from Segment 1. In the example below, a website (categoryURL) is referenced and will
    retrieve certain category information.

    {URL: .categoryUrl// "", category: .category}
  2. Enter text in the JQ Deduplication Transform field.

    This field allows you to enter a JQ expression as a check or validation to ensure there is no
    duplicate data.

  3. Use the drop-down arrow to select a Collection.

    A Collection is a group of Sources (or Extractors) that adheres to a particular Schema. A Collection
    represents the method by which to group Sources. Your choice will align with the Collection
    that contains the group of Extractors whose data you want to send to the customer Destination.

  4. Enter a value in the Hours to Collect Data field.

    This value represents the time period (from start to finish) required to retrieve the extracted data.
    This value is the crawl run completion timeframe and serves as the entire collection window for all
    Snapshots in a Delivery.

  5. Enter a value in the Chunk Collect Hours field.

    This value represents the timeframe you want each chunk to run.

Source Filters

Adjust, as needed, the toggle switch: Do you want to filter the sources included in this Flow?

Sliding the toggle switch to the right or ON position triggers display of the default Locale and
Domain Parameters along with any other Source Parameters you defined. This action allows you
to filter data based on the Source Parameters you established for the Collection. For example,
you may only want the Locale=en_us Extractors to run for the Delivery. A backend query selects
the Sources that meet the specified criteria for Delivery inclusion. Parameters are not required.

QA Checks

Adjust, as needed, the toggle switch: Do you want to skip QA checks for this Flow?

Sliding the toggle switch to the right or ON position disregards any QA checks. Occasionally,
there are crawl runs where you collect data but want to skip QA review. In this circumstance,
when Snapshots import, they automatically transition from Passed_QA to Push to customer.
Skipping QA checks may be preferred when internally testing Flows or Extractors, as there may be
a need to review this data to see how closely it aligns with specifications. In other cases,
customers simply may want to review raw data.

S3 Configuration / Input Transformation

  1. Enter text in the fields specific to Do you want to add an S3 configuration (and chunks)?

    For Chained flows, this field is disabled and in the ON position because the S3-related fields are
    required. Adding an S3 configuration allows you to specify inputs used for crawl runs and does not
    require that you use chunking. However, inputs from S3 are required if you do want to chunk Snapshots.

    The S3 configuration identifies the file location of the customer inputs. These inputs are specific
    to the URLs where crawl runs are performed, and data extracted. Customer inputs also may provide
    other specifications that will identify exactly what data should be retrieved. Chunks are used
    to group data into smaller sets, which is common when using chained Extractors. The file inputs
    are used by Segment 1.

    Example (Input Transform): Below, the customer provides an input file which is used by Segment 1.
    The input URL is amazon.com. Segment 1 performs an extraction and produces two outputs: books,
    amazon.com/books
    and games, amazon.com/games. The JQ transform takes this categoryURL
    (amazon.com/books) and category:books, uses this output as the input for Segment 2, performs
    an extraction, loops through the amazon book/game data, and produces output: a list
    of books/games, for example. This extraction produces additional detail (based on the
    defined Schema), providing additional information such as the name, product URL, rank,
    price, and availability. The final output is a list of books and games, which could comprise
    hundreds of rows – each row inclusive of product-specific detail.

    File
    URL: amazon.com
    Input --> Segment 1 --> Output  books, amazon.com/books
                                    games, amazon.com/games
    JQ Transform
    URL: amazon.com/books, category: books
    Segment 2
    Input --> Segment 2 --> Output  book1, book2
    FINAL
    books, amazon.com/books, book1, 1
    books, amazon.com/books, book2, 2
    games, amazon.com/games, game1, 1
  2. Enter the S3 Bucket, Path Template, Filename, and File Type.

    These responses designate the file location of customer inputs. As such, you must identify
    a Bucket, Path Template, Filename, and .json or .csv file. Available template variables and
    date/time formats allow you to further your customization.

    The Bucket Name represents the AWS S3 cloud storage location. The Path Template may represent
    a folder you created in the AWS S3 environment along with a concatenation of Source Parameter
    names and template variables. For example:

    workbench-dev-assets represents an S3 Bucket.
    :source.collection/:YYYY/:WW/:source represents a Path Template.
    inputs represents a Filename.

    You can make your S3 templates more readable by adding curly braces:

    s3://eds-scrape-pool/scrapers/import/production/outgoing/master_lists/srd/:
    source.country/MasterList_:source.input_name_:source.country_
    FullTextReviews.csv

    VS

    s3://eds-scrape-pool/scrapers/import/incoming/BAU/srd_01/:{source.frequency}/:
    {source.country_code}/:{YYYY}/:{MM}/:{DD}/:{source.dest_prefix}.:
    {source.output_name}.:{YYYY}-:{MM}-:{DD}T00_00_:{version}.csv
  3. Enter a response in the field: Do you want to specify an input transform?

    An input transform allows you to enter a JQ expression to adjust the input format. If, for example, a
    customer’s inputs require reformatting, you can use JQ transform to write an expression to reformat
    the data such that this issue does not cause the extraction to fail. In another scenario, if inputs
    are uploaded as RPCs but – instead – should be Product IDs, you can use JQ to write an expression
    to correct this input format.

  4. Enter an Access Key ID and a Secret Access Key.

    These references, similar to passwords, are your S3 credentials and were likely provided
    by a member of your IT department. The text you enter will be encrypted.

  5. Enter a Chaining Parameter.

    As noted previously, the Chained Flow type involves two or more chained Extractors – so named
    because one is dependent on the other. Based on the amount of website data, some Extractors
    require two or more extractions for complete data retrieval. Segment 1 runs first, and Segment 2
    requires the output from Segment 1. Chaining Parameters is the action that allows you to link
    Segment 1 and Segment 2, for example, thereby establishing an association. You can indicate
    which Sources are linked or tied together. Chaining Parameters is the glue that connects the
    two Segments. You make this association by creating Parameters, which are added to the
    Collection, and assigning a value to each.

    For ease of association, it is a best practice to name the Parameter: Chain. If you are
    performing a crawl run that involves a store or business, you might indicate Chain=CVS. As
    Segment 2 looks for Segment 1, the Parameter name designation will help find this match. You must
    be certain to avoid typos when chaining Parameters. If one Segment cannot find the other because,
    for instance, the Parameter name changed or was misspelled, the link or connection will be broken.
    In this case, Delivery data will not be valid.

Chunk Information

  1. Enter a value in the Number of Chunks field.

    Chunking involves separating large amounts of data into smaller groups to facilitate processing and
    crawl runs. This value represents the number of smaller groups or chunks of data. If you have
    1 million inputs, for example, you might have 10 chunks (Snapshots) running in parallel – each
    containing 100,000 inputs. You must load these inputs into an S3 bucket or the customer’s preferred
    input Destination. It is important to note that these chunks or child Snapshots have a parent level.

  2. Enter a value in the Chunk Collect Hours field.

    This value represents the timeframe you want each chunk to run.

  3. Enter a value in the Chunk Push Hours field.

    This value represents the timeframe from crawl run start to Destination push.

  4. Enter a value in the Chunk Timeout Hours field.

    This value represents the maximum chunking timeframe before crawl run cancellation. Data
    collected within this duration will be imported. All other data is disregarded. If a crawl
    run does not complete within the timeframe you designate, the crawl run is canceled; only
    data collected during this timeframe is imported. The value you enter here also displays
    in the Must Finish By field on the Delivery Snapshot page. A value in this field,
    therefore, indicates Timeout Hours exist. Unlike similar values you enter on the Flows
    page which primarily have SLA significance, the value you enter here is enforced and
    has consequence.

  5. Adjust, as needed, the toggle switch: Do you want to aggregate the chunks to be pushed for this Flow?

    Sliding the toggle switch to the right or ON position allows the chunks of data to be combined
    at the parent level before pushed to the customer. Each chunk (Snapshot) has a parent. When the
    child Snapshot finishes, its state is Completed. Once the last Snapshot completes, they all
    aggregate at the parent level and are pushed to the Destination as a single file. Enabling this
    toggle switch reveals the Aggregate Chunks and Retain Original Chunk File Pushes checkboxes.

  6. Select or clear the Aggregate Chunks checkbox.

    You must select this checkbox if you want to combine the chunks (Snapshots) at the parent level
    and push to the Destination as a single file.

  7. Select or clear the Retain Original Chunk File Pushes checkbox.

    You must select this checkbox if you do not want to combine the chunks (Snapshots) but, instead,
    maintain the data chucks and push to the Destination as groups of data.

  8. Adjust, as needed, the toggle switch: Do you want to add scheduling?

    Sliding the toggle switch to the right or ON position triggers display of the Minutes, Hours,
    Day (Month), Month, and Day (Week) fields, which are visible in the Cron Expression section.
    Each has specific entry criteria, evident as you access each field.

flows cron expression

Collection Information

The content you enter in this section is used primarily for Service Level Agreement (SLA) purposes.
As such, if the actions associated with the values you enter in the fields below are not satisfied,
there is no significant impact. For example, if you enter “1” in the Hours to Collect Data field
(and the crawl run does not complete within this time period), this action may be noted on the
Delivery Snapshot page; however, there is no other consequence. In general, certain timelines
and metrics are tracked in the DOC environment. This information appears on Dashboards and is
evaluated by internal staffers.

  1. Enter a value in the Hours to Collect Data field.

    This value represents the time period (from start to finish) required to retrieve the extracted data.
    This value is the crawl run completion timeframe and serves as the entire collection window for all
    Snapshots in a Delivery.

  2. Enter a value in the Hours from Start to Destination Push field.

    This value represents the timeframe from crawl run start to push to Destination. When a Snapshot
    transitions from Passed_QA to Pushed or Completed, it triggers a Destination push which moves
    customer files to the specified Destination. On the Delivery Snapshot page, a selected
    Pushed in Window checkbox indicates that this timeframe was satisfied; in addition, the
    Collected in Window column indicates the percentage of data retrieved during this period.

  3. Enter a value in the Hours to Finish field.

    This value represents the total Delivery window, from extraction start to Destination push for all
    Snapshots in the Delivery. Not every Snapshot begins at the same time. They normally are staggered
    to eliminate potential strains on the system. After this timeframe lapses, Delivery is moved
    to a Closed state; no in-progress/in-flight action is canceled.

  4. To share helpful information about the Flow with team members, enter text in the README section.

    This section, which supports the markdown syntax, allows you to provide additional Flow context and insight.

  5. To store content, click Save. To disregard, click Cancel.

Programming Message
A Snapshot's time limit is calculated when a Snapshot has a collectBy value.
The collectBy value is calculated based on the difference between the current
time and the collectBy time on the Snapshot (in minutes). The time limit is
added to the crawl run object associated with the Snapshot. The msPerInput
(how long each input is expected to take to complete in the expected timeframe)
is only calculated for a crawl run on a Snapshot when a stopBy value is
present on the Snapshot. The msPerInput for a crawl run is calculated based
on the difference between the current time and the stopBy time on the Snapshot
then dividing that value by the number of inputs. If the current time is after
the stopBy (a delivery is running late), the value used to calculate
msPerInput is based on the deliverBy value.
You cannot delete Flows.