Release Notes

Release v7.0.3 - Jan 11, 2021

Bug Fixes

  • Fixed scenario where second segment snapshots in a CHAINED Flow were receiving the wrong collection windows. Resulting in an incorrect query gap being set for crawl runs.

  • Retain parent/chunk relationships when a Snapshot aggregation is retried.

  • Fix email integration form displaying wrong information

  • Fixed issue where navigating to Deliveries from a selected Delivery resulted in deliveries missing from the list

  • Destination Push fixes

    • Make sure that failures are properly marked with FAILED instead of INTERNAL_FAILURE

    • Ensure that failed revert attempts receive status of REVERT_FAILED

    • Allow retrying a revert that failed

Release v7 - Jan 4, 2021

Generate Assets Stage

“Generate Assets” is a new stage for the import pipeline, and will be the final stage before the snapshot is ready to be reviewed by a user, or automatically pushed. If a custom destination is designated for any of the snapshot’s destinations, a custom file will be created and copied to DOC’s internal assets. Moving the custom asset generation to the import pipeline allows for the “Destination Push” process to be more efficient, and also allows the user to download a custom file (similar to downloading the JSON on a snapshot now from the home page) without a push taking place. A link will be made available in the “Download Snapshot” window for this.

IMPORTANT: Note that if you need to apply changes to a custom file after it has been imported you will need to "Regenerate Custom Assets" from the Snapshot Home page.

For more see Snapshot Retry Options

Delivery Alarm Severities

An Alarm Group’s and Delivery’s state will reflect the most severe Alarm state.

The priority is as follows (most severe first):

CRITICAL → ERROR → WARNING → INFO

Pass/Revert all QA Checks

A checkbox is available on the QA Checks page which will allow you to PASS all QA checks or revert to their initial calculated value.

View History of Metric values

When performing QA Checks you now have the option to "Show Historical Values" which will fetch and display a history of values for that particular field.

Usability improvements

  • Removed eliding for breadcrumbs

  • Source list can now be filtered by parameter values for bulk editing

Bug fixes

  • Fixed double locale conversion during import stage where column typing was getting applied twice, causing incorrect data.

Release v6.1.0 - Dec 9, 2020

SOURCE_ENGINEER Role

SOURCE_ENGINEER role is now available to assign users of an Organization. This role restricts the creation and editing of most resources but allows:

  • Creating and Editing Sources

  • Snapshot operations

    • Running, re-running, re-importing, re-extracting

    • Passing/Failing QA Checks

    • Reverting/Retrying Destination Pushes

    • Commenting

  • Editing Source Filters on a flow

Byte Order Mark on Custom Output

You can now include a BOM header in your custom output. This is configured via a checkbox on the Collection Settings page.

View Accepted Checks

All tests which have manually been passed will be indicated with an orange checkmark. The "Initial Result" is also available on a QA Test.

Bug fixes

  • Performance improvements around Snapshot Stats Generation and Re Extract

Release v6.0.0 - Nov 17 2020

New Snapshot QA Checks UI

A redesigned look and feel of the Snapshot Checks page.

Bug fix

  • Failing to get snapshot inputs from S3 will fail immediately instead of retrying. Allowing users to quickly know the inputs file is not available.

  • Fix sort order on "History" views

  • Faster re-extract performance

Release v5.5.0 - Nov 9 2020

Chunk Collect Hours on chained segments

"Chunk Collect Hours" can now be configured at the segment level, allowing more granular control of your "Collected on Time" metrics.

Delivery Alarms for Snapshot QA Checks

Alarms can now be configured using failedRequiredTests and failedOptionalTests as metrics

orgs/:orgSlug/snapshots/:snapshotId redirects to orgs/:orgSlug/projects/:projectSlug/collections/:collection:slug/sources/:sourceSlugsnapshots/:snapshotId

orgs/:orgSlug/deliveries/:deliveryId redirects to orgs/:orgSlug/projects/:projectSlug/flows/:flowSlug/deliveries/:deliveryId

Bug Fixes

  • Fix for Snapshots not timing out properly on non chained Flows

  • Snapshot Home download button is now disabled when files are not available

Release v5.4.0 - Nov 6 2020

Source Params in input transforms

PARAM() function is now available in input transforms. Can be used to reference Source parameters. Example: %{ PARAM("foo") }%

Chunk Aggregation enhancements

Better performance during chunk aggregation. Bug fix for running out of space when aggregating files.

Bug fix

Fix for Snapshot import error "Job attempt duration exceeded timeout" during TestGeneration stage

Release v5.3.0 - Oct 29 2020

Source Deployment APIs

API support for deploying sources from the Extractor Studio CLI

Release v5.2.0 - Oct 27 2020

Optional Collection Checks

Collection checks can now be marked as required or optional to allow snapshots to "auto-push"

  • Required checkbox available on Collection Checks table

  • If a check is optional, its result will still be evaluated, but will not prevent a snapshot from pushing if it fails

  • Metrics for the count of failed optional checks and failed required checks is available on the snapshot health, and are included on the snapshot tables

Warning for saving flows

Usability improvement to provide a warning to users who are editing a flow, and prevent removal of source filters by accident. If a user tries to edit a flow that has source filters, and the source filters are empty, they will be prompted to confirm that this is intentional.

Support for positive/negative swing checks

PCT checks are only capable of testing the deviation of a value between two snapshots. With the addition of this feature placing a sign (+/-) before the target number will permit the check to take the direction of the deviation into account. For example a PCT check of % Error ⇐ +10% will no longer trigger if the change is -20%.

Bug Fixes

  • Performance improvement for failed chunk aggregation when files are very large

Release v5.1.0 - Oct 22 2020

New Collection Check Functions

Checks can now reference stats and health metrics

  • HEALTH() - References snapshot health values.

    • Example %{ IF(HEALTH("rows") > 100000, 2, 10) }%

  • METRIC() - References values contained in the stats file

    • Example %{ METRIC("data:field/name/stringPunctRatio/p05") }

Re-extract improvements

  • PUSHED Snapshots can now be re-extracted

    • Removes data from Redshift Datalake

    • Supercedes snapshot and re-extracts

  • PUSHED Snapshots can also be transitioned back to QA

    • Removes data from the Redshift Datalake

    • Allows for re-extraction

Reverting pushed files from destinations is still a manual operation. Can be achieved by "reverting" the destination push.

Bug Fixes

  • Fix for applicataion displaying duplicate Deliveries

  • Increased default limit of displayed collection checks to 1000, previous was 100.

Release v5.0.1 Oct 12 2020

  • Increased memory for Import jobs

    • Reduce likelihood of Snapshot Imports failing during stats generation

  • Remove POST_PROCESSING from allowed Snapshot transitions

Release v5.0.0 - Sep 29 2020

  • Deduplication of inputs across chained chunks

    • As a chunked and chained Flow executes, inputs are guaranteed to be unique in each chunk.

    • For example: In a chained & chunked flow, segment 2 chunk 2 will not run any inputs that are in any of the sister chunks in segment 2.

    • A separate JQ Deduplication Transform can be provided which will allow you to configure what should be considered unique.

  • Source Input Thresholds

    • Users can now specify the maximum number of inputs a Source should can run with. If the number of inputs on a Snapshot exceeds the maximum then the Snapsot will fail to start.

  • Read-only view of Collection Destinations for a flow.

    • This includes showing destinations by segment (if chained flow) and a link to the collection destinations page for easy access to update when necessary.

  • flow (flow slug) now available as a template parameter for destination paths

  • Source Validations

    • Users cannot mark a Source as ACTIVE if it has no extractorId

    • "Play" button is disabled on Source Home if there is no extractorId

Bug Fixes

  • Create/Edit Source API now returns 400 instead of 500 when a bad extractor Id is passed.

  • Better handling of chunk splitting

  • SQL Query optimization

  • Proper handling of input files with bad encoding

Release v4.1.0 - Sep 16, 2020

  • Optimized chunk scheduling in CHAINED flows

    • Chained segments will wait for all snapshots to reach a finished state before starting the next segment.

  • Clicking on an Organization/Delivery navigates to Flow/Delivery to provide user with Flow context.

Bug Fixes

  • UI performance improvements on delivery pages

  • Various bug fixes on user interface

  • Typed column fixes for locale de-ch

Release v4.0.0 - Sep 8, 2020

  • Delivery Snapshots UI

    • If a delivery is CHAINED, the Snapshot table will be grouped by segment/collection

  • Improved Destination Pushes UI

    • Destination Pushes are now available under the "Health Metrics" section of Snapshot Home

    • Push errors are now displayed in the UI

    • Successful pushes are now displayed in Destination Push Spotlight

  • Maintenance/Outage message in UI

    • If the platform undergoes scheduled maintenance or is experiencing an outage, a message will be displayed at the top of each page

  • Bulk Edit Source Parameters

    • Users can now edit parameters for multiple sources at once

Bug fixes

  • Not all parameters visible on Source home page

  • API error and UI failing to display when a single snapshot’s page count is 0

  • Snapshot row count displayed - instead of 0 after importing

  • IMPORT performance issues. "Essential container in task exited"

  • Collection Check for swing % fails if a column consistently returns no data

Release v3.6.0 - Aug 31, 2020

  • Input transforms on SIMPLE and CHAINED Flows

    • For Flows whose inputs are stored on S3, a JQ expression can be declared which will transform the inputs before running snapshots.

  • Cancel Parent/Chunked Snapshots

    • It is now possible to cancel a CHUNKED snapshot which will cancel all running chunked children

  • New Collection Checks Page

    • A redesigned Collection Checks page which allows text to be inserted to support check parameterization. See Release v3.4.0 below.

Release v3.5.0 - Aug 24, 2020

  • Custom parquet output

    • A custom Parquet file can be generated from the custom output using a predefined Schema. Configured on Collection Settings.

  • Configurable dataHours on each Segment for CHAINED Flows

    • Specify how long each segment in your chain should take to collect data.

  • Re-Extract improvements

    • Now CLI output and Paginated Extractors from SaaS platform can be re-extracted using saved HTML.

Release v3.4.0 - Aug 18, 2020

  • Initial support for parameterizing Collection Checks (API)

    • Collection checks can now be written with syntax %{ PARAM("foo") }%' and %{ DEFAULT(PARAM("zip"), 400) }% to substitute using parameters specified on sources.

  • Ability to cancel Snapshots in START_SCHEDULED

Bug Fix

  • Bug fix for Chained Flow Segments failing to start due to input lookup error

Release v3.3.0 - Aug 10, 2020

  • Support for “Chunk Aggregation” in chained flows.

    • In CHAINED flows, parent snapshots in the last segment of a chain will aggregate chunks before pushing if “aggregateChunks” is turned on.

Bug Fix

  • Bug fix for “re-extract” on Snapshots

Release v3.2.0 - August 3rd, 2020

  • Zip compression available on destinations for custom, stats, pages and json files.

  • Delivery metrics and speed enhancements on CHAINED flows

    • stopBy timestamps on snapshots will be dynamically generated as the delivery progresses based on time left in “dataHours”, thus improving speed and metric reporting.

Release 3.1.0 - July 23rd, 2020

  • Start delivery API enhancements.

    • API now accepts a payload to override the saved flow configuration, negating the need to PATCH a flow before running it.

    • A new FlowDelivery relationship was added to track what configuration a delivery ran with.

Release v3.0 - July 8th, 2020

  • CHAINED Flows

    • New type of flow which allows chaining the output from one collection into the inputs of another. Data can be transformed across collection segments using jq syntax.

  • Support for transform failures and metrics

    • New Extractors support embedded transforms within them. These transforms are executed during the Snapshot import pipeline. If a transform fails it will mark it on the Page Summary and track the percentage of failed transforms as a metric on the snapshot health.

  • Multipart upload support when pushing to s3 destinations.

    • Allows files > 5gb to be pushed to buckets

Bug Fix

  • Bug fix for aggregated push variables

Release v2.6.1 - July 14th, 2020

  • New option to skipQaTests on a flow

  • Application now displays status of chunk aggregation.

  • Crawl run id is now clickable and navigates to crawl run debugger

  • New links under “Download Snapshot”

    • Inputs - the inputs that the crawl run ran with

    • Crawl Run (Log) - log file from crawl run output

    • Crawl Run (JSON) raw crawl run output before DOC transformation

  • delivery_id available as a variable for destination output path

Release v2.6.0 - June 30, 2020

  • Chunk aggregation

    • Option on Flows to wait until all child "chunks" of a snapshot are PASSED_QA before aggregating data and pushing to destinations.

Release v2.5.0 - June 22, 2020

  • Application now runs snapshots as the legacyPlatformId user (SaaS Platform) if this property is set on the organization. Allowing for proper billing and to support authenticated extractors

Release v2.4.0 - June 19, 2020

  • Support for fractional hours in flow configuration.

  • Comments thread on Deliveries

Release v2.3.0 - June 10, 2020

What’s New

  • Flow source filtering

    • Now you can select/filter which Sources get run in a SIMPLE Flow by any of the parameters in your Source.

  • Destinations accept Source Parameter values in your output template

    • Now you can use any Source Parameter as destination template variable. For example, if you have a Parameter locale. this can be set in your destination path/filename as :source.locale. When your Source is run with locale set to UK the destination will output with sourcename-UK .

  • Added support for import.io CLI extractor output

    • Now extractors created by the import.io CLI can be imported into workbench.

Bug fix

  • Setting QPS for one chunk is now supported.

Release v1.8.2 - April 24, 2020

  • Fixed latest delivery + destination pushes bugs.

  • Added assigned user to snapshot tables

  • Added seconds to the destination and simple flow timestamps

  • Updated the user docs

  • Health metrics are open on the selected snapshot home page

  • Added a snapshot link in delivery sidebar, instead of delivery home dropdown

  • Added ‘All’ button to snapshot and pushes spotlight

  • Added revert option to pushes spotlight table

  • Fixed Delivery Push and Bull Queue metrics

Release v1.8.0 - April 10, 2020

Collections

  • “First seen” column - A collection setting is now available for collections whose schema has a primary key, and generates the date when an item was "first seen" and includes it as a meta-data timestamp column "_firstSeen" in the snapshot data.

Schemas

  • “FILE” Type column - This new column type on schemas helps include the metadata information about files extracted and downloaded during extraction time.

Destinations

  • Include downloaded files in destination pushes - You can now include any files and images that were downloaded during your extraction in pushes to your destinations.

  • SFTP Destinations Type - You can now configure SFTP servers as destinations for your data.

Alarms

  • Severity on Alarms - Flow alarms now have a “severity” option which can be set on the alarm and included in any alarm notifications.

  • APIs and UI to Resolve or Trigger Alarms - Now when a delivery is open you have the option to manually trigger or resolve alarms for the selected delivery alarm page.

Snapshots

  • Importing cancelled snapshots - now if you cancel a snapshot you have an option in the UI to manually import the data that was collected.

Flows

  • Max chunk crawl time - When configuring a flow you can set the chunkTimeoutHours which is the max time a chunk should run for. If the chunk times out the data collected will be automatically imported.