Release Notes
Release v7.0.3 - Jan 11, 2021
Bug Fixes
-
Fixed scenario where second segment snapshots in a CHAINED Flow were receiving the wrong collection windows. Resulting in an incorrect query gap being set for crawl runs.
-
Retain parent/chunk relationships when a Snapshot aggregation is retried.
-
Fix email integration form displaying wrong information
-
Fixed issue where navigating to Deliveries from a selected Delivery resulted in deliveries missing from the list
-
Destination Push fixes
-
Make sure that failures are properly marked with
FAILED
instead ofINTERNAL_FAILURE
-
Ensure that failed revert attempts receive status of
REVERT_FAILED
-
Allow retrying a revert that failed
-
Release v7 - Jan 4, 2021
Generate Assets Stage
“Generate Assets” is a new stage for the import pipeline, and will be the final stage before the snapshot is ready to be reviewed by a user, or automatically pushed. If a custom destination is designated for any of the snapshot’s destinations, a custom file will be created and copied to DOC’s internal assets. Moving the custom asset generation to the import pipeline allows for the “Destination Push” process to be more efficient, and also allows the user to download a custom file (similar to downloading the JSON on a snapshot now from the home page) without a push taking place. A link will be made available in the “Download Snapshot” window for this.
IMPORTANT: Note that if you need to apply changes to a custom file after it has been imported you will need to "Regenerate Custom Assets" from the Snapshot Home page.
For more see Snapshot Retry Options
Delivery Alarm Severities
An Alarm Group’s and Delivery’s state will reflect the most severe Alarm state.
The priority is as follows (most severe first):
CRITICAL → ERROR → WARNING → INFO
Pass/Revert all QA Checks
A checkbox is available on the QA Checks page which will allow you to PASS all QA checks or revert to their initial calculated value.
View History of Metric values
When performing QA Checks you now have the option to "Show Historical Values" which will fetch and display a history of values for that particular field.
Release v6.1.0 - Dec 9, 2020
SOURCE_ENGINEER Role
SOURCE_ENGINEER
role is now available to assign users of an Organization. This role restricts the creation and editing of most resources but allows:
-
Creating and Editing Sources
-
Snapshot operations
-
Running, re-running, re-importing, re-extracting
-
Passing/Failing QA Checks
-
Reverting/Retrying Destination Pushes
-
Commenting
-
-
Editing Source Filters on a flow
Byte Order Mark on Custom Output
You can now include a BOM header in your custom output. This is configured via a checkbox on the Collection Settings page.
Release v5.5.0 - Nov 9 2020
Chunk Collect Hours on chained segments
"Chunk Collect Hours" can now be configured at the segment level, allowing more granular control of your "Collected on Time" metrics.
Delivery Alarms for Snapshot QA Checks
Alarms can now be configured using failedRequiredTests
and failedOptionalTests
as metrics
Release v5.4.0 - Nov 6 2020
Release v5.2.0 - Oct 27 2020
Optional Collection Checks
Collection checks can now be marked as required or optional to allow snapshots to "auto-push"
-
Required checkbox available on Collection Checks table
-
If a check is optional, its result will still be evaluated, but will not prevent a snapshot from pushing if it fails
-
Metrics for the count of failed optional checks and failed required checks is available on the snapshot health, and are included on the snapshot tables
Warning for saving flows
Usability improvement to provide a warning to users who are editing a flow, and prevent removal of source filters by accident. If a user tries to edit a flow that has source filters, and the source filters are empty, they will be prompted to confirm that this is intentional.
Support for positive/negative swing checks
PCT checks are only capable of testing the deviation of a value between two snapshots. With the addition of this feature placing a sign (+/-) before the target number will permit the check to take the direction of the deviation into account. For example a PCT check of % Error ⇐ +10% will no longer trigger if the change is -20%.
Release v5.1.0 - Oct 22 2020
New Collection Check Functions
Checks can now reference stats and health metrics
-
HEALTH()
- References snapshothealth
values.-
Example
%{ IF(HEALTH("rows") > 100000, 2, 10) }%
-
-
METRIC()
- References values contained in thestats
file-
Example
%{ METRIC("data:field/name/stringPunctRatio/p05") }
-
Re-extract improvements
-
PUSHED Snapshots can now be re-extracted
-
Removes data from Redshift Datalake
-
Supercedes snapshot and re-extracts
-
-
PUSHED Snapshots can also be transitioned back to QA
-
Removes data from the Redshift Datalake
-
Allows for re-extraction
-
Reverting pushed files from destinations is still a manual operation. Can be achieved by "reverting" the destination push.
Release v5.0.1 Oct 12 2020
-
Increased memory for Import jobs
-
Reduce likelihood of Snapshot Imports failing during stats generation
-
-
Remove POST_PROCESSING from allowed Snapshot transitions
Release v5.0.0 - Sep 29 2020
-
Deduplication of inputs across chained chunks
-
As a chunked and chained Flow executes, inputs are guaranteed to be unique in each chunk.
-
For example: In a chained & chunked flow, segment 2 chunk 2 will not run any inputs that are in any of the sister chunks in segment 2.
-
A separate JQ Deduplication Transform can be provided which will allow you to configure what should be considered unique.
-
-
Source Input Thresholds
-
Users can now specify the maximum number of inputs a Source should can run with. If the number of inputs on a Snapshot exceeds the maximum then the Snapsot will fail to start.
-
-
Read-only view of Collection Destinations for a flow.
-
This includes showing destinations by segment (if chained flow) and a link to the collection destinations page for easy access to update when necessary.
-
-
flow
(flow slug) now available as a template parameter for destination paths -
Source Validations
-
Users cannot mark a Source as
ACTIVE
if it has no extractorId -
"Play" button is disabled on Source Home if there is no extractorId
-
Release v4.1.0 - Sep 16, 2020
-
Optimized chunk scheduling in CHAINED flows
-
Chained segments will wait for all snapshots to reach a finished state before starting the next segment.
-
-
Clicking on an Organization/Delivery navigates to Flow/Delivery to provide user with Flow context.
Release v4.0.0 - Sep 8, 2020
-
Delivery Snapshots UI
-
If a delivery is CHAINED, the Snapshot table will be grouped by segment/collection
-
-
Improved Destination Pushes UI
-
Destination Pushes are now available under the "Health Metrics" section of Snapshot Home
-
Push errors are now displayed in the UI
-
Successful pushes are now displayed in Destination Push Spotlight
-
-
Maintenance/Outage message in UI
-
If the platform undergoes scheduled maintenance or is experiencing an outage, a message will be displayed at the top of each page
-
-
Bulk Edit Source Parameters
-
Users can now edit parameters for multiple sources at once
-
Bug fixes
-
Not all parameters visible on Source home page
-
API error and UI failing to display when a single snapshot’s page count is 0
-
Snapshot row count displayed - instead of 0 after importing
-
IMPORT performance issues. "Essential container in task exited"
-
Collection Check for swing % fails if a column consistently returns no data
Release v3.6.0 - Aug 31, 2020
-
Input transforms on SIMPLE and CHAINED Flows
-
For Flows whose inputs are stored on S3, a JQ expression can be declared which will transform the inputs before running snapshots.
-
-
Cancel Parent/Chunked Snapshots
-
It is now possible to cancel a CHUNKED snapshot which will cancel all running chunked children
-
-
New Collection Checks Page
-
A redesigned Collection Checks page which allows text to be inserted to support check parameterization. See Release v3.4.0 below.
-
Release v3.5.0 - Aug 24, 2020
-
Custom parquet output
-
A custom Parquet file can be generated from the custom output using a predefined Schema. Configured on Collection Settings.
-
-
Configurable
dataHours
on each Segment for CHAINED Flows-
Specify how long each segment in your chain should take to collect data.
-
-
Re-Extract improvements
-
Now CLI output and Paginated Extractors from SaaS platform can be re-extracted using saved HTML.
-
Release v3.4.0 - Aug 18, 2020
-
Initial support for parameterizing Collection Checks (API)
-
Collection checks can now be written with syntax
%{ PARAM("foo") }%'
and%{ DEFAULT(PARAM("zip"), 400) }%
to substitute using parameters specified on sources.
-
-
Ability to cancel Snapshots in
START_SCHEDULED
Release v3.2.0 - August 3rd, 2020
-
Zip compression available on destinations for
custom
,stats
,pages
andjson
files. -
Delivery metrics and speed enhancements on CHAINED flows
-
stopBy
timestamps on snapshots will be dynamically generated as the delivery progresses based on time left in “dataHours”, thus improving speed and metric reporting.
-
Release 3.1.0 - July 23rd, 2020
-
Start delivery API enhancements.
-
API now accepts a payload to override the saved flow configuration, negating the need to PATCH a flow before running it.
-
A new FlowDelivery relationship was added to track what configuration a delivery ran with.
-
Release v3.0 - July 8th, 2020
-
CHAINED Flows
-
New type of flow which allows chaining the output from one collection into the inputs of another. Data can be transformed across collection segments using
jq
syntax.
-
-
Support for transform failures and metrics
-
New Extractors support embedded transforms within them. These transforms are executed during the Snapshot import pipeline. If a transform fails it will mark it on the Page Summary and track the percentage of failed transforms as a metric on the snapshot health.
-
-
Multipart upload support when pushing to s3 destinations.
-
Allows files > 5gb to be pushed to buckets
-
Release v2.6.1 - July 14th, 2020
-
New option to
skipQaTests
on a flow -
Application now displays status of chunk aggregation.
-
Crawl run id is now clickable and navigates to crawl run debugger
-
New links under “Download Snapshot”
-
Inputs
- the inputs that the crawl run ran with -
Crawl Run (Log)
- log file from crawl run output -
Crawl Run (JSON)
raw crawl run output before DOC transformation
-
-
delivery_id
available as a variable for destination output path
Release v2.6.0 - June 30, 2020
-
Chunk aggregation
-
Option on Flows to wait until all child "chunks" of a snapshot are PASSED_QA before aggregating data and pushing to destinations.
-
Release v2.5.0 - June 22, 2020
-
Application now runs snapshots as the
legacyPlatformId
user (SaaS Platform) if this property is set on the organization. Allowing for proper billing and to support authenticated extractors
Release v2.4.0 - June 19, 2020
-
Support for fractional hours in flow configuration.
-
Comments thread on Deliveries
Release v2.3.0 - June 10, 2020
What’s New
-
Flow source filtering
-
Now you can select/filter which Sources get run in a SIMPLE Flow by any of the parameters in your Source.
-
-
Destinations accept Source Parameter values in your output template
-
Now you can use any Source Parameter as destination template variable. For example, if you have a Parameter locale. this can be set in your destination path/filename as :source.locale. When your Source is run with locale set to UK the destination will output with sourcename-UK .
-
-
Added support for import.io CLI extractor output
-
Now extractors created by the import.io CLI can be imported into workbench.
-
Release v1.8.2 - April 24, 2020
-
Fixed latest delivery + destination pushes bugs.
-
Added assigned user to snapshot tables
-
Added seconds to the destination and simple flow timestamps
-
Updated the user docs
-
Health metrics are open on the selected snapshot home page
-
Added a snapshot link in delivery sidebar, instead of delivery home dropdown
-
Added ‘All’ button to snapshot and pushes spotlight
-
Added revert option to pushes spotlight table
-
Fixed Delivery Push and Bull Queue metrics
Release v1.8.0 - April 10, 2020
Collections
-
“First seen” column - A collection setting is now available for collections whose schema has a primary key, and generates the date when an item was "first seen" and includes it as a meta-data timestamp column "_firstSeen" in the snapshot data.
Schemas
-
“FILE” Type column - This new column type on schemas helps include the metadata information about files extracted and downloaded during extraction time.
Destinations
-
Include downloaded files in destination pushes - You can now include any files and images that were downloaded during your extraction in pushes to your destinations.
-
SFTP Destinations Type - You can now configure SFTP servers as destinations for your data.
Alarms
-
Severity on Alarms - Flow alarms now have a “severity” option which can be set on the alarm and included in any alarm notifications.
-
APIs and UI to Resolve or Trigger Alarms - Now when a delivery is open you have the option to manually trigger or resolve alarms for the selected delivery alarm page.