Release Notes

Release v5.1.0 - Oct 22 2020

New Collection Check Functions

Checks can now reference stats and health metrics

  • HEALTH() - References snapshot health values.

    • Example %{ IF(HEALTH("rows") > 100000, 2, 10) }%

  • METRIC() - References values contained in the stats file

    • Example %{ METRIC("data:field/name/stringPunctRatio/p05") }

Re-extract improvements

  • PUSHED Snapshots can now be re-extracted

    • Removes data from Redshift Datalake

    • Supercedes snapshot and re-extracts

  • PUSHED Snapshots can also be transitioned back to QA

    • Removes data from the Redshift Datalake

    • Allows for re-extraction

Reverting pushed files from destinations is still a manual operation. Can be achieved by "reverting" the destination push.

Bug Fixes

  • Fix for applicataion displaying duplicate Deliveries

  • Increased default limit of displayed collection checks to 1000, previous was 100.

Release v5.0.1 Oct 12 2020

  • Increased memory for Import jobs

    • Reduce likelihood of Snapshot Imports failing during stats generation

  • Remove POST_PROCESSING from allowed Snapshot transitions

Release v5.0.0 - Sep 29 2020

  • Deduplication of inputs across chained chunks

    • As a chunked and chained Flow executes, inputs are guaranteed to be unique in each chunk.

    • For example: In a chained & chunked flow, segment 2 chunk 2 will not run any inputs that are in any of the sister chunks in segment 2.

    • A separate JQ Deduplication Transform can be provided which will allow you to configure what should be considered unique.

  • Source Input Thresholds

    • Users can now specify the maximum number of inputs a Source should can run with. If the number of inputs on a Snapshot exceeds the maximum then the Snapsot will fail to start.

  • Read-only view of Collection Destinations for a flow.

    • This includes showing destinations by segment (if chained flow) and a link to the collection destinations page for easy access to update when necessary.

  • flow (flow slug) now available as a template parameter for destination paths

  • Source Validations

    • Users cannot mark a Source as ACTIVE if it has no extractorId

    • "Play" button is disabled on Source Home if there is no extractorId

Bug Fixes

  • Create/Edit Source API now returns 400 instead of 500 when a bad extractor Id is passed.

  • Better handling of chunk splitting

  • SQL Query optimization

  • Proper handling of input files with bad encoding

Release v4.1.0 - Sep 16, 2020

  • Optimized chunk scheduling in CHAINED flows

    • Chained segments will wait for all snapshots to reach a finished state before starting the next segment.

  • Clicking on an Organization/Delivery navigates to Flow/Delivery to provide user with Flow context.

Bug Fixes

  • UI performance improvements on delivery pages

  • Various bug fixes on user interface

  • Typed column fixes for locale de-ch

Release v4.0.0 - Sep 8, 2020

  • Delivery Snapshots UI

    • If a delivery is CHAINED, the Snapshot table will be grouped by segment/collection

  • Improved Destination Pushes UI

    • Destination Pushes are now available under the "Health Metrics" section of Snapshot Home

    • Push errors are now displayed in the UI

    • Successful pushes are now displayed in Destination Push Spotlight

  • Maintenance/Outage message in UI

    • If the platform undergoes scheduled maintenance or is experiencing an outage, a message will be displayed at the top of each page

  • Bulk Edit Source Parameters

    • Users can now edit parameters for multiple sources at once

Bug fixes

  • Not all parameters visible on Source home page

  • API error and UI failing to display when a single snapshot’s page count is 0

  • Snapshot row count displayed - instead of 0 after importing

  • IMPORT performance issues. "Essential container in task exited"

  • Collection Check for swing % fails if a column consistently returns no data

Release v3.6.0 - Aug 31, 2020

  • Input transforms on SIMPLE and CHAINED Flows

    • For Flows whose inputs are stored on S3, a JQ expression can be declared which will transform the inputs before running snapshots.

  • Cancel Parent/Chunked Snapshots

    • It is now possible to cancel a CHUNKED snapshot which will cancel all running chunked children

  • New Collection Checks Page

    • A redesigned Collection Checks page which allows text to be inserted to support check parameterization. See Release v3.4.0 below.

Release v3.5.0 - Aug 24, 2020

  • Custom parquet output

    • A custom Parquet file can be generated from the custom output using a predefined Schema. Configured on Collection Settings.

  • Configurable dataHours on each Segment for CHAINED Flows

    • Specify how long each segment in your chain should take to collect data.

  • Re-Extract improvements

    • Now CLI output and Paginated Extractors from SaaS platform can be re-extracted using saved HTML.

Release v3.4.0 - Aug 18, 2020

  • Initial support for parameterizing Collection Checks (API)

    • Collection checks can now be written with syntax %{ PARAM("foo") }%' and %{ DEFAULT(PARAM("zip"), 400) }% to substitute using parameters specified on sources.

  • Ability to cancel Snapshots in START_SCHEDULED

Bug Fix

  • Bug fix for Chained Flow Segments failing to start due to input lookup error

Release v3.3.0 - Aug 10, 2020

  • Support for “Chunk Aggregation” in chained flows.

    • In CHAINED flows, parent snapshots in the last segment of a chain will aggregate chunks before pushing if “aggregateChunks” is turned on.

Bug Fix

  • Bug fix for “re-extract” on Snapshots

Release v3.2.0 - August 3rd, 2020

  • Zip compression available on destinations for custom, stats, pages and json files.

  • Delivery metrics and speed enhancements on CHAINED flows

    • stopBy timestamps on snapshots will be dynamically generated as the delivery progresses based on time left in “dataHours”, thus improving speed and metric reporting.

Release 3.1.0 - July 23rd, 2020

  • Start delivery API enhancements.

    • API now accepts a payload to override the saved flow configuration, negating the need to PATCH a flow before running it.

    • A new FlowDelivery relationship was added to track what configuration a delivery ran with.

Release v3.0 - July 8th, 2020

  • CHAINED Flows

    • New type of flow which allows chaining the output from one collection into the inputs of another. Data can be transformed across collection segments using jq syntax.

  • Support for transform failures and metrics

    • New Extractors support embedded transforms within them. These transforms are executed during the Snapshot import pipeline. If a transform fails it will mark it on the Page Summary and track the percentage of failed transforms as a metric on the snapshot health.

  • Multipart upload support when pushing to s3 destinations.

    • Allows files > 5gb to be pushed to buckets

Bug Fix

  • Bug fix for aggregated push variables

Release v2.6.1 - July 14th, 2020

  • New option to skipQaTests on a flow

  • Application now displays status of chunk aggregation.

  • Crawl run id is now clickable and navigates to crawl run debugger

  • New links under “Download Snapshot”

    • Inputs - the inputs that the crawl run ran with

    • Crawl Run (Log) - log file from crawl run output

    • Crawl Run (JSON) raw crawl run output before DOC transformation

  • delivery_id available as a variable for destination output path

Release v2.6.0 - June 30, 2020

  • Chunk aggregation

    • Option on Flows to wait until all child "chunks" of a snapshot are PASSED_QA before aggregating data and pushing to destinations.

Release v2.5.0 - June 22, 2020

  • Application now runs snapshots as the legacyPlatformId user (SaaS Platform) if this property is set on the organization. Allowing for proper billing and to support authenticated extractors

Release v2.4.0 - June 19, 2020

  • Support for fractional hours in flow configuration.

  • Comments thread on Deliveries

Release v2.3.0 - June 10, 2020

What’s New

  • Flow source filtering

    • Now you can select/filter which Sources get run in a SIMPLE Flow by any of the parameters in your Source.

  • Destinations accept Source Parameter values in your output template

    • Now you can use any Source Parameter as destination template variable. For example, if you have a Parameter locale. this can be set in your destination path/filename as :source.locale. When your Source is run with locale set to UK the destination will output with sourcename-UK .

  • Added support for import.io CLI extractor output

    • Now extractors created by the import.io CLI can be imported into workbench.

Bug fix

  • Setting QPS for one chunk is now supported.

Release v1.8.2 - April 24, 2020

  • Fixed latest delivery + destination pushes bugs.

  • Added assigned user to snapshot tables

  • Added seconds to the destination and simple flow timestamps

  • Updated the user docs

  • Health metrics are open on the selected snapshot home page

  • Added a snapshot link in delivery sidebar, instead of delivery home dropdown

  • Added ‘All’ button to snapshot and pushes spotlight

  • Added revert option to pushes spotlight table

  • Fixed Delivery Push and Bull Queue metrics

Release v1.8.0 - April 10, 2020

Collections

  • “First seen” column - A collection setting is now available for collections whose schema has a primary key, and generates the date when an item was "first seen" and includes it as a meta-data timestamp column "_firstSeen" in the snapshot data.

Schemas

  • “FILE” Type column - This new column type on schemas helps include the metadata information about files extracted and downloaded during extraction time.

Destinations

  • Include downloaded files in destination pushes - You can now include any files and images that were downloaded during your extraction in pushes to your destinations.

  • SFTP Destinations Type - You can now configure SFTP servers as destinations for your data.

Alarms

  • Severity on Alarms - Flow alarms now have a “severity” option which can be set on the alarm and included in any alarm notifications.

  • APIs and UI to Resolve or Trigger Alarms - Now when a delivery is open you have the option to manually trigger or resolve alarms for the selected delivery alarm page.

Snapshots

  • Importing cancelled snapshots - now if you cancel a snapshot you have an option in the UI to manually import the data that was collected.

Flows

  • Max chunk crawl time - When configuring a flow you can set the chunkTimeoutHours which is the max time a chunk should run for. If the chunk times out the data collected will be automatically imported.