Destinations

A Destination is the location where customer data is delivered. You configure Destinations
at the Organization level. While significant and frequent in use, Destinations are not always
required. For example, when testing, you might download output as opposed to pushing it to a
Destination. In another scenario, with Chained Flows, there are occasions when the first segments
are not pushed to a Destination; only the Final segment, which contains the requested output.

Add a Destination

destinations primary

To add a Destination:

  1. From the left navigation pane at the Organization level, click Destinations.

  2. From the top right of the Destinations page, click the Add a Destination icon or plus (+) symbol.

  3. From the New Destination page that now appears, enter text in the Name field.

  4. Use the drop-down arrow to select a Type, choosing from Azure, S3 or SFTP.

    Azure Destination type allows to deliver data to Azure Blob Storage.

    An S3 bucket is an AWS service that serves many purposes. In this case, it serves as a storage
    location.

    Similarly, the SFTP is another vehicle by which you may transfer and store data.

  5. Ignore or clear the Active checkbox.

    The Active checkbox is selected by default. If you clear this checkbox, no data will be
    pushed to this Destination. Clearing the Active checkbox provides a method to maintain
    the Destination while not pushing data to it. You can render a Destination inactive
    to avoid any data being pushed to this location.

  6. Choose File Type or keep Individual Files default value.

    The option is not available for Azure and SFTP destination types, Individual Files value is used by default.

  7. Choose a File.

    To further understand the context of Destinations relative to file types, it is important
    to realize that you can retrieve four kinds of data via an Import.IO web data extraction:
    • The actual dataset

    • Statistics specific to the crawl run and extraction

    • Pages, which represent the web pages that the browser rendered when accessing the websites to extract data.

    • Downloaded files/images (if configured in the Extractor).

      You can configure Import.io to download actual files that are accessible via a web page;
      for example, you might extract data from a web page that includes a list of links to PDF
      documents. You can extract the URL values that point to the PDFs; in addition, you can
      configure Import.io to download the actual files (not just extract the URL value of the
      download link).

      Commonly, customers are only interested in the extracted dataset and, potentially,
      downloaded files/images (if this is a project requirement). The data specific
      to stats or pages is available to monitor the health of the project and used
      by the developer building the Extractor or the team running the extraction. While useful
      for development and monitoring of the project, neither stats nor pages are specific
      to the actual data used by the customer.

      destinations new

Destinations offer different file types for each Snapshot of data. The customer may prefer
a specific file type, the Delivery team may make a recommendation, or the two may have
a conversation and decide. Generally, a conversation results in a defined or agreed
upon file type, which is determined at the start of a project. The file type also might
be determined during Solution Design by a Solution Architect engaged with the customer.

File Type Description

Parquet

This is a standard file format and the most common.
Parquet has many advantages such as its size, the fact
that it includes schema information, the speed at which
it will open, and the ability to open partial datasets
(subsets) from a parquet file. Data Science teams will
often prefer parquet to JSON and our Solution Architects
commonly recommend parquet.

Custom Text

This allows you to configure the export of datasets
in the .csv and .tsv formats.

JSON

This is a standard file format. JSON is a structured
data format, is commonly used by APIs, and is generally
the go-to format for structured data for Engineering
teams; however, JSON is not favorable for large datasets.

Custom Parquet

This is a standard file format.

Stats (JSON)

This file type allows you to retrieve data about the
statistical properties of the extraction to be exported
for analysis. While rare for customers, this file type
is useful for the internal Managed Services team.

Pages (JSON)

This file type allows you to retrieve data about the
HTML pages encountered during the extraction to be
exported. While rare for customers, this file type is
useful for the internal Development team.

Downloaded Files/Image

This file type allows you to download files/images
to be exported. This file type requires that file download
be configured in the Extractor. This is a relevant user
feature.

Compression Options

You can use the radio button to select a Compression Option.

gzip, the default selection, is used for single files and has a faster compression than zip.
In addition, gzip is widely available on *nix machines and is generally preferred; however,
some customers only may be able to use zip (which can handle compressing entire directories
but is slower). It is the Import.io practice to send datasets in a compressed format.

The option is not available for Azure destination type, gzip value is used by default.

Configuration

This section allows you to establish the location where the customer data will be pushed.

Azure

You must enter an Azure Account, Container and Account key.

Path Template and Filename are optional. If Filename is not provided, the default :snapshot_id.:ext template will be used. Collectively, this path designates the Destination where the customer data will be pushed or delivered.

  1. Enter an Azure Blob Storage Account.

    The Azure Storage Account contains all of your Azure Storage data objects.

  2. Enter a Container Name.

    The Container Name represents the Azure Blob Storage location. For example,
    azure-workbench-assets represents a container name.

  3. Enter a Path Template.

    The Path Template may represent a folder you created in the Azure Blob Storage environment along with a
    concatenation of Source Parameter names and template variables. Certain variables require
    prefixes. For example, input/:start_YYYY/:source.stage might represent a Path Template
    (where :start_ is the prefix to the YYYY date/time variable.) If Path Template folders structure doesn’t exist in the Container it will be created during destination push.

  4. Enter a Filename.

    The Filename can include template variables and the inferred extension, .ext. For example,
    :snapshot_id.:ext represents the output format and file extension. A combination
    of the Container Name, Path Template, and Filename might appear as follows:
    workbench-dev-assets / input/:start_YYYY/:source.stage / :snapshot_id.:ext

  5. Enter an Account key.

    This reference, similar to password, is your Azure Blob Storage Account Credentials and were likely provided
    by a member of your IT department. The text you enter will be encrypted.

  6. To store content, click Save. To disregard, click Cancel.

S3

You must enter an S3 Bucket Name, Access Key Id and Secret Access Key.

Path Template and Filename are optional. If Filename is not provided, the default :snapshot_id.:ext template will be used. Collectively, this path designates the Destination where the customer data will be pushed or delivered.

  1. Enter a Bucket Name.

    The Bucket Name represents the AWS S3 cloud storage location. For example,
    aws-workbench-assets represents a bucket name.

  2. Enter a Path Template.

    The Path Template may represent a folder you created in the AWS S3 environment along with a
    concatenation of Source Parameter names and template variables. Certain variables require
    prefixes. For example, input/:start_YYYY/:source.stage might represent a Path Template
    (where :start_ is the prefix to the YYYY date/time variable.)

  3. Enter a Filename.

    The Filename can include template variables and the inferred extension, .ext. For example,
    :snapshot_id.:ext represents the output format and file extension. A combination
    of the Bucket Name, Path Template, and Filename might appear as follows:
    workbench-dev-assets / input/:start_YYYY/:source.stage / :snapshot_id.:ext

  4. Enter an Access Key ID and a Secret Access Key.

    These references, similar to passwords, are your S3 credentials and were likely provided
    by a member of your IT department. The text you enter will be encrypted.

  5. To store content, click Save. To disregard, click Cancel.

SFTP

The Secure File Transfer Protocol (SFTP) allows you to send and receive Internet files.

  1. Enter a Path Template.

    The Path Template may represent a folder along with a concatenation of Source Parameter
    names and template variables.

  2. Enter the Filename.

    The Filename can include template variables and an inferred extension, .ext. For example,
    :snapshot_id.:ext represents the output format and file extension.

  3. Enter a Host.

    A Host is a computer device with network accessibility.

  4. Enter a Port.

    A Port is a numeric designation that represents a network service or device.

  5. Enter a Username, Password, and an SSH Key.

    These references, similar to passwords, are your SFTP credentials.

  6. To store content, click Save. To disregard, click Cancel.

Edit a Destination

To edit a Destination:

  1. From the Destinations page, use the Search feature or scrollbar (if necessary) to locate
    the Destination you want to modify.

  2. Click the Edit button associated with this entry.

    The Edit Destination page appears, allowing you to make modifications to each field except the
    Type field where you previously designated Azure, S3 or SFTP.

  3. To store updates, click Save. To disregard, click Cancel.

Template Variables / Date-Time Formats / Prefixes

Template variables, prefixes, and date/time variables allow you to add detail and specificity
to file locations. Template variables can use Source Parameters, enabling you to provide the Country
and Language information, for example. You also can use Parameters downstream when you want
to specify where to push the data. Date/time variables require prefixes. :start_YYYY might
indicate a 2021 start year, where :start_ is the prefix that is associated with the YYYY
date/time variable.

Template Variables

:org, :project, :source, :source.parameter (ie. :source.domain), :collection, :snapshot_id,
:delivery_id, :flow, :rows, :ext. (inferred extension)

Date/Time Formats

'YYYY', 'YY', 'M', 'MM', 'W', 'WW', 'D', 'DD', 'd', 'H', 'HH', 'm', 'mm', 's', 'ss'
Also ‘w' and 'ww’ for a week that starts on Sunday instead of Monday

Prefixes

:scheduled_, :start_, :end_, :delivery_start_, :delivery_end_

You can make your S3 templates more readable by adding curly braces:

s3://eds-scrape-pool/scrapers/import/production/outgoing/master_lists/srd/:
source.country/MasterList_:source.input_name_:source.country_
FullTextReviews.csv

VS

s3://eds-scrape-pool/scrapers/import/incoming/BAU/srd_01/:{source.frequency}/:
{source.country_code}/:{YYYY}/:{MM}/:{DD}/:{source.dest_prefix}.:
{source.output_name}.:{YYYY}-:{MM}-:{DD}T00_00_:{version}.csv

Delete a Destination

You cannot delete Destinations.