Collection Settings

This page allows you to establish certain Collection preferences or designations, which dictate the
appearance of extracted data output.

collection settings primary

Global Output Settings

Blank Row for Empty Page

Select this checkbox to add a blank row for each Snapshot that you import which has no data. In some
cases, customers have a requirement for this blank row to appear. A blank row also might appear
because of an error.

error rows

Include First Seen

Select this checkbox to add a column to the output. This column indicates the timestamp of the
Snapshot ini the record first encountered. More specifically, tracking records between crawl runs
is required such that the Schema associated with the Collection must have a well-defined primary key.

Custom Output Settings

Some customers want the extracted data formatted. You can use JQ to transform JSON data, which
accommodates customers who want to customize data output.

Field Description

Header

You can add headers for output formats that can tolerate them.

JQ Transform Override

To develop a JQ expression, download the JSON output
for a Snapshot. Next, iterate on the expression as needed.

Footer

You can add footers for output formats that can tolerate them.

File Extension

Enter an extension that matches the file type of the output.

Schema

You must select a Schema for the parquet output format.

Include Byte Order Mark

Select this checkbox to include the BOM, which are byte
values that may appear at the beginning of an encoded
file.

If you designated custom text output in the Destination, this section allows you
to configure how that output gets generated when published. It uses JQ to take the Snapshot
as JSON and to convert it to another output such as JSON, CSV or TSV. This is performed
during streaming mode. While you cannot use the slurp command line switch, [., inputs]
within the transform performs a similar action.

JQ EXAMPLE:

head -n 1 ~/Desktop/vs.json \
| jq -rc '[._url,.site,.event_name,.event_date,(._input |fromjson |.ProdId),
(._input |fromjson |.Token),._pageTimestamp]|@tsv'

It is common to run the -rc flags ON.

Custom Parquet Output

If you selected custom parquet output in the Destinations, you should configure Custom Output
Settings
. However, there a few considerations:

  • Select a Schema that matches the intended parquet output. The menu will list the Schemas
    available to the Organization. Create one that represents the intended output if there is not
    an appropriate one available.

  • The output generated by the provided JQ transformation must be
    a JSON object with properties that match the selected Schema.

  • The extension is ignored. The output will always be .custom.parquet.

  • Headers and footers are ignored.

When crafting the transform, be mindful of the selected Schema. Although a field may not be
required, it must be present in the output. Use null to represent a field with no value.

Group-by JQ field selection

If grouping the records by a key field or group of key fields is required, use the "Group-by JQ Field Selection" text box. This field accepts a jq transform that yields an array of values that will be used to efficiently group the records before passing them on for subsequent processing by the "JQ Transform Override". For example, asssuming there is a need to group JSON records such as the following by LastName:

{"FirstName": "Fred", "LastName": "Flintstone", "City": "Bedrock"}
{"FirstName": "Betty", "LastName": "Rubble", "City": "Bedrock"}
{"FirstName": "George", "LastName": "Jetson", "City": "Orbit City"}
{"FirstName": "Jane", "LastName": "Jetson", "City": "Orbit City"}

The following jq transformation could be used to group the records by last name:

[.LastName]

The resulting output from the grouping would be:

[{"FirstName":"Fred","LastName":"Flintstone","City":"Bedrock"}]
[{"FirstName":"George","LastName":"Jetson","City":"Orbit City"},{"FirstName":"Jane","LastName":"Jetson","City":"Orbit City"}]
[{"FirstName":"Betty","LastName":"Rubble","City":"Bedrock"}]
Notice that each line may now contain multiple records. Each line is an array of records. This means that if the "JQ Transform Override" is defined, it must not be written assuming there is one record on each line.

To process these grouped records a record at a time, the JQ Transform Override must be changed somewhat. The easiest way to do this is to add .[] | to the beginning of the transform. This instructs jq to iterate over each record of the line’s group and process each record as is described after the '|'. For example:

.[] | [.LastName, .FirstName, .City] | @csv

would yield:

"Flintstone","Fred","Bedrock"
"Jetson","George","Orbit City"
"Jetson","Jane","Orbit City"
"Rubble","Betty","Bedrock"

Or, the grouping can be preserved to produce a more complicated record using this JQ Transform Overeride:

{group: input_line_number, people: [(.[] | {last: .LastName, first: .FirstName, city: .City})]}

which yields:

{"group":1,"people":[{"last":"Flintstone","first":"Fred","city":"Bedrock"}]}
{"group":2,"people":[{"last":"Jetson","first":"George","city":"Orbit City"},{"last":"Jetson","first":"Jane","city":"Orbit City"}]}
{"group":3,"people":[{"last":"Rubble","first":"Betty","city":"Bedrock"}]}

Data Store Settings

A Data Store is an online repository that stores information. More specifically, the
Data Store captures historical data for a Collection. Commonly, Data Store and Redshift Data Lake
are used interchangeably. Data Store is an internal destination managed by the application.

A Super Admin can provision a Data Store for an Organization. Once an Organization has a Data Store,
it is provided Connection information and credentials to access the database. The Organization can
connect to the database using a standard SQL client. From the application, the Organization can create
tables for any of their Collections in the Data Store. If the Organization has a product_details
Collection, for example, the Organization can create a product_details table within their Data Store.
If the Redshift Data Lake is selected as a Linked Destination then all Snapshots will push to the
Data Store once they pass QA; in addition, you can query this available data. This is helpful if you
want to create visualizations using a Business Intelligence (BI) tool, or if you want to create
complex queries specific to this data.

Provision Table for Collection

Here, you can add a table to the Data Store.

All the historic data that has passed QA is instantly available for query.

collection settings datastore

There is a Schema created in your database for each project, which is named using the Project slug.

There is a table created for the Collection data, which is named using the Collection slug.

For nested data (arrays, objects), there are SQL Extensions for Redshift that are used.

Get Table Name

collection settings datastore2

Screen Capture Settings

Select the Include Screenshots checkbox if you want to include images, as designated
in the associated Extractor.