Collection Settings

This page allows you to establish certain Collection preferences or designations, which dictate the
appearance of extracted data output.

collection settings primary2

Global Output Settings

Blank Row for Empty Page

collection settings global output

Select this checkbox to add a blank row for each Snapshot that you import which has no data. In some
cases, customers have a requirement for this blank row to appear. A blank row also might appear
because of an error.

error rows

Include First Seen

Select this checkbox to add a column to the output. This column indicates the timestamp of the
Snapshot ini the record first encountered. Tracking records between crawl runs is required
such that the Schema associated with the Collection must have a well-defined primary key.

Custom Output Settings

Some customers want the extracted data formatted. You can use jq to transform JSON data, which
accommodates customers who want to customize data output.

collection settings custom output settings
Field Description

Group by JQ Field Selection

You can enter text that allows you to organize records
before transforming them. For example, you can
organize (or group) records based on a key field
or a group of key fields.

Header

You can add headers for output formats that can tolerate them.

JQ Transform Override

To develop a jq expression, download the JSON output
for a Snapshot. Next, iterate on the expression as needed.

Footer

You can add footers for output formats that can tolerate them.

File Extension

Enter an extension that matches the file type of the output.

Schema

You must select a Schema for the parquet output
format. Schemas align with the Extractor column names.

Include Byte Order Mark

Select this checkbox to include the Unicode BOM
at the beginning of the file.

If you designated custom text output in the Destination, this section allows you
to configure how that output gets generated when published. It uses famous jq to take the Snapshot
as JSON and convert it to another output such as JSON, CSV or TSV. This is performed
during streaming mode. While you cannot use the slurp command line switch, [., inputs]
within the transform performs a similar action – primarily to provide a means of grouping the
records before transforming them. As noted in the Group by JQ Field Selection section
further below, there is a better method for grouping records.

jq EXAMPLE:

head -n 1 ~/Desktop/vs.json \
| jq -rc '[._url,.site,.event_name,.event_date,(._input |fromjson |.ProdId),
(._input |fromjson |.Token),._pageTimestamp]|@tsv'
Run jq with the -rc flags ON.

As indicated in the table above, you also may add headers and footers for output formats
that can tolerate them. You should provide an extension that matches the file type
of the output. There is no need to select a Schema.

Group by JQ Field Selection

If grouping records by a key field or group of key fields is required, use the Group by JQ Field
Selection
text box. This field accepts a jq transform that yields an array of values that
will be used to group the records efficiently before passing them for subsequent processing
by the JQ Transform Override. For example, assume there is a need to group JSON records by LastName:

{"FirstName": "Fred", "LastName": "Flintstone", "City": "Bedrock"}
{"FirstName": "Betty", "LastName": "Rubble", "City": "Bedrock"}
{"FirstName": "George", "LastName": "Jetson", "City": "Orbit City"}
{"FirstName": "Jane", "LastName": "Jetson", "City": "Orbit City"}

The following jq transformation could be used to group the records by last name:

[.LastName]

Resulting output from grouping:

[{"FirstName":"Fred","LastName":"Flintstone","City":"Bedrock"}]
[{"FirstName":"George","LastName":"Jetson","City":"Orbit City"},{"FirstName":"Jane","LastName":"Jetson","City":"Orbit City"}]
[{"FirstName":"Betty","LastName":"Rubble","City":"Bedrock"}]
Each line is now an array that may contain multiple records. If the
JQ Transform Override is also defined, it should not be written with the assumption
that there is a single, bare record on each line.

To process these grouped records individually (and one record at a time), you must make a
slight adjustment to the JQ Transform Override by adding .[] | to the beginning
of the transform. This addition instructs jq to iterate over each record of the line’s group,
processing each record per the text that appears after the | (or pipe symbol). For example:

.[] | [.LastName, .FirstName, .City] | @csv

would yield:

"Flintstone","Fred","Bedrock"
"Jetson","George","Orbit City"
"Jetson","Jane","Orbit City"
"Rubble","Betty","Bedrock"

OR

You can preserve the grouping to produce a more complicated record using this JQ Transform Override:

{group: input_line_number, people: [(.[] | {last: .LastName, first: .FirstName, city: .City})]}

which yields:

{"group":1,"people":[{"last":"Flintstone","first":"Fred","city":"Bedrock"}]}
{"group":2,"people":[{"last":"Jetson","first":"George","city":"Orbit City"},{"last":"Jetson","first":"Jane","city":"Orbit City"}]}
{"group":3,"people":[{"last":"Rubble","first":"Betty","city":"Bedrock"}]}

Custom Parquet Output

If you selected custom parquet output in the Destinations, you should configure Custom Output
Settings
. However, there are a few considerations:

  • Select a Schema that matches the intended parquet output. The menu will list the Schemas
    available to the Organization. Create one that represents the intended output if there is not
    an appropriate one available.

  • The output generated by the provided jq transformation must be
    a JSON object with properties that match the selected Schema.

  • The extension is ignored. The output will always be .custom.parquet.

  • Headers and footers are ignored.

When crafting the transform, be mindful of the selected Schema. Although a field may not be
required, it must be present in the output. Use null to represent a field with no value.

Data Store Settings

A Data Store is an online repository that stores information. More specifically, the
Data Store captures historical data for a Collection. Commonly, Data Store and Redshift Data Lake
are used interchangeably. Data Store is an internal destination managed by the application.

A Super Admin can provision a Data Store for an Organization. Once an Organization has a Data Store,
it is provided Connection information and credentials to access the database. The Organization can
connect to the database using a standard SQL client. From the application, the Organization can create
tables for any of their Collections in the Data Store. If the Organization has a product_details
Collection, for example, the Organization can create a product_details table within their Data Store.
If the Redshift Data Lake is selected as a Linked Destination then all Snapshots will push to the
Data Store once they pass QA; in addition, you can query this available data. This is helpful if you
want to create visualizations using a Business Intelligence (BI) tool, or if you want to create
complex queries specific to this data.

Provision Table for Collection

Here, you can add a table to the Data Store.

All the historic data that has passed QA is instantly available for query.

collection settings datastore

There is a Schema created in your database for each project, which is named using the Project slug.

There is a table created for the Collection data, which is named using the Collection slug.

For nested data (arrays, objects), there are SQL Extensions for Redshift that are used.

Get Table Name

collection settings datastore2

Screen Capture Settings

Select the Include Screenshots checkbox if you want to include images, as designated
in the associated Extractor.