Adding a Transform as a Dependency

Overview

Transform functions are used when developing CLI Extractors to revise or clean up data from an extraction. In order to use a transform, it currently needs to be passed as a parameter on a code action, and is added to the crawl run output to be evaluated during a snapshot’s import (Merge Extractions stage). This structure has limitations; when a transform is passed as a parameter, the stringified version of the function is included on its relevant row of data in the crawl run output, making it hard to make adjustments to a transform function without having to create a brand new crawl run. Therefore, version 3.0.1 of the CLI will introduce two new ways to add a transform to an extractor’s code actions - as a dependency or listing on an extraction config.

Benefits

  • Transforms will be included on the Runtime Configuration (RTC) for an Extractor, making them more visible and easier to access.

  • Instead of storing the stringified version of a transform on every single data row, the crawl run output will reflect the path to a particular function, which can be referenced on the RTC used for that crawl run.

  • The ability to re-run an updated transform without creating a new crawl run. When a transform is stored on the RTC, as long as the path reference remains consistent between versions, a user will have the ability to re-run a transform with a specific RTC.

Adding a Transform

Transforms are not new to the CLI, nor is the formatting.

Example format of a transform function (not changed):

Transform

A transform function can be named as needed, and the name of the exported function needs to be included in the path for using as dependency (preferred) or on an extraction config, see the below sections for an example.

The transform is still passed through the extract method as part of the merge options. Local testing should still reflect the transform on the output. In a crawl run output, the path reference to a transform will be included, as opposed to the full stringified function. The transform will be stored in the RTC for the extractor, saved as a similar reference.

Transform as a Dependency

Similar to listing an extraction config as a dependency on an action, a transform can be added in the same way, to the depedencies object.

The dependency name should be the key, and the actual path reference should be listed as 'transform:path/to/transform.[name of transform export (defaults to transform)]'.

Example:

transform dependency example

Transform on an Extraction Config

If there is a need/want to add a transform to a specific extraction config, the path can be listed on the config, as path/to/transform.[name of transform]. If a transform is specified in an extraction config, the transform does not need to be listed as part of merge options, it will be automatically added during the extraction if a transform is present.

Example:

transform extraction config

Migration

In order to use transforms as a dependency (or on an extraction config), extractors need to be re-deployed with the newest CLI version.

Steps:

  • Ensure all transforms are in files external to a code action and are exported

  • Move reference of transform as a parameter to dependencies list (can use parameters as variable for defining path/name if necessary)

  • Re-deploy source(s)

  • Verify change to RTC for extractor version (can be done in Hades or by checking dist folder locally)

FAQs and Workarounds

What happens if I specify a transform on my extraction config and as a dependency in my code action?

The dependency reference to a transform will always take precedence over the extraction config

Is there any benefit to adding a transform as a depedency versus listing it on the extraction config?

Adding a transform as a dependency is preferable because it makes the code more readable and also has the ability to use parameters as variables in the path definition

Will my existing extractors still be compatible with v3.0.1 if transforms are being passed as a parameter?

Yes, transforms being defined as parameters are still supported, but they will not be compiled to the RTC or allow a user to re-run transforms for data rows where the transform is a parameter

How can my source/extractor be migrated to utilize the "Re-run Transforms" feature?

Re-run transforms can occur for a Snapshot when the RTC used for the crawl run has transforms listed and the path reference to a transform is found on the data row. If this is true, you could deploy a new extractor version with a revised transform and fix your Snapshot by using "Re-run Transforms"

Can someone with an older version of the CLI deploy an extractor with a transform listed as a dependency?

No, any extractors that are going to use the functionality provided by v3.0.1 and higher will require an update to the latest version.

If I cannot migrate all extractors that implement a particular action or extraction config, but want to migrate, can I continue to have mixed usage?

Yes, there are a few options for migrating "top-level actions" or extraction configs without having to migrate any of its children.

  1. Add a conditional in the implementation that checks for the presence of the transform parameter. If it is present, this can continue to be used, otherwise, the dependency will be used.

    // Example implementation
    
    implementation: async ({ url }, { transformParameter }, context, { extraction, extra, transformDependency }) => {
      await context.goto(url, { timeout: 10000, waitUntil: 'load', checkBlocked: true });
      await context.extract(extraction);
    
      // check for parameter before dependency
      await context.extract(
        extra,
        { type: 'MERGE_ROWS', transform: transformParameter || transformDependency }
      );
    }
    This approach could also be reversed to check for the dependency first, particularly when using parameters as variables in the path.
  2. IMergeOptions is the definition containing where to define a transform. A new attribute, transformPath, will be available to pass as a dependency. This would allow both to be passed, but if a transformPath exists, it will be used during import* so it is less ideal, but would allow for the transform to be compiled into the RTC. The transform path needs to be referenced as transformDependency.id.

    // Example implementation
    
    implementation: async ({ url }, { transformParameter }, context, { extraction, extra, transformDependency }) => {
      await context.goto(url, { timeout: 10000, waitUntil: 'load', checkBlocked: true });
      await context.extract(extraction);
    
      // check for parameter before dependency
      await context.extract(
        extra,
        { type: 'MERGE_ROWS', transform: transformParameter, transformPath: transformDependency.id }
      );
    }
    This should be reserved for specific edge cases, and not used as a migration path. Where possible, use the first option for a phased migration. If you are testing an action locally, the parameter transform will be evaluated, however, running the source in DOC will use the transformPath during MergeExtractions instead of the parameter.