Sources

A Source is an Extractor or web crawling tool. A Source maps to an Extractor ID.
A group of Sources is a Collection.

Add a Source

sources primary
sources new

To add a Source:

  1. From the left navigation pane, click Sources.

  2. From the top right of the Collection Sources page, click the Add Source icon
    or plus (+) symbol.

  3. From the New Source page, enter text in the Name field.

    The name you enter will autofill the Slug/ID field. You can associate Slug/IDs
    with many DOC platform objects, which serve as self-defined identifiers. Slug/IDs
    can be useful as you reference APIs or create variable names. You cannot change Slug/IDs.
    That noted, ensure they are meaningful.

  4. Enter content in the Status Text field.

    A Source may transition from one state to another. QUEUED, IN_PROGRESS, ISSUE,
    READY, and ACTIVE (among others) represent states. The Status Text field allows
    you to associate a description with the state. If the state is ISSUE, for example,
    the following text might appear in this field: Returns 500 responses instead
    of 404. This is a block and is no longer used. Lambdas now perform this job.

    The content in the Status Text field also might indicate that the website changed
    (and is no longer accessible), or the Extractor must be retrained. The text,
    in these cases, describes the ISSUE. When you initially establish a Source,
    you may leave this field blank. When you save this entry, the text in the State
    field defaults to QUEUED; however, as the import process progresses (and the state changes),
    you may choose to enter clarifying text in this field.

  5. Enter a value in the Extractor ID field.

    This value represents the alphanumeric assignment of the Extractor you designate.
    You can retrieve this value from the browser of the Extractor. For example:

    sources extractorid
  6. Enter a value in the Maximum Allowed Inputs field or use the arrow keys at the end
    of this line to make a selection.

    A Source will not run if the inputs exceed the number you enter in this field; this value
    represents the threshold. Excessive inputs can hog resources required by other projects.
    Currently, if the number of inputs exceeds this threshold, the Snapshot will not start.
    Instead, it will fail. With future development, as opposed to failing to start the Snapshot
    entirely, it will be trimmed to the Maximum Allowed Inputs.

  7. Enter Labels.

    Labels are user-defined tags. High frequency, dev, development, and staging
    are commonly used labels. Labels are similar, in use, to Parameters.

  8. Enter Parameters.

    Parameters help you further group or distinguish Sources and dictate behavior. Locale and
    Domain are the default Parameters. If you have an Extractor that performs ebay.com searches,
    for example, ebay.com-uk and ebay.com-fr, might represent locales. Domain is the website;
    in this case, ebay.com. Parameters are extremely powerful and have numerous downstream uses.
    Using Parameters, you can tag Sources and establish key/value pairs. You also can use Parameters
    as filters. You can add the stage Parameter and associate it with the dev environment. You can add
    Parameters on the Collections page. For each Parameter, you can provide a value.
    For example, Locale=en_us, Domain=ebay.com, Stage=dev.

  9. Enter content in the Status Code Formula field.

    Commonly, the status code refers to the HTTP response code. For example, a 200 response code
    is usually returned when the page fetch is successful. However, some sites return unusual status
    codes which may cause the system to register failures when, perhaps, there are none. A 200
    response code may still be returned when the intended data is not provided/returned. The Status
    Code Formula allows the status code to be rewritten based on various clues or indicators.

    According to the screen help text or tooltip for this field:

    "Falsy value means no change, -1 means blocked, otherwise return a number. Supported functions,
    plus PARAM('name') to get a Source Parameter and INPUT('name') to get an input value and all the
    page row columns as variables, such as exceptionType."

    This help text, therefore, indicates that the Formula may yield a result based on the Source
    Parameter and values in the output columns; in addition, you simply could rewrite the status code
    directly. For example, the following Status Code Formula could be written for a site that returned
    a 101 status code instead of 200.

    IF(statusCode = 101, 200, statusCode)

    The statement above changes the status code to 200 if it was originally 101.

    In another example, the statement also could reference columns of data to ensure they have
    meaningful values:

    IF(productId = NULL, statusCode, 200)
  10. To share helpful Source information with team members, enter text in the README section.

    This section, which supports the markdown syntax, allows you to provide additional
    context and insight.

  11. To store content, click Save. To disregard, click Cancel.

    sources defined

Returning to the Sources page allows you to view a table which contains defined
Sources. The table displays the Name, Status, Labels, Parameters, Extractor, and
Assignee columns. In addition, the Set Params button allows you to create Parameters and
assign name/value pairs. You must select a Source to enable this button.

You can filter and sort the entries in this table as needed; in addition, you can use
the Search feature or scrollbar to locate existing Sources.

Edit a Source

sources edit

To edit a Source:

After you add a Source, it defaults to the QUEUED state. If you assume ownership of a
QUEUED Source, its state transitions to IN_PROGRESS. When the Source is checked,
you can change the state to QA. The state of the Source will automatically transition
to ACTIVE if it passes QA. This data may be published. If a Source Snapshot fails QA,
the Source transitions to MAINTENANCE. Once it passes QA, it returns to ACTIVE.
You can change the state of the Source by performing a manual edit.

Immediately after you save a Source, a new page appears. Here, you may make edits. You also
may perform edits by selecting a Source from the primary Sources page.

  1. From the left navigation pane, click Sources.

  2. From the Sources page, use the Search feature or scrollbar to locate the Source
    you want to modify.

  3. Click to select this entry.

    A new page becomes visible. (This is the same page that is available immediately upon saving
    a new Source.) On this new page, you can edit the Labels, Extractor,
    and Maximum Inputs fields by clicking the Edit or pencil-like icon associated
    with each reference. You must access the Edit icon at the top right of the page,
    however, to perform other modifications. You can click the Run Source icon at the top
    right of the page to run the Source. Running the Source, creates a Snapshot.

  4. From the top right of this page, click the Edit Source or pencil-like icon.

    You may make edits to all fields on this page except the Slug/ID field.

  5. To store updates, click Save. To disregard, click Cancel.

You cannot delete Sources.