A Source is an Extractor or web crawling tool. A Source maps to an Extractor ID.
A group of Sources is a Collection.
To add a Source:
From the left navigation pane, click Sources.
From the top right of the Collection Sources page, click the Add Source icon
or plus (+) symbol.
From the New Source page, enter text in the Name field.
The name you enter will autofill the Slug/ID field. You can associate Slug/IDs
with many DOC platform objects, which serve as self-defined identifiers. Slug/IDs
can be useful as you reference APIs or create variable names. You cannot change Slug/IDs.
That noted, ensure they are meaningful.
Enter content in the Status Text field.
A Source may transition from one state to another. QUEUED, IN_PROGRESS, ISSUE,
READY, and ACTIVE (among others) represent states. The Status Text field allows
you to associate a description with the state. If the state is ISSUE, for example,
the following text might appear in this field: Returns 500 responses instead
of 404. This is a block and is no longer used. Lambdas now perform this job.
The content in the Status Text field also might indicate that the website changed
(and is no longer accessible), or the Extractor must be retrained. The text,
in these cases, describes the ISSUE. When you initially establish a Source,
you may leave this field blank. When you save this entry, the text in the State
field defaults to QUEUED; however, as the import process progresses (and the state changes),
you may choose to enter clarifying text in this field.
Enter a value in the Extractor ID field.
This value represents the alphanumeric assignment of the Extractor you designate.
You can retrieve this value from the browser of the Extractor. For example:
Enter a value in the Maximum Allowed Inputs field or use the arrow keys at the end
of this line to make a selection.
Excessive inputs can hog resources required by other projects. When running a Flow,
if the number of inputs for a Source breaches the Maximum Allowed Inputs, the inputs file
will be trimmed to allow the Snapshot to continue to run (as opposed to failing to start
the Snapshot entirely). This information is captured in the Trimmed Input Count field
on the Snapshots page.
Labels are user-defined tags. High frequency, dev, development, and staging
are commonly used labels. Labels are similar, in use, to Parameters.
Parameters help you further group or distinguish Sources and dictate behavior. Locale and
Domain are the default Parameters. If you have an Extractor that performs ebay.com searches,
for example, ebay.com-uk and ebay.com-fr, might represent locales. Domain is the website;
in this case, ebay.com. Parameters are extremely powerful and have numerous downstream uses.
Using Parameters, you can tag Sources and establish key/value pairs. You also can use Parameters
as filters. You can add the stage Parameter and associate it with the dev environment. You can add
Parameters on the Collections page. For each Parameter, you can provide a value.
For example, Locale=en_us, Domain=ebay.com, Stage=dev.
Enter content in the Status Code Formula field.
Commonly, the status code refers to the HTTP response code. For example, a 200 response code
is usually returned when the page fetch is successful. However, some sites return unusual status
codes which may cause the system to register failures when, perhaps, there are none. A 200
response code may still be returned when the intended data is not provided/returned. The Status
Code Formula allows the status code to be rewritten based on various clues or indicators.
According to the screen help text or tooltip for this field:
"Falsy value means no change, -1 means blocked, otherwise return a number. Supported functions,
plus PARAM('name') to get a Source Parameter and INPUT('name') to get an input value and all the
page row columns as variables, such as exceptionType."
This help text, therefore, indicates that the Formula may yield a result based on the Source
Parameter and values in the output columns; in addition, you simply could rewrite the status code
directly. For example, the following Status Code Formula could be written for a site that returned
a 101 status code instead of 200.
IF(statusCode = 101, 200, statusCode)
The statement above changes the status code to 200 if it was originally 101.
In another example, the statement also could reference columns of data to ensure they have
IF(productId = NULL, statusCode, 200)
To share helpful Source information with team members, enter text in the README section.
This section, which supports the markdown syntax, allows you to provide additional
context and insight.
To store content, click Save. To disregard, click Cancel.
Returning to the Sources page allows you to view a table which contains defined
Sources. The table displays the Name, Status, Labels, Parameters, Extractor, and
Assignee columns. In addition, the Set Params button allows you to create Parameters and
assign name/value pairs. You must select a Source to enable this button.
You can filter and sort the entries in this table as needed; in addition, you can use
the Search feature or scrollbar to locate existing Sources.
To edit a Source:
After you add a Source, it defaults to the QUEUED state. If you assume ownership of a
QUEUED Source, its state transitions to IN_PROGRESS. When the Source is checked,
you can change the state to QA. The state of the Source will automatically transition
to ACTIVE if it passes QA. This data may be published. If a Source Snapshot fails QA,
the Source transitions to MAINTENANCE. Once it passes QA, it returns to ACTIVE.
You can change the state of the Source by performing a manual edit.
Immediately after you save a Source, a new page appears. Here, you may make edits. You also
may perform edits by selecting a Source from the primary Sources page.
From the left navigation pane, click Sources.
From the Sources page, use the Search feature or scrollbar to locate the Source
you want to modify.
Click to select this entry.
A new page becomes visible. (This is the same page that is available immediately upon saving
a new Source.) On this new page, you can edit the State, Assignee, Labels, Extractor,
and Maximum Inputs fields. You must access the Edit Source icon at the top right
of the page, however, to perform other modifications. This page also displays the Extractor
Information section. Using the Migrate Extractor button, you can move the associated
Extractor and its most current Runtime Configuration from the Legacy platform to the DOC
environment. Afterwards, a list of Extractor Versions and metadata is available; in addition,
Extractors that exist via CLI deployments will include information for locating the code reference
in the Project’s GitHub repository. You can change Extractor Versions on a Source, and you
can start crawl runs with that version’s Runtime Configuration.
You can click the Run Source icon at the top right of the page to run the Source.
Running the Source, creates a Snapshot.
Projects can be locked, restricting write and edit operations under each subsequent DOC layer
(such as Collections, Sources, and Snapshots.) You will not have Project access unless you are an
ORG_ADMIN, OPS, or assigned to work on a specific layer of this Project. If your role is
ORG MEMBER and you are attempting to edit a Source, for example, you cannot make changes
unless this Source is assigned to you.
The View Readme Version History watch-like icon is located near the top center of the page.
Upon selection, you can view the Preview Version, Published, Updated by, and Actions
columns. You can click the content in the Preview Version column to view README file
information. The README file provides additional Source context and insight.
From the top right of this page, click the Edit Source or pencil-like icon.
You may make edits to all fields on this page except the Slug/ID field.
To store updates, click Save. To disregard, click Cancel.
|You cannot delete Sources.|