The primary purpose of this content is to provide instructions on how to build a crawler.
This process involves accessing a website, determining category page URLs along with sublevel URLs
and, ultimately, navigating these pages to retrieve and return data. Crawlers, themselves,
can be deployed to the SaaS application, and crawled data can be deployed to the DOC environment.
To understand crawlers, you must first understand how they work and how to differentiate them
from Extractors. An Extractor is software that you can train or modify to retrieve specific data
from websites either on a routine and scheduled basis or by using an ad-hoc, on-demand approach.
You can use a crawler for different use cases. For example, you can use a crawler to capture category
URLs from different levels of a website – accessing the initial links then subsequent sub-links,
initial pages then subsequent sub-pages while collecting data. You also can use a crawler to capture
file URLs that are available.
Categories also can have different depths:
While both the SaaS application and the CLI tool (Extractor Studio) access websites and retrieve
specific data, a crawler navigates these sites and collects category data each time it encounters
a new link or entry into a new page. This navigation approach is known as the spider technique, and this
related software is more appropriately known as the sand crawler (which is written in Java).
To emulate a crawler, you would have to chain multiple Extractors. In addition, you would need to modify
the Extractor each time you wanted to drill down to another subcategory. However, you can use the same
crawler to scrape different websites, making modifications only to its configuration.
You can deploy and run the crawler from the SaaS platform, as it is built such that it appears to be a
standard Extractor; however, it executes custom code, which can perform an extensive crawl of an entire website. In brief, you feed the URL to the crawler. The crawler – in turn – accesses that URL and, based
on the logic included in the crawler, navigates this URL along with subpages or sublevels – filtering
data that matches the template.
To build a crawler, you must:
Request access to the GitHub poc-ecom-taxonomy repository
(Taxonomy Repo) and review its content. This repo contains several folders and scripts.
In the configs folder, there is one configuration for each source.
Clone the poc-ecom-taxonomy repository.
Access and review the README file, which is the primary starting point for building a crawler.
Address and adhere to all prerequisites, dependencies, and installations.
Create crawler. Configure crawler. Run crawler. Upload crawler to SaaS application.
Deploy crawler/data to DOC environment with an associated Organization/Project/Collection.
Add Extractor ID to Source page and proceed with data delivery to customer.
Access a high-level informational video: Crawler Documentation Video. Passcode: MRw+A7YY