Create/Deploy Process

Step 1: Satisfy Prerequisites

  • Install Java 8+

  • Download latest sandcrawler fat jar into this directory.

  • Install node 11+ (via nvm, for example)

  • Install jq and sponge (via brew or another package manager);
    sponge allows you to write to the same files and scripts.

    To install sponge on a Mac, enter:

    brew install sponge
  • Identify a website and review its format/layout and its related categories. Understand how the
    categories are organized.

Step 2: Install Libraries

npm install

Step 3: Establish Environment

IMPORTANT

  • HTTPie should be installed HTTPie 2.4.0 (latest) documentation.

  • sandcrawler fat jar version is embedded in the script
    file crawl-site.sh. If a newer jar file is used, update the version in script for the dev environment.

  • If there is an issue installing sitemapper package, remove the dependency in the packages.json file.

  • For Windows, add IO_API_KEY shared environment variables with SaaS account API key and
    "npm install bash" in VS environment. Use Bash in the terminal for executing the commands.

Step 4: Check Design, Access/Build Sitemap

Make your team aware of project/task ownership. Next, ensure you have the information necessary
to navigate your source website by finding/identifying:

  1. Template(s) for product details and category packages.

  2. A RegEx (regular expression) that retrieves a unique ID from a product page.

  3. Home page.

  4. The sitemap, accessible at /sitemap.xml, /sitemap.xml.gz, and /robots.txt

  5. HTML sitemap page such as this one . Search sitemap site: sitemap site:www.mysite.com.
    Use Google to locate (if necessary).

  6. Locale of website, such as en_US or en_GB.

Access Sitemap

A sitemap lists the website pages. Typing boots.com/robots.txt in the browser, for example, produces –

crawler documentation sitemap1
If you cannot retrieve the sitemap via the robots.txt file, try sitemap.xml. Also, carefully review
the website, even copying/pasting XML statements into your browser to ensure you fully understand
the website layout and its associated categories. You must ensure the sitemap is an accurate
representation of the data. If you cannot find the sitemap, you must build one.
crawler documentation sitemap2

Build Sitemap

  1. Access the website.

  2. Determine if the required URLs can be found in the sitemap based on the project/categories.

  3. Navigate or drill down several levels, selecting links (one at a time) before drilling down and
    accessing additional data.

  4. Modify the configuration file, adjusting the startURL and maxDepth content (for example)
    as explained in the Configure Crawler section.

Step 5: Understand Template Syntax

Example

www.jcpenney.com/p/{not-slash}/{alpha}{num}{query-string?}$

Adding a question mark makes the element optional, e.g. {any?}. In addition, the regular expressions
generated are case insensitive.
Template Variable Description Regex

{any}

anything

.*

{num}

an integer

/d+

{alpha}

a-z characters

[a-z]+

{alpha-num}

Either alpha or num

a-z\d]+

{not-slash}

not a slash in a URL path

[^/#?]*

{uuid}

a UUID

{query-string}

a query string, e.g. ?a=1&b=2&c=3

\?[^#]*

{query-params}

a partial query string, e.g. a=1&b=2

[^#]*

$

Match then end of the URL

$

Step 6: Initialize Configuration File

  ./init.-config.sh <company name> <domain>

Step 7: Create Crawler

export IO_API_KEY=$(cat ~/.import.io/apikey)
	./create-crawler-extractor.sh $DOMAIN

Cache

A crawl will cache all of the HTML for the pages, so you do not have to hit the site again
for a subsequent crawl. The webcacheTtl is configurable in the config file.

You can view a cached page by visiting https://webcache.import.io/resource/${resourceid}
the resource id is in the output and the log.

Step 8: Configure Crawler

Each crawler is unique with its own configuration file. The code below is specific to a single
file configuration. You must adjust the configuration to align with your crawl. The paragraphs
that follow this code along with the Configuration References table provide code insight and context.

{
  "config": {
    "startUrls": [
      "https://oldnavy.gap.com",
      "https://oldnavy.gap.com/products/index.jsp"
    ],
    "loadSitemaps": false,
    "crawlTemplate": [
      "oldnavy.gap.com/browse/category.do"
    ],
    "jsTemplate": [
      "oldnavy.gap.com/browse/category.do"
    ],
    "noCrawlTemplate": [],
    "dataTemplate": [
      "oldnavy.gap.com/browse/product.do"
    ],
    "dataUrlIdRegexp": "[?&]pid=(\\d+)",
    "webcacheOptions": {
      "wait_js": "document.querySelectorAll('.product-card__link').length > 0"
    },
    "webcacheTtl": 604800,
    "priorityLinkTextRegexp": "(?i)\\b(sales?|clearance)\\b",
    "maxDepth": 3,
    "maxRetries": 3,
    "connections": 10,
    "pauseMillis": 1000,
    "maxDataUrls": 15000,
    "maxFetches": 500,
    "maxFetches": 2500,
    "canonicalStrategy": "MARK_FETCHED",
    "obeyRobotsTxt": true,
    "blacklistedUrlQueryParams": null,
    "whitelistedUrlQueryParams": null
  },
  "locale": "en_US",
  "status": "IN_PROGRESS",
  "crawlerExtractorGuid": "7a6689dd-6ad8-43b6-9929-99d405182fad",
  "domain": "oldnavy.gap.com",
  "detailsExtractorGuid": "4946431e-4ac3-40d9-b240-13ee4bf86e9f"
}

The startURLs reference, as its name suggests, is the starting point or initial website. From here,
you begin to collect URLs. While high-level URLs may not always contain needed data, these URLs must
be accessed and leveraged by the crawler to dive deeper into the tree of the website categories –
navigating each subcategory and collecting data that aligns with the crawler logic.

crawlTemplate allows you to access this page and perform matching. The webcacheTtl value is specific
to how long the page should live for you to retrieve data.

Templates contain regular expressions and wildcards. The crawl template and the data template,
collectively, allow you to control the filtering and spider logic. The crawler collects data from any URL
that matches the template. The data template consists of the links you need, and the data you are
seeking. The crawl template consists of the URLs that you want to fetch. The noCrawlTemplate is
a URL that does not require navigation; do not crawl or match a particular page number, for example.

priorityLinkTextRegexp allows you to prioritize based on parameters such as sales and clearance.
webcacheOptions suggests that you should remain on the crawled page until you retrieve a certain
HTML element.

You can control certain situations, such as the number of connections and the number of nested “tries”
that a crawler runs. For example, you can run once on a certain category; this is depth 1. Running
the crawler again for another category is depth 2. Running the crawler again for another category
is depth 3. The crawler will continue to run until it reaches the maximum depth at which point
the crawler will stop.

Based on the code above, you initially access the startURL. You then look for links specific
to the crawlTemplate; URLs you want to fetch. Next, execution of the code below repeats “4” more times,
reaching the maxDepth of “5” and collecting data from each crawl page.

    ],
    "loadSitemaps": false,
    "crawlTemplate": [
      "oldnavy.gap.com/browse/category.do"
    ],
    "jsTemplate": [
      "oldnavy.gap.com/browse/category.do"
    ],
    "noCrawlTemplate": [],
    "dataTemplate": [
      "oldnavy.gap.com/browse/product.do"
    ],

The maxDataUrls is 15,000. So, while there might be 100,000 URLs that match based on the crawler logic,
you are seeking 15,000.

RECOMMENDATION: Use the wait_js option when rendering JavaScript. This makes Chromium aware
when the load has finished, rather than waiting for the default timeout.

The process above comprises the initial crawler configuration step. This process must be performed
for each feed.

Configuration References

Reference Description

startUrls

URLs that seed the crawler;
initial website starting point.

loadSitemaps

If true, will load the sitemaps
recursively from robots.txt.

crawlTemplate

Where to crawl; the URLs you will fetch.

jsTemplate

Where to turn ON JavaScript when crawling.

webcacheOptions

The "opt block" options applied
IF JavaScript is turned ON.

noCrawlTemplate

Where NOT to crawl;
no URL navigation required.

dataTemplate

URL pattern(s) to feed into the detail
of the Crawler/Extractor. Links you need
for the data you are attempting to retrieve.

dataUrlIdRegexp

Regular expression to extract the unique ID
for a product from a product URL –
used for deduping.

priorityLinkTextRegexp

If the text, title, or aria label
of a crawl URL anchor matches this regexp,
apply priority to the URL and its links.

webcacheTtl

The cache TTL to apply in seconds
when getting a page.

maxDepth

How deep to go in the crawl;
the number of layers/subcategories.

connections

How many connections to use in parallel.

pauseMillis

How long to pause after rendering a page.

maxDataUrls

Target number of data URLs.

minFetches

Minimum number of URLs to crawl;
sitemaps are only enqueued when
this is reached (or there is
nothing left in the queue).

maxFetches

Maximum number of URLs to crawl.

maxRetries

How many times to retry a URL
if rendering fails.

canonicalStrategy

Since there can be duplicate URLs
for the same webpage, a canonical
URL is the designated URL for that
site. The canonical strategy speaks
to how the crawler handles the URL.
MARKED_FETCHED indicates that the URL
is processed as usual; canonical URLs,
going forward, will not be visited.
QUEUE, as its name suggests, means
that the URL is queued as normal.
REDIRECT suggests that the page will
not be processed; instead, the canonical
URL will be fetched straight away/directly.

obeyRobotsTxt

Whether to check robots.txt.

blacklistedUrlQueryParams

List of URL query parameter names that
are removed during URL normalization.

whitelistedUrlQueryParams

List of URL query parameter names that
are kept during URL normalization.

You can add an XML sitemap as a start URL by adding sitemap: as a URL prefix, for example:
sitemap:https://www.sallybeauty.com/sitemap_0.xml

Step 9: Alternative – Use Extractor (Rare Exception)

Use only as necessary.

In some cases, you might need to use an Extractor as opposed to a crawler. If there are crawler issues
(unable to configure or retrieve category page URLs, for example), you can use a different sand crawler
which, essentially, uses a different .dll on the back end.

  • For crawler issues specific to solving captchas or capturing additional information:

    • Download the latest version of sandcrawler fat jar v >= 0.4.0, update crawl-site.sh file with the version.

    • Create a SaaS extractor to apply captcha solving or other proxy information.

    • Update crawler configuration "renderer": "LIVE_QUERY" and "extractorGuid": << extractor id from the step above>>

    • Run the crawl-site script.

Step 10: SaaS Deployment

Once the crawler is working on your local machine/dev machine, you must deploy it to the Extractor
or SaaS application. Although not the cleanest process, currently, there are multiple scripts available
in the poc-ecom-taxonomy repo:

  • change-extractor-owner.sh

  • create-crawler-extractor.sh

Deploy Crawler to SaaS Application

Current Process

To deploy the crawler, create a placeholder source in a DOC TEST Collection, for example.

  • Deploy crawler as SaaS extractor using create-crawler-extractor.sh,
    the SaaS account API in the environment variable, and SaaS account UserId as
    SHARED_ACCOUNT_ID in env.sh. The SHARED_ACCOUNT_ID specifies where the Extractor
    should be deployed. For Test Company, for example, you might deploy an Extractor
    using a Test Company shared account:

crawler documentation saas deploy1

When you execute create-crawler-extractor.sh and associate the
above-referenced SHARED_ACCOUNT_ID – in the same config – there is an update
to the crawlerExtractorGuid, which is the crawler Extractor ID:

crawler documentation saas deploy2

As noted in the image above, crawlerRtcGuid and crawlerRtcDigest also get updated.

Run Crawl

  • Use crawl-site.sh to run the crawl:

    ./crawl-site.sh<domain>

Upload Crawl

  • Use upload-crawl.sh to upload the run from your local machine:

    ./upload-crawl.sh $CRAWLRUN_DIR.

This attaches the data from the crawler to the Extractor or SaaS application.

crawler documentation saas app

Create a Tree

The output of the crawl run can be viewed, per the example below:

✗ ./data-urls.sh temp/crawlrun-www.airbnb.com-1572897608/data.ndjson | head -n 1000 | ./tree.js 5
[
  {
    "section": "www.airbnb.com",
    "matches": 999,
    "sample": [
      "https://www.airbnb.com/rooms/32258029",
      "https://www.airbnb.com/rooms/22723288",
      "https://www.airbnb.com/rooms/35657347",
      "https://www.airbnb.com/rooms/35657574",
      "https://www.airbnb.com/rooms/18152139"
    ],
    "children": [
      {
        "section": "rooms",
        "matches": 999,
        "sample": [
          "https://www.airbnb.com/rooms/1160445",
          "https://www.airbnb.com/rooms/3748652",
          "https://www.airbnb.com/rooms/15548410",
          "https://www.airbnb.com/rooms/20359210",
          "https://www.airbnb.com/rooms/2408817"
        ],
        "children": [
          {
            "section": "{num}",
            "matches": 999,
            "sample": [
              "https://www.airbnb.com/rooms/24103232",
              "https://www.airbnb.com/rooms/23835092",
              "https://www.airbnb.com/rooms/5882862",
              "https://www.airbnb.com/rooms/32935864",
              "https://www.airbnb.com/rooms/32612318"
            ],
            "children": []
          }
        ]
      }
    ]
  }
]

Step 11: DOC Deployment

Deploy Data to DOC Environment

Current Process

Currently, the deployment process is in development. Once complete, many of the following steps will run
automatically via scripts.

  • The Extractor owner is updated to the project account.

To attach the Extractor ID to the Source in the DOC environment:

  • Enter your user credentials to access the DOC environment.

  • Navigate to the preferred Organization.

  • Select a Project.

  • Access the Collection.

  • Select a Source.

  • From the Extractor field on this Source page, enter the new Extractor ID
    (which aligns with the crawl run).

  • Proceed with the remaining DOC data pipeline steps which culminate in customer data delivery
    to the preferred location.

Automated Process

Once the extracted data is ready for deployment into the DOC environment, many of the subsequent
steps will proceed automatically via a series of scripts. Currently, this process is in development.

The Production process, largely driven by the Delivery Management team, will occur after you upload
the config to GitHub. This process will run in an AWS batch or cloud server. The process uploads and
imports data files to the repository, automatically using scripts. These scripts will run and deploy
the extracted data to DOC and proceed with data delivery. Future development, perhaps, might include
a Run Crawler Source button at the top right of the Source page.

Future Development

Production Process via Scripts
  • AWS Batch running docker container.

  • Configure the S3 creds for intermediate S3 location.

  • Schedule and run the crawler configs to upload the data files to intermediate S3 and then DOC API
    to import the data files as Snapshots to the relevant sources. (Future scripts will be created.)

Dev Process via Scripts
  • Configure the S3 creds for intermediate S3 location.

  • Run the crawler config to upload the data file to intermediate S3 and then use DOC API to import
    the data file as Snapshots to the source being implemented. (Future scripts will be created.)