Building out projects

Executive summary of status

A user can see a dashboard that summarizes the status of all of the projects in the orgs that they have the DELIVERY_MANAGER or ADMIN role for.

This dashboard lets them see in aggregate for each project, and broken down by collection:

  • a horizontal bar breakdown by % of source status

  • the average MTTR, MTTF, and MTBF

  • the % available (ACTIVE) of sources currently in maintenance mode

  • (what other top level metrics?)

Milestones

Once a project and a collection has been setup, there is a clear work specification to build out the sources for a collection. Batches of sources can be requested to be built as a milestone, and these milestones can have dates set for them.

The system can them show you how you are looking to perform against these milestones by showing you a burndown for each milestone.

Project status

They can further drill down into a project or collection to see additon information:

  • a graph showing source build completion over time

  • a graph showing source status over time

  • (what else?)

Managing teams

Each organization maintains a number of teams that have different (and multiple) roles: development, qa, maintenance, and ops. These teams are maintained by the delivery manager or admin for the organization.

A project has a development, maintenance, qa and ops team assigned to it. The development team are responsible for completing the milestones.

Development

Build queue

If a user is in a development team, they can see a build queue view upon login to the workbench.

The order of the queue is determined by a target date to start for each source (this is generated by the milestone plan).

The user can take ownership of an item at the top of the queue to start building. This transitions the source from QUEUED to IN PROGRESS, and moves to the source view.

My sources

A user can also see a view that shows them the sources that they currently are assigned to.

Building a source

There are multiple different types of source you can build out, and you can choose whether to create an extractor or a crawler for a source.

The standard workflow

Most configuration can be done through our CLI tool, which is a node.js package.

To configure it run: npx @import/cli configure

Each project is linked to a git repository. You can create a new repo using the CLI:

npx @import/cli init-repo <org slug> <project slug>

The repository is laid out as such:

/
├── my_collection_slug_1/
│   ├── my_source_slug_1/
│   │   ├── config.yml
│   │   ├── tests.yml
│   │   └── some-injected-script.js
│   ├── my_source_slug_2/
│   │   ├── config.yml
│   │   ├── tests.yml
│   │   └── some-injected-script.js
├── my_collection_slug_2/
│   ├── my_source_slug_1/
│   │   ├── config.yml
│   │   ├── tests.yml
│   │   └── some-injected-script.js
│   ├── my_source_slug_2/
│   │   ├── config.yml
│   │   ├── tests.yml
│   │   └── some-injected-script.js
├── config.yml

The definition for each source is held in the config.yml file in the source directory, and there can be extra assets in the directory as required by the type of source you are building. The tests that are run during editing are defined in the tests.yml, and the structure of this may depend upon the type of source you are building.

If you wish to create a new source, that can be done in the workbench or over the API.

To synchronize the repo with the project, run npx @import/cli sync in the repository. This may take some time as the last production code bundles will be downloaded for every source.

Alternively, if you need to create the templates for a single source, you can run the following:

npx @import/cli init-source <collection slug>/<source slug>

This will switch to a branch named collection_slug/source_slug where you can edit the files. This should always be your branch name when editing a source.

When you wish to submit a change, you can open a pull/merge request in your code repository solution. The changes to the configuration for the source should be reviewed as you review any other type of code.

You can submit a request to test the current version of the source on the edit branch: npx @import/cli stage.

This can be setup to happen automatically when a pull request is opened, or new commits are pushed to it.

The will package up the source configuration and run a data quality test over the proposed new version. We select 1,000(?) inputs selected from recent runs, or if new (???). This involves running the new version and the current production version over the same set of sample inputs and producing a regression report to highlight any potential issues with the data, as well as running the checks defined on the collection.

It is the goal for these tests to be fully automated where possible, but they will probably need some human verification in all cases for some time.

Once the code and the data report has been reviewed, commented on, iterated on, etc. you are ready to promote the staged version, you can run npx @import/cli promote <collection slug>/<source slug>

This can be set up to happen automatically when the pull request is merged.

The promotion will mean that now this source will be run using the new configuration.

Creating an extractor

You edit the config.yml from the template - it will have already added in an extract data action based upon the inputs you have been given.

---
realBrowser: true
actions:
- GotoAction:
    url: https://foo.com/product/${sku}
- ViewportAction:
    width: 1665
    height: 816
- WaitLoadingAction
- FunctionAction:
    file: example.js
    timeout: 10000
- DumpWindowState
- ExtractHtmlDataAction
    fields:
      siteName:
        defaultValue: Banana Republic
      sitePromo:
        selector: div.wcd_headline__content
      breadcrumbs:
        xpath: //div[contains(@class,"product-breadcrumb")]/span/a | //div[contains(@class,"product-breadcrumb")]//a
      category:
        json:
            xpath: //script[type="application/ld+json"]
            jsonpath: .graph.foo.blah[1]
      category:
        windowstate:
            jsonpath: .whatever.now
    singleRecord: true
    noscript: true
    screenCapture: true
    htmlExtraction: true

You can now open the import.io developer application and select the directory that contains the configuration. Here you can add and remove sample inputs. The result of the sample inputs will be cached at the point of extraction(s). It is these cached pages that allow an instant update of the result of the data preview for the example data when you save the configuration.

Should you change the playback steps, you will need to regenerate the cache. It is best to try to get all the information into the cache - the same process is used for re-extracting data later.

You can also debug steps at which the playback fails using the chrome dev tools that are built into the application.

You can add in branching within the DSL also, for example based upon whether a particular DOM element is in the page you may take different paths to get the same data.

If you do not require a browser, you can specify such. This means that the number of actions you can use is reduced, but it is still possible to load HTML, JSON, etc. from a URL, generate HTML/JSON/XML and extract data.

Creating a crawler

You edit the config.yml from the template - it will have already added in an extract data action based upon the inputs you have been given.

---
config:
 startUrls:
  - "https://www.abercrombie.com/shop/us"
  - "https://www.abercrombie.com/shop/us/SiteMapView?storeId=10051&langId=-1&catalogId=10901"
 crawlTemplate:
  - "www.abercrombie.com/shop/us/{not-slash}$"
 noCrawlTemplate:
 dataTemplate:
  - "www.abercrombie.com/shop/us/p/"
 dataUrlIdRegexp: "/shop/us/p/[^/]*-(\\d+)(?:[?#]|$)"
 jsTemplate:
 webcacheOptions: null
 webcacheTtl: 604800
 priorityLinkTextRegexp: "(?i)\\b(sales?|clearance)\\b"
 maxDepth: 3
 connections: 10
 pauseMillis: 1000
 maxDataUrls: 15000
 minFetches: 500
 maxFetches: 2500
 canonicalStrategy: MARK_FETCHED
 obeyRobotsTxt: true
 loadSitemaps: true
domain: "abercrombie.com"

Ops

Alerts

You can see a history of alerts that have been generated for each project in the application.

You can configure for a collection to generate an alert if the data fails its automatic QA.

These can also be routed to pagerduty, opsgenie, slack and others.