When the extractor linked to a Source creates a new crawl run, that crawl run is imported into the system as a Snapshot.

Some automated checks will be automatically run - for example to detect high block rates, and the snapshot may be automatically put into an ESCALATED state.

If the source is in state IN_PROGRESS or ISSUE the snapshot is put in the state FIXING, else it is put in the state PENDING_QA.

This then becomes available in the QA backlog for either onboarding or monitoring:

img 0

These backlogs should be monitored by the QA resources. These backlogs are sorted such that the oldest data is at the top, because new snapshots could be created.

A QA person will then take ownership of the snapshot and it will transition into the state QA by clicking the "Start QA" button:

img 1

They can then see these QA tasks they have currently got ongoing on their dashboard:

img 2

The project manager can see all the snapshots for a project in progress in another view:

img 3

The QA person then needs to complete a QA checklist for the snapshot.

img 4

You can click the "show" button to see the data side by side with the screenshot.

They can see some data comparison from previous "good" data (passed QA):

img 5

And also see graphed comparison to previous data:

img 6

They can drill down into the data, pages, top values and validation issues:

img 7

You can click on a row to get a side-by-side QA view:

img 8

Once this has been done they can move to either PASS the snapshot or push it for fixing:

img 9

Passing QA will transition the source to ACTIVE if it is not already in this state.

When a source is in ACTIVE the source will be automatically moved to the PUSHED state, and the snapshot will be pushed to whatever destinations are set up on the collection. You can see this on the snapshots summary:

img 10

There is an aggregated view of all pushes for a collection on the collection page:

img 11

Moving a snapshot to FIXING will transition the source into MAINTENANCE state, and will assign the snapshot back to whoever is assigned to the source - if anyone - for fixing.

Engineers and QA are kept up to date in slack with when sources are transitioned.

The data engineers have a view where they can see all the unassigned snapshots pushed back into a FIXING state:

img 12

They should then look at the Snapshot timeline and QA report to see why the snapshot has not passed QA:

img 13

They then edit the extractor by adding more URLs, etc. Once they think they are good to go and have published a new extractor version, they can re-extract the snapshot:

img 14

When they re-extract the snapshot it will create a new snapshot in the FIXING state and place the previous snapshot into the SUPERCEDED state.

Once they are happy that the snapshot is good they transition it to the PENDING_QA state to be added to the QA queue and subsequently retested - this will also unassign the snapshot.

If the data engineer is not able to fix the source they should transition to the ESCALATED state.

The escalation point is an internal expert. They can view a list of all the snapshots that have been escalated in the "in progress" view.

They may need to raise tickets with engineering to fix issues.

If there are known issues found, these must be added to the source as labels; currently:

  • ISSUE_CAPTCHA - issues with undetected CAPTCHAs

  • ISSUE_404_REDIRECT - redirects on product unavailable

  • ISSUE_CURRENCY - non dollar currencies seen

The list of sources can be filtered to show those with a specific tag.

What if an import fails?

Delete the snapshot, and then click the button to try again:

img 15

How do I know if a snapshot was created using an old extractor version?

You will see a banner - you can click the link to re-extract this crawl run using the latest configuration.

img 16