JSON Extraction

Sometimes the data you’re trying to extract may live in some sort of JSON.

This JSON may live embedded within a script tag, inside a DOM element, or as the result of a XHR request. In the past JSON extraction was a difficult process, requiring complex regular expressions or code to be written to add hidden elements onto the page to be extracted later.

Thankfully the Extractor Studio has tools to easily extract JSON.

Capturing the JSON

When the JSON is the result of a fetch or external API request, you’ll first need to save the JSON onto the page.

This can be achieved by using context.saveJson(id: string, data: any)

saveJson adds the JSON to a script tag on the DOM, with the id provided. You can then target this id later and use jq to retrieve what you need from the JSON (see below)

Example:

The following example takes a javscript object and saves it to the DOM under a script tag with the id of #foo

const myObject = {
  name: 'Fred',
  age: 1
};
await context.saveJson('foo', myObject);

Extracting the JSON (using JQ)

Just like you would with a combination of XPath and Regular Expression you can use the jq property on an extraction field to parse and extract JSON.

Target

First you must select a DOM node where the JSON resides. The JSON can be embedded within a script tag, or as an attribute or the inner html of an element.

You can use a manual selector or XPath to properly target the DOM element.

Getting the data

Once you have targeted the JSON you can use a jq property on the field to parse and manipulate the JSON to get the data you want.

Example:

The following example targets the same JSON we added to the DOM above by referencing the selector #foo, and extracts both name and age.

singleRecord: false
regionsSelector: null
recordSelector: null
recordXPath: null
fields:
  - name: Name
    type: TEXT
    manualSelector: '#foo'
    jq: '.name'
  - name: Age
    type: TEXT
    manualSelector: '#foo'
    jq: '.age'

JQ Functionality

The jq functionality supported for extraction via the configuration files may differ slightly to regular JQ. Below is what is supported

Supported syntax constructs

  • Identity: .

  • Object identifier-index: .foo, .foo.bar

  • Generic object index: .["foo"]

  • Array index: .[2]

  • Array/string slice: .[10:15]

  • Optional index/slice: .foo?, .["foo"]?, .[2]?, .[10:15]?

  • Array/object value iterator: .[]

  • Optional iterator: .[]?

  • Comma: ,

  • Pipe: |

  • Parentheses: (. + 2) * 5

  • Array construction: [1, 2, 3]

  • Object construction: {"a": 42, "b": 17}

  • Arithmetics: +, -, *, /, %

  • String/array concatenation with +: "foo" + "bar", [1, 2] + [3, 4]

  • Object merging with +: {a: 1} + {b: 2}

  • Comparisons: ==, !=, <, , >, >=

  • Boolean operators: and, or (and not as a function)

  • Alternative operator: //

  • If-then-else expressions: if A then B else C end, if A1 then B1 elif A2 then B2 else C end

  • Escape sequences (\", \\, \/, \b, \f, \n, \r, \t) in string literals: "foo\"bar", "foo\nbar"

Supported functions

  • ascii_downcase

  • ascii_upcase

  • downcase (non-standard extension, see below)

  • empty

  • false

  • from_entries

  • infinite

  • isfinite

  • isinfinite

  • isnan

  • isnormal

  • join

  • keys

  • keys_unsorted

  • length

  • map

  • map_values

  • nan

  • not

  • null

  • reverse

  • select

  • sort

  • sort_by

  • to_entries

  • tonumber

  • tostring

  • true

  • with_entries

  • upcase (non-standard extension, see below)

The extension functions downcase'' and upcase'' are not present in standard JQ. They differ from ascii_downcase'' and ascii_upcase'' in that they change casing for all Unicode letters, not only for ASCII letters (A-Z).

Supported features

Feature Example

Identity

., .

Array Index

.[0], .[1 ], .[-1], .[ 1][0], .[1][1].x, .[1][1].x[0], .[ -1 ]

Object Identifier-Index

.foo, .bar, .bar.x, .foo[1]

Generic Object Index

.["foo"], .["bar"].x, .bar[ "y"], .["2bar"], .["a b" ]

Pipe

.a | .b, .a|.b

Parentheses

( .a), .a, (-1 ), (-5.5), (.4), (. | .)

Addition (numbers)

1 + 1, .a + [.b][0], .b + .a, 3 + 4.1 + .a, 3 + (-3)

Subtraction (numbers)

.a - .b, .b - .a, 4- 3, -3 -(4)

Multiplication (numbers)

1 * 1, .a * [.b][0], .b * .a, 3 * 4.1 * .a, 3 * (-.3)

Modulo (numbers)

1 % 1, .a % [.b][0], .b % .a, 3 % 4 % .a

Division (numbers)

.a / .b, .b / .a, 4/ 3, -3/(4), -1.1 + (3 * (.4 - .b) / .a) + .b

Array Construction

[], [ ], [4], [ -6, [0]], [7 | 4], [.], [. | [6]], [5, 6] | .

Object Construction

{}, { }, {"foo": 6}, {"foo": 6, "bar": [5, 3]}, {"x": 3} | {"y": .x}, {foo: "bar"}, {({"a": "b"} | .a): true}, {"a": 4, "b": 3, "c": -1, "d": "f"}

Integer literal

3, 6, -4, 0, 8

Float literal

.3, 6.0, -4.001, 3.14, 0.1

Boolean literal

true, false

Double quote String literal

"true", "false", "foo", ["ba’r"]

length

[] | length, length

keys

keys

keys_unsorted

keys_unsorted

to_entries

. | to_entries

from_entries

. | from_entries

reverse

. | reverse

map

map(.+1 ), . | map( {foo: .})

map_values

map_values(.+1 ), . | map_values( {foo: .})

with_entries

with_entries({key: .key, value: (2 * .value)}), with_entries({key: "a", value: (2 * .value)})

tonumber

tonumber

tostring

tostring

sort

sort, [4, 5, 6] | sort

sort_by

sort_by(-.), sort_by(1 + .), sort_by(1)

join

join(", "), join(""), join(.[0])

Additive inverse

-(1 + 3), -(-1), .a | -(.b), [--1]

Array Construction

[], [4]

Array/Object Value Iterator

.[], .[ ]

Array/Object Value Iterator 2

.["foo"][], .foo[]

Pipe

.[] | ., .[] | .name

Stream as object value

{names: .[] | .name}, {"names": .[] | .name, "ages": .[] | .age}, {"names": .[] | .name, "x": 3}, {"names": 5.4, "x": .[] | .age}, {names: 5.4, ages: .[] | .age, ages2: .[] | .id}

Array/String slice

.[2:4], .[0:1]