JSON Extraction
Sometimes the data you’re trying to extract may live in some sort of JSON.
This JSON may live embedded within a script tag, inside a DOM element, or as the result of a XHR request. In the past JSON extraction was a difficult process, requiring complex regular expressions or code to be written to add hidden elements onto the page to be extracted later.
Thankfully the Extractor Studio has tools to easily extract JSON.
Capturing the JSON
When the JSON is the result of a fetch
or external API request, you’ll first need to save the JSON onto the page.
This can be achieved by using context.saveJson(id: string, data: any)
saveJson
adds the JSON to a script tag on the DOM, with the id
provided. You can then target this id
later and use jq
to retrieve what you need from the JSON (see below)
Extracting the JSON (using JQ)
Just like you would with a combination of XPath and Regular Expression you can use the jq
property on an extraction field to parse and extract JSON.
Target
First you must select a DOM node where the JSON resides. The JSON can be embedded within a script
tag, or as an attribute or the inner html of an element.
You can use a manual selector or XPath to properly target the DOM element.
Getting the data
Once you have targeted the JSON you can use a jq
property on the field to parse and manipulate the JSON to get the data you want.
Example:
The following example targets the same JSON we added to the DOM above by referencing the selector #foo
, and extracts both name and age.
singleRecord: false
regionsSelector: null
recordSelector: null
recordXPath: null
fields:
- name: Name
type: TEXT
manualSelector: '#foo'
jq: '.name'
- name: Age
type: TEXT
manualSelector: '#foo'
jq: '.age'
JQ Functionality
The jq
functionality supported for extraction via the configuration files may differ slightly to regular JQ. Below is what is supported
Supported syntax constructs
-
Identity:
.
-
Object identifier-index:
.foo
,.foo.bar
-
Generic object index:
.["foo"]
-
Array index:
.[2]
-
Array/string slice:
.[10:15]
-
Optional index/slice:
.foo?
,.["foo"]?
,.[2]?
,.[10:15]?
-
Array/object value iterator:
.[]
-
Optional iterator:
.[]?
-
Comma:
,
-
Pipe:
|
-
Parentheses:
(. + 2) * 5
-
Array construction:
[1, 2, 3]
-
Object construction:
{"a": 42, "b": 17}
-
Arithmetics:
+
,-
,*
,/
,%
-
String/array concatenation with
+
:"foo" + "bar"
,[1, 2] + [3, 4]
-
Object merging with
+
:{a: 1} + {b: 2}
-
Comparisons:
==
,!=
,<
,⇐
,>
,>=
-
Boolean operators:
and
,or
(andnot
as a function) -
Alternative operator:
//
-
If-then-else expressions:
if A then B else C end
,if A1 then B1 elif A2 then B2 else C end
-
Escape sequences (
\"
,\\
,\/
,\b
,\f
,\n
,\r
,\t
) in string literals:"foo\"bar"
,"foo\nbar"
Supported functions
-
ascii_downcase
-
ascii_upcase
-
downcase (non-standard extension, see below)
-
empty
-
false
-
from_entries
-
infinite
-
isfinite
-
isinfinite
-
isnan
-
isnormal
-
join
-
keys
-
keys_unsorted
-
length
-
map
-
map_values
-
nan
-
not
-
null
-
reverse
-
select
-
sort
-
sort_by
-
to_entries
-
tonumber
-
tostring
-
true
-
with_entries
-
upcase (non-standard extension, see below)
The extension functions downcase'' and
upcase'' are not present in
standard JQ. They differ from ascii_downcase'' and
ascii_upcase'' in
that they change casing for all Unicode letters, not only for ASCII
letters (A-Z).
Supported features
Feature | Example |
---|---|
Identity |
|
Array Index |
|
Object Identifier-Index |
|
Generic Object Index |
|
Pipe |
|
Parentheses |
|
Addition (numbers) |
|
Subtraction (numbers) |
|
Multiplication (numbers) |
|
Modulo (numbers) |
|
Division (numbers) |
|
Array Construction |
|
Object Construction |
|
Integer literal |
|
Float literal |
|
Boolean literal |
|
Double quote String literal |
|
length |
|
keys |
|
keys_unsorted |
|
to_entries |
|
from_entries |
|
reverse |
|
map |
|
map_values |
|
with_entries |
|
tonumber |
|
tostring |
|
sort |
|
sort_by |
|
join |
|
Additive inverse |
|
Array Construction |
|
Array/Object Value Iterator |
|
Array/Object Value Iterator 2 |
|
Pipe |
|
Stream as object value |
|
Array/String slice |
|