Import JSON without schema

Is it possible to import huge stored JSONs without creating a schema first?

Hi,

I hope you are well.

As storage?, what operations do you need to do on the data once onboard?

Best

I have some JSONs that I need to load to the DB where I would like to read them based on their timestamp and perform statistical analysis like average, sliding windows, etc. based on other key nested values that are in the JSON.

Ok, a good way to start is to realise that CrateDB has type object:

https://crate.io/docs/crate/reference/en/latest/general/ddl/data-types.html?highlight=object#object

create table t (o object);

The literal syntax of an object is almost JSON:

https://crate.io/docs/crate/reference/en/latest/general/ddl/data-types.html?highlight=object#object-literals

Three separate JSON objects like:

{ "id": 0, "ts": 1590432824744, "cargo": {"key": "july", "value": 7}}
{ "id": 1, "ts": 1590432824745, "samba": {"name": "de Janeiro", "value": 10}}
{ "id": 1, "ts": 1590432824746, "key": "length", "value": {"units": "meter", "qty": 12.4}}

can be inserted like:

insert into t values ({ id = 0, ts = 1590432824744, cargo = {key = 'july', value = 7}});
insert into t values ({ id = 1, ts = 1590432824745, samba = {name = 'de Janeiro', value = 10}});
insert into t values ({ id = 2, ts = 1590432824746, key = 'length', value = { units = 'meter', qty = 12.4}});

Every insert creates nested objects dynamically in the schema, so when you do:

cr> show create table t;
+-----------------------------------------------------+
| SHOW CREATE TABLE doc.t                             |
+-----------------------------------------------------+
| CREATE TABLE IF NOT EXISTS "doc"."t" (              |
|    "o" OBJECT(DYNAMIC) AS (                         |
|       "id" BIGINT,                                  |
|       "key" TEXT,                                   |
|       "ts" BIGINT,                                  |
|       "cargo" OBJECT(DYNAMIC) AS (                  |
|          "key" TEXT,                                |
|          "value" BIGINT                             |
|       ),                                            |
|       "samba" OBJECT(DYNAMIC) AS (                  |
|          "name" TEXT,                               |
|          "value" BIGINT                             |
|       ),                                            |
|       "value" OBJECT(DYNAMIC) AS (                  |
|          "qty" REAL,                                |
|          "units" TEXT                               |
|       )                                             |
|    )                                                |
| )                                                   |
| ...                                                 |
+-----------------------------------------------------+

And you can query:

cr> select * from t where o['ts'] > 1590432824744;
+-------------------------------------------------------------------------------------------+
| o                                                                                         |
+-------------------------------------------------------------------------------------------+
| {"id": 1, "samba": {"name": "de Janeiro", "value": 10}, "ts": 1590432824745}              |
| {"id": 2, "key": "length", "ts": 1590432824746, "value": {"qty": 12.4, "units": "meter"}} |
+-------------------------------------------------------------------------------------------+
SELECT 2 rows in set (0.003 sec)
cr> select o['samba'], o['value']['qty'] from t where o['ts'] > 1590432824744;
+-------------------------------------+-------------------+
| o['samba']                          | o['value']['qty'] |
+-------------------------------------+-------------------+
| {"name": "de Janeiro", "value": 10} |            NULL   |
| NULL                                |              12.4 |
+-------------------------------------+-------------------+
SELECT 2 rows in set (0.011 sec)

So you need to transform the original JSON into CrateDB object notation.

Awesome! And, what would happen if the data type inferred is wrong once the table is created, can I change it?

Hi,

For your convenience, this will work as well:

insert into t values('{ "id": 0, "ts": 1590432824744, "cargo": {"key": "july", "value": 7}}');
insert into t values('{ "id": 1, "ts": 1590432824745, "samba": {"name": "de Janeiro", "value": 10}}');
insert into t values('{ "id": 2, "ts": 1590432824746, "key": "length", "value": {"units": "meter", "qty": 12.4}}');

i.e. you take a JSON literal and enclose it in ' to make it a text/string, then this text is cast to object on insert, because the column o is declared an object, same as:

'<your humongous JSON literal>'::object;

In SQL:

select pg_typeof('{ "id": 2, "ts": 1590432824746, "key": "length", "value": {"units": "meter", "qty": 12.4}}');
 'text'
--------
 text

select pg_typeof('{ "id": 2, "ts": 1590432824746, "key": "length", "value": {"units": "meter", "qty": 12.4}}'::object);
 'object'
----------
 object

select '{ "id": 2, "ts": 1590432824746, "key": "length", "value": {"units": "meter", "qty": 12.4}}'::object['key'];
'length'
----------
 length

Here is a hot spot:

select pg_typeof('{ "id": 2, "ts": 1590432824746, "key": "length", "value": {"units": "meter", "qty": 12.4}}'::object['key']);
 'undefined'
-------------
 undefined

No surprises here, there is no schema for this dynamic object literal, but you can always cast it:

select pg_typeof('{ "id": 2, "ts": 1590432824746, "key": "length", "value": {"units": "meter", "qty": 12.4}}'::object['key']::varchar);
 'text'
--------
 text

As the parser detects new columns (whether nested or not), it finds and interprets tokens assigning them a CrateDB type as specified in the reference https://crate.io/docs/crate/reference/en/latest/general/ddl/data-types.html . The next insert that contains an already defined column be careful, because the type will be already defined, so there will be an implicit cast:

insert into t values({id = '6789'});      -- implicit cast of text '6789' to bigint
insert into t values({id = '6789'::int}); -- explicit cast of text '6789' to bigint

insert into t values({id = 'banana'});
ERROR:  Cannot cast `{"id"='banana'}` of type `object` to type `object` 

The types for your JSON will be assigned as understood by CrateDB, column by column.

You can always cast the fields individually on select, on the way out. Or perhaps at a later date, whence the dynamic schema for your object is settled (you have ingested the whole lot of JSON), you can have a look at the resulting schema, create a new table with the curated schema, and then re-ingest the data.

2 Likes

Pretty cool, thanks a lot @miguel.arregui!

1 Like