Starting out with cratedb

Is there a doc on database design for time series data IOT sensors and cratedb that I can start out with.
Dos and donts.
partitioned tables with composite keys.
Beginners guide

Thanks
Bryan

1 Like

Hi @bww00

Unfortunately we don’t have such a guide yet, but it is definitely on our back log.
Of course schema design largely depends on what you want to achieve i.e. it is quite different if you want to store high frequency data and.

A typical time series table might look something like this:

CREATE TABLE iot (
   id bigint,
   value double precision,
   metadata object,
   ts timestamp with time zone,
   ts_month timestamp with time zone GENERATED ALWAYS AS date_trunc('month', ts)
 ) PARTITIONED BY (ts_month);

I would like to point you to the following:


If you are more specific, what you want to achieve, I might be able to give you some hints.
Further I can offer you to have a short call to discuss your use case.

Best regards
Georg

1 Like

Hello @proddata,

I know this is an old post, but it fits well with my question. (if not, then I’ll be happy to create a new post)

I’m trying to test a use-case for up to a 1000 sensors with data every minute. (that’s 1440 “rows” of data per sensor per day.
After looking at your example time series table, I was wondering how i could go about implementing my schema in my scenario while satisfying crate’s best practices for long term use?

I would like to query this data easily (combining multiple sensors in 1 query) for either historical or live data.
From a MySQL perspective, I would’ve put sensor ID’s as column names and fill each “cell” with its value in order to have 1 big table with 1000 columns and 1440 rows adding up each day.
However, this doesn’t feel like the right approach, but having to separate each sensor in a different table would require me to join them one by one for a query which changes depending on the user (e.g. using variable dropdowns in grafana to freely choose which combination of sensors to visualize in a graph).

In crate I was thinking of using objects/arrays in order to combine all sensors per row.

Can you or anyone advice me with your/their experience on such a scenario? With my proposal I’d have all metadata in a separate table, even though I believe that might be “frowned upon” with how cratedb was meant to be used.

Example of a possible very simple psuedo-query (could be 10 or more sensors in 1 query):

SELECT 
  ts, 
  sensor1, 
  sensor330,
  sensor42,
  ...
  sensor591
FROM "doc"."test4"
1 Like

I am afraid, there probably is no simple answer to that question, as it depends a bit on how you ingest/collect the data and how you want to query it.

We by now at least have some guide for time-series data, which you can find here


For Grafana something like

CREATE TABLE table_1 (
  ts TIMESTAMP,
  id TEXT,
  val DOUBLE
);
INSERT INTO table_1 (ts,id,val) VALUES (now(),'id1',1.0),(now(),'id2',2.0);

would also be fine if e.g. id would be selected as metric in Grafana.

That would be 1440 * 1000 * 365 = 525 Mio records / year


Alternatives:

As you mentioned, you could use individual columns for every sensor.
This however would lead to 1000 indexes by default.

CREATE TABLE table_1 (
  ts TIMESTAMP,
  sensor01 DOUBLE
   ...
   sensor1000 DOUBLE 
);

if the sensor are always read in bulk you could use arrays.
However this kinda implies, that you always read all the sensors at once.

CREATE TABLE table_1 (
  ts TIMESTAMP,
  sensors ARRAY(DOUBLE)
);
INSERT INTO table_1 (ts,sensors) VALUES (now(),[1.0,2.0,3.1,...,...])

and

SELECT ts,  sensors[1], sensors[123] FROM table_1

another altnernative would be using OBJECTs

with dynamic objects, this would be very similar to individual columns

CREATE TABLE table_1 (
  ts TIMESTAMP,
  sensors OBJECT(DYNAMIC)
);

with ignored objects, you probably save some space, but performance is worse for large aggregations

CREATE TABLE table_1 (
  ts TIMESTAMP,
  sensors OBJECT(IGNORED)
);

Also there is a soft-limit of 1000 columns per table, which can be increased, however this might also have performance implications depending on other factors.


So in general it would be good to get a little bit more insight in the use case, to really recommend optimal schemas / settings :slight_smile:

@proddata thank you for replying.

The scenario I’m looking at right now is using 1 edge-server which runs cratedb in docker.
This server will be accompanied by 1 IoT-controller which collects all sensordata (x1000) each minute. The controller then pushes this data every minute to the crate’s HTTP API endpoint, thus ideally, retaining a log file with a life-cycle of max 1 minute of data.

So cratedb should somehow ingest about 1000 values each minute. This is excluding all the necessary metadata for info about each sensor, the project, the controller and so on.

With this I’d use Grafana to generate different graphs by querying a number of sensors (specific sensors or a variety of different sensors all together). This is also where metadata gets very important in order to present the user with measurement names for each sensor instead of sensorID’s. Combining this with multiple IoT-controllers, this would mean using metadata of the project and controller to differentiate in between the many collections of sensors per controller.

I hope that this gives a bit more context to what I’m experimenting with.

So for cratedb, each sensor will have a unique type of “id-number” which is also present in a different metadata-table in order to connect the 2-tables with a join.
Off the top of my head, I’m thinking something like this when pushing to cratedb HTTP endpoint (abbreviated 1000 values into 3 values for short):

{
    "stmt": "INSERT INTO test4 (ts, lifecounter, fields) VALUES (?, ?, ?)",
    "bulk_args": [
        [
            "2022-05-24T16:11:56.000Z",
            49,
            [
                {
                    "id": "10.1.1.0 ",
                    "val": 6.200
                },
                {
                    "id": "10.1.2.0 ",
                    "val": 3.900
                },
                ...
                ...
                ...
                {
                    "id": "10.4.235.0 ",
                    "val": 52.250
                },
            ]
        ]
    ]

This might not even be the right structure, but I was experimenting with postman in order to play with the console in the adminUI. This already proved quite difficult to query the way I presented earlier in my previous post.

This is with the following table (again, this is probably a bad structure, but I was just experimenting):

CREATE TABLE IF NOT EXISTS test4 (ts TIMESTAMP, lifecounter INTEGER, fields ARRAY(OBJECT) WITH ('mapping.total_fields.limit' = 2000)

Of course after reading your suggested articles, I’ll have to consider using partitioning as well as defining the right amount of shards beforehand.
But just looking at the data-concept, I’m stuck at how to structure my data while taking into account that I have to combine this sensor data with the necessary metadata somehow.

e.g. from a different perspective: I’ve also been experimenting with InfluxDB up until now where the controller will create 1000 lines (in line protocol). Each line will have a bunch of tags (all necessary metadata) and at the end 1 field for the _value with it’s timestamp next to it.
With this I could query influx by filtering on the tags for the necessary ID’s or names of the specific sensors. Others would have a variable showing all the name-tags (when user is choosing sensor-measurement names) which would be used in a query to read out the variable and filter the chosen sensors to visualize in a graph.

Perhaps in my scenario I might also be “forced” to add more edge-servers/nodes for stability instead of using it as a single-node?