can you help me understand what I am reading in this documentation?https://crate.io/docs/reference/sql/partitioned_tables.html
In these example tables, column
id long is not a
primary_key ; indeed,
id could not be a primary key here, because as noted below “if a primary key is set, it must be present in the
PARTITION BY clause”
In my app, I’ve historically had a
primary key on
id string NOT NULL , but now I want to add partitioning on this table, on a generated date column just like in the example
partition_date timestamp GENERATED ALWAYS AS date_trunc('day', created_at) . I’ve read that partitioning on a date column will help with speed of queries scoped by time period (counting all today’s records, for example, would only hit today’s partitions), and helping me archive older frames of data (e.g. anything > 180 days), but I don’t want to lose performance of single PK lookups.
So since I can’t do just
PARTITIONED BY (partition_date) , is it best if I…
a) Remove the primary key constraint from
id ? I’m nervous this would affect my performance for single row lookups! In this context, it makes sense that PK must be in the partition key, because to lookup
WHERE id = "abc-123" should ideally only have to hit a single node.
b) use both columns as partition key, like
PARTITIONED BY (id, partition_date) – This seems weird, because instinctively, I want to assume that
id would have high cardinality and be a bad choice for a partition column, and ‘day’ or ‘month’ would be better, like is shown in the example on your docs. In this case, is my PK lookup hitting every partition, or does it know exactly where to go? If I run an aggregate query scoped to today only, will it hit every partition or only the one holding today’s data?