Filtering Shards during query with clustering column filter

Hello,

I could not find information in the documentation about whether the shards are filtered during a SELECT query if the filter of the query includes the CLUSTERING KEY.

For example, if a table like the following one

CREATE TABLE test.example (
   "c1" INTEGER,
   "c2" BIGINT,
   "c3" BIGINT,
   PRIMARY KEY ("c1", "c2")
)
CLUSTERED BY ("c1") INTO 6 SHARDS

is filtered with the following query

SELECT *
FROM test.example
WHERE c1 = 55

will the engine skip the shards that do not contain c1=55 or will it still search all the shards even if their distribution was based on this exact field?

Additionally, I am interesting in the equivalent question with the partition key instead. Will only the shards of one partition be searched in this case?

CREATE TABLE test.example (
   "c1" INTEGER,
   "c2" BIGINT,
   "c3" BIGINT,
   PRIMARY KEY ("c1", "c2")
)
PARTITIONED BY ("c1") 
1 Like

For partitions it is definitely taken into account. You can check that by running your queries with EXPLAIN ANALYZE. This shows which shards are queried.
How it exactly work with only the routin_column being set in CLUSTERED BY I would have to check.

Generally if you are interested in how CrateDB works I can recommend you the excellent articles from @mfussenegger

How CrateDB plans and executes queries
https://zignar.net/2021/05/15/how-cratedb-plans-and-executes-queries/

Distributed SELECT statement execution in CrateDB
https://zignar.net/2021/05/20/distributed-select-statement-execution-in-cratedb/

Query Then Fetch execution in CrateDB
https://zignar.net/2021/05/28/query-then-fetch-execution-in-cratedb/

Group By query execution in CrateDB
https://zignar.net/2021/06/04/group-by-query-execution-in-cratedb/

Turns out that it didn’t take the clustered by columns into consideration, but will in the future: Optimize routing based on clustered by columns in where clause by mfussenegger · Pull Request #11753 · crate/crate · GitHub

1 Like