Group by query with join taking forever

Hello,
during tests on 8 nodes, bare metal CrateDB 4.6.4 cluster we noticed that one groupby query including join is taking forever to finish.

Reproduction steps

CREATE TABLE IF NOT EXISTS test_table1 (
   "id" BIGINT,
   test1 BIGINT DEFAULT NULL,
   PRIMARY KEY ("id")
)
CLUSTERED BY ("id") INTO 140 SHARDS;
INSERT INTO test_table1 SELECT  col1 , CEIL(random()*(col1%4000000)) FROM Generate_series(1,3000000,1);
CREATE TABLE IF NOT EXISTS test_table2 (
   "id" BIGINT,
   test1 BIGINT DEFAULT NULL,
   PRIMARY KEY ("id")
)
CLUSTERED BY ("id") INTO 140 SHARDS;
INSERT INTO test_table2 SELECT  col1 , CEIL(random()*(col1%4000000)) FROM Generate_series(1,123000000,1);
select t1.test1, COUNT(*)
from test_table1 t1
join test_table2 t2 on (t1.id = t2.id)
group by t1.test1 limit 100;

When I split above query to 32 range queries as below I am receiving results within 2 mins

select t1.test1, COUNT(*) from test_table1 t1 join test_table2 t2 on (t1.id = t2.id)  where t1.test1 >= <min_id> and t1.test1 < <max_id> group by t1.test1

Probably inner join is a performance killer. However when we were testing CrateDb 4.5.1 on single node 6 month ago, the same query took 4.3 min

Could you give me a hint?

Looks like you are creating far too many shards for the respective data.
Average shard size should be ~ 10 - 50 GiB per shard.
Both the above mentioned tables combined would have 280 shards

As per guide on your page

To increase the chances that a query can be parallelized and distributed maximally, there should be at least as many shards for a table than there are CPUs in the cluster. This is because CrateDB will automatically balance shards across the cluster so that each node contains as few shards as possible.

So 8 nodes x 16 CPUs = 128 shards.

it’s generally advisable to slightly over-allocate

Thats why I went with 140 shards.

Could you tell me whether I misunderstood something?

1 Like

In the meanwhile I have tried to do the same test with 16 shards, 6, 3 and 1 without success.
@proddata were you able to reproduce the issue?

I have looked into source code of HashInnerJoinBatchIterator and if I understand correctly steps are as below

  1. create HashMap for left side (size of HashMap depends on available memory, row size and hardcoded value of 500k (PAGE_SIZE)
  2. Iterate through right side and perform matching
  3. Repeat 1 and 2 until all blocks of left side are processed.
    In my case left table has 3kk of records, right side has 123kk of records. So we have at least 6 full scans (including hash calculation) of right side.
    I think this might be a performance killer here. It would be great if you could add task for optimization of this algorithm.
    ie.
  4. you could use some disk based hashing and increase max size of hashmap to more than 500k
  5. Presort left and right sides (all columns are indexed by default so I think it should be fast) and then iterate simultaneously both sides,compare them on fly and finish as soon as one of sides reach end.
1 Like

In the meanwhile I have tried to do the same test with 16 shards, 6, 3 and 1 without success.
@proddata were you able to reproduce the issue?

Yes, I was able to reproduce to some degree, but testing wasn’t conclusive yet. Probably only will have time to dig deeper next week.

  1. create HashMap for left side (size of HashMap depends on available memory, row size and hardcoded value of 500k (PAGE_SIZE)

That explains why limiting the search space to 500k values is significantly faster :worried:

select t1.test1, COUNT(*)
from test_table2 t2
join test_table1 t1 on (t1.id = t2.id)
where  t1.id < 500000
group by t1.test1 limit 100;
-- SELECT OK, 1 record returned (10.466 seconds)


select t1.test1, COUNT(*)
from test_table2 t2
join test_table1 t1 on (t1.id = t2.id)
where  t2.id < 500000
group by t1.test1 limit 100;
-- SELECT OK, 1 record returned (0.981 seconds)

with

CREATE TABLE IF NOT EXISTS "doc"."test_table1" (
   "id" BIGINT,
   "test1" BIGINT DEFAULT NULL,
   PRIMARY KEY ("id")
)
CLUSTERED BY ("id") INTO 1 SHARDS (
   number_of_replicas = '1-all'
)

CREATE TABLE IF NOT EXISTS "doc"."test_table2" (
   "id" BIGINT,
   "test1" BIGINT DEFAULT NULL,
   PRIMARY KEY ("id")
)
CLUSTERED BY ("id") INTO 14 SHARDS

There are quite some issues open regarding JOINIssues · crate/crate · GitHub


Could you tell me whether I misunderstood something?

No, in general that is true. Having more shards can have performance benefits for e.g. aggregations, however more shards also come with a performance penalty. Typically the default setting (2 shards per node) is a good starting point. Don’t forget that the default replication is set to 0-1 i.e. there would be twice the shards on a multinode cluster anyway.

Where you able to do some more tests on this?

@proddata could you raise an issue in github?

@jayeff @proddata could you check this?

Hey @kamcio181

I don’t really know how I could help you here right now. There are currently limitations within CrateDB regarding JOINs and there are several open issues within crate/crate. In terms of prioritisation it always good to have a concrete use case in mind, that a performance improvement / feature could help with. We are always happy about contributions from reproducible bug reports, to feature requests with use cases as well as actual code contributions to one of our repositories.

Also if you are able to describe in a little bit more detail, what you are actually are trying to achieve, there might be other solutions within CrateDB.


As mentioned before there are already quite some issue regarding improving the JOIN-performance:


Regarding sorted merge and JOIN performance you might finds this article interesting to read:

@proddata thank you for your response. According to links you have posted sort merge join algorithm may not work here. I will try to investigate more the block hash join and maybe come with some idea.

What we want to achieve in this particular test is constraint table t1 with big set of ids available in table t2. First solution that comes to mind is inner join. But for some reason it is not working in our case.

1 Like

A subselect

SELECT * FROM test_table1
WHERE id IN (SELECT id FROM test_table2)

might be significantly faster

Unfortunately, Java heap space exception is thrown during query

An exception or a curcuitbreaker?

How big are the nodes / how much memory / heap?

I am not sure, how can I recognize whether it was a circutbraker?

each of 8 nodes has 64 GB RAM - about 30 GB for heap