SQL Without LIMIT unoptimized ? doc Schema performance different compared to other schemas?

Hi everyone :slight_smile:

I have two strange behaviours. (i’m testing in nightly Build 4.5, and i will test on previous version)

On the same CrateDB Server, i have two schemas :

  • Doc schema
  • Bonjour schema

Both have the same tables and datas (table parameters are the same).

image

But performances are strangely different depending on schema and also if LIMIT clause is present or not :

in Red : On schema “Bonjour”
In Blue : On schema “Doc”

Any idea ?

Dear Mike,

thanks for your report. Is the behavior reproducible in a way that reissuing both statements without LIMIT clauses once more will also yield a long query duration?

The reason for the initial (longer) duration might be some initialization overhead. May I ask what hardware and environment you are running CrateDB on and how your configuration looks like? Having enough memory available for CrateDB is crucial for reasonable operation.

That the first query took 30 seconds for answering might indicate something into the direction that the environment CrateDB is running on needs some time to get ready. It might also indicate that the number of configured shards is too high for the fair amount of (~100000?) records as this would definitively induce a significant overhead.

With kind regards,
Andreas.

Hi Andreas, it seems after re-importing my 100 000 Rows into my two different schemas, performances are now the same for both.

However, even after re-importing, the SELECT * from users (containing 100 001 rows) and SELECT * from users LIMIT 1000000 (which returns the same result because i have only 100 001 rows) still does a different thing (done multiple times) :

image

In CrateDB WebAdmin UI Console, i can’t reproduce due to LIMIT security :slight_smile: (but with a LIMIT, i have the same duration results as in my CrateDBAdmin .Net client).

I have also tried with a Where clause (just to be sure) :

And i have tried with one result only … The limit clause at 1 performs about 10x faster than without the limit clause …

So, in conclusion … we would count() results before systematically, and re-request with the Count() result in the LIMIT clause :wink:

Something is boosting requests with LIMIT clause, even if the limit is over the number of resulted rows.

This behaviour is totally hidden in Web Admin UI Console due to LIMIT security clause :slight_smile: (but in custom applications … it may not be the case …)

I’m’ using CrateDB 4.5 in a Windows Test Machine (Core I7, 16GB RAM) with nothing running on it …

But a good info : on the same machine, i have switched for the CrateDB 4.4.2 and 4.1.4 and i can’t reproduce the problem (with same tables and data and config… only one node)…

You can reproduce that behaviour on version 4.5 on your side ?

2 Likes

Dear Mike,

thank you for investigating further.

Sure, that is obvious, right?

This is the thing I would like to follow up on and dig deeper.

So, according to your observations, you believe it happened somewhere in between 4.4.2 and 4.5.0? There is one other release before 4.5.0 happened, namely 4.4.3. Maybe you can also check that one in order to narrow down your observations further?

With kind regards,
Andreas.

You have a quick link for the 4.4.3 Windows x64 package please ? :slight_smile:

Index of /downloads/releases/cratedb/x64_windows/ :wink:

yes i found it sorry :wink:

I can reproduce it with the 4.4.3 version (but not for the LIMIT 1 with a where clause as the 4.5 version) :

image

1 Like

No because i had a where clause which returned only one result with or without the limit clause :slight_smile: but this usecase is not reproductible in 4.4.3

Hi Mike,

thank you again.

All right, so the observed performance regression on the with or without LIMIT clause thing might have slipped in between 4.4.2 and 4.4.3, right?

All right, so the observed performance regression on the with a WHERE clause only returning one record, even adding a LIMIT 1 will gain more performance thing only happened between 4.4.3 and 4.5.0, right?

With kind regards,
Andreas.

That’s right. You’ve well summarized :slight_smile:

1 Like

Ok, other test :

I have installed the 4.5 version on a Linux Debian Stretch on a dedicated server. I have the same behaviour on LIMIT Clauses but not on the WHERE + LIMIT 1 usecase (i think we can forget it as it does not have the same behaviour on linux (which is OK) and windows (not OK)).

Another info, my table “users” have a column “id” which is a PRIMARY KEY. Table rows have been initially imported (by JSON files) on a disordered manner (thanks to multi-threading ! :p)

So if i do a SELECT * from users ORDER BY id , performances are now equal to the same request with a LIMIT 1000000 (but without the ORDER BY) …

I can’t see the link between the ORDER BY [primary key] and the LIMIT clause (which should takes place after the query results) but optimization has certainly multiple paths (and passes :p) :

image

Finally… i don’t know what to think about all of that :slight_smile: