I need to run a query across 500GB data, do we need 500GB data on RAM?
CrateDB and memory management:
- CrateDB runs on the JVM (for ninjas), you set its heap size with env var
- CrateDB uses G1 GC, large heaps may eventually result in increased latency (fine tuning G1 params).
- When you issue a query, all the intermediate result structures, as well as final ones, must reside in heap space, so the heap needs to be as big as these result sets, which are dependent on your specific use case.
- Our usual recommendation is to start off with 25% of available memory in the host. Notice that all Lucene level memory management is done off heap, via memory mapped files. Lucene is the indexing/retrieval/persistence engine we use at the bottom of the application stack.
- We also recommend that you do not exceed 30.5GB of heap, so that you can benefit from a JVM level optimisation (for ninjas).
Memory config guide: here.
Finally, CrateDB uses a memory circuit breaker at the cluster level. Any query resulting in memory usage above a certain threshold, OR if the cluster is at memory utilisation limit, will be terminated. There are six kinds of circuit breaker:
Parting thoughts. If your node has 8Gb of RAM, using defaults (60% query breaker 1), means 60% of 8GB => 4.8GB. Your query intermediate/final results (the live set) would need to fit in 4.8GB.
count(distinct) query on an absolutely humongous dataset will tend to be shutdown by the query breaker, thus for such cases we recommend the use of hyperloglog-distinct.