CrateDB nodes constantly crashing

Hi everyone,

We have a 3 nodes (8 cores - 64 GB RAM each node) cluster in AWS Cloud. Our use case is an IOT platform and our data ingestion rate is about 350 insertions/sec. With a data retention window of about 6 months.

CRATE_HEAP_SIZE = 30.5 GB

We are currently getting nodes disconnections every 30 minutes or less. The node completely crashes, restart and a new master node is elected. This happens all day.

Any advice?

Thanks in advance

Could you share some further information with us?

  • Which version of CrateDB are you using?
  • Which OS are you running CrateDB on?

Further you might provide:

  • DB Schemas
  • Config File
  • A heap dump
  • The crate log of the crashing node

… and …

  • Monitoring snapshots (e.g. grafana) of exposed JMX metrix
    • crate_threadpools queueSize/active/rejected,
    • GC infos:
      • Young + Old Generation avg (jvm_gc_collection_seconds_sum/jvm_gc_collection_seconds_count)
      • Survivor space (jvm_memory_pool_bytes_used)
      • GC rates (jvm_gc_collection_seconds_count)
    • DirectBuffer memory usage (jvm_buffer_pool_used_bytes)
    • Queries per second (crate_query_total_count)
    • Query error rate (crate_query_failed_count)
    • CircuitBreaker memory usages (crate_circuitbreakers)

Hey Daniel, any update on this one?

I will be glad to look into this. To be sure to be as close to your setup as possible, I appreciate the following information:

  • cratedb version
  • crate.yml / configuration flags
  • show create table of the ingest column
  • how many data do you have in the table/partitions
  • os version/image
  • EC2 Type you use
  • Loadbalancer type
  • filesystem/-type/size where the data is stored

Regards,
Walter