How does CrateDB compare to CockroachDB and YugabyteDB?

Hello,

I’m trying to understand how CrateDB compares to other “newSQL” (distributed SQL) systems such as CockroachDB and YugabyteDB.

Is there a report/whitepaper about this, and maybe some benchmarks?

Things I’m more focused at the moment:

  • How CrateDB compares with respect to horizontal scalability and high availability.
  • How CrateDB compares to them in terms of administrative operations, e.g upgrades, backups, adding/removing nodes etc.
  • How CrateDB compares to them with respect to resource utilization, e.g. for X million rows the amount of per node RAM+CPU for such and such performance, etc.

I’m also investigating whether it would make sense to combine CrateDB for time-series and IoT data and YugabyteDB/CockroachDB for non-time-series/non-IoT data. Is anybody experienced with that and willing to share some feedbackk?

1 Like

Hi @Emre_Sevinc

Strictly speaking CrateDB is not a NewSQL database, as CrateDB trades ACID for an eventual consistent approach. CrateDB is a document store with a PostgreSQL-compatible SQL Interface. Compared to Postgres however objects are first level citizens in CrateDB. This also comes with flexibility to work with dynamic schemas and mapping.

CrateDB uses Lucene storage / indexing engine, bringing fast distributed indexes as well as columnar stores (doc values) to the table, which allows you to run queries including aggregations across TiB of data and billions of rows within milliseconds.

In short: Cockroach and Yugabyte exceed in uses cases that require strong consistency or need geo-distributed data, while CrateDB mainly focus on providing a solution for large scale analytics applications.

I don’t think that we have any recent white paper for Yugabyte or Cockroach available. Doing benchmarks typically is strongly dependent on actual use case and needs deep knowledge of all the systems that you are testing. I can’t really help you with Yugabyte or Cockroach, but definitely get you started and assist with CrateDB

  • How CrateDB compares with respect to horizontal scalability and high availability.
    → There are production workload using up to several hundreds of CrateDB nodes. With single notes often holding multiple TiB of data. High availability is a major feature of CrateDB with highly configurable replication strategies taking into account the actual infrastructure (e.g. hardware zones), as our customers are often using it at the core of their applications (i.e. if it is not operational, the application wouldn’t). With the next release (4.7) we will also integrate Cross-Cluster-Replication.

  • How CrateDB compares to them in terms of administrative operations, e.g upgrades, backups, adding/removing nodes etc.
    → Again I can only speak for CrateDB here. Upgrades for minor versions are typically done in a rolling fashion, meaning with constant availability of the cluster. Adding or removing nodes to an existing cluster is typically as easy as spinning up another container/VM. Data automatically will get redistributed across the nodes. Removing can be done with a decommission statement, which automatically takes care of moving shards to the remaining nodes. Backups are realised using a repository/snapshot mechanism, allowing safe delta-updates, as only changed segments are transferred to the repository.

  • How CrateDB compares to them with respect to resource utilization, e.g. for X million rows the amount of per node RAM+CPU for such and such performance, etc.
    → Generally speaking is very efficient in terms of resources, especially when considering lots of indexes and runs mostly on commodity hardware. We ran test storing and querying 4TiB of indexed time-series data with 2 cores and 4gig of memory e.g. CrateDB it is possible to store large amounts of data with limited resources.


BTW while IoT and Time-series data are relevant use cases for CrateDB, we have quite a lot of users outside of this space using CrateDB to analyse network traffic, streaming analytics, retail, marketing and many more

I hope that helps you a little bit :slight_smile:

3 Likes

In the documentation on a multi-cluster setup, I’ve found this bit:

[…] replicas are written synchronously and making a write operation wait for all the replicas to write somewhere in a data center hundreds of miles away can lead to noticeable latency and cause the cluster to slow down.

To me, it seems contradicting. Can you elaborate on this a bit? Why would a system which is eventually consistent need synchronous write operations?

I hope this is relevant to the original thread, as YugabyteDB as well as CockroachDB also perform synchronous writes across a multi-cluster setup, precisely to guarantee ACID conformance.

Thanks & best regards!

2 Likes

To me, it seems contradicting. Can you elaborate on this a bit? Why would a system which is eventually consistent need synchronous write operations?

The actual writing on the nodes happens in an async, non-blocking manner. Though writing doesn’t not mean, that it is immediately available for querying, but only that durability can be achieved (i.e. data is within the WAL). The order how inserts are happening on different nodes is not guaranteed.

It only really means, that the client has to wait till all async write operations are finished. This doesn’t prevent other clients from reading the data in the process.


If you want a true multi-zone setup, all actively written to primaries and replicas should reside within a zone (which can be enforced).

With CrateDB 4.7 we will enable logical replication between clusters, which asynchronously syncs tables or the whole cluster between two different clusters. This can be bi-directional, but isn’t synchronised.

2 Likes