we talked about this already outside of the community last week, but let me summarize a few points from our conversation here for the benefit of the community.
Although CrateDB historically comes from an IoT/manufacturing background, it is still a generic database product that is not limited to a single vertical. We see a wide range of analytics use cases enabled by CrateDB, such as with marketing or financial data. The benefits of high ingest performance, horizontal scaling and performant analytics queries are relevant there as well.
The shared nothing architecture enables a lot of these benefits, but also has its implications. The distributiveness of data makes joins more expensive, as data needs to be transported between nodes. While this isn’t a problem per-se for a small number of joins with (typically rather small) dimension tables, it can become considerable in data warehousing use cases with large, hierarchical datamodels involving ten or more joins. Replicating dimension tables on all nodes removes the performance problem, but comes at the cost of redundant storage.
The fact that CrateDB doesn’t support transactions can become relevant when dealing with ETL loads populating dimension tables. A complete reload (i.e. deletion of previous rows and inserting of new ones) of a tables’ data cannot be encapsulated in a transaction to remain invisible to other clients. There are strategies to work around this, such as adding a version number for each load and always using in queries the latest version. Once the load for a table has finished successfully, the previous version can safely be deleted.
That being said, the bottom line is that CrateDB is suitable for a lot more than just IoT, it needs to be evaluated on a per-use-case basis if there is a fit with CrateDB’s architecture.