I went down this road with Clickhouse and spent 18 months setting it up for realtime analytics. I'm not sure it's going to stick, primarily because our Data Transformation & Aggregation steps require some gnarly joins across large tables and Clickhouse does not handle large joins well. The consequence is that the aggregation has to happen in a separate system (currently using Snowflake) and when there were changes to what we were processing, it sometimes requires gymnastics both in the aggregation layer and inside of Clickhouse to accommodate the change. Denormalizing was rife with tradeoffs, mostly just to make Clickhouse happy. On top of that, we leaned heavily on projections for performance, which is wonderfully automated, but also meant waiting for unpredictable background processing during backloads, etc.
We might stick with Clickhouse, but after working with it for a year and a half, I'm curious to see whether a system that handles joins more effectively would be a better fit. To that end, my next R&D project is to set up a vertical slice of our analytics on Apache Doris to see how well it handles a similar workload.
Disclaimer: I'm the product manager for ClickHouse core database.
Which version are you working with? I recommend trying the latest version. For the last eight months, we have spent considerable time improving our JOINs.
Sai from ClickHouse here. Adding to above, we just released a blog that presents JOIN benchmarks of ClickHouse against Snowflake and Databricks. This is after the recent enhancements made to the ClickHouse core. https://clickhouse.com/blog/join-me-if-you-can-clickhouse-vs.... The benchmarks is around 2 dimensions of both speed and cost.
This is really encouraging! Commented elsewhere in the thread but this was one of the main odd points I ran into when experimenting with ClickHouse, and the changes in the PR and mentioned in the recent video about join improvements (https://www.youtube.com/watch?v=gd3OyQzB_Fc&t=137s) seem to hit some of the problems. I'm curious whether "condition pushdown" mentioned in the video will make it so "a.foo_id=3 and b.foo_id=a.foo_id" doesn't need "b.foo_id=3" added for optimal speed.
I also share nrjames's curiosity about whether the spill-to-disk situation has improved. Not having to even think about whether a join fits in memory would be a game changer.
[Edited to add: looking at comments from
saisrirampur and SnooBananas6657, ClickHouse has worked some on joins recently, and there's a very recent a PR around optimizing join order. Cool!]
In playing a little with ClickHouse, issues with joins were one of the biggest things that jumped out at me.
It wouldn't change the join order to try to ensure the smaller side was in RAM. It wouldn't propagate WHERE conditions to the other side of a join, so "WHERE a.customer_id=3 AND b.customer_id=a.customer_id" wouldn't automatically filter the scan of b to blocks where b.customer_id=3. Projections could theoretically be used to speed joins on the projection's ORDER BY columns, but it didn't appear to happen. You'd sometimes need to manually set the join algo for best performance, when I'd rather the optimizer choose it, maybe adaptively based on whether it seemed to make sense to do a join in-RAM.
ClickHouse was also great at certain things, like scanning a ton of rows very efficiently and interfacing with lots of systems (its table functions could pull from lots of sources). It was easy to pick up and start playing with, and seems to have found use in a lot of organizations and at substantial scale. And there are a lot of features that look potentially quite powerful (incremental materialized views, aggregating/deduping merge tree variants, etc.). The best way I could process what I saw experimenting with it was that they'd focused on a providing really efficient low-level execution engine, but sometimes you needed to act as the query optimizer.
I got a recent ClickHouse and played with it, and looked around GitHub some; replying to this dead thread to update outdated bits of the above.
In general: given a huge hash join, you can shrink it (reorder, etc. so fewer rows would need hashed), sort (take advantage of sort order for an efficient join), and/or spill it to disk. ClickHouse's improvements from 2024.12 forward have made it much better at the first of the three; it could get much further with the other two.
1) Shrinking -- "Can we make it a small join?" They've done a lot to solve this one! They'll swap table order, propagate WHERE conditions, and end up using projections, so quite often if the join can be small, it will be, without special effort on the query writer's part. To be clear, this is a huge win for many use cases, and I'm not writing more because all I can say is "great."
2) Sorting -- "Can we take advantage of an existing sort order?" Huge joins can be efficient and not run out of RAM if the data is already ordered by the join key and all you have to do is walk through part of it in order. ClickHouse doesn't automatically do this; it defaults to hashing even when the tables' natural row ordering would make sort-based joins fast. Docs suggest users consider full_sorting_merge when tables are already ordered by the join key. Perhaps a future setting could make the optimizer consider it.
Joins with coinciding ORDER BYs seem like they could occur naturally in "might-have" situations, e.g. joining pageviews to conversions on a shared key.
2a) Sorting with projections -- "Can we use sort orders if we include projections?" If rows aren't already on disk in the correct order to make a sort join fast, the ClickHouse administrator could add a projection with a different ORDER BY to one or both tables to fix that! Right now ClickHouse will pick projections to aid in filtering, but won't look for projections that allow a fast sort join.
Combining 2 and 2a, ClickHouse admins would gain a pretty big hammer to use on problem joins: create a projection or two allowing a fast sort-join, and now the former problem query is efficient with no effort on query authors' part. This is valuable when the admins are willing to understand query and execution patterns precisely, but not everyone (or everything, when some queries are generated) can write queries with optimization in mind.
Query authors can force use of a projection with the mergeTreeProjection function, and force join_algorithm = 'full_sorting_merge', but that's the user acting as query optimizer, not the desired end state.
3) Spilling -- "Will we automatically spill to disk instead of having an out-of-memory error?" Looks like this is still a "no". Docs show options to stop joins at a certain size, but I don't see a way to e.g. trigger a retry with a more spill-friendly join_algorithm when a join gets too big (or if the planner expects it will be big). There is a very interesting patch in https://github.com/ClickHouse/ClickHouse/pull/72728 which introduces the idea of query-wide RAM accounting that can trigger spill to disk, but it's only used in grace_hash as of that PR.
--
I haven't really looked at DISTINCT or aggregation's RAM use or whether they take advantage of sorts. And maybe the better path is to make transparent spill-to-disk something you can rely on then reduce how often you need to spill as an optimization. Either way, ClickHouse could get from its current state--where joins that can fit in RAM are efficient but larger ones need some hand-holding--to a state where you really can throw all sorts of queries at it and reliably get results without really having to guide it.
+1 to StarRocks. You can cut out a lot of the weight associated with denormalization, which ClickHouse almost forces you to do. Crazy big cluster sizes as well
In our experience, the primary driver of Snowflake costs is not the compute for aggregation, but the compute required for lots of reads/scans.
We recently built a Snowflake-to-ClickHouse pipeline for a customer where aggregates are built hourly in Snowflake, then pushed into a ClickHouse table to power their user-facing dashboards.
By offloading dashboard queries to ClickHouse, they slashed their Snowflake bill by ~25%, which was worth millions to them.
(Admittedly, running aggregations elsewhere—for example, in Spark—could further reduce costs, but you would then need Iceberg to make the tables queryable in Snowflake.)
I'm in an enterprise environment where a central IT platform team controls what size warehouses we can have in Snowflake. They are not receptive to arguments for larger warehouses, unfortunately. Our issue becomes long-running queries b/c Snowflake spills the data to disk during the joins. TBH, I could join the data more quickly on my laptop than in the warehouse I'm allowed to use. Anyhow, I have then an old build server that is beefy & has 512 GB of RAM, so I can set up my aggregation and/or OLAP services there, since it's an unencumbered playground.
I went down this road with Clickhouse and spent 18 months setting it up for realtime analytics. I'm not sure it's going to stick, primarily because our Data Transformation & Aggregation steps require some gnarly joins across large tables and Clickhouse does not handle large joins well. The consequence is that the aggregation has to happen in a separate system (currently using Snowflake) and when there were changes to what we were processing, it sometimes requires gymnastics both in the aggregation layer and inside of Clickhouse to accommodate the change. Denormalizing was rife with tradeoffs, mostly just to make Clickhouse happy. On top of that, we leaned heavily on projections for performance, which is wonderfully automated, but also meant waiting for unpredictable background processing during backloads, etc.
We might stick with Clickhouse, but after working with it for a year and a half, I'm curious to see whether a system that handles joins more effectively would be a better fit. To that end, my next R&D project is to set up a vertical slice of our analytics on Apache Doris to see how well it handles a similar workload.
Disclaimer: I'm the product manager for ClickHouse core database.
Which version are you working with? I recommend trying the latest version. For the last eight months, we have spent considerable time improving our JOINs.
For example, we just merged https://github.com/ClickHouse/ClickHouse/pull/80848, which will help a lot with performance in the near future.
Sai from ClickHouse here. Adding to above, we just released a blog that presents JOIN benchmarks of ClickHouse against Snowflake and Databricks. This is after the recent enhancements made to the ClickHouse core. https://clickhouse.com/blog/join-me-if-you-can-clickhouse-vs.... The benchmarks is around 2 dimensions of both speed and cost.
This is really encouraging! Commented elsewhere in the thread but this was one of the main odd points I ran into when experimenting with ClickHouse, and the changes in the PR and mentioned in the recent video about join improvements (https://www.youtube.com/watch?v=gd3OyQzB_Fc&t=137s) seem to hit some of the problems. I'm curious whether "condition pushdown" mentioned in the video will make it so "a.foo_id=3 and b.foo_id=a.foo_id" doesn't need "b.foo_id=3" added for optimal speed.
I also share nrjames's curiosity about whether the spill-to-disk situation has improved. Not having to even think about whether a join fits in memory would be a game changer.
Will Clickhouse spill to disk yet when joins are too large for memory?
[Edited to add: looking at comments from saisrirampur and SnooBananas6657, ClickHouse has worked some on joins recently, and there's a very recent a PR around optimizing join order. Cool!]
In playing a little with ClickHouse, issues with joins were one of the biggest things that jumped out at me.
It wouldn't change the join order to try to ensure the smaller side was in RAM. It wouldn't propagate WHERE conditions to the other side of a join, so "WHERE a.customer_id=3 AND b.customer_id=a.customer_id" wouldn't automatically filter the scan of b to blocks where b.customer_id=3. Projections could theoretically be used to speed joins on the projection's ORDER BY columns, but it didn't appear to happen. You'd sometimes need to manually set the join algo for best performance, when I'd rather the optimizer choose it, maybe adaptively based on whether it seemed to make sense to do a join in-RAM.
ClickHouse was also great at certain things, like scanning a ton of rows very efficiently and interfacing with lots of systems (its table functions could pull from lots of sources). It was easy to pick up and start playing with, and seems to have found use in a lot of organizations and at substantial scale. And there are a lot of features that look potentially quite powerful (incremental materialized views, aggregating/deduping merge tree variants, etc.). The best way I could process what I saw experimenting with it was that they'd focused on a providing really efficient low-level execution engine, but sometimes you needed to act as the query optimizer.
I got a recent ClickHouse and played with it, and looked around GitHub some; replying to this dead thread to update outdated bits of the above.
In general: given a huge hash join, you can shrink it (reorder, etc. so fewer rows would need hashed), sort (take advantage of sort order for an efficient join), and/or spill it to disk. ClickHouse's improvements from 2024.12 forward have made it much better at the first of the three; it could get much further with the other two.
1) Shrinking -- "Can we make it a small join?" They've done a lot to solve this one! They'll swap table order, propagate WHERE conditions, and end up using projections, so quite often if the join can be small, it will be, without special effort on the query writer's part. To be clear, this is a huge win for many use cases, and I'm not writing more because all I can say is "great."
2) Sorting -- "Can we take advantage of an existing sort order?" Huge joins can be efficient and not run out of RAM if the data is already ordered by the join key and all you have to do is walk through part of it in order. ClickHouse doesn't automatically do this; it defaults to hashing even when the tables' natural row ordering would make sort-based joins fast. Docs suggest users consider full_sorting_merge when tables are already ordered by the join key. Perhaps a future setting could make the optimizer consider it.
Joins with coinciding ORDER BYs seem like they could occur naturally in "might-have" situations, e.g. joining pageviews to conversions on a shared key.
2a) Sorting with projections -- "Can we use sort orders if we include projections?" If rows aren't already on disk in the correct order to make a sort join fast, the ClickHouse administrator could add a projection with a different ORDER BY to one or both tables to fix that! Right now ClickHouse will pick projections to aid in filtering, but won't look for projections that allow a fast sort join.
Combining 2 and 2a, ClickHouse admins would gain a pretty big hammer to use on problem joins: create a projection or two allowing a fast sort-join, and now the former problem query is efficient with no effort on query authors' part. This is valuable when the admins are willing to understand query and execution patterns precisely, but not everyone (or everything, when some queries are generated) can write queries with optimization in mind.
Query authors can force use of a projection with the mergeTreeProjection function, and force join_algorithm = 'full_sorting_merge', but that's the user acting as query optimizer, not the desired end state.
3) Spilling -- "Will we automatically spill to disk instead of having an out-of-memory error?" Looks like this is still a "no". Docs show options to stop joins at a certain size, but I don't see a way to e.g. trigger a retry with a more spill-friendly join_algorithm when a join gets too big (or if the planner expects it will be big). There is a very interesting patch in https://github.com/ClickHouse/ClickHouse/pull/72728 which introduces the idea of query-wide RAM accounting that can trigger spill to disk, but it's only used in grace_hash as of that PR.
--
I haven't really looked at DISTINCT or aggregation's RAM use or whether they take advantage of sorts. And maybe the better path is to make transparent spill-to-disk something you can rely on then reduce how often you need to spill as an optimization. Either way, ClickHouse could get from its current state--where joins that can fit in RAM are efficient but larger ones need some hand-holding--to a state where you really can throw all sorts of queries at it and reliably get results without really having to guide it.
Curious, you mention Doris. I wonder if you've tried looking into StarRocks?
+1 to StarRocks. You can cut out a lot of the weight associated with denormalization, which ClickHouse almost forces you to do. Crazy big cluster sizes as well
https://medium.com/@marvin_data/testing-query-speed-for-duck...
what about druid ? I don't hear much about druid these days. Has it fallen off?
yeah its legacy tech at this point
Not yet, but it's on the list! This is R&D work that I'm doing on the side, when I have time. Do you prefer StarRocks to Doris?
Yeah, interestingly StarRocks was originally a fork from Doris, but these days it tends to outperform in most of the use cases I’ve read up on.
low latency and high concurrecy is a hard problem with large data.
We also have a budget issue with Snowflake, so it would help to aggregate elsewhere.
In our experience, the primary driver of Snowflake costs is not the compute for aggregation, but the compute required for lots of reads/scans.
We recently built a Snowflake-to-ClickHouse pipeline for a customer where aggregates are built hourly in Snowflake, then pushed into a ClickHouse table to power their user-facing dashboards.
By offloading dashboard queries to ClickHouse, they slashed their Snowflake bill by ~25%, which was worth millions to them.
(Admittedly, running aggregations elsewhere—for example, in Spark—could further reduce costs, but you would then need Iceberg to make the tables queryable in Snowflake.)
I'm in an enterprise environment where a central IT platform team controls what size warehouses we can have in Snowflake. They are not receptive to arguments for larger warehouses, unfortunately. Our issue becomes long-running queries b/c Snowflake spills the data to disk during the joins. TBH, I could join the data more quickly on my laptop than in the warehouse I'm allowed to use. Anyhow, I have then an old build server that is beefy & has 512 GB of RAM, so I can set up my aggregation and/or OLAP services there, since it's an unencumbered playground.
Simon is one of the few people exploring the contemporary challenges of data modelling in a meaningful way.
Is there anyone i can learn about latest challenges in datamodeling ?
[dead]
What even is Clickhouse?