Ask HN: Best way to approach user-facing analytics
I’m looking for advice on technical strategies for user-facing analytics features (think real-time customizable dashboard). I’ve ended up in a situation where I want to design some analytics dashboards for end users that display aggregates/historical data for a number of years. The queries I want to run take way too long to run on-the-fly so I’ve resorted to pre-computing the data I care about on a schedule. This approach works fine but it’s not real time and it adds some friction to introducing new features/views (every new feature requires backfilling years worth of data). I’m curious if others have better strategies for developing analytics software like this that’s real-time, responsive, and quick to iterate on? OLAP DBs like clickhouse? Reverse ETL? Better OLTP indexing?
There are a few different approaches. The main categories, from simplest to most complex:
1) Read replicas with copied data. The most straightforward, allowing using the same SQL syntax and tooling. Examples: Postgres read replica and BemiDB (disclaimer: I'm a contributor)
2) Operational databases with integrations. Designed for sub-second real-time, bring their own extended SQL syntax for things like window functions. Examples: Materialize and RisingWave
3) Analytical databases with syncing. Allow writing and reading directly, optimized for analytical workloads. Examples: ClickHouse and DuckDB
4) Data warehouses with ETL. Great for large volumes of data, traditionally used with ETL batch processing. Examples: Snowflake and Redshift
Common approach is to use a data warehouse that is suitable for executing OLAP queries (to load/calculate data for your dashboards) in seconds. In simplest cases ('small data') this can be simply app's database replica - a separate (read-only) DB server that is used for reporting/analytics purposes. It is easy to configure master-slave replication with built-in SQL Server/PostgreSql/Mysql capabilities.
If replica server cannot execute queries fast enough, a specialized (optimized for OLAP-workload) database should be used instead. This can be a cloud service (like BigQuery, Redshift, Snowflake, Motherduck) or self-hosted solutions (ClickHouse, PostgreSql with pg_analytics extension, or even in-process DuckDB). Data sync is performed either with scheduled full-copy (simple, but not suitable for near real-time analytics) or via CDC (see Airbyte).
can take a look at Quill.co if you just want it taken care of for you. Bring any sql DB, and all you need to worry about is frontend styling & workflows.
Looks interesting. Have you used this for complex dashboards before? How’s the latency?