Author here. For context, I was the tech lead for the Postgres team at Cloudflare, and this came directly out of a challenge I kept hitting there: BI and dashboard teams needed to run long-running analytical queries, and the answer was always to spin up another bespoke read replica or stand up an ETL dump into an analytical database and query that.
So the question I started with was: what's the fewest components I could get away with? That led to the architecture here — Streambed connects to Postgres as a logical replication subscriber (same mechanism as a read replica) and streams WAL changes straight into Apache Iceberg on S3, queryable from psql via an embedded DuckDB. There are a lot of edge cases to handle, and it's very much early days.
Welcome any feedback.
To me being able to query over psql is secondary. I’m fine with any SQL. What is very important is being able to transform the data to better suite analytical queries. That is, define custom transformations, define how data sectioned and what indices available.
Thanks for releasing this! How do you handle DDL queries? Are table changes synchronized to the Iceberg table automatically?
Also, I recently started looking into olake[0] to serve the same purpose. What would you say differentiates Streambed?
[0] https://github.com/datazip-inc/olake
Thanks for the kind words!
Short answer: yes, column-level schema changes sync to Iceberg automatically[0].
Logical replication (pgoutput in v1) doesn't actually stream DDL statements. Instead, Postgres emits a fresh Relation message describing the table's current column layout right before the next change to that table. So we diff that against the last layout we knew and infer what changed.
From there we evolve the Iceberg schema in place: flush any buffered rows under the old schema first, then write a new metadata version with the change. What's handled today:
[0] https://github.com/viggy28/streambed/pull/21
Hey vira28, thanks a lot for your work. This is a very promising project because other alternative like supabase/etl, Kuvasz-streamer, Sequin all have some subtle issues.
Few questions: 1) For a supabase project can we setup replication slot on replica instead of primary? https://sequinstream.com/docs/reference/databases#using-sequ...
2) For a planetscale cluster are the replication slots on primary or the follower nodes?
I'm asking because isn't setting up slots on primary riskier than setting them on replicas/followers? Because If you have them primary In case of WAL buildup your primary will go down?
Welcome. To avoid primary running out of disk space, you can configure max_slot_wal_keep_size https://www.postgresql.org/docs/17/runtime-config-replicatio...
Since Supabase is vanilla Postgres, streambed should work with replica as the source.
reg, Planetscale, I haven't looked at their offerings yet.
Where do you host your DB currently? Happy to try out with that provider as the source.
Just wanted to say thank you! Very relevant to our use cases. I'll report if I find any issues.
Welcome. Would love to hear your experience. Feel free to share here or in the repo. Fully open source.
> streams WAL changes straight into Apache Iceberg on S3, queryable from psql via an embedded DuckDB
Why not use Ducklake instead of Apache Iceberg? Wouldn't that simplify the architecture substantially?
> queryable from psql via an embedded DuckDB.
noob question here from someone who ony played a bit with iceberg and trino: what's the reason to do the analytics stil inside the postgres -- is it so that you don't eat up the IOPS/bandwidth of the main postgresql disks?
How does it compare to https://github.com/supabase/etl ?
Very cool! What would a 10,000 feet solution look like for MySQL to Iceberg on S3?
Should be fairly doable using binlog-based producer https://github.com/go-mysql-org/go-mysql.
Why are your queries slow?