points by vira28 2 days ago

Author here. For context, I was the tech lead for the Postgres team at Cloudflare, and this came directly out of a challenge I kept hitting there: BI and dashboard teams needed to run long-running analytical queries, and the answer was always to spin up another bespoke read replica or stand up an ETL dump into an analytical database and query that.

So the question I started with was: what's the fewest components I could get away with? That led to the architecture here — Streambed connects to Postgres as a logical replication subscriber (same mechanism as a read replica) and streams WAL changes straight into Apache Iceberg on S3, queryable from psql via an embedded DuckDB. There are a lot of edge cases to handle, and it's very much early days.

Welcome any feedback.

kikimora 2 days ago

To me being able to query over psql is secondary. I’m fine with any SQL. What is very important is being able to transform the data to better suite analytical queries. That is, define custom transformations, define how data sectioned and what indices available.

erikcw 2 days ago

Thanks for releasing this! How do you handle DDL queries? Are table changes synchronized to the Iceberg table automatically?

Also, I recently started looking into olake[0] to serve the same purpose. What would you say differentiates Streambed?

[0] https://github.com/datazip-inc/olake

  • vira28 2 days ago

    Thanks for the kind words!

    Short answer: yes, column-level schema changes sync to Iceberg automatically[0].

    Logical replication (pgoutput in v1) doesn't actually stream DDL statements. Instead, Postgres emits a fresh Relation message describing the table's current column layout right before the next change to that table. So we diff that against the last layout we knew and infer what changed.

    From there we evolve the Iceberg schema in place: flush any buffered rows under the old schema first, then write a new metadata version with the change. What's handled today:

      - ADD COLUMN — new field ID allocated; the column's Postgres DEFAULT is carried into Iceberg's initial-default/write-default, so existing rows read back correctly
      - DROP COLUMN — removed from the current schema, existing data files untouched
      - Type widening — int4→int8, float4→float8 (the changes Iceberg considers compatible)
      - REPLICA IDENTITY changes
    

    [0] https://github.com/viggy28/streambed/pull/21

saxenaabhi 2 days ago

Hey vira28, thanks a lot for your work. This is a very promising project because other alternative like supabase/etl, Kuvasz-streamer, Sequin all have some subtle issues.

Few questions: 1) For a supabase project can we setup replication slot on replica instead of primary? https://sequinstream.com/docs/reference/databases#using-sequ...

2) For a planetscale cluster are the replication slots on primary or the follower nodes?

I'm asking because isn't setting up slots on primary riskier than setting them on replicas/followers? Because If you have them primary In case of WAL buildup your primary will go down?

  • vira28 1 day ago

    Welcome. To avoid primary running out of disk space, you can configure max_slot_wal_keep_size https://www.postgresql.org/docs/17/runtime-config-replicatio...

    Since Supabase is vanilla Postgres, streambed should work with replica as the source.

    reg, Planetscale, I haven't looked at their offerings yet.

    Where do you host your DB currently? Happy to try out with that provider as the source.

ashtuchkin 2 days ago

Just wanted to say thank you! Very relevant to our use cases. I'll report if I find any issues.

  • vira28 2 days ago

    Welcome. Would love to hear your experience. Feel free to share here or in the repo. Fully open source.

kshri24 2 days ago

> streams WAL changes straight into Apache Iceberg on S3, queryable from psql via an embedded DuckDB

Why not use Ducklake instead of Apache Iceberg? Wouldn't that simplify the architecture substantially?

raducu 2 days ago

> queryable from psql via an embedded DuckDB.

noob question here from someone who ony played a bit with iceberg and trino: what's the reason to do the analytics stil inside the postgres -- is it so that you don't eat up the IOPS/bandwidth of the main postgresql disks?