Show HN: Serverless OLAP with Seafowl and GCP

54 points by paws 3 years ago

Hello HN! I'm an engineer at Splitgraph and recently started learning Rust so I could make my first contribution to Seafowl [0], an early stage analytical database. Along the way I figured out a database hosting hack on GCP and wanted to share it with HN. It's a way to achieve "true" scale to zero database hosting that could be useful for certain side projects or spiky traffic situations.

A recurring problem I've faced with side projects is the need for Postgres, but no desire to deploy or maintain new instances. So when I learned GCP's "always free" tier includes serverless [1] I got curious to see if I could run a database.

While a lot of classic databases aren't usually a great fit for serverless, Seafowl separates compute, storage and catalog (catalog == a SQLite file of metadata). [2] Last month I was able to introduce GCS bucket compat to Seafowl, which enabled me to mount the catalog via gcsfuse (i.e. an adapter that allows attaching GCS buckets to local filesystems). Upshot: while FUSE does add HTTP requests to container startup, init time remains comparatively quick, even cold starts, because fetching is limited to the single catalog SQLite file only.

With this approach you get a URL you can query directly from your FE if you want, e.g. fetch() can send SELECT * ... queries straight from your users' browser. You could plot a graph from a static React frontend, or observablehq.com editor, with no persistent backend needed. So at times when nobody's using your app, 100% of your stack can scale to zero with obvious cloud spend advantages. And even if you exceed free tier limits, being PAYG offers a good chance you'll come out ahead on hosting costs anyway.

NB: Seafowl is an early stage project, so it's not really suitable if you need transactions or fast single-row writes. Otherwise, this could be a nice way to get free database hosting at a big 3 cloud provider, especially for e.g. read-only analytical reporting queries.

Feedback and suggestions are appreciated. Hope it helps you! More available if you want [3].

[0] https://seafowl.io/docs/getting-started/introduction

[1] https://cloud.google.com/run/pricing#cpu-requests

[2] Neon is another interesting project that separates compute and storage. https://neon.tech/blog/architecture-decisions-in-neon

One issue I observed was a noticeably longer startup time vs this FUSE approach, which I believe may be related to Postgres connection setup time/roundtrips. Looking forward to trying Neon again in future.

[3] https://www.splitgraph.com/blog/deploying-serverless-seafowl

aleda145 3 years ago

I use seafowl hosted on Cloud Run for a side project for Swedish Real Estate data. Around a million rows, seafowl works great!

One killer feature (aside from scaling to zero) is that the queries can be constructed as GET requests. That means we can cache the query results with cloudflare.

I have it exposed here if you want to write some SQL and check it out live: https://bostadsbussen.se/sold/query

paws 3 years ago

Glad to hear you've had good results! Yes, @mildbyte did a great job making Seafowl comply with HTTP cache semantics (i.e. Etags/Cache-Control), and it should give good results for both CDNs and browsers. When building Open Data Monitor [0] I certainly observed some nice speed ups.
For those interested in how caching works (i.e. if your dataset is public it could be an easy win) more info is in the docs [1]
[0] https://open-data-monitor.splitgraph.io/week/2023-05-22
It's a Socrata scraper that renders diffs of public/government datasets
[1] https://seafowl.io/docs/getting-started/tutorial-fly-io/part...

lee101 3 years ago

[dead]