Show HN: Arroyo – Write SQL on streaming data

115 points by necubi 2 years ago

Hey HN,

Arroyo is a modern, open-source stream processing engine, that lets anyone write complex queries on event streams just by writing SQL—windowing, aggregating, and joining events with sub-second latency.

Today data processing typically happens in batch data warehouses like BigQuery and Snowflake despite the fact that most of the data is coming in as streams. Data teams have to build complex orchestration systems to handle late-arriving data and job failures while trying to minimize latency. Stream processing offers an alternative approach, where the query is compiled into a streaming program that constantly updates as new data comes in, providing low-latency results as soon as the data is available.

I started the Arroyo project after spending the past five years building real-time platforms at Lyft and Splunk. I saw first hand how hard it is for users to build correct, reliable pipelines on top of existing systems like Flink and Spark Streaming, and how hard those pipelines are to operate for infra teams. I saw the need for a new system that would be easy enough for any data team to adopt, built on modern foundations and with the lessons of the past decade of research and industry development.

Arroyo works by taking SQL queries and compiling them into an optimized streaming dataflow program, a distributed DAG of computation with nodes that read from sources (like Kafka), perform stateful computations, and eventually write results to sinks. That state is consistently snapshotted using a variation of the Chandy-Lamport checkpointing algorithm for fault-tolerance and to enable fast rescaling and updates of the pipelines. The entire system is easy to self-host on Kubernetes and Nomad.

See it in action here: https://www.youtube.com/watch?v=X1Nv0gQy9TA or follow the getting started guide (https://doc.arroyo.dev/getting-started) to run it locally.

thom 2 years ago

Unbounded streams, but with watermarks (which right now seem fixed length?):

https://doc.arroyo.dev/concepts#watermarks

Also works based on fixed, pre-built pipelines. This is all very much in the style of most stream processing platforms today but I hope we’ll continue to move closer as an industry to having our cake and eating it: ingest everything in real-time, while serving any query (with joins) over the full dataset (either incrementally or ad-hoc).

necubi 2 years ago

In SQL we currently support specifying watermarks based on SQL expressions, so it can be a bit more expressive than just the basic fixed-delay watermark: https://doc.arroyo.dev/sql/ddl#options. For watermark-based systems, I think the ideal is something that allows users to express either max latencies or statistical completeness (like, wait up to 1 minute or until an estimated 99.5% of the data is there).
Beyond that, there are systems that are more integrated end-to-end that can update as late arriving data comes in (like Materialize), and think those have there place. However for many uses of stream processing what's important is taking action once the data is complete enough, and watermarks a useful and pretty straightforward mechanism for that.

yevpats 2 years ago

Looks cool. What is the difference between this tools and benthos (https://www.benthos.dev/)?

necubi 2 years ago

There's a spectrum of streaming systems from simple, stateless processors to complex, stateful distributed processors.
I'm not very familiar with Benthos, but it appears to be on the simple, stateless end. Systems like that can be great if they fit your use cases—they tend to be much easier to understand and operate.
But they are not as flexible or powerful as stateful systems like Arroyo or Flink. For example, they cannot accurately compute aggregations, joins, or windows over long time ranges.

sorenbs 2 years ago

This is a really exciting project! I recently learned about https://github.com/vmware/database-stream-processor which builds on a new theoretical foundation and claims to be 9x faster than Flink. It is also written in Rust, and there is a compiler from SQL to Rust executables. Can you comment on the differences?

ryzhyk 2 years ago

I'm a developer of DBSP. Our repo now lives here: https://github.com/feldera/dbsp/. And here is some more benchmarking data vs Flink and Beam: https://github.com/feldera/dbsp/tree/main/benchmark.
See feldera.com for more info.
- sorenbs 2 years ago
  
  Thank you!
  Really happy to see all the activity at the new repo. For a while there I thought the project was dying...
sorenbs 2 years ago

As an aside - we are evaluating different streaming engines to power data projections for Prisma.io. Excited to see support for Debezium coming in v0.4.0.
Would be interesting to somehow make Arroyo run the Nexmark benchmark so we can clearly compare to Flink and DBSP: https://liveandletlearn.net/post/vmware-take-3-experience-wi...
- jnewhouse 2 years ago
  
  Hi there! We actually already have a built-in Nexmark source. It's pretty useful for developing new capabilities, and available as a source out of the box.
  Just read through the DBSP docs and it looks like it is working in a similar space. The biggest differences in my mind are around distribution and reliability. Arroyo works across a cluster of machines and has built in fault tolerance, while for DBSP that's still just planned for the future.
  (I'm the co-creator of Arroyo, for context)
  
  sorenbs 2 years ago
  
  Thank you!
  I'll try to get something set up to compare performance of the two on the same machine.
  
  necubi 2 years ago
  
  That'd be great! We have versions of most of the nexmark queries and some internal benchmark vs Flink, we'd love to help if you're interested in benchmarking against more systems. Reach out at micah@arroyo.systems.

benjaminwootton 2 years ago

Between Flink, Spark and KSQL, streaming is very JVM centric. It is nice to see more non JVM projects emerge.

I am not sure about your premise that the operations side is difficult. It tends to be submitting a job to a cluster in Flink or Spark.

The harder barrier to entry is the functional style of transformation code. Even though other frameworks have it, I think the SQL API as the first class citizen is the bigger differentiator.

jsty 2 years ago

In the watermarks documentation it mentions that events arriving after the watermark are dropped. Are there any plans to make this configurable (to disable dropping or trigger exception handling) and/or alertable?

I can think of quite a few use cases (particularly in finance) where we'd want late-arrivals to be recorded and possibly incorporated into later or revised results, not silently dropped on the floor.

necubi 2 years ago

Yeah, currently late-arriving data is dropped but we will be making this more configurable. We're currently working on what we call "update tables," which means being able to emit incremental changes to state as well as final results. Once that's in we'll be able to give richer semantics around late arriving data.

gunnarmorling 2 years ago

Very interesting project, Arroyo has been on my watch list for a while! How would you say does Arroyo compare to Apache Flink, i.e. what are pros and cons? For instance, given it's implemented in Rust, I'd assume Arroyo's resource consumption might be lower?

(Disclaimer: I work for Decodable, where we build a SaaS based on Flink)

dangoodmanUT 2 years ago

Would love to know how you look at tools list Materialize in comparison

fasteo 2 years ago

Slightly off-topic.

"Arroyo" is a Spanish word meaning creek, or stream

necubi 2 years ago

Yep! In California where we live, it typically refers to a seasonal desert stream that can range from a trickle to a torrent. We chose it because Arroyo is a stream processor, and it's very good at autoscaling.
callmeed 2 years ago

Correct
I live in a California town called Arroyo Grande ("big creek")
- fasteo 2 years ago
  
  My second surname is Arroyo. Probably my ancestors lived near a big creek ;)

KRAKRISMOTT 2 years ago

Very exciting, how is feature parity with tinybird?

https://www.tinybird.co/

_peregrine_ 2 years ago

I am not sure of specifics on features, but I think the fundamental difference is that Arroyo is a stream processing engine i.e., it doesn’t have a database, whereas Tinybird has the statefulness afforded by ClickHouse as its primary data store. Arroyo would be more like Flink, Tinybird would be more like ClickHouse.
Disclaimer: I work for Tinybird.
- necubi 2 years ago
  
  Yes, that's mostly true. Arroyo is a stream processor like Flink. In Arroyo, you pre-register the queries you are interested in, and they will be compiled into streaming dataflow jobs that continuously execute as events come in. This means that we're not storing all of the raw events that come in for later querying.
  However, like a database we do have a serving layer (currently only in our cloud version due to its reliance on our distributed state backend: https://doc.arroyo.dev/connectors/state) so it is possible to query the results directly from Arroyo as well.
  Generally you would want to use something like Arroyo when your data is too high volume to reasonably store it all in a DB like Clickhouse, or your queries are too expensive to perform on every query, as Arroyo incrementally computes the results of the query as events come in.
  There's also opportunities to use these systems together: Arroyo can pre-aggregate the high volume raw data, and then it can be inserted into a Clickhouse-based system for final processing along different dimensions.

httgp 2 years ago

This looks great, and it’s very cool that it recommends Nomad to run it in production.

I wish more products would support (or at least document how to run on) Nomad.

necubi 2 years ago

We're big fans of Nomad! It's what we use for scheduling in our cloud platform, due to its simplicity and scheduling speed. Although we also have great support for Kubernetes as that's what most folks will be running.

laurensr 2 years ago

Would Arroyo be an alternative to Confluent KSQL?

necubi 2 years ago

Yes, Arroyo is an alternative to KSQL, although more in the design-space of Flink.
KSQL is pretty simple and easy to run if you already have Kafka, but will be much more expensive and harder to scale due to its reliance on Kafka streams for persistence and shuffling of data in processing DAG.
And with Confluent's embrace of Flink in the past year (https://www.confluent.io/blog/cloud-kafka-meets-cloud-flink-...) it's not clear that KSQL has much of a future.
kohlerm 2 years ago

If you look at the page, I guess the answer is yes. Or more precisely, an alternative to Flink written in rust.

yakorevivan 2 years ago

[dead]

trevyn 2 years ago

Any interest in redoing the web console in Rust? 8)

ukuina 2 years ago

What would that accomplish?
- tmikaeld 2 years ago
  
  I think op was trying to be funny, like, everything being ported to rust these days
  
  trevyn 2 years ago
  
  No, Arroyo is written is Rust, but the web front-end is React.