Lots of automatic performance optimizations on what is already a very fast engine. (I’ve stopped using Pandas)
I know most software folks feel some type of way about SQL (most don’t grok it beyond a simple SELECT) but this is one of the advantages of declarative languages like SQL and a plan-execute programming paradigm where a plan is created before instructions are run, making it amenable to plan optimization.
Maybe the syntax of the language could be improved (e.g. Linq) but conceptually SQL is historically when we’ve blundered into the right. Data operations are often done in sets rather than loops and it’s a worthwhile investment for software engineers to learn to think in this way if they want to work with data correctly at scale.
Stonebraker was right in that people who avoid SQL are doomed to reinvent it poorly.
Wes McKinney (author of Pandas, but also co-author of a host of other major data formats/tools like Ibis, Arrow, Parquet, Feather), wrote "10 Things I Hate About pandas"
It was one of his earlier projects, and it "stuck". It was one of the first popular dataframe libraries for Python and it filled that niche for years (alongside dplyr/data.table in R), especially during the big data/data science craze of the 2010s. Tons have been written on it, data scientists and other data folk were brought up on it, so it's the de facto standard in many pipelines.
Since then, we have moved on to better tools like Polars and DuckDB. You will still see a lot of Pandas code in the wild due to how prevalent it is, but if you were to start a new project, you might want to use more modern tools.
Pandas is kinda of analogous to jquery -- jquery was hugely influential during its time, but we have learned a lot since, and there are more modern options (React/Vue/Svelte).
Pandas has a rather unnatural API. For instance you have to name the DataFrame anytime you want to pull a column out of it, so filtering is e.g., `my_df[(my_df['col1'] == value) & (my_df['col2'] + my_df['col3'] > 42)]`. The indexing is also kind of a mess — there are like seven different styles. Row-wise mapping is a huge pain. And of course there is no optimization; it's all computed eagerly so you are responsible for your own optimizations.
Polars, on the other hand, lets you refer to columns like `col('col1')`, which starts to add up if your DataFrame has a long name. It has no row indexing; it's all done by filtering, which is conceptually very simple. Row-wise mapping is trivial. And there is an optimizer that runs before execution.
But more than that, polars has a very fluent API, whereas pandas relies heavily on statements that can't be chained; it really breaks the flow.
Yes, agreed. The API is a big inconsistent kludge, has many warts, and generally requires too much typing and memorization. The performance is subpar. There are some very annoying design choices wrt. implicit type coercion that don't jive with my personal preferences, which caused me recurring grief.
And to engage in some light gatekeeping, there sure is a lot of terrible pandas code out there written by people that have no business calling themselves programmers. I fully realize this can happen anywhere, but I'm never excited anymore to read a line of pandas.
The tl;dr is that Polars is faster and has cleaner syntax than Pandas. That said, there's more information (books, videos, discussion boards) and example code for Pandas. If you've got a small project then Pandas does fine. I use Pandas but it's not my day job. If it was, I'd probably switch to Polars.
Thanks! I've been wanting to add this since I first started out working on DuckDB almost two years ago but I finally managed to accumulate the time (and the skills required!) to finish it up over the summer. It still has a long way to go, support for indexes in extensions are pretty... raw, and we only push down constant filters into index scans (so no spatial index-join acceleration yet). But I think having a proper spatial index is one of those things that are kind of required to really elevate the spatial extension from being just a toy and I'm super stoked to work more on it during the next release cycle and all the new possibilities that it opens up.
I still need to update the docs and there's probably going to be a blog post on it in the future, but for now there are some more details in the PR https://github.com/duckdb/duckdb_spatial/pull/383
Love the expanded C API support! Also those performance improvements are massive!
Pushing through filters and the streaming optimization for fetchone() is great! This makes it more viable to use duckdb in smaller queries from python.
I'm pretty excited for variables too! I really wanted them for when I'm using the CLI. Same with query/query_table! I appreciate the push for features that make people's lives easier while also still improving performance.
Everyone who I've introduced duckdb to (at work or outside of work) eventually is blown away (some still have lingering SQL stigma)
Unfortunately ESRI will probably just build an application around it, promote the app like they created something new and sell it to their customers. Same thing they did with GDAL.
Has saved us a number of times when having to deploy at a remote client with limited on-prem customisation for security reasons (ie. no to installing a big Postgres or other RDBMS solution).
Powerful tooling; all local to the environment and the data being worked on; SQL, so it's pretty close to a drop-in replacement compared to our old solution. Really great stuff and I was very happy to see the project gain the confidence to hit 1.0 a while back and now 1.1.
I've been eyeing DuckDB for a metric collection hobby project. Quick benchmark showed promising query performance over SQLite (unsurprising considering DuckDB is column oriented), but quite a bit slower for inserts. Does anyone have experience using it as an "online" backend DB as opposed to a data analytics engine for interactive use? From what I gather they are trying to position themselves more in the latter use case.
Doing row-by-row inserts into DuckDB is really slow. Accumulating rows in an in-memory data structure and periodically batching them into something like an in-memory Arrow table, and then reading the Arrow table into DuckDB, is fast and has been tenable for my own use cases.
Depends on the scale of users you expect for your project. Generally I like to keep oltp and olap tools in their lanes, but if < 100 people are going to be using it probably doesn't matter. I doubt duckdb has any sort of acid guarantees, so thats something to keep in mind.
there is Ibis[0] as a fairly mature package. They recently adopted duckdb as the default execution engine and it can give you a nice python dataframe API ontop of duckdb, with hot-swappability towards heavier engines.
With tools like this providing a comprehensive python API and the ability to always fall back to raw SQL, i am not sure DuckDB devs should focus on the python API at all beyond basic (to_table, from_table) features.
Impressive progress and a real chance to shake up the data tool market, but still a way to go:
There is is still much to do especially on large table formats (iceberg/delta) and memory management when running on bigger boxes on cloud. Eg the elusive "Failed to allocate ..." bug[1] is an inhibitor to the claim that big data is dead[2]. As it is, we tried and abandoned DuckDB as a cheaper replacement for some databricks batch jobs.
This episode of Kris Jenkins' Developer Voices podcast talks with a couple authors of a new book on DuckDB, and does a great job of explaining the sorts of things that make it so unusual: https://www.youtube.com/watch?v=_nA3uDx1rlg
The C extensions API is a big big very big thing.
As someone who routinely write small PG extensions, I'd love to be able to kinda use the same code for multiple backend DBs.
And I guess lots of inspiration has come from all the efforts that embed DuckDb in Postgres as a miniOLAP
What sort of extensions do you find yourself writing? I think that is fascinating and thinking back, probably something I should have at least considered a few times.
I wanted to use ibis with Duck as the backend. But I am afraid the chicken and egg problem of using existing ML libraries defaulting to use pandas as the dataframe API. Does anyone have a workaround?
DuckDB understands Pandas and Polars dataframes so you can move back and forth seamlessly between them in-memory (with Polars, it's nearly zero-copy via an in-memory data structure).
df = pd.DataFrame({"a": [1,2,3], "b": [4,5,6]}) # New Pandas dataframe
df2 = duckdb.query("select 2*a as c, b from df") # DuckDB
df3 = df2.df() # materialize as Pandas, .pl() for Polars
ml_df = ML.func(df3) # pass dataframe into ML function
df4 = duckdb.query("select count(distinct x) from ml_df")
It's not super easy to port a library with domain-specific ML/DS algorithms built around Pandas, such as accepting Pandas dataframes as arguments to its functions and class constructors.
IEEE 754 defines the result of divisions and multiplications involving NaN and Inf, which are used extensively in graphics programming.
For example, 'select 1.0/(1.0/0.0)' becomes 0.0 again. This allows for inf/nans to be canceled out instead of having to do error checking and exception handling every time one crops up.
Lots of automatic performance optimizations on what is already a very fast engine. (I’ve stopped using Pandas)
I know most software folks feel some type of way about SQL (most don’t grok it beyond a simple SELECT) but this is one of the advantages of declarative languages like SQL and a plan-execute programming paradigm where a plan is created before instructions are run, making it amenable to plan optimization.
Maybe the syntax of the language could be improved (e.g. Linq) but conceptually SQL is historically when we’ve blundered into the right. Data operations are often done in sets rather than loops and it’s a worthwhile investment for software engineers to learn to think in this way if they want to work with data correctly at scale.
Stonebraker was right in that people who avoid SQL are doomed to reinvent it poorly.
Imo dataframe apis can be superior, take a look at polars. I personally much prefer it to SQL.
For duckdb, ibis looks like a pretty nice way of using the duckdb query engine.
Dataframe APIs, sure.
Pandas? Not even once.
Can you please elaborate what's wrong with Pandas? Im looking to use either Polars or Pandas in a project and looking for insights.
Wes McKinney (author of Pandas, but also co-author of a host of other major data formats/tools like Ibis, Arrow, Parquet, Feather), wrote "10 Things I Hate About pandas"
https://wesmckinney.com/blog/apache-arrow-pandas-internals/
It was one of his earlier projects, and it "stuck". It was one of the first popular dataframe libraries for Python and it filled that niche for years (alongside dplyr/data.table in R), especially during the big data/data science craze of the 2010s. Tons have been written on it, data scientists and other data folk were brought up on it, so it's the de facto standard in many pipelines.
Since then, we have moved on to better tools like Polars and DuckDB. You will still see a lot of Pandas code in the wild due to how prevalent it is, but if you were to start a new project, you might want to use more modern tools.
Pandas is kinda of analogous to jquery -- jquery was hugely influential during its time, but we have learned a lot since, and there are more modern options (React/Vue/Svelte).
Pandas has a rather unnatural API. For instance you have to name the DataFrame anytime you want to pull a column out of it, so filtering is e.g., `my_df[(my_df['col1'] == value) & (my_df['col2'] + my_df['col3'] > 42)]`. The indexing is also kind of a mess — there are like seven different styles. Row-wise mapping is a huge pain. And of course there is no optimization; it's all computed eagerly so you are responsible for your own optimizations.
Polars, on the other hand, lets you refer to columns like `col('col1')`, which starts to add up if your DataFrame has a long name. It has no row indexing; it's all done by filtering, which is conceptually very simple. Row-wise mapping is trivial. And there is an optimizer that runs before execution.
But more than that, polars has a very fluent API, whereas pandas relies heavily on statements that can't be chained; it really breaks the flow.
I don't find pandas intuitive (API simplicity), then you have the hard to debug issues and perf
Yes, agreed. The API is a big inconsistent kludge, has many warts, and generally requires too much typing and memorization. The performance is subpar. There are some very annoying design choices wrt. implicit type coercion that don't jive with my personal preferences, which caused me recurring grief.
And to engage in some light gatekeeping, there sure is a lot of terrible pandas code out there written by people that have no business calling themselves programmers. I fully realize this can happen anywhere, but I'm never excited anymore to read a line of pandas.
Related: Ibis (a portable Python dataframe library) dropping the pandas backend in favor of DuckDB for better performance and compatibility. [1]
--
1: https://news.ycombinator.com/item?id=41389806
Thank you everyone (all sibling posts) ! Very useful information
Not the OP.
The tl;dr is that Polars is faster and has cleaner syntax than Pandas. That said, there's more information (books, videos, discussion boards) and example code for Pandas. If you've got a small project then Pandas does fine. I use Pandas but it's not my day job. If it was, I'd probably switch to Polars.
Yeah, I agree that pandas has a horrible api. Polars is by and far the best one I've tried.
It maps to SQL semantics fairly cleanly, but is more expressive and composable.
I feel like R-Tree spatial indexes are potentially the most exciting new feature in this release, but they're buried right down at the bottom of the announcement: https://duckdb.org/2024/09/09/announcing-duckdb-110.html#r-t...
Thanks! I've been wanting to add this since I first started out working on DuckDB almost two years ago but I finally managed to accumulate the time (and the skills required!) to finish it up over the summer. It still has a long way to go, support for indexes in extensions are pretty... raw, and we only push down constant filters into index scans (so no spatial index-join acceleration yet). But I think having a proper spatial index is one of those things that are kind of required to really elevate the spatial extension from being just a toy and I'm super stoked to work more on it during the next release cycle and all the new possibilities that it opens up.
I still need to update the docs and there's probably going to be a blog post on it in the future, but for now there are some more details in the PR https://github.com/duckdb/duckdb_spatial/pull/383
Love the expanded C API support! Also those performance improvements are massive! Pushing through filters and the streaming optimization for fetchone() is great! This makes it more viable to use duckdb in smaller queries from python.
I'm pretty excited for variables too! I really wanted them for when I'm using the CLI. Same with query/query_table! I appreciate the push for features that make people's lives easier while also still improving performance.
Everyone who I've introduced duckdb to (at work or outside of work) eventually is blown away (some still have lingering SQL stigma)
Damn, GeoParquet and R-Tree for spatial indexes is huge!!! ESRI better watch their back!
Unfortunately ESRI will probably just build an application around it, promote the app like they created something new and sell it to their customers. Same thing they did with GDAL.
Big fan of DuckDB!
Has saved us a number of times when having to deploy at a remote client with limited on-prem customisation for security reasons (ie. no to installing a big Postgres or other RDBMS solution).
Powerful tooling; all local to the environment and the data being worked on; SQL, so it's pretty close to a drop-in replacement compared to our old solution. Really great stuff and I was very happy to see the project gain the confidence to hit 1.0 a while back and now 1.1.
Congrats to everyone!
I've been eyeing DuckDB for a metric collection hobby project. Quick benchmark showed promising query performance over SQLite (unsurprising considering DuckDB is column oriented), but quite a bit slower for inserts. Does anyone have experience using it as an "online" backend DB as opposed to a data analytics engine for interactive use? From what I gather they are trying to position themselves more in the latter use case.
Doing row-by-row inserts into DuckDB is really slow. Accumulating rows in an in-memory data structure and periodically batching them into something like an in-memory Arrow table, and then reading the Arrow table into DuckDB, is fast and has been tenable for my own use cases.
You can always use sqlite as your primary data store, and then directly query the sqlite database from duckdb whenever you need analytics.
Depends on the scale of users you expect for your project. Generally I like to keep oltp and olap tools in their lanes, but if < 100 people are going to be using it probably doesn't matter. I doubt duckdb has any sort of acid guarantees, so thats something to keep in mind.
DuckDB does have ACID guarantees and transactions but I'd not be surprised if they are rarely used (if at all).
Ref: https://duckdb.org/docs/sql/statements/transactions
In the concurrency documentation they explicitly specify that it's not designed for lots of small transactions
Concurrency: https://duckdb.org/docs/connect/concurrency
Congratulations to duckdb team! Can't wait to try some of the newly released features and performance improvements.
I am quite curious about the plans for python dataframe like API for duckdb, and python ecosystem in general.
there is Ibis[0] as a fairly mature package. They recently adopted duckdb as the default execution engine and it can give you a nice python dataframe API ontop of duckdb, with hot-swappability towards heavier engines.
With tools like this providing a comprehensive python API and the ability to always fall back to raw SQL, i am not sure DuckDB devs should focus on the python API at all beyond basic (to_table, from_table) features.
Impressive progress and a real chance to shake up the data tool market, but still a way to go: There is is still much to do especially on large table formats (iceberg/delta) and memory management when running on bigger boxes on cloud. Eg the elusive "Failed to allocate ..." bug[1] is an inhibitor to the claim that big data is dead[2]. As it is, we tried and abandoned DuckDB as a cheaper replacement for some databricks batch jobs.
[0] https://github.com/ibis-project/ibis [1] https://github.com/duckdb/duckdb/issues/12667, https://github.com/duckdb/duckdb/issues/9880, https://github.com/duckdb/duckdb/issues/12528 [2] https://motherduck.com/blog/big-data-is-dead/
The last I read, the Spark API was to become the focus point.
https://duckdb.org/docs/api/python/spark_api
Not sure what the current status is.
ref: https://github.com/duckdb/duckdb/issues/2000#issuecomment-18...
This episode of Kris Jenkins' Developer Voices podcast talks with a couple authors of a new book on DuckDB, and does a great job of explaining the sorts of things that make it so unusual: https://www.youtube.com/watch?v=_nA3uDx1rlg
The C extensions API is a big big very big thing. As someone who routinely write small PG extensions, I'd love to be able to kinda use the same code for multiple backend DBs. And I guess lots of inspiration has come from all the efforts that embed DuckDb in Postgres as a miniOLAP
What sort of extensions do you find yourself writing? I think that is fascinating and thinking back, probably something I should have at least considered a few times.
I wanted to use ibis with Duck as the backend. But I am afraid the chicken and egg problem of using existing ML libraries defaulting to use pandas as the dataframe API. Does anyone have a workaround?
DuckDB understands Pandas and Polars dataframes so you can move back and forth seamlessly between them in-memory (with Polars, it's nearly zero-copy via an in-memory data structure).
It's super easy to switch back and forth between dataframes and duckdb, so there's no lock in?
It's not super easy to port a library with domain-specific ML/DS algorithms built around Pandas, such as accepting Pandas dataframes as arguments to its functions and class constructors.
The namesake: https://en.wikipedia.org/wiki/Eaton%27s_pintail
Perfect timing. We are currently building a spike to validate migration to DuckDB. We are extremely optimistic based on community feedback.
Today I learned:
SELECT 1 / 0 AS var_name;
yields a double with value infinite. Which is SQL spec. Must be fun times to actually use that :-)
IEEE 754 defines the result of divisions and multiplications involving NaN and Inf, which are used extensively in graphics programming.
For example, 'select 1.0/(1.0/0.0)' becomes 0.0 again. This allows for inf/nans to be canceled out instead of having to do error checking and exception handling every time one crops up.
"In process analytics database" none the wiser