wenc 4 months ago

Lots of automatic performance optimizations on what is already a very fast engine. (I’ve stopped using Pandas)

I know most software folks feel some type of way about SQL (most don’t grok it beyond a simple SELECT) but this is one of the advantages of declarative languages like SQL and a plan-execute programming paradigm where a plan is created before instructions are run, making it amenable to plan optimization.

Maybe the syntax of the language could be improved (e.g. Linq) but conceptually SQL is historically when we’ve blundered into the right. Data operations are often done in sets rather than loops and it’s a worthwhile investment for software engineers to learn to think in this way if they want to work with data correctly at scale.

Stonebraker was right in that people who avoid SQL are doomed to reinvent it poorly.

  • theLiminator 4 months ago

    Imo dataframe apis can be superior, take a look at polars. I personally much prefer it to SQL.

    For duckdb, ibis looks like a pretty nice way of using the duckdb query engine.

    • isoprophlex 4 months ago

      Dataframe APIs, sure.

      Pandas? Not even once.

      • neeleshs 4 months ago

        Can you please elaborate what's wrong with Pandas? Im looking to use either Polars or Pandas in a project and looking for insights.

        • wenc 4 months ago

          Wes McKinney (author of Pandas, but also co-author of a host of other major data formats/tools like Ibis, Arrow, Parquet, Feather), wrote "10 Things I Hate About pandas"

          https://wesmckinney.com/blog/apache-arrow-pandas-internals/

          It was one of his earlier projects, and it "stuck". It was one of the first popular dataframe libraries for Python and it filled that niche for years (alongside dplyr/data.table in R), especially during the big data/data science craze of the 2010s. Tons have been written on it, data scientists and other data folk were brought up on it, so it's the de facto standard in many pipelines.

          Since then, we have moved on to better tools like Polars and DuckDB. You will still see a lot of Pandas code in the wild due to how prevalent it is, but if you were to start a new project, you might want to use more modern tools.

          Pandas is kinda of analogous to jquery -- jquery was hugely influential during its time, but we have learned a lot since, and there are more modern options (React/Vue/Svelte).

        • bobbylarrybobby 4 months ago

          Pandas has a rather unnatural API. For instance you have to name the DataFrame anytime you want to pull a column out of it, so filtering is e.g., `my_df[(my_df['col1'] == value) & (my_df['col2'] + my_df['col3'] > 42)]`. The indexing is also kind of a mess — there are like seven different styles. Row-wise mapping is a huge pain. And of course there is no optimization; it's all computed eagerly so you are responsible for your own optimizations.

          Polars, on the other hand, lets you refer to columns like `col('col1')`, which starts to add up if your DataFrame has a long name. It has no row indexing; it's all done by filtering, which is conceptually very simple. Row-wise mapping is trivial. And there is an optimizer that runs before execution.

          But more than that, polars has a very fluent API, whereas pandas relies heavily on statements that can't be chained; it really breaks the flow.

        • fsndz 4 months ago

          I don't find pandas intuitive (API simplicity), then you have the hard to debug issues and perf

          • isoprophlex 4 months ago

            Yes, agreed. The API is a big inconsistent kludge, has many warts, and generally requires too much typing and memorization. The performance is subpar. There are some very annoying design choices wrt. implicit type coercion that don't jive with my personal preferences, which caused me recurring grief.

            And to engage in some light gatekeeping, there sure is a lot of terrible pandas code out there written by people that have no business calling themselves programmers. I fully realize this can happen anywhere, but I'm never excited anymore to read a line of pandas.

        • neeleshs 4 months ago

          Thank you everyone (all sibling posts) ! Very useful information

        • fearmerchant 4 months ago

          Not the OP.

          The tl;dr is that Polars is faster and has cleaner syntax than Pandas. That said, there's more information (books, videos, discussion boards) and example code for Pandas. If you've got a small project then Pandas does fine. I use Pandas but it's not my day job. If it was, I'd probably switch to Polars.

      • theLiminator 4 months ago

        Yeah, I agree that pandas has a horrible api. Polars is by and far the best one I've tried.

        It maps to SQL semantics fairly cleanly, but is more expressive and composable.

simonw 4 months ago

I feel like R-Tree spatial indexes are potentially the most exciting new feature in this release, but they're buried right down at the bottom of the announcement: https://duckdb.org/2024/09/09/announcing-duckdb-110.html#r-t...

  • maxxen 4 months ago

    Thanks! I've been wanting to add this since I first started out working on DuckDB almost two years ago but I finally managed to accumulate the time (and the skills required!) to finish it up over the summer. It still has a long way to go, support for indexes in extensions are pretty... raw, and we only push down constant filters into index scans (so no spatial index-join acceleration yet). But I think having a proper spatial index is one of those things that are kind of required to really elevate the spatial extension from being just a toy and I'm super stoked to work more on it during the next release cycle and all the new possibilities that it opens up.

ashkankiani 4 months ago

Love the expanded C API support! Also those performance improvements are massive! Pushing through filters and the streaming optimization for fetchone() is great! This makes it more viable to use duckdb in smaller queries from python.

I'm pretty excited for variables too! I really wanted them for when I'm using the CLI. Same with query/query_table! I appreciate the push for features that make people's lives easier while also still improving performance.

Everyone who I've introduced duckdb to (at work or outside of work) eventually is blown away (some still have lingering SQL stigma)

ZeroCool2u 4 months ago

Damn, GeoParquet and R-Tree for spatial indexes is huge!!! ESRI better watch their back!

  • DonnyV 4 months ago

    Unfortunately ESRI will probably just build an application around it, promote the app like they created something new and sell it to their customers. Same thing they did with GDAL.

adwf 4 months ago

Big fan of DuckDB!

Has saved us a number of times when having to deploy at a remote client with limited on-prem customisation for security reasons (ie. no to installing a big Postgres or other RDBMS solution).

Powerful tooling; all local to the environment and the data being worked on; SQL, so it's pretty close to a drop-in replacement compared to our old solution. Really great stuff and I was very happy to see the project gain the confidence to hit 1.0 a while back and now 1.1.

Congrats to everyone!

beingflo 4 months ago

I've been eyeing DuckDB for a metric collection hobby project. Quick benchmark showed promising query performance over SQLite (unsurprising considering DuckDB is column oriented), but quite a bit slower for inserts. Does anyone have experience using it as an "online" backend DB as opposed to a data analytics engine for interactive use? From what I gather they are trying to position themselves more in the latter use case.

  • voidsnax 4 months ago

    Doing row-by-row inserts into DuckDB is really slow. Accumulating rows in an in-memory data structure and periodically batching them into something like an in-memory Arrow table, and then reading the Arrow table into DuckDB, is fast and has been tenable for my own use cases.

  • pantsforbirds 4 months ago

    You can always use sqlite as your primary data store, and then directly query the sqlite database from duckdb whenever you need analytics.

  • 89vision 4 months ago

    Depends on the scale of users you expect for your project. Generally I like to keep oltp and olap tools in their lanes, but if < 100 people are going to be using it probably doesn't matter. I doubt duckdb has any sort of acid guarantees, so thats something to keep in mind.

log4shell 4 months ago

Congratulations to duckdb team! Can't wait to try some of the newly released features and performance improvements.

I am quite curious about the plans for python dataframe like API for duckdb, and python ecosystem in general.

tlavoie 4 months ago

This episode of Kris Jenkins' Developer Voices podcast talks with a couple authors of a new book on DuckDB, and does a great job of explaining the sorts of things that make it so unusual: https://www.youtube.com/watch?v=_nA3uDx1rlg

fforflo 4 months ago

The C extensions API is a big big very big thing. As someone who routinely write small PG extensions, I'd love to be able to kinda use the same code for multiple backend DBs. And I guess lots of inspiration has come from all the efforts that embed DuckDb in Postgres as a miniOLAP

  • pantsforbirds 4 months ago

    What sort of extensions do you find yourself writing? I think that is fascinating and thinking back, probably something I should have at least considered a few times.

mrbonner 4 months ago

I wanted to use ibis with Duck as the backend. But I am afraid the chicken and egg problem of using existing ML libraries defaulting to use pandas as the dataframe API. Does anyone have a workaround?

  • wenc 4 months ago

    DuckDB understands Pandas and Polars dataframes so you can move back and forth seamlessly between them in-memory (with Polars, it's nearly zero-copy via an in-memory data structure).

        df = pd.DataFrame({"a": [1,2,3], "b": [4,5,6]}) # New Pandas dataframe
        df2 = duckdb.query("select 2*a as c, b from df")  # DuckDB 
        df3 = df2.df()  # materialize as Pandas, .pl() for Polars
        ml_df = ML.func(df3) # pass dataframe into ML function
        df4 = duckdb.query("select count(distinct x) from ml_df")
  • datadrivenangel 4 months ago

    It's super easy to switch back and forth between dataframes and duckdb, so there's no lock in?

    • leventov 4 months ago

      It's not super easy to port a library with domain-specific ML/DS algorithms built around Pandas, such as accepting Pandas dataframes as arguments to its functions and class constructors.

openplatypus 4 months ago

Perfect timing. We are currently building a spike to validate migration to DuckDB. We are extremely optimistic based on community feedback.

BurnGpuBurn 4 months ago

Today I learned:

SELECT 1 / 0 AS var_name;

yields a double with value infinite. Which is SQL spec. Must be fun times to actually use that :-)

  • ok123456 4 months ago

    IEEE 754 defines the result of divisions and multiplications involving NaN and Inf, which are used extensively in graphics programming.

    For example, 'select 1.0/(1.0/0.0)' becomes 0.0 again. This allows for inf/nans to be canceled out instead of having to do error checking and exception handling every time one crops up.

lakomen 4 months ago

"In process analytics database" none the wiser