hendiatris 11 hours ago

This is a huge challenge with Iceberg. I have found that there is substantial bang for your buck in tuning how parquet files are written, particularly in terms of row group size and column-level bloom filters. In addition to that, I make heavy use of the encoding options (dictionary/RLE) while denormalizing data into as few files as possible. This has allowed me to rely on DuckDB for querying terabytes of data at low cost and acceptable performance.

What we are lacking now is tooling that gives you insight into how you should configure Iceberg. Does something like this exist? I have been looking for something that would show me the query plan that is developed from Iceberg metadata, but didn’t find anything. It would go a long way to showing where the bottleneck is for queries.

  • jasonjmcghee 7 hours ago

    Have you written about your parquet strategy anywhere? Or have suggested reading related to the tuning you've done? Super interested.

    • indoordin0saur 6 hours ago

      Also very interested in the parquet tuning. I have been building my data lake and most optimization I do is just with efficient partitioning.

      • hendiatris 4 hours ago

        I will write something up when the dust settles, I’m still testing things out. It’s a project where the data is fairly standardized but there is about a petabyte to deal with, so I think it makes sense to make investments in efficiency at the lower level rather than through tons of resources at it. That has meant a custom parser for the input data written in Rust, lots of analysis of the statistics of the data, etc. It has been a different approach to data engineering and one that I hope we see more of.

        Regarding reading materials, I found this DuckDB post to be especially helpful in realizing how parquet could be better leveraged for efficiency: https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-...

mritchie712 5 hours ago

This is a bit overblown.

Is Iceberg "easy" to set up? No.

Can you get set up in a week? Yes.

If you really need a datalake, spending a week setting it up is not so bad. We have a guide[0] here that will get you started in under an hour.

For smaller (e.g. under 10tb) data where you don't need real-time, DuckDB is becoming a really solid option. Here's on setup[1] we've played around with using Arrow Flight.

If you don't want to mess with any of this, we[2] spin it all up for you.

0 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws

1 - https://www.definite.app/blog/duck-takes-flight

2 - https://www.definite.app/

  • simlevesque 5 hours ago

    I think Iceberg can work in real time but the current implementations make it impossible.

    I have a vision for a way to make it work. I made another comment here. Your blog posts were helpful, I digged a bit in the Duck Takes Flight code in python and rust.

  • pid-1 4 hours ago

    If you're already in AWS, why wouldn't you use AWS Glue Catalog + AWS SDK for pandas + Athena?

    You can setup a data lake, save data and start doing queries in like 10 minutes with this setup.

  • whalesalad 3 hours ago

    heads up the logo on your site needs to be 2x'd in pixel density it comes across as blurry on hidpi displays. or convert it to an svg/vector.

datax2 an hour ago

"Hadoop’s meteoric rise led many organizations to implement it without understanding its complexities, often resulting in underutilized clusters or over-engineered architectures. Iceberg is walking a similar path."

This pain is too real, and too close to home. I've seen this outcome turn the entire business off of consuming their data via hadoop because it turns into a wasteland of delayed deliveries, broken datasets, op's teams who cannot scale, and architects overselling too robust designs.

I've tried to scale down hadoop to the business user with visual etl tools like Alteryx, but there again compatibility between Alteryx and hadoop suck via ODBC connectors. I came from an AWS based stack into a poorly leapfrogged data stack and it's hard not to pull my hair out between the business struggling to use it and infra + op's not keeping up. Now these teams want to push to iceburg or big query while ignoring the mountains of tech debt they have created.

Don't get me wrong Hadoop isn't a bad idea, its just complex and a time suck, and unless you have time to dedicate to properly deploy these solutions which most business do not, your implementation will suffer, your business will suffer.

"While the parallels to Hadoop are striking, we also have the opportunity to avoid its pitfalls." no one in IT learns from their failures unless they are writing the checks, most will flip before they feel the pain.

robertkoss 3 hours ago

This article is just shameless advertising for Estuary Flow, a company that the author is working for. "Operational Maturity", as if Iceberg, Delta or Hudi are not mature. These are battle-tested frameworks that have been in production for years. The "small files problem" is not really a problem because every framework supports some way of compacting smaller files. Just run a nightly job that compacts the small files and you're good 2 go.

  • prpl 2 hours ago

    AWS can do it for you with S3 tables.

simlevesque 5 hours ago

I'm working on an alternative Iceberg client to work better in write heavy use cases. Instead of many smaller files it writes on the same file until it's 1mb in size but it gives it a new name. Then I update the manifest to the new filename and checksum. I keep old files on disk for 60 seconds to allow pending queries. I'm also working on auto compaction, when I have ten 1mb files I compact them, same with ten 10mb files, etc...

I feel like this could be a game changer for the ecosystem. It's more cpu and network heavy for writes but the reads are always fast. And the writes are still faster than pyiceberg.

I want to hear opinions or how this could never work.

  • thom 5 hours ago

    Interesting. My personal feeling is that we're slowly headed to a world where we can have our cake and eat it: fast bulk ingestion, fast OLAP, fast OLTP, low latency, all together in the same datastore. I'm hoping we just get to collapse whole complex data platforms into a single consistent store with great developer experience, and never look back.

    • ndm000 5 hours ago

      I’ve felt the same way. It’s so inefficient to have two patterns - OLAP and OLTP - both using SQL interfaces but requiring syncing between systems. There are some physical limits at play though. OLAP will always take less processing and disk usage if the data it needs is all right next to each other (columnar storage) where as OLTP’s need for fast writes usually means row based storage is more efficient. I think the solution would be one system that stores data consistently both ways and knows when to use which method for a given query.

      • thom 4 hours ago

        In a sense, OLAP is just a series of indexing strategies that takes OLTP data and formats it for particular use cases (sometimes with eventual consistency). Some of these indexing strategies in enterprises today involve building out entire bespoke platforms to extract and transform the data. Incremental view maintenance is a step in the right direction - tools like Materialize give you good performance to keep calculated data up to date, and also break out of the streaming world of only paying attention to recent data. But you need to close the loop and also be able to do massive crunchy queries on top of that. I have no doubt we'll get there, really exciting times.

        • ndm000 3 hours ago

          Completely agree. All of the pieces are there and it's just waiting to be acted upon. I haven't seen any of the major players really doubling down on this, but would be so compelling.

    • simlevesque 5 hours ago

      I think it's possible too and the Iceberg spec allows it but the implementations are not suited for every use case.

  • mritchie712 5 hours ago

    nice! anywhere we can follow your progress?

    • simlevesque 5 hours ago

      Not right now sadly I have some work obligations taking my time but I can't wait to share more.

      I'm using a basic implementation that's not backed by iceberg, just Parquet files in hive partitions that I can query using DuckDB.

Gasp0de 9 hours ago

Does anyone have a good alternative for storing large amounts of very small files that need to be individually queriable? We are dealing with a large amount of sensor readings that we need to be able to query on a per sensor basis and a timespan, and we are dealing with the problem mentioned in the article, that storing millions of small files in S3 is expensive.

  • this_user 7 hours ago

    Do you absolutely have to write the data to files directly? If not, then using a time series database might be the better option. Most of them are pretty much designed for workloads with large numbers of append operations. You could always export to individual files later on if you need it.

    Another option if you have enough local storage would be to use something like JuiceFS that creates a virtual file system where the files are initially written to the local cache before JuiceFS writes the data to your S3 provider as larger chunks.

    SeaweedFS can do something similar if you configure it the right way. But both options require that you have enough storage outside of your object storage.

  • themgt 3 hours ago

    I've only played with it a bit but Nvidia AIStore project seems underappreciated: "lightweight, built-from-scratch storage stack tailored for AI applications" + S3 compatible

    https://github.com/NVIDIA/aistore

  • alchemist1e9 7 hours ago

    https://github.com/mxmlnkn/ratarmount

    > To use all fsspec features, either install via pip install ratarmount[fsspec] or pip install ratarmount[fsspec]. It should also suffice to simply pip install fsspec if ratarmountcore is already installed.

  • paulsutter 8 hours ago

    If you want to keep them in S3, consolidate into sorted parquet files. You get random access to row groups, and only the columns you need are read so it’s very efficient. DuckDB can both build and access these files efficiently. You could compact files hourly/nightly/weekly whatever

    Of course you could also use Aurora for a clean scalable Postgres that can survive zone failures for a simpler solution

    • Gasp0de 8 hours ago

      The problem is that the initial writing is already so expensive, I guess we'd have to write multiple sensors into the same file instead of having one file per sensor per interval. I'll look into parquet access options, if we could write 10k sensors into one file but still read a single sensor from that file that could work.

      • hendiatris 4 hours ago

        You may be able to get close with sufficiently small row groups, but you will have to do some tests. You can do this in a few hours of work, by taking some sensor data, sorting it by the identifier and then writing it to parquet with one row group per sensor. You can do this with the ParquetWriter class in PyArrow, or something else that allows you fine grained control of how the file is written. I just checked and saw that you can have around 7 million row groups per file, so you should be fine.

        Then spin up duckdb and do some performance tests. I’m not sure this will work, there is some overheard with reading parquet, which is why it is discouraged to have small files and row groups.

      • bloomingkales 5 hours ago

        Something like Redis instead? [sensorid-timerange] = value. Your key is [sensorid-timerange] to get the values for that sensor and that time range.

        No more files. You might be able to avoid per usage pricing just by hosting this on a regular vps.

        • Gasp0de 5 hours ago

          We use Redis for buffering for a certain timeperiod, and then we write data for one sensor for that period to S3. However we fill up large Redis clusters pretty fast, so we can only buffer for a shortish period.

zhousun 3 hours ago

The only datastack iceberg (or lakehouse) will never replace is OLTP systems, for high-concurrency updates optimistic concurrency control & object store is simply a no go.

Iceberg out-of-the-box is "NOT" good at streaming use cases, unlike formats like Hudi or Paimon, the table format does not have the concept of merge/ index. However, the beauty of iceberg is it is very unopinionated, so it is indeed possible to design an engine to stream write to iceberg. As far as I know this is how engines like Upsolver was implemented: 1. Have in-memory buffer to track incoming rows before flushing a version to iceberg (every 10s to a few minutes). 2. Build Indexing structure to write position deletes/ deletion vector instead of equality deletes. 3. The writer will all try to merge small files and optimize the table.

And stay tuned, we at https://www.mooncake.dev/ are working on a solution to mirror a postgres table to iceberg, and keep them always up-to-date.

alexmorley 8 hours ago

Most of these issues will be ring true to lots of folk using Iceberg at the moment. But this does not:

> Yet, competing table formats like Delta Lake and Hudi mirror this fragmentation. [ ... ] > Just as Spark emerged as the dominant engine in the Hadoop ecosystem, a dominant table format and catalog may appear in the Iceberg era.

I think extremely few people are making bets on any other open source table format now - that consolidation has already happened in 2023-2024 (see e.g. Databricks who have their own competing format leaning heavily into iceberg; or adoption from all of the major data warehouse providers).

  • twoodfin 8 hours ago

    Microsoft is right now making a huge bet on Delta by way of their “Microsoft Fabric” initiative (as always with Microsoft: Is it a product? Is it a branding scheme? Yes.)

    They seem to be the only vendor crazy enough to try to fast-follow Databricks, who is clearly driving the increasingly elaborate and sophisticated Delta ecosystem (check the GitHub traffic…)

    But Microsoft + Databricks is a lot of momentum for Delta.

    On the merits of open & simple, I agree, better for everyone if Iceberg wins out—as Iceberg and not as some Frankenstandard mashed together with Delta by the force of 1,000 Databricks engineers.

    • datadrivenangel 7 hours ago

      The only reason Microsoft is using Delta is to emphasize to CTOs and investors that fabric is as good as databricks, even when that is obviously false to anyone who has smelled the evaporative scent of vaporware before.

      • twoodfin 7 hours ago

        Very different business, of course, but Databricks v. Fabric reminds me a lot of Slack v. Teams.

        Regardless of the relative merits now, I think everyone agrees that a few years ago Slack was clearly superior. Microsoft could have certainly bought Slack instead of pumping probably billions into development, marketing, discounts to destroy them.

        I think Microsoft could and would consider buying Databricks—$80–100B is a lot, but not record-shattering.

        If I were them, though, I’d spend a few billion competing as an experiment, first.

        • foobiekr 6 hours ago

          Anti-trust is the reason a lot of the kinds of deals you’re talking about don’t happen.

          • twoodfin 4 hours ago

            I agree. If the anti-trust regime had been different Microsoft would have bought Databricks years ago. Satya Nadella has surely been tapping his foot watching their valuation grow and grow.

            The Trump folks have given mixed messages on the Biden-era FTC; I'd put the odds that with the right tap dancing (sigh) Microsoft could make a blockbuster like this in the B2B space work.

      • esafak 7 hours ago

        Microsoft's gonna Microsoft.

theyinwhy 2 hours ago

What's a good alternative? Google BigQuery?

paulsutter 9 hours ago

Does this feel about 3x too verbose, like it’s generated?

  • jasonjmcghee 7 hours ago

    Idk if it's the verbosity but yes, reads as generated to me. Specifically sounds like ChatGPT's writing.