sakras 3 days ago

Giving it a quick look, seems like they've addressed a lot of the shortcomings of Parquet which is very exciting. In no particular order:

- Parquet metadata is Thrift, but with comments saying "if this field exists, this other field must exist", and no code actually verifying the fact, so I'm pretty sure you could feed it bogus Thrift metadata and crash the reader.

- Parquet metadata must be parsed out, meaning you have to: allocate a buffer, read the metadata bytes, and then dynamically keep allocating a whole bunch of stuff as you parse the metadata bytes, since you don't know the size of the materialized metadata! Too many heap allocations! This file format's Flatbuffers approach seems to solve this as you can interpret Flatbuffer bytes directly.

- The encodings are much more powerful. I think a lot of people in the database community have been saying that we need composable/recursive lightweight encodings for a long time. BtrBlocks was the first such format that was open in my memory, and then FastLanes followed up. Both of these were much better than Parquet by itself, so I'm glad ideas from those two formats are being taken up.

- Parquet did the Dremel record-shredding thing which just made my brain explode and I'm glad they got rid of it. It seemed to needlessly complicate the format with no real benefit.

- Parquet datapages might contain different numbers of rows, so you have to scan the whole ColumnChunk to find the row you want. Here it seems like you can just jump to the DataPage (IOUnit) you want.

- They got rid of the heavyweight compression and just stuck with the Delta/Dictionary/RLE stuff. Heavyweight compression never did anything anyway, and was super annoying to implement, and basically required you to pull in 20 dependencies.

Overall great improvement, I'm looking forward to this taking over the data analytics space.

  • progval 2 days ago

    > - They got rid of the heavyweight compression and just stuck with the Delta/Dictionary/RLE stuff. Heavyweight compression never did anything anyway, and was super annoying to implement, and basically required you to pull in 20 dependencies.

    "Heavyweight compression" as in zstd and brotli? That's very useful for columns of non-repeated strings. I get compression ratios in the order of 1% on some of those columns, because they are mostly ASCII and have lots of common substrings.

  • dan-robertson 2 days ago

    I think the wasm compiler is going to bring in more dependencies than the ‘heavy’ compression would have.

    I think that more expensive compression may have made more of a difference 15 years ago when cpu was more plentiful compared to network or disk bandwidth.

  • miohtama 3 days ago

    For compression, has the world settled with zstd now?

    • dan-robertson 2 days ago

      I think it’s a pretty common choice when you want compression in a new format or protocol. It works better for compressing chunks of your data rather than large files where you want to maintain some kind of index or random access. Similarly, if you have many chunks then you can parallelise decompression (I’m not sure any kind of parallelism support should have been built in to the zstd format, though it is useful for command line uses).

      A big problem for some people is that Java support is hard as it isn’t portable so eg making a Java web server compress its responses with are isn’t so easy.

      • oblio 2 days ago

        Java can just use native libraries, there are plenty of Java projects that do that.

        It's not like it's 1999 and there is still some Sun dogma against doing this.

        • dan-robertson a day ago

          Sure, I don't want to make a big deal about this but I have observed Java projects choosing to not support zstd for portability (or software packaging) reasons.

          • oblio a day ago

            Well, convenience is also a factor in some cases. Much easier to schlep a "pure-Java" jar around.

    • craftkiller 2 days ago

      Depends on the use-case. For transparent filesystem compression I would still recommend lz4 over zstd because speed matters more than compression ratio in that use case.

    • lionkor 2 days ago

      Most definitely not settled, but it's a good default

  • MrBuddyCasino 3 days ago

    > I'm looking forward to this taking over the data analytics space

    Parquet is surprisingly arcane. There are a lot of unpleasant and poorly documented details one has to be aware of in order to use it efficiently.

  • theamk 2 days ago

    They added wasm dependency though.

    Build teams, weep in fear...

anigbrowl 3 days ago

> Wes McKinney

Sold

For those who don't know. Wes McKinney is the creator of Pandas, the go-to tabular analysis library for Python. That gives his format widespread buy-in from the outset, as well as a couple of decades' of Caring About The Problem which makes his insights unusually valuable.

  • snthpy 3 days ago

    Andy Pavlo also deserves a shout out; he's an authority on databases and lives a "data oriented lifestyle". See his two "What goes around comes around ..." papers for great overviews of the past and future 50 years in databases respectively (as well as all the amazing CMU Seminar Series). I'm excited to see him involved in this.

    My apologies to the Chinese coauthors who I'm not familiar with.

    Bookmarked for thorough reading!

    • whage 2 days ago

      I've been looking for an article that I think I found 10+ years ago on the web about a very similar topic, something like "the evolution of database systems". It was a very well written article that I just skimmed through and planned to read properly but could never find it again. I remember it had hand-drawn-style diagrams of database architectures with blue backgrounds, sometimes squiggly lines and maybe even some bees on them (could be just my mind mixing it with something). I'd be eternally grateful if someone here could find it.

  • snthpy 3 days ago

    I'm a big fan of Wes' work and Pandas was incredibly influential. However technically it wasn't his best work. In terms of selling points, I think that the Arrow data format is technically much better and more influential on the data ecosystem as a whole, see DataFusion, etc...

    That said, now let me see what F3 is actually about (and yes, your comment is what actually made me want to click through to the link) ...

  • geodel 3 days ago

    Also creator of Apache Arrow. A core component of modern data analytics.

  • wodenokoto 3 days ago

    His work on parquet probably stands out as a better call to authority

  • nialse 3 days ago

    Mixing data and code is a classic security mistake. Having one somewhat known individual involved doesn’t magically make it less of a mistake.

    • snthpy 3 days ago

      I was also concerned about the wasted overhead. However I guess it's just there for compatibility (since space is cheap) and for common encodings you'll be able to skip reading it with range requests and use your trusted codec to decode the data. Smart move imho.

      • nialse 3 days ago

        I’m not concerned about the overhead, there is always more and larger pieces of iron. Still not a good idea to mix executable code with data.

        • chme 2 days ago

          It really depends on the order of priorities. If the overall goal is to allow digital archeologist to make sense of some file they found, it would be prudent to give them some instructions on how it is decoded.

          I just hope that people will not just execute that code in an unconfined environment.

          • nialse 2 days ago

            Hope is not an adequate security best practice. ;)

digdugdirk 3 days ago

Forgive me for not understanding the difference between columnar storage and otherwise, but why is this so cool? Is the big win that you can send around custom little vector embedding databases with a built in sandbox?

I get the "step change/foundational building block" vibe from the paper - and the project name itself implies more than a little "je ne sais quoi" - but unfortunately I only understand a few sentences per page. The pictures are pretty though, and the colour choices are tasteful yet bold. Two thumbs up from the easily swayed.

  • gopalv 3 days ago

    > Is the big win that you can send around custom little vector embedding databases with a built in sandbox?

    No, this is a compatibility layer for future encoding changes.

    For example, ORCv2 has never shipped because we tried to bundle all the new features into a new format version, ship all the writers with the features disabled, then ship all the readers with support and then finally flip the writers to write the new format.

    Specifically, there was a new flipped bit version of float encoding which sent the exponent, mantissa and sign as integers for maximum compression - this would've been so much easier to ship if I could ship a wasm shim with the new file and skip the year+ wait for all readers to support it.

    We'd have made progress with the format, but we'd also be able to deprecate a reader impl in code without losing compatibility if the older files carried their own information.

    Today, something like Spark's variant type would benefit from this - the sub-columnarization that does would be so much easier to ship as bytecode instead of as an interpreter that contains support for all possible recombinations from split up columns.

    PS: having spent a lot of nights tweaking tpc-h with ORC and fixing OOMs in the writer, it warms my heart to see it sort of hold up those bits in the benchmark

    • RyanHamilton 2 days ago

      I'm less confident. Your description highlights a real problem but this particular solution looks like an attempt to shoe horn a technical solution to a political people problem. It feels like one of these great ideas that years later results in 1000s of different decoders, breakages and a nightmare to maintain. Then someone starts an initiative to move decoding from being bundled and to instead just defining the data format.

      Sometimes the best option is to do the hard political work and improve the standard and get everyone moving with it. People have pushed parquet and arrow. Which they are absolutely great technologies that I use regularly but 8 years after someone asked how to write parquet in java, the best answer is to use duckdb: https://stackoverflow.com/questions/47355038/how-to-generate...

      Not having a good parquet writer for java shows a poor attempt at pushing forward a standard. Similarly arrow has problems in java land. If they can't be bothered to consider how to actually implement and roll out standards to a top 5 language, I'm not sure I want them throwing WASM into the mix will fix it.

  • willtemperley 2 days ago

    Columnar storage is great because a complex nested schema can be decomposed into its leaf values and stored as primitives. You can directly access leaf values, avoiding a ton of IO and parsing. Note that all these formats are actually partitioned into groups of rows at the top level.

    One big win here is that it's possible to get Apache Arrow buffers directly from the data pages - either by using the provided WASM or bringing a native decoder.

    In Parquet this is currently very complicated. Parquet uses the Dremel encoding which stores primitive values alongside two streams of integers (repetition and definition levels) that drive a state machine constructed from the schema to reconstruct records. Even getting those integer streams is hard - Parquet has settled on “RLE” which is a mixture of bit-packing and run-length encoding and the reference implementation uses 74,000 lines of generated code just for the bit-packing part.

    So to get Arrow buffers from Parquet is a significant amount of work. F3 should make this much easier and future-proof.

    One of the suggested wins here is random access to metadata. When using GeoParquet I index the metadata in SQLite, otherwise it would take about 10 minutes as opposed to a few milliseconds to run a spatial query on e.g. Overture Maps - I'd need to parse the footer of ~500 files meaning ~150MB of Thrift would need to be parsed and queried.

    However the choice of Google's Flatbuffers is an odd one. Memory safety with FlatBuffers is a known problem [1]. I seriously doubt that generated code that must mitigate these threats will show any real-world performance benefits. Actually - why not just embed a SQLite database?

    [1] https://rustsec.org/advisories/RUSTSEC-2021-0122.html

  • bbminner 3 days ago

    Afaik the win of columnar storage comes from the fact that you can very quickly scan the entire column across all rows making very efficient use of os buffering etc. so queries like select a where b = 'x' are very quick.

    • Someone 3 days ago

      > so queries like select a where b = 'x' are very quick

      I wouldn’t say “very quick”. They only need to read and look at the data for columns a and b, whereas, with a row-oriented approach, with storage being block-based, you will read additional data, often the entire dataset.

      That’s faster, but for large datasets you need an index to make things “very quick”. This format supports that, but whether to have that is orthogonal to being row/column oriented.

      • mr_toad 2 days ago

        Sum(x) is a better example. Indexing x won’t help when you need all the values.

    • rovr138 3 days ago

      Another useful one is aggregations. Think sum(), concat(), max(), etc. You can operate on the column.

      This is in contrast to row based. You have to scan the full row, to get a column. Think how you'd usually read a CSV (read line, parse line).

    • PhilippGille 2 days ago

      When b = 'x' is true for many rows and you select * or multiple columns, then it's the opposite, because reading all row data is slower in column based data structures than in row based ones.

      IMO it's easier to explain in terms of workload:

      - OLTP (T = transactional workloads), row based, for operating on rows - OLAP (A = analytical workloads), column based, for operating on columns (sum/min/max/...)

theamk 3 days ago

this is messed up

> The decoding performance slowdown of Wasm is minimal (10–30%) compared to a native implementation.

so... you take 10%-30% performance hit _right away_, and you perpetually give up any opportunities to improve the decoder in the future. And you also give up any advanced decoding functions other than "decode whole block and store into memory".

I have no idea why would anyone do this. If you care about speed, then wasm is not going to cut it. If you don't care about speed, you don't need super-fancy encoding algorithms, just use any of the well-known ones.

  • apavlo 3 days ago

    > so... you take 10%-30% performance hit _right away_, and you perpetually give up any opportunities to improve the decoder in the future.

    The WASM is meant as a backup. If you have the native decoder installed (e.g., as a crate), then a system will prefer to use that. Otherwise, fallback to WASM. A 10-30% performance hit is worth it over not being able to read a file at all.

    • Certhas 3 days ago

      It even says so right in the abstract:

      "Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable."

      The idea that software I write today can decode a data file written in ten years using new encodings is quite appealing.

      And the idea that new software written to make use of the new encodings doesn't have to carry the burden of implementing the whole history of encoders for backwards compatibility likewise.

      • silvestrov 2 days ago

        Now you have code stored in your database which you don't know what will do when you execute it.

        Sounds very much like the security pain from macros in Excel and Microsoft Word that could do anything.

        This is why most PDF readers will ignore any javascript embedded inside PDF files.

        • tsukikage 2 days ago

          It gets even better further down the paper!

          "In case users prefer native decoding speed over Wasm, F3 plans to offer an option to associate a URL with each Wasm binary, pointing to source code or a precompiled library."

          • discreteevent 2 days ago

            They are not suggesting that the code at the url would be automatically downloaded. It would be up to you to get the code and build it into your application like any other library.

        • Certhas 2 days ago

          Is this relevant in practice? Say I go to a website to download some data, but a malicious actor has injected an evil decoder (that does what exactly?). They could just have injected the wasm into the website I am visiting to get the data!

          In fact, wasm was explicitly designed for me to run unverified wasm blobs from random sources safely on my computer.

        • mr_toad 2 days ago

          Excel, Word and PDF readers weren’t properly sandboxed.

  • xyzzy_plugh 3 days ago

    I kind of agree with you, but there's more to the picture.

    The situation you describe is kind of already the case with various approaches to compression. For example, perhaps we decide to bitpack instead of use the generic compressor. Or change compressors entirely.

    This sort of thing exists without WASM, and it means you have to "transcode" i.e. rewrite the file after updating your software with the new techniques.

    With WASM, it's the same. You just rewrite the file.

    I do agree that this pushes the costs of iteration up the stack in a vastly less efficient way. Overall this seems way more expensive, very unclear that future proofing is worth it. I've worked with exabyte-scale systems and re-encoding swaths of data regularly would not be good.

tomnicholas1 3 days ago

The pitch for this sounds very similar to the pitch for Vortex (i.e. obviating the need to create a new format every time a shift occurs in data processing and computing by providing a data organization structure and a general-purpose API to allow developers to add new encoding schemes easily).

But I'm not totally clear what the relationship between F3 and Vortex is. It says their prototype uses the encoding implementation in Vortex, but does not use the Vortex type system?

  • apavlo 3 days ago

    The backstory is complicated. The plan was to establish a consortium between CMU, Tsinghua, Meta, CWI, VoltronData, Nvidia, and SpiralDB to unify behind a single file format. But that fell through after CMU's lawyers freaked out over Meta's NDA stuff to get access to a preview of Velox Nimble. IANAL, but Meta's NDA seemed reasonable to me. So the plan fell through after about a year, and then everyone released their own format:

    → Meta's Nimble: https://github.com/facebookincubator/nimble

    → CWI's FastLanes: https://github.com/cwida/FastLanes

    → SpiralDB's Vortex: https://vortex.dev

    → CMU + Tsinghua F3: https://github.com/future-file-format/f3

    On the research side, we (CMU + Tsinghua) weren't interested in developing new encoders and instead wanted to focus on the WASM embedding part. The original idea came as a suggestion from Hannes@DuckDB to Wes McKinney (a co-author with us). We just used Vortex's implementations since they were in Rust and with some tweaks we could get most of them to compile to WASM. Vortex is orthogonal to the F3 project and has the engineering energy necessary to support it. F3 is an academic prototype right now.

    I note that the Germans also released their own fileformat this year that also uses WASM. But they WASM-ify the entire file and not individual column groups:

    → Germans: https://github.com/AnyBlox

    • rancar2 3 days ago

      Andrew, it’s always great to read the background from the author on how (and even why!) this all played out. This comment is incredibly helpful for understanding the context of why all these multiple formats were born.

    • Centigonal a day ago

      If I could ask you to speculate for a second, how do you think we will go from here to a clear successor to Parquet?

      Will one of the new formats absorb the others' features? Will there be a format war a la iceberg vs delta lake vs hudi? Will there be a new consortium now that everyone's formats are out in the wild?

    • digdugdirk 3 days ago

      ... Are you saying that there's 5 competing "universal" file format projects? Each with different non-compatible approaches? Is this a laughing/crying thing, or a "lots of interesting paths to explore" thing?

      Also, back on topic - is your file format encryptable via that WASM embedding?

    • tomnicholas1 2 days ago

      Thank you for the explanation! But what a mess.

      I would love to bring these benefits to the multidimensional array world, via integration with the Zarr/Icechunk formats somehow (which I work on). But this fragmentation of formats makes it very hard to know where to start.

Someone 3 days ago

The embedded WASM makes this more extensible, but still, they have to choose, and freeze forever, interfaces that the WASM blocks can implement. Chances are that will, at some time, become a pain point.

They also have to specify the WASM version. In 6 years, we already have 3 versions there (https://webassembly.org/specs/). Are those 100% backwards compatible?

mgaunard 2 days ago

Requiring any implementation who wants to read the data to be able to run webassembly means the format is ill-suited to any environment which aims to reduce dependencies or bloat.

  • aDyslecticCrow 2 days ago

    Wasm is pretty simple; Don't confuse it with the "web" part of the acronym. And as others point out; wasm is a backup. You can use native binding without going through wasm, which provide better performance,

    • mgaunard 2 days ago

      it needs a compiler to convert it to machine code, it also has a non-trivial runtime...

      • Bigpet 2 days ago

        Don't know if you need to compile, you might want to but I think interpreting might seem reasonable if size/complexity is a concern.

        Is the runtime really that large? I know with wasm 2.0 with garbage collection and exceptions is a bit of a beast but wasm 1.0? What's needed (I'm speaking from a place of ignorance here, I haven't implemented a WASM runtime)? Some contiguous memory, a stack machine, IEEE float math and some utf-8 operations. I think you can add some reasonable limitations like only a single module and a handful of available imports relevant to the domain.

        I know that feature creep would almost inevitably follow, but if someone cares about minimizing complexity it seems possible.

      • drob518 2 days ago

        If you view WASM as a worst case fallback, then you could implement a simple interpreter and transcode the file to something that your program has native decoders for. Yes, that might be slow, but at least the data remains usable. At least, that’s my understanding. I don’t have a dog in the hunt.

  • Certhas 2 days ago

    If you have a native decoder you don't need to run WASM. It's literally in the abstract.

    • Someone 2 days ago

      But given a file in this format, you cannot know whether your native decoder can read it without running WASM; the file you read may use a feature that was invented after your library was written.

      • ko27 2 days ago

        Of course you can, you can literally read the version from the file itself.

lifthrasiir 3 days ago

It seems one of the first file formats that embed WebAssembly modules. Is there any other prior work? I'm specifically interested from the compression perspective as a well-chosen WebAssembly preprocessor can greatly boost the compression ratio. (In fact, I'm currently working on such file format just in case.)

  • ianbicking 2 days ago

    Alan Kay described [1] what he considers the first object oriented system, made in the 60s by an unknown programmer. It was a tape-based storage system, where the "format" of the tap was a set of routines to read, write, etc. at a known offsets on the tape.

    So, prior art! :)

    [1] https://www.cs.tufts.edu/comp/150FP/archive/alan-kay/smallta... (page 4)

esafak 3 days ago

How do they prevent people from embedding malicious payloads in the WebAssembly?

  • aeonfox 3 days ago

    By sandboxing:

    > We first discuss the implementation considerations of the input to the Wasm-side Init() API call. The isolated linear memory space of Wasm instance is referred to as guest, while the program’s address space running the Wasm instance is referred to as host. The input to a Wasm instance consists of the contiguous bytes of an EncUnit copied from the host’s memory into the guest’s memory, plus any additional runtime options.

    > Although research has shown the importance of minimizing the number of memory copies in analytical workloads, we consider the memory copy while passing input to Wasm decoders hard to avoid for several reasons. First, the sandboxed linear memory restricts the guest to accessing only its own memory. Prior work has modified Wasm runtimes to allow access to host memory for reduced copying, but such changes compromise Wasm’s security guarantees

    • theamk 3 days ago

      It's "the future work". But possibly, allowlists might help:

      > Security concerns. Despite the sandbox design of Wasm, it still has vulnerabilities, especially with the discovery of new attack techniques. [...] We believe there are opportunities for future work to improve Wasm security. One approach is for creators of Wasm decoding kernels to register their Wasm modules in a central repository to get the Wasm modules verified and tamper-resistant.

      • aeonfox 3 days ago

        The only issue the article seems to raise is that their solution isn't that optimal because there's redundant copying of the input data into the sandbox, but this enables the sandbox to be secure as the Wasm code can't modify data outside of its sandbox. I'd assume the memory is protected at the CPU level, something akin to virtualisation. But then there's maybe some side-channel attack it could use to extract outside data somehow?

    • mannyv 3 days ago

      Because sandboxing has worked so well in preventing security issues?

    • flockonus 3 days ago

      For one, sandboxing can't solve the halting problem.

      • aeonfox 3 days ago

        So you're suggesting a denial of service? I doubt the sandbox is running on cooperative scheduling. Ultimately the VM can be run for some cycles, enough to get work done, but enough to permit other programs to run, and if the decoder doesn't seem to be making any progress the user or automated process can terminate it. Something like this already happens when js code takes up too much time on a browser and a little pop-down shows to let you stop it.

      • catlifeonmars 3 days ago

        You can if you bound the program in time and space. In particular if you put a timeout on a computation you know for a fact it will halt at some point before the timeout.

  • waterproof 2 days ago

    On top of security concerns, you're now shipping potentially buggy code with every dataset. And you're relying on adherence to semver to match local binary unpacker implementations to particular wasm unpackers.

    I see why you're doing it, but it also opens up a whole avenue of new types of bugs. Now the dataset itself could have consistency issues if, say, the unpacker produces different data based on the order it's called, and there will be a concept of bugfixes to the unpackers - how will we roll those out?

    Fascinating.

coderlens 2 days ago

Kinda off topic but im really curious:

Why is there a chess move mentioned towards the end of the paper? (page24)

CMU-DB vs. MIT-DB: #1 pe4

  • waterproof 2 days ago

    Sounds like they're trying to start a chess match with their rival researchers, inside of a sequence of papers. #1 pe4 would indicate a classic king's pawn opening (pawn to e4) [1]

    Clever.

    [1](https://chesspathways.com/chess-openings/kings-pawn-opening/)

    • coderlens 12 hours ago

      Interesting. This is the first time I'm seeing this inside a paper. I would love to see the entire sequence if it has been done before, in other papers.

fallat 3 days ago

A format requiring a program to decode is nuts. Might as well bundle 7zip with every zip file.

  • coderatlarge 3 days ago

    not a bad idea if your average file size is a terabyte and there’s never a sure way to get a clean binary.

  • totetsu 2 days ago

    Didn’t this used to be done? I’m sure I remember finding .exe zip files and being surprised it wasn’t just a virus.

  • mr_toad 2 days ago

    Self extracting executables, (zip files with the decoder included) have been around for decades.

  • iceboundrock 2 days ago

    7zip has a feature called `7-Zip self-extracting (SFX) archive`.

fijiaarone 3 days ago

Nothing screams "future proof" like WASM.

  • DecoPerson 3 days ago

    Old websites still run very well.

    Is there any reason to believe that a major new browser tech, WASM, will ever have support dropped for its early versions?

    Even if the old versions are not as loved (i.e.: engine optimized for it and immediately ready to be executed) as the old versions, emulation methods work wonders and could easily be downloaded on-demand by browsers needing to run "old WASM".

    I'm quite optimistic for the forwards-compatibility proposed here.

    • xyzzy_plugh 3 days ago

      Sure. Just recently Google was pushing to remove XSLT in Chrome. Flash is a bit of a different beast but it died long ago.

      However I don't think it matters too much. Here WASM is not targeting the browser, and there are many more runtimes for WASM, in many diverse languages, and they outnumber browsers significantly. It won't die easily.

    • Someone 3 days ago

      > Old websites still run very well.

      <blink> is gone. In C, gets is gone. It may take only one similar minor change to make huge amounts of data unreadable by newer WASM versions.

  • defanor 3 days ago

    Indeed, I guess in a while it is going to look pretty much like the older formats relying on obscure bytecode look now. Possibly the kind of data analytics it aims has some special requirements, but I would expect a simple format like CSV (or DSV, more generally) to be more future-proof generally.

    • yjftsjthsd-h 3 days ago

      > in a while it is going to look pretty much like the older formats relying on obscure bytecode look now

      Depends how obscure. You can play Z-machine games on pretty much anything now.

anonzzzies 2 days ago

Do we have a vetted 'optimal' format for column (this) and row (sqlite?) yet? I often have to switch all day and I have something in memory made by us that works but nothing standard.

alfiedotwtf 2 days ago

Weird… for a format that says it’s a general future-proof file format, there’s not a single reference to cache-oblivious structures or other structures like VRF

gigatexal 3 days ago

Inertia is a thing. Parquet and ORC and all the myriad of other formats all exist with ecosystems and things. For this to succeed they’d have to seed the community with connectors and libraries and hooks into things to make it easy to use.

Plugins for DudckDB for example to read and write to it. Heck if you got Iceberg to use it instead of parquet that could be a win.

Sometimes tech doesn’t win on the merits because of how entrenched existing stuff is and the high cost of switching.

  • ozgrakkurt 2 days ago

    This is why parquet is becoming standard even though it is suboptimal and every single high performance db has its own internal format.

    If you need optimized format you do it yourself, if you need standard you use w/e else is using unless it is too bad.

    These new formats seem like they would only be useful as research to build on when creating a specific tool or database

trhway 3 days ago

the embedded decoder may as well be closing on a full-blown SQL execution engine (as the data size tramps the executable size these days). Push that execution onto a memory chips and into SSD/HDD controllers. I think something similar happens in filesystem development too where instead of straight access to raw data you may access some execution API over the data. A modern take on IBM mainframe's filesystem database.

  • ozgrakkurt 2 days ago

    Ssd controllers can barely keep up with advertised read/write workloads afaik so it is hard to understand how can this be useful

albertzeyer 2 days ago

Why not store the data directly as Arrow files, to allow for mmaping? I see F3 also supports such zero-copy mmap, and skimming through the paper, it actually seems that it uses Arrow buffers, so I wonder what is the difference to directly using Arrow files? (Arrow files is what is being used by HuggingFace datasets.)

  • thinkharderdev 2 days ago

    The main reason is that arrow files are not compressed at all. So storing everything as Arrow would increase storage size by 10-100x (depending on data)

  • dan-robertson 2 days ago

    I think mmapping is less useful when you are expecting to get your data over a network. Not sure though.

traceroute66 2 days ago

"file format for the future" .... asterisk, footer note "until something better comes along"

If I had a dollar for all the file formats that got deprecated or never took-off over the years...

bound008 3 days ago

The irony of being a PDF file

ikeashark 2 days ago

Anything by my favorite Andy Pavlo is good in my book

pyrolistical 3 days ago

The neat thing about this the client could substitute with their down decoder when they detect a supported format.

This would allow them to use a decoder optimized for the current hardware. Eg. multi-threaded

binary132 3 days ago

This feels like one of those ideas that seems like a good idea during a late-night LLM brainstorming sesh but not so good when you come back to it with a fresh brain in the morning.

  • edoceo 3 days ago

    LLM = Lager, lager, mead

ks2048 2 days ago

So, it’s columnar. Why not a format that allows columnar or row-based?

indigovole 2 days ago

What kind of embedded WASM malware are we going to be defending against with this?

  • twoodfin 2 days ago

    The obvious concern would be data-dependent backdoors for malicious “decoding”, i.e. correctly decoding ordinary data, but manipulating the decoding of targeted data in some compromising way.

    That relies on some rather far-fetched assumptions about what the attacker might reasonably be able to control undetected, and what goals they might reasonably be able to achieve through such low-level data corruption.

    Maybe information leakage? Tweak some low-order float bits in the decoded results with high-order bits from data the decoder recognizes as “interesting”?

    • discreteevent 2 days ago

      What's the attack vector in this case? The Wasm is loaded from the file itself. If they can compromise the file then its cheaper to just compromise the data directly.

      • twoodfin 2 days ago

        What I’m imagining is essentially a supply chain attack: The victim (mistakenly) trusts the attacker to supply an encoder. The encoder appears to function normally, but in fact will subtly leak information smuggled in decoded values of the victim’s data.

        Far-fetched, indeed.

      • actionfromafar 2 days ago

        Providing an optional, optimized, native decoder which is much faster, but does something wicked when it sees the right data.

1oooqooq 3 days ago

data file format for the future [pdf]

:)

slashdave 2 days ago

I'm looking forward to the announcement of F4 next year.

DenisM 3 days ago

tldr

The proliferation of opensource file formats (i.e., Parquet, ORC) allows seamless data sharing across disparate platforms. However, these formats were created over a decade ago for hardware and workload environments that are much different from today

Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable.

  • NegativeK 3 days ago

    I know sandboxes for wasm are very advanced, but decades of file formats with built-in scripting are flashing the danger lights at me.

    • drob518 2 days ago

      “This time it’ll be different. Trust us.”

  • kijin 3 days ago

    Is that really necessary, though? Data files are useless without a program that knows how to utilize the data. Said program should already know how to decode data on the platform it's running.

    And if you're really working on an obscure platform, implementing a decoder for a file format is probably easier than implementing a full-blown wasm runtime for that platform.

    • magicalhippo 3 days ago

      > Data files are useless without a program that knows how to utilize the data.

      As I see it, the point is that the exact details of how the bits are encoded is not really interesting from the perspective of the program reading the data.

      Consider a program that reads CSV files and processes the data in them. First column contains a timestamp, second column contains a filename, third column contains a size.

      As long as there's a well-defined interface that the program can use to extract rows from a file, where each row contains one or more columns of data values and those data values have the correct data type, then the program doesn't really care about this coming from a CSV file. It could just as easily be a 7zip-compressed JSON file, or something else entirely.

      Now, granted, this file format isn't well-suited as a generic file format. After all, the decoding API they specify is returning data as Apache Arrow arrays. Probably not well-suited for all uses.

      • mbreese 3 days ago

        I think the counter argument here is that you’re now including a CSV decoder in every CSV data file now. At the data sizes we’re talking, this is negligible overhead, but it seems overly complicated to me. Almost like it’s trying too hard to be clever.

        How many different storage format implementations will there realistically be?

        • catlifeonmars 3 days ago

          > How many different storage format implementations will there realistically be?

          Apparently an infinite number, if we go with the approach in the paper /s

          • magicalhippo 2 days ago

            It does open up the possibility for specialized compressors for the data in the file, which might be interesting for archiving where improved compression ratio is worth a lot.

            • catlifeonmars 2 days ago

              That makes sense. I think fundamentally you’re trading off space between the compressed data and the lookup tables stored in your decompression code. I can see that amortizing well if the compressed payloads are large or if there are a lot of payloads with the same distribution of sequences though.