During this time I got really excited about succinct data structures and wrote a Rust library implementing many bitvector types, along with a wavelet matrix. [2]
My interest came from a data visualization perspective - I was curious if space-efficient data structures could fundamentally improve interactive exploration of large datasets on the client side. Happy to chat about that if anyone's curious.
[1] Paper: https://archive.yuri.is/pdfing/weighted_range_quantile_queri... though it's pretty hard to understand without some background context. I've been meaning to write a blog post explaining the core contribution, which is a simple tweak to one of Navarro's textbook data structures.
I first heard of the concept of succinct data structures from Edward Kmett, a famous Haskeller behind many popular Haskell libraries. He gave a talk on succinct data structures a long time ago: http://youtu.be/uA0Z7_4J7u8
Note that succinct data structures may not be faster than conventional structures if your dataset fits in memory http://www.cs.cmu.edu/~huanche1/slides/FST.pdf . Of course, for large datasets where storage access times dominate, succinct data structures win all around. In any case, succinct trees are works of art (If I recall https://arxiv.org/pdf/1805.11255 was a good exposition) (just look at how that RMQ works)!
There's a relative of this in making in-memory node representation efficient for large structs that have a bunch of rarely-used fields: chunk up memory in units (most reasonably 8 bytes/pointer size), allocate offsets for rarely-used fields in ascending order, and then use bitmasks to indicate which fields are present. (Note the bits correspond to units, not fields; a 16-byte field would use 2 adjacent bits in the bitmask.)
The trick is that masking & popcount (both low-cycle CPU instructions in most modern CPUs¹) make this quite fast to access and thus suitable for in-memory representation.
The intended use is when presence of optional fields is known at allocation time and doesn't change afterwards, i.e. some object is built up into a dynamically allocated buffer whose size is shrunk down by omitting fields. Changing which fields are present afterwards requires reallocating the thing, which tends to make the entire pattern pointless.
¹ the real annoyance here is that almost all x86_64 CPUs have POPCNT, except the absolute earliest ones, but if you build e.g. some package for a Linux distribution without any CPU optimization flags it'll use a library fallback popcount routine rather than the instruction :(
Succinct data structures are very fun! If anyone is interested, I've implemented some of this in Zig: https://github.com/judofyr/zini. The main thing this implements is a minimal perfect hash function which uses less than 4 bits per element (and can respond to queries in ~50 ns). As a part of that I ended up implementing on of these indexes for constant-time select(n): https://github.com/judofyr/zini/blob/main/src/darray.zig.
It feels kinda magic implementing these algorithms because everything becomes so tiny!
For Java, C++, and Rust there is also https://sux.di.unimi.it/ maintained by Sebastiano Vigna, a professor from Italy.
Together with his student (I also helped a bit), he wrote a paper about RecSplit, a minimal perfect hash function (MPHF) algorithm I have invented: https://arxiv.org/abs/1910.06416 - that algorithm uses around 1.56 bits per key. But it is quite slow. In January 2020 I presented the paper at a conference, that was right before the pandemic.
The algorithm with the least memory usage (and much faster as well) is now Consensus-RecSplit: https://arxiv.org/abs/2502.05613 - it can do 1.44 bits per key, which is right at the theoretical minimum (meaning, there is no good way to shrink it further). At least a small part of my original invention is still used there, nice. The fastest current MPHF is probably PtrHash https://arxiv.org/abs/2502.15539 - both papers were published last month (February 2025) by the way.
Just spent my morning diving into succinct data structures after seeing this. The memory efficiency is incredible - especially the balanced parentheses tree representing a full node tree in just 2 bits per node! I've been working on a project parsing large (10GB+) XML files for scientific data analysis, and our current approach burns through RAM like crazy. Has anyone here successfully implemented these structures in production systems?
The wavelet matrix concept seems particularly promising for our text-heavy workloads. I'm curious if the constant-time operations actually hold up under real-world conditions or if there are hidden performance cliffs.
This feels like one of those CS concepts that should be more widely known but somehow got overlooked by mainstream programming. Kind of like how bloom filters were obscure until suddenly every system was using them.
When I’ve dealt with huge files in .NET, the usual approach is to stream the file such that only some of it is in memory at once. This way you can process files hundreds of GBs. Of course, if you truly need them all in memory at once for some reason I genuinely can’t think of, then you’d need something else.
Does your language have the concept of streaming files?
If you're streaming something row-based like a CSV, or a zipped CSV, then that's usually easy.
But when you get to hierarchical data structures like JSON/protobuf there very often simply isn't a streaming library available. There's a library function to decode the whole thing into an object in memory, and that's all.
Nothing prevents streaming in theory, it's just far more complicated to write that library.
I really like the article, but it would benefit from some numbers or complexity estimates to get some intuitive sense of what the cost is.
Am I paying 30% overhead for this particular index or that wavelet matrix? Is it double the memory use? Or is it O(log N)? No idea! "doesn't use much more space" could mean a lot of different things!
"Succinct data structure" does have a very strict definition which probably answers some of your questions: https://en.wikipedia.org/wiki/Succinct_data_structure. It's all about being close to the theoretical minimum.
> Am I paying 30% overhead for this particular index or that wavelet matrix?
Nope! That would not fit the definition. That would be a "compact data structure" according to this definition.
> This is a field that seems to have emerged in computer science relatively recently; many of the important data structures were invented in the last 25 years.
I didn't think that at all. In fact I found it very readable and prefer it over the drier presentation you linked. To each its own, I guess, but there's really no need to infer ulterior motives.
Wow, this is really fascinating. I guess it all comes down to how it's doing select and rank in constant time, which is probably some clever bit arithmetic. I'll have to look into how that works.
I can speak a bit about one of the approaches: "Practical Entropy-Compressed Rank/Select Dictionary" by Daisuke Okanohara and Kunihiko Sadakane. This presents two different variants: One for dense (i.e. more than 50% of the bits are set) and one for sparse.
The dense implementation is basically based around partitioning them into "blocks" of a given size and then you can find the block by doing `block[idx / block_size]`. It then also groups each block into sub-blocks which helps you even further. All of these are additional data structures (very much like an index) which are stored next to the regular bitset. You use the blocks/sub-blocks to find roughly where in the bitset you are and then use a algorithm for finding the value in a given machine-word.
The sparse implementation treats the bitset as an ordered list of numbers (e.g. 100001001 is interpreted as 0, 5, 8) and then it stores those numbers using Elias-Fano encoding. The higher bits of the Elias-Fano encoding happen to be dense and hence we can use the previous dense implementation on that higher bits and then combine it with the lower bits.
I also emailed Gonzalo Navarro once to ask a question, and we had a great discussion and ended up writing a paper together about the answer. [1]
Another paper of his that I really like combines a few elegant ideas into a simple implementation of bitvector rank/select: https://users.dcc.uchile.cl/~gnavarro/ps/sea12.1.pdf
During this time I got really excited about succinct data structures and wrote a Rust library implementing many bitvector types, along with a wavelet matrix. [2]
My interest came from a data visualization perspective - I was curious if space-efficient data structures could fundamentally improve interactive exploration of large datasets on the client side. Happy to chat about that if anyone's curious.
[1] Paper: https://archive.yuri.is/pdfing/weighted_range_quantile_queri... though it's pretty hard to understand without some background context. I've been meaning to write a blog post explaining the core contribution, which is a simple tweak to one of Navarro's textbook data structures.
[2] The rust version is here: https://github.com/yurivish/made-of-bits/tree/main/rust-play... and an earlier pure-JS implementation is here: https://github.com/yurivish/made-of-bits/tree/main
Reading a Gonzalo Navarro paper is like going for walk, taking a shower and having a wonderful coffee. It literally sets the mind on fire.
https://dblp.org/pid/n/GonzaloNavarro.html
I first heard of the concept of succinct data structures from Edward Kmett, a famous Haskeller behind many popular Haskell libraries. He gave a talk on succinct data structures a long time ago: http://youtu.be/uA0Z7_4J7u8
Note that succinct data structures may not be faster than conventional structures if your dataset fits in memory http://www.cs.cmu.edu/~huanche1/slides/FST.pdf . Of course, for large datasets where storage access times dominate, succinct data structures win all around. In any case, succinct trees are works of art (If I recall https://arxiv.org/pdf/1805.11255 was a good exposition) (just look at how that RMQ works)!
There's a relative of this in making in-memory node representation efficient for large structs that have a bunch of rarely-used fields: chunk up memory in units (most reasonably 8 bytes/pointer size), allocate offsets for rarely-used fields in ascending order, and then use bitmasks to indicate which fields are present. (Note the bits correspond to units, not fields; a 16-byte field would use 2 adjacent bits in the bitmask.)
The trick is that masking & popcount (both low-cycle CPU instructions in most modern CPUs¹) make this quite fast to access and thus suitable for in-memory representation.
The intended use is when presence of optional fields is known at allocation time and doesn't change afterwards, i.e. some object is built up into a dynamically allocated buffer whose size is shrunk down by omitting fields. Changing which fields are present afterwards requires reallocating the thing, which tends to make the entire pattern pointless.
¹ the real annoyance here is that almost all x86_64 CPUs have POPCNT, except the absolute earliest ones, but if you build e.g. some package for a Linux distribution without any CPU optimization flags it'll use a library fallback popcount routine rather than the instruction :(
Succinct data structures are very fun! If anyone is interested, I've implemented some of this in Zig: https://github.com/judofyr/zini. The main thing this implements is a minimal perfect hash function which uses less than 4 bits per element (and can respond to queries in ~50 ns). As a part of that I ended up implementing on of these indexes for constant-time select(n): https://github.com/judofyr/zini/blob/main/src/darray.zig.
It feels kinda magic implementing these algorithms because everything becomes so tiny!
For Java, C++, and Rust there is also https://sux.di.unimi.it/ maintained by Sebastiano Vigna, a professor from Italy.
Together with his student (I also helped a bit), he wrote a paper about RecSplit, a minimal perfect hash function (MPHF) algorithm I have invented: https://arxiv.org/abs/1910.06416 - that algorithm uses around 1.56 bits per key. But it is quite slow. In January 2020 I presented the paper at a conference, that was right before the pandemic.
The algorithm with the least memory usage (and much faster as well) is now Consensus-RecSplit: https://arxiv.org/abs/2502.05613 - it can do 1.44 bits per key, which is right at the theoretical minimum (meaning, there is no good way to shrink it further). At least a small part of my original invention is still used there, nice. The fastest current MPHF is probably PtrHash https://arxiv.org/abs/2502.15539 - both papers were published last month (February 2025) by the way.
Just spent my morning diving into succinct data structures after seeing this. The memory efficiency is incredible - especially the balanced parentheses tree representing a full node tree in just 2 bits per node! I've been working on a project parsing large (10GB+) XML files for scientific data analysis, and our current approach burns through RAM like crazy. Has anyone here successfully implemented these structures in production systems?
The wavelet matrix concept seems particularly promising for our text-heavy workloads. I'm curious if the constant-time operations actually hold up under real-world conditions or if there are hidden performance cliffs.
This feels like one of those CS concepts that should be more widely known but somehow got overlooked by mainstream programming. Kind of like how bloom filters were obscure until suddenly every system was using them.
When I’ve dealt with huge files in .NET, the usual approach is to stream the file such that only some of it is in memory at once. This way you can process files hundreds of GBs. Of course, if you truly need them all in memory at once for some reason I genuinely can’t think of, then you’d need something else.
Does your language have the concept of streaming files?
If you're streaming something row-based like a CSV, or a zipped CSV, then that's usually easy.
But when you get to hierarchical data structures like JSON/protobuf there very often simply isn't a streaming library available. There's a library function to decode the whole thing into an object in memory, and that's all.
Nothing prevents streaming in theory, it's just far more complicated to write that library.
I really like the article, but it would benefit from some numbers or complexity estimates to get some intuitive sense of what the cost is.
Am I paying 30% overhead for this particular index or that wavelet matrix? Is it double the memory use? Or is it O(log N)? No idea! "doesn't use much more space" could mean a lot of different things!
"Succinct data structure" does have a very strict definition which probably answers some of your questions: https://en.wikipedia.org/wiki/Succinct_data_structure. It's all about being close to the theoretical minimum.
> Am I paying 30% overhead for this particular index or that wavelet matrix?
Nope! That would not fit the definition. That would be a "compact data structure" according to this definition.
My goto library for succinct data structures is SDSL-Lite [0].
[0] https://github.com/simongog/sdsl-lite
I really love this space: Navarro's book is an excellent survey.
Erik Demaine has a few great lectures on succinct data structures too: L17 and L18 on https://courses.csail.mit.edu/6.851/spring12/lectures/
> This is a field that seems to have emerged in computer science relatively recently; many of the important data structures were invented in the last 25 years.
This is crazy!
The word count seems artificially increased in the post. Here's a succinct explanation: https://www.eecs.tufts.edu/~aloupis/comp150/projects/Succinc...
I didn't think that at all. In fact I found it very readable and prefer it over the drier presentation you linked. To each its own, I guess, but there's really no need to infer ulterior motives.
With advanced CS topics, it often works to search for "<topic> lecture notes".
Wow, this is really fascinating. I guess it all comes down to how it's doing select and rank in constant time, which is probably some clever bit arithmetic. I'll have to look into how that works.
Some of it is moving or has moved down to the instruction sets: https://vaibhavsagar.com/blog/2019/09/08/popcount/
I can speak a bit about one of the approaches: "Practical Entropy-Compressed Rank/Select Dictionary" by Daisuke Okanohara and Kunihiko Sadakane. This presents two different variants: One for dense (i.e. more than 50% of the bits are set) and one for sparse.
The dense implementation is basically based around partitioning them into "blocks" of a given size and then you can find the block by doing `block[idx / block_size]`. It then also groups each block into sub-blocks which helps you even further. All of these are additional data structures (very much like an index) which are stored next to the regular bitset. You use the blocks/sub-blocks to find roughly where in the bitset you are and then use a algorithm for finding the value in a given machine-word.
The sparse implementation treats the bitset as an ordered list of numbers (e.g. 100001001 is interpreted as 0, 5, 8) and then it stores those numbers using Elias-Fano encoding. The higher bits of the Elias-Fano encoding happen to be dense and hence we can use the previous dense implementation on that higher bits and then combine it with the lower bits.
I'm also aware of https://arxiv.org/abs/1706.00990 which is more about how to do this most efficiently at a machine-word level.
"Engineering Compact Data Structures for Rank and Select Queries on Bit Vectors" is another quite recent paper which I haven't fully digested yet.
Way better detailed explanation: https://stackoverflow.com/questions/72580828/what-is-a-succi...
That IS excellent - thank you