During this time I got really excited about succinct data structures and wrote a Rust library implementing many bitvector types and a wavelet matrix. [2]
My interest came from a data visualization perspective -- I was curious if space-efficient data structures could fundamentally improve the interactive exploration of large datasets on the client side. Happy to chat about that if anyone's curious.
[1] Paper: https://archive.yuri.is/pdfing/weighted_range_quantile_queri... though it's pretty hard to understand without some background context. I've been meaning to write a blog post explaining the core contribution, which is a simple tweak to one of Navarro's textbook data structures.
Many dictionaries now list one common use of “literally” as meaning “figuratively, with emphasis”. So literally officially sometimes now literally means figuratively.
I suspect some people are literally having conniption fits about this…
I’m sorry, but your comment mixes two different types of dictionaries. You talk about “official” meanings which would be a prescriptive dictionary telling you the way you are allowed to use a word. But the dictionaries that include “figuratively” in their definitions are clearly descriptive, presenting all the ways words are commonly used.
You can’t take a descriptive dictionary and then claim it is prescriptive.
There are no prescriptive dictionaries, at least not correct ones, for living languages.
IIRC both the OED and CED list figurative uses for the word, do you know any publications considered more authoritative than those for English? Webster too, for those who prefer simplified English.
They have Académie Française which intends to control the language to an extent, in recent times focussing a lot on resisting then encroachment of English word and phrases, but IIRC their recommendations don't carry as much weight as many think and are often ignored even by government departments and other official French bodies.
The Académie do publish a dictionary every few decades though, there was a new edition recently, so there is a prescriptive dictionary for French even though it carries little weight in reality.
French is the only living language to attempt it to this extent, though the existence of one is enough to make my “there are none for living languages” point incorrect. It is difficult to pin a language down until no one really speaks it day-to-day (so it doesn't evolve at the rates commonly used languages do).
Very few of those have official force or cover much more than a subset of language properties (i.e. spelling rules), but definitely more than the "none" of my original assertion.
Isn't this more of a cultural thing, that Germans seem to agree that it is authoritative and use it as a reference?
I'm not sure what would even make a dictionary prescriptive other than an explicit declaration that it is so or, ridiculously, a law declaring the same.
I'm sorry, can you point to such a prescriptive dictionary? People can talk however they please, and dictionaries are tasked with keeping up with the vernacular.
The "literally" ship sailed centuries ago. Sorry, but that battle has been lost. Even so-called "prescriptive" dictionaries would be categorically incorrect if they ignore nearly three centuries of common vernacular.
It never has, it always will. We've already lost a host of words that meant "I'm not exaggerating, I actually mean it": "really", "very", etc. I'm going to keep up the fight.
Language is defined by its speakers, as basically a "vote". I'm going to keep voting for "literally" meaning "this actually happened" as long as it's practical, because 1) there are dozens of other ways to emphasize something 2) we need some way to say "this is not an exaggeration".
Why a quantum leap isn’t the length of an Ångstrom will always sadden me. I’m sure there are other scientific concepts you can use to describe a Great Leap Forward…
These are the days I really love HN. Despite having been in this field for 30 some odd years, I'd never heard of "succinct data structures" until now. And had I not seen this post, maybe I never would have.
Is that important? Well, maybe. As I started digging in and looking for libraries that implement some of this stuff, I found that a popular graph processing library (JGraphT) uses a succinct data structure library (Sux4J) for working with large graphs. And working with graphs is something I've been doing a deep dive into lately. So yeah, if these data structures represent something that has practical applications in graph processing, then maybe finding this is important.
Certainly I'm glad I stumbled across this in either case; the topic strikes me as fascinating.
Note that succinct data structures may not be faster than conventional structures if your dataset fits in memory http://www.cs.cmu.edu/~huanche1/slides/FST.pdf . Of course, for large datasets where storage access times dominate, succinct data structures win all around. In any case, succinct trees are works of art (If I recall https://arxiv.org/pdf/1805.11255 was a good exposition) (just look at how that RMQ works)!
As an application grows the features start to interact. We tend to not be paying much attention to how the old code 'fits into memory' while we are writing new code that also has to fit into memory.
Once you have an entire system written the benefits of having four interacting features that each fit into a third of memory may be bigger than you think. And I don't know how well Intel's hardware level profiling information reveals that but I know for sure that the profiling tools that ship with most commercially viable programming languages never do. They blame issues on whoever touched the system last, and if that is an epicycle or a metastable situation then the apparent guilty party may be a frame-up, while the real culprit goes unpunished.
As an example: if your workflow has a step that eventually needs to use 40% of available memory to solve a problem, then GC will almost always trigger within that step of the process, rather than in the other 60% of memory being systematically wasted by heaps of inefficient code in the leadup to this step, because the top of the memory sawtooth will almost always occur within that part of the call graph. But because the 40% is your intrinsic complexity, people's brains shut off the moment a cost is attributed to unavoidable work, instead of the avoidable work that really caused the majority of the problem.
True, but it depends on what you mean with fitting in memory.
Succinct datastructures are used in genomics (e.g. bwa, megahit exome sequencer) because N is so large that you're actually hitting asymptotic behavior.
For memory latency it can by making your memory footprint fit in LLC or a single node; cross-node NUMA latencies are typically enough to absolutely tank performance.
It can theoretically also help in massively parallel access situations where bandwidth becomes a limiting concern. Although, I intuit we would need near-memory hardware to perform the rank+select. Otherwise the latency of the multiple dependent accesses will kill your performance again, cfr previous point.
With a lot of parallel accesses, bandwidth could also be an issue in conventional structures.
There's fits in memory, and there's fits in memory.
I have used various bit-packing schemes in order to keep data in memory. If you can keep your entire dataset in memory, it opens up different ways to tackle it. Succinct data structures look like a way to enable this.
It's not storage access times that kill you. You need to rearrange all your processing to work in batches or windows or use some kind of query API to do anything. Everything becomes much more painful.
I think it depends more on the ratio between access time and how often you use the data. Adding two arrays that fit in L1 is already limited by access time. On Zen3, we can add two 32-byte vectors per cycle, but only store one of these per cycle. For matrix multiplication, we can do the two additions per cycle (or really c <- a * b + c) because we have to do multiple operations once we have loaded the data into registers.
I can see it be useful for data sets of a few dozen MBs as well.
I first heard of the concept of succinct data structures from Edward Kmett, a famous Haskeller behind many popular Haskell libraries. He gave a talk on succinct data structures a long time ago: http://youtu.be/uA0Z7_4J7u8
"Intuitively, a succinct data structure is one whose space usage equals the space needed to write out the data, plus something that grows more slowly than that. If you're familiar with little-o notation, a succinct data structure is one whose space usage is X + o(X), where X is the number of bits needed to write out the data itself."
Brings to mind COBS encoding, which does this for streams bytes containing arbitrary length "packets" or similar.
This is great! I love both how far you can push this and get meaningful improvements, and how it's totally overkill for anything we'll ever be able to implement on a physical computer. The hardware focused approach is to use popcnt for a base size of 512 (since cache architecture will make you fetch that much memory if you touch the original array anyway). We then can store 1 UInt16 prefix sum per 512 bits (n/32 bits overall), and if we have more than 2^16 bits in total, we can store UInt64 prefixes every 2^16 bits (n/1024 bits overall).
Theoretically, this approach uses O(nlogn) bits as opposed to o(n) for the theoretical approach, but in practice, for <2^64 bools, the actual storage ens up being n/32+n/1024 which is pretty hard to beat. The theoretical approach gets it's wins from making extremely clever use of the difference between O(loglog(n)) and O(1), but unfortunately for the foreseeable future, logn < 64 and loglog(n) < 6, so all the subtlety gets swallowed up into the base case of a single popcnt instruction.
templatetypedef answers are the best answers. I know going in that (1) there are going to be surprising insights, and (2) I'm going to understand them fully.
Succinct data structures are very fun! If anyone is interested, I've implemented some of this in Zig: https://github.com/judofyr/zini. The main thing this implements is a minimal perfect hash function which uses less than 4 bits per element (and can respond to queries in ~50 ns). As a part of that I ended up implementing on of these indexes for constant-time select(n): https://github.com/judofyr/zini/blob/main/src/darray.zig.
It feels kinda magic implementing these algorithms because everything becomes so tiny!
For Java, C++, and Rust there is also https://sux.di.unimi.it/ maintained by Sebastiano Vigna, a professor from Italy.
Together with his student (I also helped a bit), he wrote a paper about RecSplit, a minimal perfect hash function (MPHF) algorithm I have invented: https://arxiv.org/abs/1910.06416 - that algorithm uses around 1.56 bits per key. But it is quite slow. In January 2020 I presented the paper at a conference, that was right before the pandemic.
The algorithm with the least memory usage (and much faster as well) is now Consensus-RecSplit: https://arxiv.org/abs/2502.05613 - it can do 1.44 bits per key, which is right at the theoretical minimum (meaning, there is no good way to shrink it further). At least a small part of my original invention is still used there, nice. The fastest current MPHF is probably PtrHash https://arxiv.org/abs/2502.15539 - both papers were published last month (February 2025) by the way.
I'm working on making pthash faster and more practical. I can compile the data and code to C++, send efficiently store the keys also to be able to eliminate false positives.
Just spent my morning diving into succinct data structures after seeing this. The memory efficiency is incredible - especially the balanced parentheses tree representing a full node tree in just 2 bits per node! I've been working on a project parsing large (10GB+) XML files for scientific data analysis, and our current approach burns through RAM like crazy. Has anyone here successfully implemented these structures in production systems?
The wavelet matrix concept seems particularly promising for our text-heavy workloads. I'm curious if the constant-time operations actually hold up under real-world conditions or if there are hidden performance cliffs.
This feels like one of those CS concepts that should be more widely known but somehow got overlooked by mainstream programming. Kind of like how bloom filters were obscure until suddenly every system was using them.
When I’ve dealt with huge files in .NET, the usual approach is to stream the file such that only some of it is in memory at once. This way you can process files hundreds of GBs. Of course, if you truly need them all in memory at once for some reason I genuinely can’t think of, then you’d need something else.
Does your language have the concept of streaming files?
If you're streaming something row-based like a CSV, or a zipped CSV, then that's usually easy.
But when you get to hierarchical data structures like JSON/protobuf there very often simply isn't a streaming library available. There's a library function to decode the whole thing into an object in memory, and that's all.
Nothing prevents streaming in theory, it's just far more complicated to write that library.
protobuf sure, but streaming libraries for json (and xml, as in the parent) are extremely common. not harder (maybe even easier) than non-streaming to write, tho more cumbersome to use, so something you'd reach for only if you specifically need it ('cuz of memory constraints)
Yup. I don't remember streaming JSON being common in the early days but now it is. But the absence of streaming protobuf is what has killed me, when dealing with gigantic protobuf files from government agencies (ugh).
Heh, yeah. The protobuf people's expectation was if you had a really large dataset you'd wrap it up in your own mini-protocol of "sequence of protobuf messages". But of course that's way more friction, so in practice it will end up not getting done when it should be (plus also, it requires a certain amount of ability to predict the future).
Lesson for technologists: if you want to make the world a better place arrange your tech such that the lowest-friction path is also the correct path.
(Another example: disasterous multi-byte UTF encodings [correct solution was more friction] vs basically successful UTF8 [correct solution was less friction].)
I don't know if you're still dealing w/ this particular problem for protobufs, but based on my experience with thrift, a very similar library, there are probably some not too terrible ways you can kinda hack up the client-side parsing to be more streaming-ish...
nanopb is designed around streaming. It's limited in a few ways[1] but is designed for use on low-memory systems (microcontrollers) where the whole protobuf message won't necessarily fit into memory at once. Might not help for your use cases though, since it's a C library without a stable ABI.
The low hanging fruit in this area is to do partial unmarshalling or using a SAX parser on a stream. It's likely you'll have to do this to retrieve data and put it in whatever succinct or otherwise efficient data structure.
In Java, which I consider to have the best tooling for advanced XML applications, you'd look into JAXB on streams, StAX or SAX. On complicated and heavily nested XML it might take some effort and profiling to figure out the optimal state machines for exhaustive traversal, if that's what you do.
I'd also like to mention that XSLT is an often underappreciated approach.
I'm pretty sure these were used to store the built in dictionaries on early mobile phones, especially for the implementation of T9 word and similar programs.
This might be a little over my head, but i'm not understanding how the balanced parenthesis is conveying anything other than the topology of the tree structure. Are we not accounting for the bits required for a pointer in memory to an object? Or simply the bits required to guarantee the uniqueness of a node in the tree?
There's a relative of this in making in-memory node representation efficient for large structs that have a bunch of rarely-used fields: chunk up memory in units (most reasonably 8 bytes/pointer size), allocate offsets for rarely-used fields in ascending order, and then use bitmasks to indicate which fields are present. (Note the bits correspond to units, not fields; a 16-byte field would use 2 adjacent bits in the bitmask.)
The trick is that masking & popcount (both low-cycle CPU instructions in most modern CPUs¹) make this quite fast to access and thus suitable for in-memory representation.
The intended use is when presence of optional fields is known at allocation time and doesn't change afterwards, i.e. some object is built up into a dynamically allocated buffer whose size is shrunk down by omitting fields. Changing which fields are present afterwards requires reallocating the thing, which tends to make the entire pattern pointless.
¹ the real annoyance here is that almost all x86_64 CPUs have POPCNT, except the absolute earliest ones, but if you build e.g. some package for a Linux distribution without any CPU optimization flags it'll use a library fallback popcount routine rather than the instruction :(
I really like the article, but it would benefit from some numbers or complexity estimates to get some intuitive sense of what the cost is.
Am I paying 30% overhead for this particular index or that wavelet matrix? Is it double the memory use? Or is it O(log N)? No idea! "doesn't use much more space" could mean a lot of different things!
"Succinct data structure" does have a very strict definition which probably answers some of your questions: https://en.wikipedia.org/wiki/Succinct_data_structure. It's all about being close to the theoretical minimum.
> Am I paying 30% overhead for this particular index or that wavelet matrix?
Nope! That would not fit the definition. That would be a "compact data structure" according to this definition.
You should be careful when using asymptotic bounds with more precision than about O(sqrt(n)). The bounds ignore constant factors, and the constant factors could be more significant than slowly growing non-constant factors for reasonable values of n.
It's also very common in algorithm design that improving the asymptotic bounds and making the algorithm faster (or the data structure smaller) are orthogonal (or even opposite) goals. Real computers have complex performance characteristics and fixed word lengths, and it's rarely a good idea to implement a theoretical algorithm exactly as described.
Succinct data structures often have a number of internal parameters. In a theoretical parameterization, those parameters may be described as being O(log n), O(log^2 n), or O(log log n). In a concrete implementation, it may be a good idea to use constant approximations for some nontrivial (such as 2x) performance gains. O(log n) could become 32 or 64, O(log^2 n) could become 1024 (or a power-of-2 multiple), and O(log log n) could become 4 or 8.
And then, if you parameterize the succinct data structure with these constants, the space overhead becomes a constant fraction.
Succinct data structures require the extra space (above the information-theoretical minimum) to be an additive term of o(n) bits (little O, not big O). That just means that the extra space grows more slowly than n, as n approaches infinity, so their ratio (extra space)/n approaches 0 in the limit.
That's a simplification. Succinct data structures are usually parameterized data structures. They have a number of parameters, such as block sizes and sampling rates, that govern the space overhead and query performance. The published version may use a parameterization that makes the space overhead sublinear while guaranteeing attractive asymptotic bounds for worst-case performance. But if you actually implement that, the performance is likely terrible.
Consider the rank data structure for bitvectors that is supposed to guarantee O(1) time queries with o(n) bits of space overhead. The bitvector is divided into superblocks of b1 bits and blocks of b2 bits. The top-level index, which stores the rank up to each superblock, uses O(n log n / b1) bits of space. The second-level index, which stores the rank within the superblock up to each block, uses O(n log b1 / b2) bits of space. And then you need to do linear search within the block, which is typically assumed to take O(b2 / log n) or O(b2 / w) time, where w is word size. If you choose b1 = log^2 n and b2 = log n, you get O(1) time with O(n log log n / log n) bits of space overhead. Which is technically sublinear but effectively indistinguishable from linear with realistic values of n.
Real-world implementations use constant values for the parameters. Typically the goal is to get the blocks and indexes align with cache lines and to replace arbitrary divisions with divisions by compile-time power-of-2 constants. Some values I've seen are (b1 = 512, b2 = 64) and (b1 = 65536, b2 = 512). In both cases, the overhead is linear.
And sometimes the implemented data structure is a simplification, because the nominally succinct data structure is too large and slow. For example, it's rare to see actual O(1)-time implementations of select on bitvectors. That would require three levels of indexes with many special cases. It's more common to use two levels of indexes, with queries that almost always constant-time but have (poly)logarithmic worst cases with adversarial inputs.
This is really good information! Thanks for writing it up.
Honestly, I never actually "trust" the complexity analysis. Whenever I find a paper I immediately look for their specific results on an experiment, and if I can't find that I will assume the paper is only of theoretical interest. There's of course still a lot to learn from a purely theoretical paper, but I've always ended up being disappointed when I've implemented something which only had a "good" asymptotic bounds.
Funny, I recently independently reinvented [1] the "balanced parentheses tree" and was very satisfied with its flexibility and speed: it seemed so simple and general I was surprised not to already know it by name. Now I do!
I'm sure it's out of date in some areas by now, but I have Navarro's textbook[1] and it's a great survey. (The only criticism that comes to mind is that it weirdly shortchanges Elias-Fano encoding[2], which is hugely important in practice but relegated to just an offhand sentence or two in the book.)
As an old-school signal processing person, "wavelet" and "FM" are throwing me for a loop. I can see FM is named for the authors. "wavelet" - I don't see at first glance what the name means or if it's related to the signal processing concept.
Yes, the 'wavelet tree' (and other wavelet thingies in succinct data structures) are rather unfortunately named. It ranks right up there with 'wavefunction collapse' in procedural generation.
Marisa trie is a really cool and useful succinct data structure (also mentioned in High Performance Python book):
https://github.com/pytries/marisa-trie
I also find Succinct Data Structures fascinating. I own Navarro's text book, and it's one of my favorite books to take to the coffee shop and try to understand.
I also have a copy of Jacobson's Thesis.
Since information is rather scarce, here's my shameless plug for a related blog post I made, which includes a few links at the end to more resources https://denvaar.dev/posts/jacobsons_rank.html
See also: the undergrad who found a breakthrough in hash tables recently and wasn't put off by a long-standing conjecture about what the bound on their performance was because he simply wasn't aware of it
As an aside, an FM index can be used to efficiently turn an LLM into an actual stochastic parrot (one that emits only substrings of some dataset). This is more useful than it sounds because you can use it for quoting from large corpora.
But you can also simply store the tree structure as a single array of ints (or longs) ... each node in the tree corresponds to a unique index position, and the value at that position is the index position of the node's parent (or it's own position if it's a top level node) ... good stuff.
How do succinct data structures do with vector operations on cpu?
Not sure if they are succinct, but the Apache arrow format encodes data in several ways that is compact in memory but also allows operations on these structures.
Wow, this is really fascinating. I guess it all comes down to how it's doing select and rank in constant time, which is probably some clever bit arithmetic. I'll have to look into how that works.
I can speak a bit about one of the approaches: "Practical Entropy-Compressed Rank/Select Dictionary" by Daisuke Okanohara and Kunihiko Sadakane. This presents two different variants: One for dense (i.e. more than 50% of the bits are set) and one for sparse.
The dense implementation is basically based around partitioning them into "blocks" of a given size and then you can find the block by doing `block[idx / block_size]`. It then also groups each block into sub-blocks which helps you even further. All of these are additional data structures (very much like an index) which are stored next to the regular bitset. You use the blocks/sub-blocks to find roughly where in the bitset you are and then use a algorithm for finding the value in a given machine-word.
The sparse implementation treats the bitset as an ordered list of numbers (e.g. 100001001 is interpreted as 0, 5, 8) and then it stores those numbers using Elias-Fano encoding. The higher bits of the Elias-Fano encoding happen to be dense and hence we can use the previous dense implementation on that higher bits and then combine it with the lower bits.
Maybe a silly question, but has anyone used these in production? Or used libraries in production which are built on these structures?
Im imagining a meeting about some project design, and thinking about how it'd go if someone suggested using parentheses to represent nodes of a tree. I imagine it'd get written off quickly. Not because it wouldn't work, but because of the complexity and learning curve involved.
> This is a field that seems to have emerged in computer science relatively recently; many of the important data structures were invented in the last 25 years.
are succinct data structures good for "dynamic" applications? ie modifying the data often, last time i looked into this field it seems that all the popular structures were very static in nature and had costly updates
Maybe there are more modern, advanced data structures / techniques, but yes, my understanding is that this is a common trade off (having your data structure be more static, with offline updates).
I didn't think that at all. In fact I found it very readable and prefer it over the drier presentation you linked. To each its own, I guess, but there's really no need to infer ulterior motives.
I also emailed Gonzalo Navarro once to ask a question, and we had a great discussion and ended up writing a paper together about the answer. [1]
Another paper of his that I really like combines a few elegant ideas into a simple implementation of bitvector rank/select: https://users.dcc.uchile.cl/~gnavarro/ps/sea12.1.pdf
During this time I got really excited about succinct data structures and wrote a Rust library implementing many bitvector types and a wavelet matrix. [2]
My interest came from a data visualization perspective -- I was curious if space-efficient data structures could fundamentally improve the interactive exploration of large datasets on the client side. Happy to chat about that if anyone's curious.
[1] Paper: https://archive.yuri.is/pdfing/weighted_range_quantile_queri... though it's pretty hard to understand without some background context. I've been meaning to write a blog post explaining the core contribution, which is a simple tweak to one of Navarro's textbook data structures.
[2] The rust version is here: https://github.com/yurivish/made-of-bits/tree/main/rust-play... and an earlier pure-JS implementation is here: https://github.com/yurivish/made-of-bits/tree/main
Reading a Gonzalo Navarro paper is like going for walk, taking a shower and having a wonderful coffee. It literally sets the mind on fire.
https://dblp.org/pid/n/GonzaloNavarro.html
Well not literally.
Many dictionaries now list one common use of “literally” as meaning “figuratively, with emphasis”. So literally officially sometimes now literally means figuratively.
I suspect some people are literally having conniption fits about this…
I’m sorry, but your comment mixes two different types of dictionaries. You talk about “official” meanings which would be a prescriptive dictionary telling you the way you are allowed to use a word. But the dictionaries that include “figuratively” in their definitions are clearly descriptive, presenting all the ways words are commonly used.
You can’t take a descriptive dictionary and then claim it is prescriptive.
There are no prescriptive dictionaries, at least not correct ones, for living languages.
IIRC both the OED and CED list figurative uses for the word, do you know any publications considered more authoritative than those for English? Webster too, for those who prefer simplified English.
I think French has prescriptive dictionaries (to varying degrees of success)
They have Académie Française which intends to control the language to an extent, in recent times focussing a lot on resisting then encroachment of English word and phrases, but IIRC their recommendations don't carry as much weight as many think and are often ignored even by government departments and other official French bodies.
The Académie do publish a dictionary every few decades though, there was a new edition recently, so there is a prescriptive dictionary for French even though it carries little weight in reality.
French is the only living language to attempt it to this extent, though the existence of one is enough to make my “there are none for living languages” point incorrect. It is difficult to pin a language down until no one really speaks it day-to-day (so it doesn't evolve at the rates commonly used languages do).
https://en.wikipedia.org/wiki/Linguistic_prescription#Formal...
Very few of those have official force or cover much more than a subset of language properties (i.e. spelling rules), but definitely more than the "none" of my original assertion.
"prescriptive" does not mean "have legal force" though...
The Duden is prescriptive for German AFAIK.
Isn't this more of a cultural thing, that Germans seem to agree that it is authoritative and use it as a reference?
I'm not sure what would even make a dictionary prescriptive other than an explicit declaration that it is so or, ridiculously, a law declaring the same.
> There are no prescriptive dictionaries, at least not correct ones, for living languages.
there are no 100% correct descriptive dictionaries. Any prescriptive dictionary is automatically correct.
> Any prescriptive dictionary is automatically correct.
… in the view of their compilers.
I could write a prescriptive dictionary far more easily than I could get others to accept it as correct.
If you write a prescriptive dictionary it is correct because you are dictating the norms not describing what is real.
Yes you would have to be involved with a regulatory institution first
I'm sorry, can you point to such a prescriptive dictionary? People can talk however they please, and dictionaries are tasked with keeping up with the vernacular.
The "literally" ship sailed centuries ago. Sorry, but that battle has been lost. Even so-called "prescriptive" dictionaries would be categorically incorrect if they ignore nearly three centuries of common vernacular.
There aren’t prescriptive dictionaries for (American, at least) English.
But “official” is defined in descriptive dictionaries to include descriptive dictionaries.
Well, literally doesn't mean literally anymore--literally.
It never has, it always will. We've already lost a host of words that meant "I'm not exaggerating, I actually mean it": "really", "very", etc. I'm going to keep up the fight.
Since there are _literally_ people who use, and have been using for a while, the word without the same exact meaning as we both agree on... well.
Having said that, I will join you in this fight.
See also: exponentially.
Language is defined by its speakers, as basically a "vote". I'm going to keep voting for "literally" meaning "this actually happened" as long as it's practical, because 1) there are dozens of other ways to emphasize something 2) we need some way to say "this is not an exaggeration".
“Exponentially” and “quantum” are the only language hills I’d die on.
Why a quantum leap isn’t the length of an Ångstrom will always sadden me. I’m sure there are other scientific concepts you can use to describe a Great Leap Forward…
I think the Quantum Leap expression can also be understood as a "step" with no intermediate stages, i.e. very abrupt or transformative.
The moreso that those things don't even figuratively set my mind on fire.
What about metaphorically?
the "technical note" link in the RLE bit vector section of the rust repo is broken (https://yuri.is/pdfing/weighted_range_quantile_queries.pdf 404s)
oh wait nvm just realised you linked a working archive link in your post... still worth updating the link in the repo for people who stumble upon it
Fixed, thanks!
These are the days I really love HN. Despite having been in this field for 30 some odd years, I'd never heard of "succinct data structures" until now. And had I not seen this post, maybe I never would have.
Is that important? Well, maybe. As I started digging in and looking for libraries that implement some of this stuff, I found that a popular graph processing library (JGraphT) uses a succinct data structure library (Sux4J) for working with large graphs. And working with graphs is something I've been doing a deep dive into lately. So yeah, if these data structures represent something that has practical applications in graph processing, then maybe finding this is important.
Certainly I'm glad I stumbled across this in either case; the topic strikes me as fascinating.
Note that succinct data structures may not be faster than conventional structures if your dataset fits in memory http://www.cs.cmu.edu/~huanche1/slides/FST.pdf . Of course, for large datasets where storage access times dominate, succinct data structures win all around. In any case, succinct trees are works of art (If I recall https://arxiv.org/pdf/1805.11255 was a good exposition) (just look at how that RMQ works)!
As an application grows the features start to interact. We tend to not be paying much attention to how the old code 'fits into memory' while we are writing new code that also has to fit into memory.
Once you have an entire system written the benefits of having four interacting features that each fit into a third of memory may be bigger than you think. And I don't know how well Intel's hardware level profiling information reveals that but I know for sure that the profiling tools that ship with most commercially viable programming languages never do. They blame issues on whoever touched the system last, and if that is an epicycle or a metastable situation then the apparent guilty party may be a frame-up, while the real culprit goes unpunished.
As an example: if your workflow has a step that eventually needs to use 40% of available memory to solve a problem, then GC will almost always trigger within that step of the process, rather than in the other 60% of memory being systematically wasted by heaps of inefficient code in the leadup to this step, because the top of the memory sawtooth will almost always occur within that part of the call graph. But because the 40% is your intrinsic complexity, people's brains shut off the moment a cost is attributed to unavoidable work, instead of the avoidable work that really caused the majority of the problem.
True, but it depends on what you mean with fitting in memory.
Succinct datastructures are used in genomics (e.g. bwa, megahit exome sequencer) because N is so large that you're actually hitting asymptotic behavior.
For memory latency it can by making your memory footprint fit in LLC or a single node; cross-node NUMA latencies are typically enough to absolutely tank performance.
It can theoretically also help in massively parallel access situations where bandwidth becomes a limiting concern. Although, I intuit we would need near-memory hardware to perform the rank+select. Otherwise the latency of the multiple dependent accesses will kill your performance again, cfr previous point.
With a lot of parallel accesses, bandwidth could also be an issue in conventional structures.
There's fits in memory, and there's fits in memory.
I have used various bit-packing schemes in order to keep data in memory. If you can keep your entire dataset in memory, it opens up different ways to tackle it. Succinct data structures look like a way to enable this.
It's not storage access times that kill you. You need to rearrange all your processing to work in batches or windows or use some kind of query API to do anything. Everything becomes much more painful.
I think it depends more on the ratio between access time and how often you use the data. Adding two arrays that fit in L1 is already limited by access time. On Zen3, we can add two 32-byte vectors per cycle, but only store one of these per cycle. For matrix multiplication, we can do the two additions per cycle (or really c <- a * b + c) because we have to do multiple operations once we have loaded the data into registers.
I can see it be useful for data sets of a few dozen MBs as well.
Memory is expensive. In the cloud especially. Using a succinct structure could enable cheaper computing for specific tasks. This benefits everyone.
I first heard of the concept of succinct data structures from Edward Kmett, a famous Haskeller behind many popular Haskell libraries. He gave a talk on succinct data structures a long time ago: http://youtu.be/uA0Z7_4J7u8
Ed is great. He's not just a Haskeller, but has also done interesting work in the likes of C++ and 6502 assembly amongst others.
His code on this seems to be https://hackage.haskell.org/package/structures
There is also HaskellWorks packages like https://hackage.haskell.org/package/hw-xml
He also has https://github.com/ekmett/succinct-binary
My Haskell attempt at Wavelet Matrices: https://github.com/jahaynes/waveletto
Whoa - I was there! This was at a small elm-lang meetup; Prezi hosted this while Evan was working there.
Thank you for sharing this link!
Way better detailed explanation: https://stackoverflow.com/questions/72580828/what-is-a-succi...
Yes, that's great. This part:
"Intuitively, a succinct data structure is one whose space usage equals the space needed to write out the data, plus something that grows more slowly than that. If you're familiar with little-o notation, a succinct data structure is one whose space usage is X + o(X), where X is the number of bits needed to write out the data itself."
Brings to mind COBS encoding, which does this for streams bytes containing arbitrary length "packets" or similar.
This is great! I love both how far you can push this and get meaningful improvements, and how it's totally overkill for anything we'll ever be able to implement on a physical computer. The hardware focused approach is to use popcnt for a base size of 512 (since cache architecture will make you fetch that much memory if you touch the original array anyway). We then can store 1 UInt16 prefix sum per 512 bits (n/32 bits overall), and if we have more than 2^16 bits in total, we can store UInt64 prefixes every 2^16 bits (n/1024 bits overall).
Theoretically, this approach uses O(nlogn) bits as opposed to o(n) for the theoretical approach, but in practice, for <2^64 bools, the actual storage ens up being n/32+n/1024 which is pretty hard to beat. The theoretical approach gets it's wins from making extremely clever use of the difference between O(loglog(n)) and O(1), but unfortunately for the foreseeable future, logn < 64 and loglog(n) < 6, so all the subtlety gets swallowed up into the base case of a single popcnt instruction.
This is great. A very understandable explanation, thanks for sharing!
That IS excellent - thank you
templatetypedef answers are the best answers. I know going in that (1) there are going to be surprising insights, and (2) I'm going to understand them fully.
Succinct data structures are very fun! If anyone is interested, I've implemented some of this in Zig: https://github.com/judofyr/zini. The main thing this implements is a minimal perfect hash function which uses less than 4 bits per element (and can respond to queries in ~50 ns). As a part of that I ended up implementing on of these indexes for constant-time select(n): https://github.com/judofyr/zini/blob/main/src/darray.zig.
It feels kinda magic implementing these algorithms because everything becomes so tiny!
For Java, C++, and Rust there is also https://sux.di.unimi.it/ maintained by Sebastiano Vigna, a professor from Italy.
Together with his student (I also helped a bit), he wrote a paper about RecSplit, a minimal perfect hash function (MPHF) algorithm I have invented: https://arxiv.org/abs/1910.06416 - that algorithm uses around 1.56 bits per key. But it is quite slow. In January 2020 I presented the paper at a conference, that was right before the pandemic.
The algorithm with the least memory usage (and much faster as well) is now Consensus-RecSplit: https://arxiv.org/abs/2502.05613 - it can do 1.44 bits per key, which is right at the theoretical minimum (meaning, there is no good way to shrink it further). At least a small part of my original invention is still used there, nice. The fastest current MPHF is probably PtrHash https://arxiv.org/abs/2502.15539 - both papers were published last month (February 2025) by the way.
I'm working on making pthash faster and more practical. I can compile the data and code to C++, send efficiently store the keys also to be able to eliminate false positives.
https://github.com/rurban/pthash
Just spent my morning diving into succinct data structures after seeing this. The memory efficiency is incredible - especially the balanced parentheses tree representing a full node tree in just 2 bits per node! I've been working on a project parsing large (10GB+) XML files for scientific data analysis, and our current approach burns through RAM like crazy. Has anyone here successfully implemented these structures in production systems?
The wavelet matrix concept seems particularly promising for our text-heavy workloads. I'm curious if the constant-time operations actually hold up under real-world conditions or if there are hidden performance cliffs.
This feels like one of those CS concepts that should be more widely known but somehow got overlooked by mainstream programming. Kind of like how bloom filters were obscure until suddenly every system was using them.
When I’ve dealt with huge files in .NET, the usual approach is to stream the file such that only some of it is in memory at once. This way you can process files hundreds of GBs. Of course, if you truly need them all in memory at once for some reason I genuinely can’t think of, then you’d need something else.
Does your language have the concept of streaming files?
If you're streaming something row-based like a CSV, or a zipped CSV, then that's usually easy.
But when you get to hierarchical data structures like JSON/protobuf there very often simply isn't a streaming library available. There's a library function to decode the whole thing into an object in memory, and that's all.
Nothing prevents streaming in theory, it's just far more complicated to write that library.
protobuf sure, but streaming libraries for json (and xml, as in the parent) are extremely common. not harder (maybe even easier) than non-streaming to write, tho more cumbersome to use, so something you'd reach for only if you specifically need it ('cuz of memory constraints)
e.g. standard go json library https://pkg.go.dev/encoding/json#example-Decoder.Decode-Stre...
Yup. I don't remember streaming JSON being common in the early days but now it is. But the absence of streaming protobuf is what has killed me, when dealing with gigantic protobuf files from government agencies (ugh).
Heh, yeah. The protobuf people's expectation was if you had a really large dataset you'd wrap it up in your own mini-protocol of "sequence of protobuf messages". But of course that's way more friction, so in practice it will end up not getting done when it should be (plus also, it requires a certain amount of ability to predict the future).
Lesson for technologists: if you want to make the world a better place arrange your tech such that the lowest-friction path is also the correct path.
(Another example: disasterous multi-byte UTF encodings [correct solution was more friction] vs basically successful UTF8 [correct solution was less friction].)
I don't know if you're still dealing w/ this particular problem for protobufs, but based on my experience with thrift, a very similar library, there are probably some not too terrible ways you can kinda hack up the client-side parsing to be more streaming-ish...
nanopb is designed around streaming. It's limited in a few ways[1] but is designed for use on low-memory systems (microcontrollers) where the whole protobuf message won't necessarily fit into memory at once. Might not help for your use cases though, since it's a C library without a stable ABI.
[1]https://jpa.kapsi.fi/nanopb/docs/#features-and-limitations
In programming languages suitable for enterprise software development there are blessed streaming parsers for XML, because it's a rather common task.
It's very common that other programming languages have basic SAX parsers.
What are these languages that don't which you've encountered?
The low hanging fruit in this area is to do partial unmarshalling or using a SAX parser on a stream. It's likely you'll have to do this to retrieve data and put it in whatever succinct or otherwise efficient data structure.
In Java, which I consider to have the best tooling for advanced XML applications, you'd look into JAXB on streams, StAX or SAX. On complicated and heavily nested XML it might take some effort and profiling to figure out the optimal state machines for exhaustive traversal, if that's what you do.
I'd also like to mention that XSLT is an often underappreciated approach.
There's a create blog from the creator of RhymeBrain that talks about Succinct Tries: https://stevehanov.ca/blog/index.php?id=120
I'm pretty sure these were used to store the built in dictionaries on early mobile phones, especially for the implementation of T9 word and similar programs.
This might be a little over my head, but i'm not understanding how the balanced parenthesis is conveying anything other than the topology of the tree structure. Are we not accounting for the bits required for a pointer in memory to an object? Or simply the bits required to guarantee the uniqueness of a node in the tree?
You store the structural information separately from the data. The data can be stored sequentially in some traversal order.
He touches on indexes but doesn't really mention the implementation. This is about the primitives.
I really love this space: Navarro's book is an excellent survey.
Erik Demaine has a few great lectures on succinct data structures too: L17 and L18 on https://courses.csail.mit.edu/6.851/spring12/lectures/
There's a relative of this in making in-memory node representation efficient for large structs that have a bunch of rarely-used fields: chunk up memory in units (most reasonably 8 bytes/pointer size), allocate offsets for rarely-used fields in ascending order, and then use bitmasks to indicate which fields are present. (Note the bits correspond to units, not fields; a 16-byte field would use 2 adjacent bits in the bitmask.)
The trick is that masking & popcount (both low-cycle CPU instructions in most modern CPUs¹) make this quite fast to access and thus suitable for in-memory representation.
The intended use is when presence of optional fields is known at allocation time and doesn't change afterwards, i.e. some object is built up into a dynamically allocated buffer whose size is shrunk down by omitting fields. Changing which fields are present afterwards requires reallocating the thing, which tends to make the entire pattern pointless.
¹ the real annoyance here is that almost all x86_64 CPUs have POPCNT, except the absolute earliest ones, but if you build e.g. some package for a Linux distribution without any CPU optimization flags it'll use a library fallback popcount routine rather than the instruction :(
thankfully a number of distros are starting to ship packages for x86v2 by default (basically everything Core 2 and newer) which fixes this finally.
I really like the article, but it would benefit from some numbers or complexity estimates to get some intuitive sense of what the cost is.
Am I paying 30% overhead for this particular index or that wavelet matrix? Is it double the memory use? Or is it O(log N)? No idea! "doesn't use much more space" could mean a lot of different things!
"Succinct data structure" does have a very strict definition which probably answers some of your questions: https://en.wikipedia.org/wiki/Succinct_data_structure. It's all about being close to the theoretical minimum.
> Am I paying 30% overhead for this particular index or that wavelet matrix?
Nope! That would not fit the definition. That would be a "compact data structure" according to this definition.
You should be careful when using asymptotic bounds with more precision than about O(sqrt(n)). The bounds ignore constant factors, and the constant factors could be more significant than slowly growing non-constant factors for reasonable values of n.
It's also very common in algorithm design that improving the asymptotic bounds and making the algorithm faster (or the data structure smaller) are orthogonal (or even opposite) goals. Real computers have complex performance characteristics and fixed word lengths, and it's rarely a good idea to implement a theoretical algorithm exactly as described.
Succinct data structures often have a number of internal parameters. In a theoretical parameterization, those parameters may be described as being O(log n), O(log^2 n), or O(log log n). In a concrete implementation, it may be a good idea to use constant approximations for some nontrivial (such as 2x) performance gains. O(log n) could become 32 or 64, O(log^2 n) could become 1024 (or a power-of-2 multiple), and O(log log n) could become 4 or 8.
And then, if you parameterize the succinct data structure with these constants, the space overhead becomes a constant fraction.
Succinct data structures require the extra space (above the information-theoretical minimum) to be an additive term of o(n) bits (little O, not big O). That just means that the extra space grows more slowly than n, as n approaches infinity, so their ratio (extra space)/n approaches 0 in the limit.
That's a simplification. Succinct data structures are usually parameterized data structures. They have a number of parameters, such as block sizes and sampling rates, that govern the space overhead and query performance. The published version may use a parameterization that makes the space overhead sublinear while guaranteeing attractive asymptotic bounds for worst-case performance. But if you actually implement that, the performance is likely terrible.
Consider the rank data structure for bitvectors that is supposed to guarantee O(1) time queries with o(n) bits of space overhead. The bitvector is divided into superblocks of b1 bits and blocks of b2 bits. The top-level index, which stores the rank up to each superblock, uses O(n log n / b1) bits of space. The second-level index, which stores the rank within the superblock up to each block, uses O(n log b1 / b2) bits of space. And then you need to do linear search within the block, which is typically assumed to take O(b2 / log n) or O(b2 / w) time, where w is word size. If you choose b1 = log^2 n and b2 = log n, you get O(1) time with O(n log log n / log n) bits of space overhead. Which is technically sublinear but effectively indistinguishable from linear with realistic values of n.
Real-world implementations use constant values for the parameters. Typically the goal is to get the blocks and indexes align with cache lines and to replace arbitrary divisions with divisions by compile-time power-of-2 constants. Some values I've seen are (b1 = 512, b2 = 64) and (b1 = 65536, b2 = 512). In both cases, the overhead is linear.
And sometimes the implemented data structure is a simplification, because the nominally succinct data structure is too large and slow. For example, it's rare to see actual O(1)-time implementations of select on bitvectors. That would require three levels of indexes with many special cases. It's more common to use two levels of indexes, with queries that almost always constant-time but have (poly)logarithmic worst cases with adversarial inputs.
This is really good information! Thanks for writing it up.
Honestly, I never actually "trust" the complexity analysis. Whenever I find a paper I immediately look for their specific results on an experiment, and if I can't find that I will assume the paper is only of theoretical interest. There's of course still a lot to learn from a purely theoretical paper, but I've always ended up being disappointed when I've implemented something which only had a "good" asymptotic bounds.
You're absolutely right and my response completely missed your point, thanks for clarifying further.
Funny, I recently independently reinvented [1] the "balanced parentheses tree" and was very satisfied with its flexibility and speed: it seemed so simple and general I was surprised not to already know it by name. Now I do!
[1] https://pkg.go.dev/golang.org/x/tools/internal/astutil/curso...
My goto library for succinct data structures is SDSL-Lite [0].
[0] https://github.com/simongog/sdsl-lite
I'm sure it's out of date in some areas by now, but I have Navarro's textbook[1] and it's a great survey. (The only criticism that comes to mind is that it weirdly shortchanges Elias-Fano encoding[2], which is hugely important in practice but relegated to just an offhand sentence or two in the book.)
[1] https://www.cambridge.org/core/books/compact-data-structures...
[2] https://vigna.di.unimi.it/ftp/papers/QuasiSuccinctIndices.pd...
As an old-school signal processing person, "wavelet" and "FM" are throwing me for a loop. I can see FM is named for the authors. "wavelet" - I don't see at first glance what the name means or if it's related to the signal processing concept.
Yes, the 'wavelet tree' (and other wavelet thingies in succinct data structures) are rather unfortunately named. It ranks right up there with 'wavefunction collapse' in procedural generation.
Marisa trie is a really cool and useful succinct data structure (also mentioned in High Performance Python book): https://github.com/pytries/marisa-trie
They publish a benchmark for anyone interested: https://marisa-trie.readthedocs.io/en/latest/benchmarks.html
Summary for storing a list of 3M Russian words:
- ~60x less memory
- ~5-10x slower compared to hashmap
I also find Succinct Data Structures fascinating. I own Navarro's text book, and it's one of my favorite books to take to the coffee shop and try to understand. I also have a copy of Jacobson's Thesis.
Since information is rather scarce, here's my shameless plug for a related blog post I made, which includes a few links at the end to more resources https://denvaar.dev/posts/jacobsons_rank.html
> Ignorance can bring you far.
See also: the undergrad who found a breakthrough in hash tables recently and wasn't put off by a long-standing conjecture about what the bound on their performance was because he simply wasn't aware of it
As an aside, an FM index can be used to efficiently turn an LLM into an actual stochastic parrot (one that emits only substrings of some dataset). This is more useful than it sounds because you can use it for quoting from large corpora.
That tree format (using parentheses) is also known as Newick format >> https://en.wikipedia.org/wiki/Newick_format
But you can also simply store the tree structure as a single array of ints (or longs) ... each node in the tree corresponds to a unique index position, and the value at that position is the index position of the node's parent (or it's own position if it's a top level node) ... good stuff.
With advanced CS topics, it often works to search for "<topic> lecture notes".
How do succinct data structures do with vector operations on cpu?
Not sure if they are succinct, but the Apache arrow format encodes data in several ways that is compact in memory but also allows operations on these structures.
Wow, this is really fascinating. I guess it all comes down to how it's doing select and rank in constant time, which is probably some clever bit arithmetic. I'll have to look into how that works.
I can speak a bit about one of the approaches: "Practical Entropy-Compressed Rank/Select Dictionary" by Daisuke Okanohara and Kunihiko Sadakane. This presents two different variants: One for dense (i.e. more than 50% of the bits are set) and one for sparse.
The dense implementation is basically based around partitioning them into "blocks" of a given size and then you can find the block by doing `block[idx / block_size]`. It then also groups each block into sub-blocks which helps you even further. All of these are additional data structures (very much like an index) which are stored next to the regular bitset. You use the blocks/sub-blocks to find roughly where in the bitset you are and then use a algorithm for finding the value in a given machine-word.
The sparse implementation treats the bitset as an ordered list of numbers (e.g. 100001001 is interpreted as 0, 5, 8) and then it stores those numbers using Elias-Fano encoding. The higher bits of the Elias-Fano encoding happen to be dense and hence we can use the previous dense implementation on that higher bits and then combine it with the lower bits.
I'm also aware of https://arxiv.org/abs/1706.00990 which is more about how to do this most efficiently at a machine-word level.
"Engineering Compact Data Structures for Rank and Select Queries on Bit Vectors" is another quite recent paper which I haven't fully digested yet.
Some of it is moving or has moved down to the instruction sets: https://vaibhavsagar.com/blog/2019/09/08/popcount/
One famously fun paper is "The LCA problem revisited"
https://ics.uci.edu/~eppstein/261/BenFar-LCA-00.pdf
For those who can't read, I recommend this talk about it:
https://youtu.be/4NXJm2T9Yks
Maybe a silly question, but has anyone used these in production? Or used libraries in production which are built on these structures?
Im imagining a meeting about some project design, and thinking about how it'd go if someone suggested using parentheses to represent nodes of a tree. I imagine it'd get written off quickly. Not because it wouldn't work, but because of the complexity and learning curve involved.
The most emblematic application in real life is in bioinformatics. BWA and Bowtie are two widely used softwares built upon them.
Why are you having project design meetings about details as low level as the in-memory representation of data in a program?
> This is a field that seems to have emerged in computer science relatively recently; many of the important data structures were invented in the last 25 years.
This is crazy!
are succinct data structures good for "dynamic" applications? ie modifying the data often, last time i looked into this field it seems that all the popular structures were very static in nature and had costly updates
Maybe there are more modern, advanced data structures / techniques, but yes, my understanding is that this is a common trade off (having your data structure be more static, with offline updates).
The word count seems artificially increased in the post. Here's a succinct explanation: https://www.eecs.tufts.edu/~aloupis/comp150/projects/Succinc...
I didn't think that at all. In fact I found it very readable and prefer it over the drier presentation you linked. To each its own, I guess, but there's really no need to infer ulterior motives.
Fascinating topic!
[flagged]
I've always gotten the 'ick' from XML/json etc. This is like my personal anti-nausea pill.