Succinct data structures

blog.startifact.com

588 points by pavel_lishin 7 months ago

yurivish 7 months ago

I also emailed Gonzalo Navarro once to ask a question, and we had a great discussion and ended up writing a paper together about the answer. [1]

Another paper of his that I really like combines a few elegant ideas into a simple implementation of bitvector rank/select: https://users.dcc.uchile.cl/~gnavarro/ps/sea12.1.pdf

During this time I got really excited about succinct data structures and wrote a Rust library implementing many bitvector types and a wavelet matrix. [2]

My interest came from a data visualization perspective -- I was curious if space-efficient data structures could fundamentally improve the interactive exploration of large datasets on the client side. Happy to chat about that if anyone's curious.

[1] Paper: https://archive.yuri.is/pdfing/weighted_range_quantile_queri... though it's pretty hard to understand without some background context. I've been meaning to write a blog post explaining the core contribution, which is a simple tweak to one of Navarro's textbook data structures.

[2] The rust version is here: https://github.com/yurivish/made-of-bits/tree/main/rust-play... and an earlier pure-JS implementation is here: https://github.com/yurivish/made-of-bits/tree/main

sitkack 7 months ago

Reading a Gonzalo Navarro paper is like going for walk, taking a shower and having a wonderful coffee. It literally sets the mind on fire.
https://dblp.org/pid/n/GonzaloNavarro.html
- SoftTalker 7 months ago
  
  Well not literally.
  
  dspillett 7 months ago
  
  Many dictionaries now list one common use of “literally” as meaning “figuratively, with emphasis”. So literally officially sometimes now literally means figuratively.
  I suspect some people are literally having conniption fits about this…
  
  cowsandmilk 7 months ago
  
  I’m sorry, but your comment mixes two different types of dictionaries. You talk about “official” meanings which would be a prescriptive dictionary telling you the way you are allowed to use a word. But the dictionaries that include “figuratively” in their definitions are clearly descriptive, presenting all the ways words are commonly used.
  You can’t take a descriptive dictionary and then claim it is prescriptive.
  
  dspillett 7 months ago
  
  There are no prescriptive dictionaries, at least not correct ones, for living languages.
  IIRC both the OED and CED list figurative uses for the word, do you know any publications considered more authoritative than those for English? Webster too, for those who prefer simplified English.
  
  ForTheKidz 7 months ago
  
  I think French has prescriptive dictionaries (to varying degrees of success)
  
  dspillett 7 months ago
  
  They have Académie Française which intends to control the language to an extent, in recent times focussing a lot on resisting then encroachment of English word and phrases, but IIRC their recommendations don't carry as much weight as many think and are often ignored even by government departments and other official French bodies.
  The Académie do publish a dictionary every few decades though, there was a new edition recently, so there is a prescriptive dictionary for French even though it carries little weight in reality.
  French is the only living language to attempt it to this extent, though the existence of one is enough to make my “there are none for living languages” point incorrect. It is difficult to pin a language down until no one really speaks it day-to-day (so it doesn't evolve at the rates commonly used languages do).
  
  throwaway290 7 months ago
  
  https://en.wikipedia.org/wiki/Linguistic_prescription#Formal...
  
  dspillett 7 months ago
  
  Very few of those have official force or cover much more than a subset of language properties (i.e. spelling rules), but definitely more than the "none" of my original assertion.
  
  throwaway290 7 months ago
  
  "prescriptive" does not mean "have legal force" though...
  
  throwaway290 7 months ago
  
  > There are no prescriptive dictionaries, at least not correct ones, for living languages.
  there are no 100% correct descriptive dictionaries. Any prescriptive dictionary is automatically correct.
  
  dspillett 7 months ago
  
  > Any prescriptive dictionary is automatically correct.
  … in the view of their compilers.
  I could write a prescriptive dictionary far more easily than I could get others to accept it as correct.
  
  throwaway290 7 months ago
  
  If you write a prescriptive dictionary it is correct because you are dictating the norms not describing what is real.
  Yes you would have to be involved with a regulatory institution first
  
  computably 7 months ago
  
  Right, just like every law is automatically just. /s
  
  throwaway290 7 months ago
  
  If it's not just then change the law!
  
  cycomanic 7 months ago
  
  The Duden is prescriptive for German AFAIK.
  
  davidcalloway 7 months ago
  
  Isn't this more of a cultural thing, that Germans seem to agree that it is authoritative and use it as a reference?
  I'm not sure what would even make a dictionary prescriptive other than an explicit declaration that it is so or, ridiculously, a law declaring the same.
  
  soulofmischief 7 months ago
  
  I'm sorry, can you point to such a prescriptive dictionary? People can talk however they please, and dictionaries are tasked with keeping up with the vernacular.
  The "literally" ship sailed centuries ago. Sorry, but that battle has been lost. Even so-called "prescriptive" dictionaries would be categorically incorrect if they ignore nearly three centuries of common vernacular.
  
  bee_rider 7 months ago
  
  There aren’t prescriptive dictionaries for (American, at least) English.
  
  brandly 7 months ago
  
  But “official” is defined in descriptive dictionaries to include descriptive dictionaries.
  
  penguin_booze 7 months ago
  
  Well, literally doesn't mean literally anymore--literally.
  
  gwd 7 months ago
  
  It never has, it always will. We've already lost a host of words that meant "I'm not exaggerating, I actually mean it": "really", "very", etc. I'm going to keep up the fight.
  
  Zecc 7 months ago
  
  Since there are _literally_ people who use, and have been using for a while, the word without the same exact meaning as we both agree on... well.
  Having said that, I will join you in this fight.
  See also: exponentially.
  
  gwd 7 months ago
  
  Language is defined by its speakers, as basically a "vote". I'm going to keep voting for "literally" meaning "this actually happened" as long as it's practical, because 1) there are dozens of other ways to emphasize something 2) we need some way to say "this is not an exaggeration".
  
  bee_rider 7 months ago
  
  “Exponentially” and “quantum” are the only language hills I’d die on.
  
  nxobject 7 months ago
  
  Why a quantum leap isn’t the length of an Ångstrom will always sadden me. I’m sure there are other scientific concepts you can use to describe a Great Leap Forward…
  
  Quekid5 7 months ago
  
  I think the Quantum Leap expression can also be understood as a "step" with no intermediate stages, i.e. very abrupt or transformative.
  
  gwd 7 months ago
  
  The moreso that those things don't even figuratively set my mind on fire.
  
  sitkack 7 months ago
  
  What about metaphorically?
__tidu 7 months ago

the "technical note" link in the RLE bit vector section of the rust repo is broken (https://yuri.is/pdfing/weighted_range_quantile_queries.pdf 404s)
- __tidu 7 months ago
  
  oh wait nvm just realised you linked a working archive link in your post... still worth updating the link in the repo for people who stumble upon it
  
  yurivish 7 months ago
  
  Fixed, thanks!

mindcrime 7 months ago

These are the days I really love HN. Despite having been in this field for 30 some odd years, I'd never heard of "succinct data structures" until now. And had I not seen this post, maybe I never would have.

Is that important? Well, maybe. As I started digging in and looking for libraries that implement some of this stuff, I found that a popular graph processing library (JGraphT) uses a succinct data structure library (Sux4J) for working with large graphs. And working with graphs is something I've been doing a deep dive into lately. So yeah, if these data structures represent something that has practical applications in graph processing, then maybe finding this is important.

Certainly I'm glad I stumbled across this in either case; the topic strikes me as fascinating.

qazxcvbnm 7 months ago

Note that succinct data structures may not be faster than conventional structures if your dataset fits in memory http://www.cs.cmu.edu/~huanche1/slides/FST.pdf . Of course, for large datasets where storage access times dominate, succinct data structures win all around. In any case, succinct trees are works of art (If I recall https://arxiv.org/pdf/1805.11255 was a good exposition) (just look at how that RMQ works)!

hinkley 7 months ago

As an application grows the features start to interact. We tend to not be paying much attention to how the old code 'fits into memory' while we are writing new code that also has to fit into memory.
Once you have an entire system written the benefits of having four interacting features that each fit into a third of memory may be bigger than you think. And I don't know how well Intel's hardware level profiling information reveals that but I know for sure that the profiling tools that ship with most commercially viable programming languages never do. They blame issues on whoever touched the system last, and if that is an epicycle or a metastable situation then the apparent guilty party may be a frame-up, while the real culprit goes unpunished.
As an example: if your workflow has a step that eventually needs to use 40% of available memory to solve a problem, then GC will almost always trigger within that step of the process, rather than in the other 60% of memory being systematically wasted by heaps of inefficient code in the leadup to this step, because the top of the memory sawtooth will almost always occur within that part of the call graph. But because the 40% is your intrinsic complexity, people's brains shut off the moment a cost is attributed to unavoidable work, instead of the avoidable work that really caused the majority of the problem.
yvdriess 7 months ago

True, but it depends on what you mean with fitting in memory.
Succinct datastructures are used in genomics (e.g. bwa, megahit exome sequencer) because N is so large that you're actually hitting asymptotic behavior.
For memory latency it can by making your memory footprint fit in LLC or a single node; cross-node NUMA latencies are typically enough to absolutely tank performance.
It can theoretically also help in massively parallel access situations where bandwidth becomes a limiting concern. Although, I intuit we would need near-memory hardware to perform the rank+select. Otherwise the latency of the multiple dependent accesses will kill your performance again, cfr previous point.
With a lot of parallel accesses, bandwidth could also be an issue in conventional structures.
barrkel 7 months ago

There's fits in memory, and there's fits in memory.
I have used various bit-packing schemes in order to keep data in memory. If you can keep your entire dataset in memory, it opens up different ways to tackle it. Succinct data structures look like a way to enable this.
It's not storage access times that kill you. You need to rearrange all your processing to work in batches or windows or use some kind of query API to do anything. Everything becomes much more painful.
thomaskoopman 7 months ago

I think it depends more on the ratio between access time and how often you use the data. Adding two arrays that fit in L1 is already limited by access time. On Zen3, we can add two 32-byte vectors per cycle, but only store one of these per cycle. For matrix multiplication, we can do the two additions per cycle (or really c <- a * b + c) because we have to do multiple operations once we have loaded the data into registers.
I can see it be useful for data sets of a few dozen MBs as well.
fegu 7 months ago

Memory is expensive. In the cloud especially. Using a succinct structure could enable cheaper computing for specific tasks. This benefits everyone.

lostmsu 7 months ago

Way better detailed explanation: https://stackoverflow.com/questions/72580828/what-is-a-succi...

topspin 7 months ago

Yes, that's great. This part:
"Intuitively, a succinct data structure is one whose space usage equals the space needed to write out the data, plus something that grows more slowly than that. If you're familiar with little-o notation, a succinct data structure is one whose space usage is X + o(X), where X is the number of bits needed to write out the data itself."
Brings to mind COBS encoding, which does this for streams bytes containing arbitrary length "packets" or similar.
adgjlsfhk1 7 months ago

This is great! I love both how far you can push this and get meaningful improvements, and how it's totally overkill for anything we'll ever be able to implement on a physical computer. The hardware focused approach is to use popcnt for a base size of 512 (since cache architecture will make you fetch that much memory if you touch the original array anyway). We then can store 1 UInt16 prefix sum per 512 bits (n/32 bits overall), and if we have more than 2^16 bits in total, we can store UInt64 prefixes every 2^16 bits (n/1024 bits overall).
Theoretically, this approach uses O(nlogn) bits as opposed to o(n) for the theoretical approach, but in practice, for <2^64 bools, the actual storage ens up being n/32+n/1024 which is pretty hard to beat. The theoretical approach gets it's wins from making extremely clever use of the difference between O(loglog(n)) and O(1), but unfortunately for the foreseeable future, logn < 64 and loglog(n) < 6, so all the subtlety gets swallowed up into the base case of a single popcnt instruction.
jbreckmckye 7 months ago

That IS excellent - thank you
atiedebee 7 months ago

This is great. A very understandable explanation, thanks for sharing!
akoboldfrying 7 months ago

templatetypedef answers are the best answers. I know going in that (1) there are going to be surprising insights, and (2) I'm going to understand them fully.

kccqzy 7 months ago

I first heard of the concept of succinct data structures from Edward Kmett, a famous Haskeller behind many popular Haskell libraries. He gave a talk on succinct data structures a long time ago: http://youtu.be/uA0Z7_4J7u8

eru 7 months ago

Ed is great. He's not just a Haskeller, but has also done interesting work in the likes of C++ and 6502 assembly amongst others.
kreyenborgi 7 months ago

His code on this seems to be https://hackage.haskell.org/package/structures
There is also HaskellWorks packages like https://hackage.haskell.org/package/hw-xml
- kccqzy 7 months ago
  
  He also has https://github.com/ekmett/succinct-binary
mrkeen 7 months ago

My Haskell attempt at Wavelet Matrices: https://github.com/jahaynes/waveletto
johnpmayer 7 months ago

Whoa - I was there! This was at a small elm-lang meetup; Prezi hosted this while Evan was working there.
Thank you for sharing this link!

judofyr 7 months ago

Succinct data structures are very fun! If anyone is interested, I've implemented some of this in Zig: https://github.com/judofyr/zini. The main thing this implements is a minimal perfect hash function which uses less than 4 bits per element (and can respond to queries in ~50 ns). As a part of that I ended up implementing on of these indexes for constant-time select(n): https://github.com/judofyr/zini/blob/main/src/darray.zig.

It feels kinda magic implementing these algorithms because everything becomes so tiny!

thomasmg 7 months ago

For Java, C++, and Rust there is also https://sux.di.unimi.it/ maintained by Sebastiano Vigna, a professor from Italy.
Together with his student (I also helped a bit), he wrote a paper about RecSplit, a minimal perfect hash function (MPHF) algorithm I have invented: https://arxiv.org/abs/1910.06416 - that algorithm uses around 1.56 bits per key. But it is quite slow. In January 2020 I presented the paper at a conference, that was right before the pandemic.
The algorithm with the least memory usage (and much faster as well) is now Consensus-RecSplit: https://arxiv.org/abs/2502.05613 - it can do 1.44 bits per key, which is right at the theoretical minimum (meaning, there is no good way to shrink it further). At least a small part of my original invention is still used there, nice. The fastest current MPHF is probably PtrHash https://arxiv.org/abs/2502.15539 - both papers were published last month (February 2025) by the way.
- rurban 7 months ago
  
  I'm working on making pthash faster and more practical. I can compile the data and code to C++, send efficiently store the keys also to be able to eliminate false positives.
  https://github.com/rurban/pthash

cxie 7 months ago

Just spent my morning diving into succinct data structures after seeing this. The memory efficiency is incredible - especially the balanced parentheses tree representing a full node tree in just 2 bits per node! I've been working on a project parsing large (10GB+) XML files for scientific data analysis, and our current approach burns through RAM like crazy. Has anyone here successfully implemented these structures in production systems?

The wavelet matrix concept seems particularly promising for our text-heavy workloads. I'm curious if the constant-time operations actually hold up under real-world conditions or if there are hidden performance cliffs.

This feels like one of those CS concepts that should be more widely known but somehow got overlooked by mainstream programming. Kind of like how bloom filters were obscure until suddenly every system was using them.

MortyWaves 7 months ago

When I’ve dealt with huge files in .NET, the usual approach is to stream the file such that only some of it is in memory at once. This way you can process files hundreds of GBs. Of course, if you truly need them all in memory at once for some reason I genuinely can’t think of, then you’d need something else.
Does your language have the concept of streaming files?
- crazygringo 7 months ago
  
  If you're streaming something row-based like a CSV, or a zipped CSV, then that's usually easy.
  But when you get to hierarchical data structures like JSON/protobuf there very often simply isn't a streaming library available. There's a library function to decode the whole thing into an object in memory, and that's all.
  Nothing prevents streaming in theory, it's just far more complicated to write that library.
  
  dilap 7 months ago
  
  protobuf sure, but streaming libraries for json (and xml, as in the parent) are extremely common. not harder (maybe even easier) than non-streaming to write, tho more cumbersome to use, so something you'd reach for only if you specifically need it ('cuz of memory constraints)
  e.g. standard go json library https://pkg.go.dev/encoding/json#example-Decoder.Decode-Stre...
  
  crazygringo 7 months ago
  
  Yup. I don't remember streaming JSON being common in the early days but now it is. But the absence of streaming protobuf is what has killed me, when dealing with gigantic protobuf files from government agencies (ugh).
  
  dilap 7 months ago
  
  Heh, yeah. The protobuf people's expectation was if you had a really large dataset you'd wrap it up in your own mini-protocol of "sequence of protobuf messages". But of course that's way more friction, so in practice it will end up not getting done when it should be (plus also, it requires a certain amount of ability to predict the future).
  Lesson for technologists: if you want to make the world a better place arrange your tech such that the lowest-friction path is also the correct path.
  (Another example: disasterous multi-byte UTF encodings [correct solution was more friction] vs basically successful UTF8 [correct solution was less friction].)
  I don't know if you're still dealing w/ this particular problem for protobufs, but based on my experience with thrift, a very similar library, there are probably some not too terrible ways you can kinda hack up the client-side parsing to be more streaming-ish...
  
  SAI_Peregrinus 7 months ago
  
  nanopb is designed around streaming. It's limited in a few ways[1] but is designed for use on low-memory systems (microcontrollers) where the whole protobuf message won't necessarily fit into memory at once. Might not help for your use cases though, since it's a C library without a stable ABI.
  [1]https://jpa.kapsi.fi/nanopb/docs/#features-and-limitations
  
  cess11 7 months ago
  
  In programming languages suitable for enterprise software development there are blessed streaming parsers for XML, because it's a rather common task.
  It's very common that other programming languages have basic SAX parsers.
  What are these languages that don't which you've encountered?
cess11 7 months ago

The low hanging fruit in this area is to do partial unmarshalling or using a SAX parser on a stream. It's likely you'll have to do this to retrieve data and put it in whatever succinct or otherwise efficient data structure.
In Java, which I consider to have the best tooling for advanced XML applications, you'd look into JAXB on streams, StAX or SAX. On complicated and heavily nested XML it might take some effort and profiling to figure out the optimal state machines for exhaustive traversal, if that's what you do.
I'd also like to mention that XSLT is an often underappreciated approach.
leafmeal 7 months ago

There's a create blog from the creator of RhymeBrain that talks about Succinct Tries: https://stevehanov.ca/blog/index.php?id=120
I'm pretty sure these were used to store the built in dictionaries on early mobile phones, especially for the implementation of T9 word and similar programs.
CrimsonCape 7 months ago

This might be a little over my head, but i'm not understanding how the balanced parenthesis is conveying anything other than the topology of the tree structure. Are we not accounting for the bits required for a pointer in memory to an object? Or simply the bits required to guarantee the uniqueness of a node in the tree?
- senderista 7 months ago
  
  You store the structural information separately from the data. The data can be stored sequentially in some traversal order.
- neuroelectron 7 months ago
  
  He touches on indexes but doesn't really mention the implementation. This is about the primitives.

sujayakar 7 months ago

I really love this space: Navarro's book is an excellent survey.

Erik Demaine has a few great lectures on succinct data structures too: L17 and L18 on https://courses.csail.mit.edu/6.851/spring12/lectures/

eqvinox 7 months ago

There's a relative of this in making in-memory node representation efficient for large structs that have a bunch of rarely-used fields: chunk up memory in units (most reasonably 8 bytes/pointer size), allocate offsets for rarely-used fields in ascending order, and then use bitmasks to indicate which fields are present. (Note the bits correspond to units, not fields; a 16-byte field would use 2 adjacent bits in the bitmask.)

The trick is that masking & popcount (both low-cycle CPU instructions in most modern CPUs¹) make this quite fast to access and thus suitable for in-memory representation.

The intended use is when presence of optional fields is known at allocation time and doesn't change afterwards, i.e. some object is built up into a dynamically allocated buffer whose size is shrunk down by omitting fields. Changing which fields are present afterwards requires reallocating the thing, which tends to make the entire pattern pointless.

¹ the real annoyance here is that almost all x86_64 CPUs have POPCNT, except the absolute earliest ones, but if you build e.g. some package for a Linux distribution without any CPU optimization flags it'll use a library fallback popcount routine rather than the instruction :(

adgjlsfhk1 7 months ago

thankfully a number of distros are starting to ship packages for x86v2 by default (basically everything Core 2 and newer) which fixes this finally.

tux3 7 months ago

I really like the article, but it would benefit from some numbers or complexity estimates to get some intuitive sense of what the cost is.

Am I paying 30% overhead for this particular index or that wavelet matrix? Is it double the memory use? Or is it O(log N)? No idea! "doesn't use much more space" could mean a lot of different things!

judofyr 7 months ago

"Succinct data structure" does have a very strict definition which probably answers some of your questions: https://en.wikipedia.org/wiki/Succinct_data_structure. It's all about being close to the theoretical minimum.
> Am I paying 30% overhead for this particular index or that wavelet matrix?
Nope! That would not fit the definition. That would be a "compact data structure" according to this definition.
- jltsiren 7 months ago
  
  You should be careful when using asymptotic bounds with more precision than about O(sqrt(n)). The bounds ignore constant factors, and the constant factors could be more significant than slowly growing non-constant factors for reasonable values of n.
  It's also very common in algorithm design that improving the asymptotic bounds and making the algorithm faster (or the data structure smaller) are orthogonal (or even opposite) goals. Real computers have complex performance characteristics and fixed word lengths, and it's rarely a good idea to implement a theoretical algorithm exactly as described.
  Succinct data structures often have a number of internal parameters. In a theoretical parameterization, those parameters may be described as being O(log n), O(log^2 n), or O(log log n). In a concrete implementation, it may be a good idea to use constant approximations for some nontrivial (such as 2x) performance gains. O(log n) could become 32 or 64, O(log^2 n) could become 1024 (or a power-of-2 multiple), and O(log log n) could become 4 or 8.
  And then, if you parameterize the succinct data structure with these constants, the space overhead becomes a constant fraction.
  
  senderista 7 months ago
  
  Succinct data structures require the extra space (above the information-theoretical minimum) to be an additive term of o(n) bits (little O, not big O). That just means that the extra space grows more slowly than n, as n approaches infinity, so their ratio (extra space)/n approaches 0 in the limit.
  
  jltsiren 7 months ago
  
  That's a simplification. Succinct data structures are usually parameterized data structures. They have a number of parameters, such as block sizes and sampling rates, that govern the space overhead and query performance. The published version may use a parameterization that makes the space overhead sublinear while guaranteeing attractive asymptotic bounds for worst-case performance. But if you actually implement that, the performance is likely terrible.
  Consider the rank data structure for bitvectors that is supposed to guarantee O(1) time queries with o(n) bits of space overhead. The bitvector is divided into superblocks of b1 bits and blocks of b2 bits. The top-level index, which stores the rank up to each superblock, uses O(n log n / b1) bits of space. The second-level index, which stores the rank within the superblock up to each block, uses O(n log b1 / b2) bits of space. And then you need to do linear search within the block, which is typically assumed to take O(b2 / log n) or O(b2 / w) time, where w is word size. If you choose b1 = log^2 n and b2 = log n, you get O(1) time with O(n log log n / log n) bits of space overhead. Which is technically sublinear but effectively indistinguishable from linear with realistic values of n.
  Real-world implementations use constant values for the parameters. Typically the goal is to get the blocks and indexes align with cache lines and to replace arbitrary divisions with divisions by compile-time power-of-2 constants. Some values I've seen are (b1 = 512, b2 = 64) and (b1 = 65536, b2 = 512). In both cases, the overhead is linear.
  And sometimes the implemented data structure is a simplification, because the nominally succinct data structure is too large and slow. For example, it's rare to see actual O(1)-time implementations of select on bitvectors. That would require three levels of indexes with many special cases. It's more common to use two levels of indexes, with queries that almost always constant-time but have (poly)logarithmic worst cases with adversarial inputs.
  
  judofyr 7 months ago
  
  This is really good information! Thanks for writing it up.
  Honestly, I never actually "trust" the complexity analysis. Whenever I find a paper I immediately look for their specific results on an experiment, and if I can't find that I will assume the paper is only of theoretical interest. There's of course still a lot to learn from a purely theoretical paper, but I've always ended up being disappointed when I've implemented something which only had a "good" asymptotic bounds.
  
  senderista 7 months ago
  
  You're absolutely right and my response completely missed your point, thanks for clarifying further.

abetusk 7 months ago

My goto library for succinct data structures is SDSL-Lite [0].

[0] https://github.com/simongog/sdsl-lite

senderista 7 months ago

I'm sure it's out of date in some areas by now, but I have Navarro's textbook[1] and it's a great survey. (The only criticism that comes to mind is that it weirdly shortchanges Elias-Fano encoding[2], which is hugely important in practice but relegated to just an offhand sentence or two in the book.)

[1] https://www.cambridge.org/core/books/compact-data-structures...

[2] https://vigna.di.unimi.it/ftp/papers/QuasiSuccinctIndices.pd...

adonovan 7 months ago

Funny, I recently independently reinvented [1] the "balanced parentheses tree" and was very satisfied with its flexibility and speed: it seemed so simple and general I was surprised not to already know it by name. Now I do!

[1] https://pkg.go.dev/golang.org/x/tools/internal/astutil/curso...

ks2048 7 months ago

As an old-school signal processing person, "wavelet" and "FM" are throwing me for a loop. I can see FM is named for the authors. "wavelet" - I don't see at first glance what the name means or if it's related to the signal processing concept.

eru 7 months ago

Yes, the 'wavelet tree' (and other wavelet thingies in succinct data structures) are rather unfortunately named. It ranks right up there with 'wavefunction collapse' in procedural generation.

vismit2000 7 months ago

Marisa trie is a really cool and useful succinct data structure (also mentioned in High Performance Python book): https://github.com/pytries/marisa-trie

zupa-hu 7 months ago

They publish a benchmark for anyone interested: https://marisa-trie.readthedocs.io/en/latest/benchmarks.html
Summary for storing a list of 3M Russian words:
- ~60x less memory
- ~5-10x slower compared to hashmap

denvaar 7 months ago

I also find Succinct Data Structures fascinating. I own Navarro's text book, and it's one of my favorite books to take to the coffee shop and try to understand. I also have a copy of Jacobson's Thesis.

Since information is rather scarce, here's my shameless plug for a related blog post I made, which includes a few links at the end to more resources https://denvaar.dev/posts/jacobsons_rank.html

zavec 7 months ago

> Ignorance can bring you far.

See also: the undergrad who found a breakthrough in hash tables recently and wasn't put off by a long-standing conjecture about what the bound on their performance was because he simply wasn't aware of it

bo1024 7 months ago

With advanced CS topics, it often works to search for "<topic> lecture notes".

nmca 7 months ago

As an aside, an FM index can be used to efficiently turn an LLM into an actual stochastic parrot (one that emits only substrings of some dataset). This is more useful than it sounds because you can use it for quoting from large corpora.

hawaiianSpork 7 months ago

How do succinct data structures do with vector operations on cpu?

Not sure if they are succinct, but the Apache arrow format encodes data in several ways that is compact in memory but also allows operations on these structures.

itronitron 7 months ago

That tree format (using parentheses) is also known as Newick format >> https://en.wikipedia.org/wiki/Newick_format

But you can also simply store the tree structure as a single array of ints (or longs) ... each node in the tree corresponds to a unique index position, and the value at that position is the index position of the node's parent (or it's own position if it's a top level node) ... good stuff.

svachalek 7 months ago

Wow, this is really fascinating. I guess it all comes down to how it's doing select and rank in constant time, which is probably some clever bit arithmetic. I'll have to look into how that works.

judofyr 7 months ago

I can speak a bit about one of the approaches: "Practical Entropy-Compressed Rank/Select Dictionary" by Daisuke Okanohara and Kunihiko Sadakane. This presents two different variants: One for dense (i.e. more than 50% of the bits are set) and one for sparse.
The dense implementation is basically based around partitioning them into "blocks" of a given size and then you can find the block by doing `block[idx / block_size]`. It then also groups each block into sub-blocks which helps you even further. All of these are additional data structures (very much like an index) which are stored next to the regular bitset. You use the blocks/sub-blocks to find roughly where in the bitset you are and then use a algorithm for finding the value in a given machine-word.
The sparse implementation treats the bitset as an ordered list of numbers (e.g. 100001001 is interpreted as 0, 5, 8) and then it stores those numbers using Elias-Fano encoding. The higher bits of the Elias-Fano encoding happen to be dense and hence we can use the previous dense implementation on that higher bits and then combine it with the lower bits.
I'm also aware of https://arxiv.org/abs/1706.00990 which is more about how to do this most efficiently at a machine-word level.
"Engineering Compact Data Structures for Rank and Select Queries on Bit Vectors" is another quite recent paper which I haven't fully digested yet.
zellyn 7 months ago

Some of it is moving or has moved down to the instruction sets: https://vaibhavsagar.com/blog/2019/09/08/popcount/

Brystephor 7 months ago

Maybe a silly question, but has anyone used these in production? Or used libraries in production which are built on these structures?

Im imagining a meeting about some project design, and thinking about how it'd go if someone suggested using parentheses to represent nodes of a tree. I imagine it'd get written off quickly. Not because it wouldn't work, but because of the complexity and learning curve involved.

pelario 7 months ago

The most emblematic application in real life is in bioinformatics. BWA and Bowtie are two widely used softwares built upon them.
392 7 months ago

Why are you having project design meetings about details as low level as the in-memory representation of data in a program?

MortyWaves 7 months ago

> This is a field that seems to have emerged in computer science relatively recently; many of the important data structures were invented in the last 25 years.

This is crazy!

Validark 7 months ago

One famously fun paper is "The LCA problem revisited"

https://ics.uci.edu/~eppstein/261/BenFar-LCA-00.pdf

For those who can't read, I recommend this talk about it:

https://youtu.be/4NXJm2T9Yks

mzs 7 months ago

The word count seems artificially increased in the post. Here's a succinct explanation: https://www.eecs.tufts.edu/~aloupis/comp150/projects/Succinc...

pegasus 7 months ago

I didn't think that at all. In fact I found it very readable and prefer it over the drier presentation you linked. To each its own, I guess, but there's really no need to infer ulterior motives.

__tidu 7 months ago

are succinct data structures good for "dynamic" applications? ie modifying the data often, last time i looked into this field it seems that all the popular structures were very static in nature and had costly updates

denvaar 7 months ago

Maybe there are more modern, advanced data structures / techniques, but yes, my understanding is that this is a common trade off (having your data structure be more static, with offline updates).

anshul-s 7 months ago

Fascinating topic!

dsffsad 7 months ago

[flagged]

neuroelectron 7 months ago

I've always gotten the 'ick' from XML/json etc. This is like my personal anti-nausea pill.