c-linkage 14 hours ago

Interning strings saves a ton of space. I wish more programmers would use it.

Back in the 32-bit days I was working on a large (multi GB) distributed Oracle database system but I couldn't use transaction log shipping (don't ask). Database A was the "main" system and database B was the "replica". To keep them in sync I needed a program that would compare the two databases and then generate an update script that would make B look like A.

Complicating things was that, for some reason, floating point data in binary format would never match exactly between the two systems, so all floats had to be exported as text.

The first attempt by a junior dev was implemented in C#. Not only was it terribly slow, but it also ran out of memory.

I wrote a new version in C that interned all strings using a hash table and a custom bump-allocator. I also exported every field in the database as a string, so I didn't have to deal with native types. Using this technique meant that a database record could be represented as a plain array of pointers to the interned strings.

Since each string was only recorded once, and every field was a pointer to a string, should two database records have the same values then they must by definition point to the same string. Comparing database rows was as easy as doing a memcmp() on the two pointer arrays, one being a record from database A and the other being a the record from database B.

Not only was the system incredibly fast, but it never took more than 150MB of memory to run.

  • gopalv 14 hours ago

    This is mostly the real reason why interning gets used, to avoid long string comparisons over saving memory as such.

    Interned strings tend to not have a good cleanup mechanism, in a system where a lot of them are churned through. So often they tend to actually use more memory as data patterns evolve in a system.

    I use the same trick when parsing json, where a large set of rows tend to have the keys repeated & the conversion to columnar is easier if the keys are interned.

    • o11c 13 hours ago

      If your language supports good strong/weak references and containers thereof, cleaning up dead interned strings isn't hard. I'm not aware of any language that provides this out-of-the-box, unfortunately.

      Why do so many languages make weak references such second-class citizens? Why do all containers suck so much that you have to implement your own (and that's hoping the language is actually efficient enough to let you?)

      • rscho 12 hours ago

        > not aware of any language that provides this out-of-the-box, unfortunately.

        Many lisps. Racket, for example.

      • wongarsu 12 hours ago

        > I'm not aware of any language that provides this out-of-the-box, unfortunately

        The currently most prominent example would be Rust. Rc<T> is a simple generic container that implements reference counting, and any instance of it can be downgraded to a Weak<T>. Or Arc<T> and std::sync::Weak<T> if it needs to be thread safe.

        • o11c 11 hours ago

          I've done it in C++, so Rust is probably capable of it if you add enough layers of rc refcell and whatever else it requires to fit into its restricted worldview.

          Does Rust actually have container implementations that do all of the following:

          * When walking the container (either iterating or looking up, even through a non-mutable reference), call a user-provided predicate (not just builtin "weak" like many languages have via weakset/weakkeymap/weakvaluemap) to detect if a node should be considered "dead", and if so transparently remove the node. [In my experience this is relatively easy to add when you're implementing the container algorithms yourself, though I've never done it for bulk algorithms yet.]

          * When looking up a key (which may have different type or identity), the lookup returns the actual key the container had. [This may be impossible for container implementations that split the key.]

          • whytevuhuni 10 hours ago

            Mutating a container through a shared reference means the container either has to be single-threaded (not marked as Sync/Send), or be thread-safe.

            The single-threaded ones are easy to make, but Rust will prevent you from sending them to another thread, which is probably something you want.

            For thread-safe things, look into the crossbeam crate, it has really good collections.

            One I worked with was the dashmap, which has a .retain() method [1] that works over a shared map reference, but runs a closure which gets mutable access to each key and value, and decides whether to keep the pair or not.

            Its .get() [2] uses equality (so you can use a different object), but returns a reference to the original key-value pair. The .get_mut() will return it as mutable, but inside a guard that keeps the item locked until it goes out of scope.

            [1] https://docs.rs/dashmap/latest/dashmap/struct.DashMap.html#m...

            [2] https://docs.rs/dashmap/latest/dashmap/struct.DashMap.html#m...

            • o11c 9 hours ago

              `.retain()` unfortunately isn't what I'm talking about, since it's at least O(n) per call. It might be better if you can arrange to call it just once after a batch of operations, but that isn't always the case.

              "Delete as you go" usually† adds no space/time complexity to the operations you're already doing (though it does affect the constant factor). It does mean you're giving up on predictable destructor calls, but generally the "heavy" destructor was called by whoever made it expire (e.g. the death of the last remaining shared owner if that's what expiry is based on).

              † If many expirations happen at once, this can easily drop to merely amortized time. It's also possible for a container to offer, say, O(1) iteration via dedicated pointers but have a delete operation that's more complicated.

          • duped 10 hours ago

            To the first question, not really, and if it did it would be pretty fragile because of mutability requirements. It's fragile in C++ too because of iterator invalidation, Rust mostly turns that into a compiler error.

            To the second question, yes, it's super common.

            • o11c 9 hours ago

              I didn't have any problem with iterator/pointer invalidation (easy to merge into the same thing), since holding an active iterator inhibited expiration. It's possible to imagine an expiration system that doesn't automatically guarantee this but I didn't have one; the only non-weak-based expiry I had was based on timers, and it's sanest to only count time as elapsing at the heart of the event loop.

      • ayuhito 7 hours ago

        Go recently did add a new weak pointers and string interning package to its standard library which is an interesting read.

        [0] https://go.dev/blog/unique

        • ignoramous 6 hours ago

          > string interning package to its standard library

          TFA literally says interning isn't there in Go yet.

            While the unique package is useful, Make is admittedly not quite like Intern for strings, since the Handle[T] is required to keep a string from being deleted from the internal map. This means you need to modify your code to retain handles as well as strings.
          
          > an interesting read

          Tailscale's attempt at implementing "unique" was quite... something: https://tailscale.com/blog/netaddr-new-ip-type-for-go

          • ayuhito 5 hours ago

            The Tailscale article taught me a lot! Thanks for sharing.

      • jdougan 11 hours ago

        > not aware of any language that provides this out-of-the-box, unfortunately.

        Smalltalk has this. Class "Symbol".

    • kazinator 13 hours ago

      In Lisp, interning is not only used for saving on string comparisons. It's the basis of the symbol abstraction. Or perhaps not the basis, but interning is the only way symbols can correspond to a printed representation. Without interning, we don't have it that A and A are the same object.

      A symbol being an object with a durable identity is important because it can have properties other than just the string which gives it a name.

  • kccqzy 13 hours ago

    Here's another perspective: I was once asked to improve a custom data caching system that took too much memory even though it had string interning. (At that time, eviction had to be used on factors other than memory used.) String interning certainly helped with memory use but it wasn't enough. Eventually my solution was to compress each record of the dataset in memory. I found that this saved more memory than interning individual strings within each record. At that time I picked Google's snappy compression algorithm since it was already a dependency of the project, but these days I might have picked zstd with a negative compression level or lz4.

    This just goes to show that if your primary goal is to save memory, you should consider storing it compressed and then decompress on the fly when used. Modern compression algorithms are good and fast on moderately sized textual data. You might be surprised to learn how fast decompression is. There are of course other benefits to interning like using pointer equality as string equality, but these aren't a factor in my project.

    • nostrademons 13 hours ago

      Many compression algorithms are effectively string interning that works on general-purpose binary data and adaptively pick the common substrings that are most repeated and assign them the smallest bit representations. That's why formats like XML and JSON compress so well: all those repeated string keys get stored once and then become sub-byte entries in a lookup table.

      • kccqzy 11 hours ago

        Good point! And since we can have sub-byte entries in a lookup table, no wonder why a simplistic string interning solution using pointer-sized entries in a lookup might not work as effectively to reduce memory used.

    • zerd 9 hours ago

      We did a very similar thing in a previous project. Used protostuff to serialize each object, and snappy at the time to compress it. And a small proxy class to access it that decompressed on demand. Costs a few microseconds on read, but saved 4x the space (20GiB -> 5GiB IIRC). This was after interning etc. I'm sure you can save a lot more space if you compress batches of objects and store it column oriented, but will take a bit longer to decompress individual objects.

  • viraptor 13 hours ago

    So a fun story about how interning numbers can go wrong: When compiling the apple's ui designs from the xml based xib to a binary nib, their compiler uses lots of interning. It's pretty cool for avoiding 20 copies of an empty string for example. But then they messed up and did the same thing with numbers while ignoring the type... Which means if you have a value 5.0 as a size somewhere, then your sequentially assigned ids will be: 1, 2, 3, 4, 5.0, 6,...

  • MrLeap 10 hours ago

    > Complicating things was that, for some reason, floating point data in binary format would never match exactly between the two systems, so all floats had to be exported as text.

    Floating point weirdsies between systems is a well encountered quirk in multiplayer gamedev. It's the source of many recommendations that you do physics in fixed point.

    • nyrikki 10 hours ago

      In oracle, a float by default is 126 binary, or 22 bytes.

      Probably something to do with ANSI history.

      It is even more painful than IEEE 754 ambiguity.

  • kazinator 13 hours ago

    > I wish more programmers would use it.

    For instance, people writing blogs to explain Lisp, who then present some Javascript crud in which symbols are just character strings.

trevor-e 13 hours ago

Great write-up!

Apologies if this side-tracks the conversation, but why do we as an industry complicate the naming of such techniques? I don't remember what "string interning" is off the top of my head, but a "string lookup table" is immediately obvious. Maybe I'm wrong though and this is a commonly understood name. I guess interning captures both creating the lookup table + using the index in the related data structures. We used this technique at KAYAK to help efficiently construct the list of possible flight combinations for N x M depart/return legs.

  • SnowflakeOnIce 11 hours ago

    "String interning" is discussed in the Java Language Spec, going back to the 1990s, and wasn't novel there. It goes back at least to Lisp implementations from the 1970s. Probably even earlier than that.

  • umanwizard 12 hours ago

    Interning is a very common name. "String lookup table" would be too broad, it would mean anything that associates a string with a key.

  • mannyv 13 hours ago

    So interning is really just normalization/dictionarying your strings.

    • cb321 13 hours ago

      "pointerizing" might be another name. It would be utterly unsurprising if the database/sql world had more than one other name as well. Usually, ideas that go back to the 1950s at the dawn of data representation have many names.

      Maybe worth noting, you only need the dictionary in one direction (converting an expanded string to a pointer by looking it up). In the other direction, you just need to indirect the pointer and most people don't call an address space indexed by integers a "dictionary" (though, I guess the OG Javascript did merge these two ideas).

      Also notable, how wide a pointer you need (8-bit..64-bit) and how often you do each direction just depends upon workload. This one-direction kind of matters since sometimes your data set is so static that you can just convert to pointers up-front. Then your data analysis doesn't even need the expanded->pointer dictionary. With larger pointers (like 16..64-bit offsets into a file), you just need the file (and whatever data &| files reference it via the pointers).

      Personally, I've run into many "enum-scale" scenarios where 8-bits was enough address space. E.g., there are <256 countries in the world, last I checked. You might need another layer of indirection if you cannot bound the length of per-country data, though.

      As mentioned else thread (at least here by nostrademons: https://news.ycombinator.com/item?id=43245421 but probably elsewhere), data compression algorithms just generalize this to not 8..64 bit, but " 'best' bits per token", but they need to measure up front what 'best' might mean. That domain of research/impl may also have its own terminology. I don't know what to do about "nobody talking to each other" either in this eentsy microcosm or more generally. Society can be tough. :-)

Diggsey 14 hours ago

I have a Rust crate (ijson) which implements string interning and a generally much more efficient in-memory JSON representation. It probably doesn't compete with the author's specialized implementation, but if you just want a plug and play solution it's pretty good.

saghm 14 hours ago

I did something similar to reduce the memory overhead of our Rust backend at a recent job; I noticed that when dealing with certain requests, we were using quite a lot of memory on duplicates of the same strings when processing things, so I made something I called an "alias cache" type that could be used to obtain an alias of an existing string if needed, and then introduced those to several places in the codebase where it was most helpful. This strategy also allowed some flexibility on how long strings were "cached" for, which let me tweak which layers of processing should share a cache versus creating their own (and letting the old one drop). Our use case definitely didn't have room for interning to use 2000x less memory, but I think in some of the pathological cases it reduced the max memory usage to around 50% of what it was before without significantly affecting runtime, which I was pretty pleased with!

est an hour ago

Does column store do this automatically?

3abiton 14 hours ago

Very fun read guillaume! I never find the courage to work on my "this seems to be interesting weekend idea" due to my fear of time commitment, but I often find these HN write up somehow fulfilling.

Someone 13 hours ago

A problem with explicit interning is that libraries, when creating objects, cannot make an informed decision whether to offer some runtime performance in exchange for the expectation of a decrease in memory usage.

And it is even less than an expectation. Many interning libraries never free objects that they created, so they can keep lots of objects that never are needed again around, and thus increase memory usage.

I think the ideal API would somehow a) be simple and b) have the program communicate with the system about what they desire from the system. As a first step, the system has to know how much memory the program is willing to use, how much longer it expects to run, and for how long the program plans to retain data it is creating now.

Simple examples:

- if the program is shutting down, and there’s a megabyte of memory available, it’s likely detrimental to intern any new data.

- in a “json grep” tool, interning json keys likely isn’t worth it, as most data will be discarded soon.

That’s probably at least a dozen of ph.d’s of research, and likely not attainable, though.

uhgrippa 13 hours ago

Very interesting, I particularly enjoyed the handling of memory reduction with the Interned<T> comparison. How do you go about finding open source projects which expose fascinating data such as this one?

xelxebar 7 hours ago

Man, lately I've been amazed by how many performance tricks like these are just bog standard APL. For example, string interning looks like this:

    strs←∪input_strings    ⍝ Build
    strs⍪←⊂'new string'    ⍝ Append
    strs⍳⊂'some string'    ⍝ Search
    f←strs∘⍳               ⍝ Hash table creation
One extra advantage on top of standard interning is that this also uses an incarnation of tiny pointers[0], automatically saving 2X memory generally. Better yet, standard APL data design will often obviate the need for most string searches, letting you get away with integer comparisons instead.

I'm starting to really see how data-oriented design tends to make straightforward code both parsimonious and highly performant.

[0]:https://news.ycombinator.com/item?id=43023634

Horffupolde 14 hours ago

What about loading it into ClickHouse?

  • wiredfool 11 hours ago

    Yeah, I've got some clickhouse tables that store similar data (GBFS), and I've got an amortized size on the 1byte/row level.

wdb 13 hours ago

As its already in Rust. Next step, is leveraging Apache DataFusion?

hu3 13 hours ago

From the looks of it it seems that interning here is creating references to strings that repeat. Strings are large so if you can store a reference to it, you save space.

kajecounterhack 10 hours ago

Years ago at Google, the permissions system (OG ganpati, for the googlers) was having major performance issues related to OOM in Java. IIRC there was an epic thread where folks discussed potential resolutions including, this being Java, GC tuning. But in the end, string interning was what helped a ton to defer the scaling issues until something else could be done (mostly a rewrite, smh). I mostly remember this as being a shocking solution to me. Part of me imagined there'd be some fancy scalable solution instead of single-binary optimization, but at the end of the day sometimes you just gotta intern some strings.

6r17 12 hours ago

First time I read about interning and this is really cool ; kudos for the article and the usage. Thanks for sharing !

Zezima 14 hours ago

An amazing example of Rust's abilities. More impressed by the relentless iterative approach taken by the author. Nicely done.

jcgrillo 14 hours ago

It's really shocking to see how wasteful a big blob of JSON is. This is such a great illustration of how much room there is for improvement.

  • throitallaway 14 hours ago

    The best thing that JSON is good at is human readability. I would say human composibility as well, but YAML and others are better for that. As a machine data exchange and storage format, JSON is extremely inefficient.

    • travisjungroth 13 hours ago

      I really dislike writing YAML. It might be a skill issue (as the kids say), but it just never landed. I'd rather write JSON, even with all its verbosity.

    • jcgrillo 13 hours ago

      It never ceases to amaze me how good computer people are at solving problems which are not problems (readability of serialized representation) at the expense of multiple orders of magnitude more compute and storage.

henry700 14 hours ago

Amazing doing this on a weekend and keeping enough track to write a great article about it within 3 weeks time.

Aachen 12 hours ago

I never heard of interning before (besides being an intern at a job perhaps) so I gave the post a read and... isn't this called deduplicating in plain English? I didn't understand what context the jargon adds; what makes this special

  • coder543 11 hours ago

    https://en.wikipedia.org/wiki/String_interning

    It's an established term in software. I would speculate that the word "intern" is short for "internalizing" here.

    • olddustytrail 8 hours ago

      I think it's from the Lisp "intern" function, which creates an internal symbol if it doesn't already exist and there's no external symbol of the same name.

valand 13 hours ago

Fun fact, V8 interns strings! But it doesn't seem to intern big ones.

  • nostrademons 13 hours ago

    Many VMs do. The JVM was one of the first to do this extensively - all strings are in a "constant pool" as part of the .class format, and then they're referenced by index in instructions. Python does it for strings that appear in source code, and you can force it for runtime data with sys.intern().

lukeweston1234 14 hours ago

Great post, one of my more favorite reads in the last few weeks.

vtuulos 13 hours ago

if you want to see similar tricks applied in Python (with a JIT compiler for query-time optimization), take a look at this fun deck that I presented a long time ago: https://tuulos.github.io/sf-python-meetup-sep-2013

we were able to handle trillion+ datapoints with relatively modest machines - definitely a useful approach if you are ready to do some bit twiddling

  • ciupicri 12 hours ago

    Just use `s = sys.intern(s)` for every string and be done with it. Or something like:

        _my_intern_dict = {}
        my_intern = lambda x: _my_intern_dict.setdefault(x, x)
        s = my_intern(s)
    
    Just make sure to delete _my_intern_dict when it's not needed anymore.