Bijou64: A variable-length integer encoding

www.inkandswitch.com

46 points by justinweiss 1 hour ago

omoikane 3 minutes ago

UTF-8 has the same issue ("overlength encoding") where multiple representations are possible the same code point. Someone proposed a similar tweak to remove the overlapping ranges by adjusting the base offset for byte sequences that are longer than 1. That was discussed here:

https://news.ycombinator.com/item?id=44456073 - Corrected UTF-8 (2025-07-03, 54 comments)

This "corrected UTF-8" has other problems, but I thought it's interesting how the shifted-offset idea carries over.

boricj 7 minutes ago

I'm working on a C++ library at work that binds wire data and application data through token and model layers, which includes among other things a fair amount of tokenizers/composers for various formats (JSON, CBOR, BSON, CSV...).

This looks neat, but if encoding/decoding performance is important, payload size isn't and the integer is bounded, I would just put a fixed-size integer into the payload as-is.

LEB128 (and JSON for that matter) can encode integer values of arbitrary length. This doesn't, which may or may not be important but it's different.

I'll admit that I do not do any cryptographic work with my library and therefore canonical representations aren't a huge concern in my use-cases. I merely provide various configurable limits (max value length, max depth, max items per collection) to prevent infinitely long documents from hogging my tokenizers indefinitely.

nine_k 10 minutes ago

In short: instead of a truly indefinite-length solution with a signal bit on the current byte saying whether to check the next byte, this uses a counter. Values 0x0 to 0xF7 are one-byte integers, 0xF8 to 0xFF use the upper 5 bits as a counter for the number of subsequent bytes. This limits the maximum magnitude to slightly less than 2 ^ 264 (almost all 33-byte values), which seems to be okay for practical computations. The proposed standard limits the supported size to u64 though.

The upsides: the size of the integer is apparent upon reading the first byte, and every number has exactly one canonical representation. I wish C strings had been standardized around something similar, instead on null termination.

> ...adversarial input, which is rarely in the test suite.

This made my scratch my head. My tests for quite pedestrian APIs often contain adversarial input of obvious shapes. I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.

stebalien 16 minutes ago

I've used LEB128 (with canonicalisation) extensively and... this looks so much nicer for most use-cases (length prefixed, supports the full uint64 range without that extra 10th byte).

The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers.

[1]: https://github.com/multiformats/multicodec

HansHamster 6 minutes ago

It feels a bit unfair to say that it is faster by being able to tell the total length from the first byte and capping it at 64 bit, while some of the other formats can store arbitrarily large integers. I guess you could use another variable length encoding for the prefix at the cost of some performance and using even more space...

willtemperley 11 minutes ago

Maybe someone can explain why an encoder would ever create the padding bytes allowed in LEB128. I contributed the parser for LEB128 in apple/swift-binary-parsing and I’m still none the wiser. I’m genuinely mystified.

Chaosvex 9 minutes ago

You wouldn't. It's a strange argument that can be countered with, "maybe don't do that?"
- willtemperley 6 minutes ago
  
  So why does the spec allow it? Like a good engineer I read the spec and tested against the over-wide example encodings given.
layer8 5 minutes ago

The issue is that non-unique encodings are an attack vector, because parsers may in practice behave differently for noncanonical (or nominally invalid) encodings.
esrauch 4 minutes ago

Let's say you are writing into a byte[] and have a LEB128 length-prefix followed by a payload, but that determining the length actually involves nontrivial encoding work. For example, you have a UTF16 string and want to write out a UTF8 string, you want to go over the characters and write them out, but the UTF8 length is not known without doing all of that work.
If you can choose a fixed number of bytes for the length prefix, you can skip that number, do the encoding and find out the length, and then come back and fill in the length-prefix after.
But you actually don't know how many bytes it will take without doing all of the work to know the payload length (since larger payloads take more bytes to represent the length).
If you allow overlong representation you can reserve a few bytes and sometimes it'll just be the effective no-op bytes. If you don't, you won't be able to.

RedShift1 22 minutes ago

This seems quite convoluted just to avoid the "0 can be represented in more than one way" problem.

nine_k 19 minutes ago

It allows finding out the length (and allocating memory) after reading the first byte.
ahoka 16 minutes ago

I think it's neat.
ape4 9 minutes ago

Comparing a number to zero is something that's done a lot
- Chaosvex 5 minutes ago
  
  True but also not particularly relevant?