Protocol buffers suck but so does everything else. Name another serialization declaration format that both (a) defines which changes can be make backwards-compatibly, and (b) has a linter that enforces backwards compatible changes.
Just with those two criteria you’re down to, like, six formats at most, of which Protocol Buffers is the most widely used.
And I know the article says no one uses the backwards compatible stuff but that’s bizarre to me – setting up N clients and a server that use protocol buffers to communicate and then being able to add fields to the schema and then deploy the servers and clients in any order is way nicer than it is with some other formats that force you to babysit deployment order.
The reason why protos suck is because remote procedure calls suck, and protos expose that suckage instead of trying to hide it until you trip on it. I hope the people working on protos, and other alternatives, continue to improve them, but they’re not worse than not using them today.
> Typical offers a new solution ("asymmetric" fields) to the classic problem of how to safely add or remove fields in record types without breaking compatibility. The concept of asymmetric fields also solves the dual problem of how to preserve compatibility when adding or removing cases in sum types.
That's a nice idea... But I believe the design direction of proto buffers was to make everything `optional`, because `required` tends to bite you later when you realize it should actually be optional.
My understanding is that asymmetric fields provide a migration path in case that happens, as stated in the docs:
> Unlike optional fields, an asymmetric field can safely be promoted to required and vice versa.
> [...]
> Suppose we now want to remove a required field. It may be unsafe to delete the field directly, since then clients might stop setting it before servers can handle its absence. But we can demote it to asymmetric, which forces servers to consider it optional and handle its potential absence, even though clients are still required to set it. Once that change has been rolled out (at least to servers), we can confidently delete the field (or demote it to optional), as the servers no longer rely on it.
This seems interesting. Still not sure if `required` is a good thing to have (for persistent data like log you cannot really guarantee some field's presence without schema versioning baked into the file itself) but for an intermediate wire use cases, this will help.
I've never heard of Typical but the fact they didn't repeat protobuf's sin regarding varint encoding (or use leb128 encoding...) makes me very interested! Thank you for sharing, I'm going to have to give it a spin.
Seems like a lot of effort to avoid adding a message version field. I’m not a web guy, so maybe I’m missing the point here, but I always embed a schema version field in my data.
The point is that its hard to prevent asymmetry in message versions if you are working with many communicating systems. Lets say four services inter-communicate with some protocol, it is extremely annoying to impose a deployment order where the producer of a message type is the last to upgrade the message schema, as this causes unnecessary dependencies between the release trains of these services. At the same time, one cannot simply say: "I don't know this message version, I will disregard it" because in live systems this will mean the systems go out of sync, data is lost, stuff breaks, etc.
There's probably more issues I haven't mentioned, but long story short: in live, interconnected systems, it becomes important to have intelligent message versioning, i.e: a version number is not enough.
I think I see what you’re getting at? My mental model is client and server, but you’re implying a more complex topology where no one service is uniquely a server or a client. You’d like to insert a new version at an arbitrary position in the graph without worrying about dependencies or the operational complexity of doing a phased deployment. The result is that you try to maintain a principled, constructive ambiguity around the message schema, hence asymmetrical fields? I guess I’m still unconvinced and I may have started the argument wrong, but I can see a reasonable person doing it that way.
Yes thats a big part, but even bigger is just the alignment of teams.
Imagine team A building feature XYZ
Team B is building TUV
one of those features in each team deals with messages, the others are unrelated.
At some point in time, both teams have to deploy.
If you have to sync them up just to get the protocol to work, thats an extra complexity in the already complex work of the teams.
If you can ignore this, great!
It becomes even more complex with rolling updates though: not all deployments of a service will have the new code immediately, because you want multiple to be online to scale on demand. This creates an immediate necessary ambiguity in the qeustion: "which version does this service accept?" because its not about the service anymore, but about the deployments.
Ah, I see. Team A would like to deploy a new version of a service. It used to accept messages with schema S, but the new version accepts only S’ and not S. So the only thing you can do is define S’ so that it is ambiguous with S. Team B uses Team A’s service but doesn’t want to have to coordinate deployments with Team A.
I think the key source of my confusion was Team A not being able to continue supporting schema S once the new version is released. That certainly makes the problem harder.
Idk I generally think “magic numbers” are just extra effort. The main annoyance is adding if statements everywhere on version number instead of checking the data field you need being present.
It also really depends on the scope of the issue. Protos really excel at “rolling” updates and continuous changes instead of fixed APIs. For example, MicroserviceA calls MicroserviceB, but the teams do deployments different times of the week. Constant rolling of the version number for each change is annoying vs just checking for the new feature. Especially if you could have several active versions at a time.
It also frees you from actually propagating a single version number everywhere. If you own a bunch of API endpoints, you either need to put the version in the URL, which impacts every endpoint at once, or you need to put it in the request/response of every one.
I think this is only a problem if you’re using a weak data interchange library that can’t use the schema number field to discriminate a union. Because you really shouldn’t have to write that if statement yourself.
We use protocol buffers on a game and we use the back compat stuff all the time.
We include a version number with each release of the game. If we change a proto we add new fields and deprecate old ones and increment the version. We use the version number to run a series of steps on each proto to upgrade old fields to new ones.
This. Plus ASN.1 is pluggable as to encoding rules and has a large family of them:
- BER/DER/CER (TLV)
- OER and PER ("packed" -- no tags and
no lengths wherever
possible)
- XER (XML!)
- JER (JSON!)
- GSER (textual representation)
- you can add your own!
(One could add one based on XDR,
which would look a lot like OER/PER
in a way.)
ASN.1 also gives you a way to do things like formalize typed holes.
Not looking at ASN.1, not even its history and evolution, when creating PB was a crime.
The people who wrote PB clearly knew ASN.1. It was the most famous IDL at the time. Do you assume they just came one morning and decided to write PB without taking a look at what existed?
Anyway, as stated PB does more than ASN.1. It specifies both the description format and the encoding. PB is ready to be used out of the box. You have a compact IDL and a performant encoding format without having to think about anything. You have to remember that PB was designed for internal Google use as a tool to solve their problems, not as a generic solution.
ASN.1 is extremely unwieldy in comparaison. It has accumulated a lot of cruft through the year. Plus they don’t provide a default implementation.
I agree that saying that no-one uses backwards compatible stuff is bizarre. Rolling deploys, being able to function with a mixed deployment is often worth the backwards compatibility overhead for many reasons.
In Java, you can accomplish some of this with using of Jackson JSON serialization of plain objects, where there several ways in which changes can be made backwards-compatibly (e.g. in the recent years, post-deserialization hooks can be used to handle more complex cases), which satisfies (a). For (b), there’s no automatic linter. However, in practice, I found that writing tests that deserialize prior release’s serialized objects get you pretty far along the line of regression protection for major changes. Also it was pretty easy to write an automatic round-trip serialization tester to catch mistakes in the ser/deser chain. Finally, you stay away from non-schemable ser/deser (such as a method that handles any property name), which can be enforced w/ a linter, you can output the JSON schema of your objects to committed source. Then any time the generated schema changes, you can look for corresponding test coverage in code reviews.
I know that’s not the same as an automatic linter, but it gets you pretty far in practice. It does not absolve you from cross-release/upgrade testing, because serialization backwards-compatibility does not catch all backwards-compatibility bugs.
Additionally, Jackson has many techniques, such as unwrapping objects, which let you execute more complicated refactoring backwards-compatibly, such as extracting a set of fields into a sub-object.
I like that the same schema can be used to interact with your SPA web clients for your domain objects, giving you nice inspectable JSON. Things serialized to unprivileged clients can be filtered with views, such that sensitive fields are never serialized, for example.
You can generate TypeScript objects from this schema or generate clients for other languages (e.g. with Swagger). Granted it won’t port your custom migration deserialization hooks automatically, so you will either have to stay within a subset of backwards-compatible changes, or add custom code for each client.
You can also serialize your RPC comms to a binary format, such as Smile, which uses back-references for property names, should you need to reduce on-the-wire size.
It’s also nice to be able to define Jackson mix-ins to serialize classes from other libraries’ code or code that you can’t modify.
This is always the thing to look for; "What are the alternatives?", and/why aren't there better ones.
I don't understand most use cases of protobufs, including ones that informed their design. I use it for ESP-hosted, to communicate between two MCUs. It is the highest-friction serialization protocol I've seen, and is not very byte-efficient.
Maybe something like the specialized serialization libraries (bincode, postcard etc) would be easier? But I suspect I'm missing something about the abstraction that applies to networked systems, beyond serialization.
> Name another serialization declaration format that both (a) defines which changes can be make backwards-compatibly, and (b) has a linter that enforces backwards compatible changes.
The article covers this in the section "The Lie of Backwards- and Forwards-Compatibility." My experience working with protocol buffers matches what the author describes in this section.
Protobufs are better but not best. Still, by far, the easiest thing to use and the safest is actual APIs. Like, in your application. Interfaces and stuff.
Obviously if your thing HAS to communicate over the network that's one thing, but a lot of applications don't. The distributed system micro service stuff is a choice.
Guys, distributed systems are hard. The extremely low API visibility combined with fragile network calls and unsafe, poorly specified API versioning means your stuff is going to break, and a lot.
Want a version controlled API? Just write in interface in C# or PHP or whatever.
This sort of comments doesn't add anything to the discussion unless you are able to point out what you believe to be the best. It reads as an unnecessary and unsubstantiated put-down.
> And I know the article says no one uses the backwards compatible stuff but that’s bizarre to me – setting up N clients and a server that use protocol buffers to communicate and then being able to add fields to the schema and then deploy the servers and clients in any order is way nicer than it is with some other formats that force you to babysit deployment order.
Yet the author has the audacity to call the authors of protobuf (originally Jeff Dean et al) "amateurs."
> Just with those two criteria you’re down to, like, six formats at most, of which Protocol Buffers is the most widely used.
What I dislike the most about blog posts like this is that, although the blogger is very opinionated and critical of many things, the post dates back to 2018, protobuf is still dominant, and apparently during all these years the blogger failed to put together something that they felt was a better way to solve the problem. I mean, it's perfectly fine if they feel strongly about a topic. However, investing so much energy to criticize and even throw personal attacks on whoever contributed to the project feels pointless and an exercise in self promotion at the expense of shit-talking. Either you put something together that you feel implements your vision and rights some wrongs, or don't go out of your day to put down people. Not cool.
The blog post leads with the personal assertion that "ad-hoc and built by amateurs". Therefore I doubt that JSON, a data serialization language designed by trimming most of JavaScript out and to be parses with eval(), would meet the opinionated high bar.
Also, JSON is a data interchange language, and has no support for types beyond the notoriously ill-defined primitives. In contrast, protobuf is a data serialization language which supports specifying types. This means that for JSON, to start to come close to meet the requirements met by protobuf, would need to be paired with schema validation frameworks and custom configurable parsers. Which it definitely does not cover.
You must be young. XML and XML Schemas existed before JSON or Protobuf, and people ditched them for a good reason and JSON took over.
Protobuf is just another version of the old RPC/Java Beans, etc... of a binary format. Yes, it is more efficient data wise than JSON, but it is a PITA to work on and debug with.
> You must be young. XML and XML Schemas existed before JSON or Protobuf, and people ditched them for a good reason and JSON took over.
I'm not sure you got the point. It's irrelevant how old JSON or XML (a non sequitur) are. The point is that one of the main features and selling points of protobuf is strong typing and model validation implemented at the parsing level. JSON does not support any of these, and you need to onboard more than one ad-hoc tool to have a shot at feature parity, which goes against the blogger's opinionated position on the topic.
What about Cap’n Proto https://capnproto.org/ ? (Don't know much about these things myself, but it's a name that usually comes up in these discussions.)
Cap'n'proto is not very nice to work with in C++, and I'd discourage anyone from using it from other programming languages, the implementations are just not there yet. We use both cnp and protobufs at work, and I vastly prefer protobufs, even for C++. I only wish they stayed the hell away from abseil, though.
The thing is a huge pain to manage as a dependency, especially if you wander away from the official google-approved way of doing things. Protobuf went from a breeze to use to the single most common source of build issues in our cross-platform project the moment they added this dependency. It's so bad that many distros and package managers keep the pre-abseil version as a separate package, and many just prefer to get stuck with it rather than upgrade. Same with other google libraries that added abseil as a dependency, as far as I'm aware
I prefer a little builtin backwards (and forwards!) compatibility (by always enforcing a length for each object, to be zero-padded or truncated as needed), but yes "don't fear adding new types" is an important lesson.
Protobufs aren’t new. They’re really just rpc over https. I’ve used dce-rpc in 1997 which had IDL. I believe CORBA used IDL as well although I personally did not use it. There have been other attempts like ejb, etc. which are pretty much the same paradigm.
The biggest plus with protobuf is the social/financial side and not the technology side. It’s open source and free from proprietary hacks like previous solutions.
Apart from that, distributed systems of which rpc is a sub topic are hard in general. So the expectation would be that it sucks.
Backwards compatibility is just not an issue in self-describing structures like JSON, Java serialization, and (dating myself) Hessian. You can add fields and you can remove fields. That's enough to allow seamless migrations.
It's only positional protocols that have this problem.
You can remove JSON fields at the cost of breaking your clients at runtime that expect those fields. Of course the same can happen with any deserialization libraries, but protobufs at least make it more explicit - and you may also be more easily able to track down consumers using older versions.
For the missing case, whenever I use json, I always start with a sane default struct, then overwrite those with the externally provided values. If a field is missing, it will be handled reasonably.
> Name another serialization declaration format that both (a) defines which changes can be make backwards-compatibly, and (b) has a linter that enforces backwards compatible changes.
TLV style binary formats are all you need. The “Type” in that acronym is a 32-bit number which you can use to version all of your stuff so that files are backwards compatible. Software that reads these should read all versions of a particular type and write only the latest version.
Code for TLV is easy to write and to read, which makes viewing programs easy. TLV data is fast for computers to write and to read.
Protobuf is overused because people are fucking scared to death to write binary data. They don’t trust themselves to do it, which is just nonsense to me. It’s easy. It’s reliable. It’s fast.
A major value of protobuf is in its ecosystem of tools (codegen, lint, etc); it's not only an encoding. And you don't generally have to build or maintain any of it yourself, since it already exists and has significant industry investment.
Just FYI: an obligatory comment from the protobuf v2 designer.
Yeah, protobuf has lots of design mistakes but this article is written by someone who does not understand the problem space. Most of the complexity of serialization comes from implementation compatibility between different timepoints. This significantly limits design space.
To clarify. Protobuf’s simplest change is adding a field to a message so wrapping maps of maps, maps of fields, oneof fields into a message makes these play to its strengths. It feels like over engineering to turn your Inventory map of items into a Inventory message, but you will be grateful for it when you need a capacity field later.
>Most of the complexity of serialization comes from implementation compatibility between different timepoints.
The author talks about compatibility a fair bit, specifically the importance of distinguishing a field that wasn't set from one that was intentionally set to a default, and how protobuffs punted on this.
If you see some statements like below on the serialization topic:
> Make all fields in a message required. This makes messages product types.
> One possible argument here is that protobuffers will hold onto any information present in a message that they don't understand. In principle this means that it's nondestructive to route a message through an intermediary that doesn't understand this version of its schema. Surely that's a win, isn't it?
> Granted, on paper it's a cool feature. But I've never once seen an application that will actually preserve that property.
Then it is fair to raise eyebrows on the author's expertise. And please don't ask if I'm attached to protobuf; I can roast the protocol buffer on its wrong designs for hours. It is just that the author makes series of wrong claims presumably due to their bias toward principled type systems and inexperience of working on large scale systems.
> If you see some statements like below on the serialization topic:
> Make all fields in a message required. This makes messages product types.
> Then it is fair to raise eyebrows on the author's expertise.
It's fair to raise eyebrows on your expertise, since required fields don't contribute to b/w incompatibility at all, as every real-world protocol has a mandatory required version number that's tied to a direct parsing strategy with strictly defined algebra, both for shrinking (removing data fragments) and growing (introducing data fragments) payloads. Zero-values and optionality in protobuf is one version of that algebra, it's the most inferior one, subject to lossy protocol upgrades, and is the easiest one for amateurs to design. Then, there's next lavel when the protocol upgrade is defined in terms of bijective functions and other elements of symmetric groups that can tell you whether the newly announced data change can be carried forward (new required field) or dropped (removed field) as long as both the sending and receiving ends are able to derive new compound structures from previously defined pervasive types (the things the protobuf says are oneoffs and messages, for example).
What you describe using many completely unnecessary mathematical terms is not only not found in “every real-world protocol”, but in fact is something virtually absent from overwhelming majority of actually used protocols, with a notable exception of the kind of protocol that gets a four digit numbered RFC document that describes it. Believe it or not, but in the software industry, nobody is defining a new “version number” with “strictly defined algebra” when they want to add a new field to an communication protocol between two internal backend services.
> What you describe using many completely unnecessary mathematical terms
Unnecessary for you, surely.
> Believe it or not, but in the software industry, nobody is defining a new “version number” with “strictly defined algebra” when they want to add a new field to an communication protocol between two internal backend services.
Name a protocol that doesn't have a respective version number, or without the defined algebra in terms of the associated spec clarifications that accompany the new version. The word "strictly" in "strictly defined algebra" has to do with the fact that you cannot evolve a protocol without strictly publishing the changed spec, that is you're strictly obliged to publish a spec, even the loosely defined one, with lots of omissions and zero-values. That's the inferior algebra for protobuf, but you can think it is unnecessary and doesn't exist.
Instead of just handwaving about whether it's necessary or not, why not point to any protocol that relies on that attribute, and we can then evaluate how important that protocol is?
Yeah. And for anyone curious about the actual content hidden under the jargon-kludge-FP-nerd parent comment, here's my attempt at deciphering it.
They seem to be saying that you have to publish code that can change a type from schema A to schema B... And back, whenever you make a schema B. This is the "algebra". The "and back" part makes it bijective. You do this at the level of your core primitive types so that it's reused everywhere. This is what they meant by "pervasive" and it ties into the whole symmetric groups thing.
Finally, it seems like when you're making a lossy change, where a bijection isn't possible, they want you to make it incompatible. i.e, if you replaced address with city, then you cannot decode the message in code that expects address.
> since required fields don't contribute to b/w incompatibility at all, as every real-world protocol has a mandatory required version number that's tied to a direct parsing strategy with strictly defined algebra
At least I know 10 different tech companies with billion dollars revenue which does not suit to your description. This comment makes me wonder if you have any experience of working on real world distributed systems. Oh and I'm pretty sure you did not read Kenton's comment; he already precisely addressed your point:
> This is especially true when it comes to protocols, because in a distributed system, you cannot update both sides of a protocol simultaneously. I have found that type theorists tend to promote "version negotiation" schemes where the two sides agree on one rigid protocol to follow, but this is extremely painful in practice: you end up needing to maintain parallel code paths, leading to ugly and hard-to-test code. Inevitably, developers are pushed towards hacks in order to avoid protocol changes, which makes things worse.
I recommend you to do your homework before making such a strong argument. Reading a 5 mins long comment is not that hard. You can avoid lots of shame by doing so.
Granted, on paper it’s a cool feature. But I’ve never once seen an application that will actually preserve that property.
Chances are, the author literally used software that does it as he wrote these words. This feature is critical to how Chrome Sync works. You wouldn’t want to lose synced state if you use an older browser version on another device that doesn’t recognize the unknown fields and silently drops them. This is so important that at some point Chrome literally forked protobuf library so that unknown fields are preserved even if you are using protobuf lite mode.
I'm starting to wonder if some of those bad design decisions are symptoms of a larger "cultural bias" at Google. Specifically the "No Compositionality" point: It reminds me of similar bad designs in Go, CSS and the web platform at large.
The pattern seems to be that generalized, user-composable solutions are discouraged in favor of a myriad of special constructs that satisfy whatever concrete use cases seem relevant for the designers in the moment.
This works for a while and reduces the complexity of the language upfront, while delivering results - but over time, the designs devolve into a rats's nest of hyperspecific design features with awkward and unintuitive restrictions.
Eventually, the designers might give up and add more general constructs to the language - but those feel tacked on and have to coexist with specific features that can't be removed anymore.
It works both ways. General constructs tend to become overly abstract and you end up with sneaky errors in different places due to a minor change to an abstraction.
Like the old adage, this is just a matter of preference. Good software engineering requires, first and foremost, great discipline, regardless of the path or tool you choose.
If there are errors in implementation of general constructs, they tend to be visible at their every use, and get rapidly fixed.
Some general constructs are better than the others, because they have an algebraic theory behind them, and sometimes that theory was already researched for a few hundred years.
For example, product/coproduct types mentioned in the article are quite close to addition and multiplication that we've all learned in school, and obey the same laws.
So there are several levels where the choice of ad-hoc constructs is wrong, and in the end the only valid reason to choose them is time constraints.
If they had 24 years to figure out how to do it properly, but they didn't, the technology is just dead.
I've certainly run into cases where small changes in general systems led to hard-to-detect bugs, which took a great deal of investigation to figure out. Not all failures are catastrophic.
The technology is quite alive, which is why it hasn't been 'fixed' - changing the wheels on a moving car, and all that. The actual disappointment is that a better alternative hasn't taken off in the six years since this post was written... If its so easy, where's the alternatives?
> This works for a while and reduces the complexity of the language upfront, while delivering results - but over time, the designs devolve into a rats's nest of hyperspecific design features with awkward and unintuitive restrictions.
I share the author's sentiment. I hate these things.
True story: trying to reverse engineer macOS Photos.app sqlite database format to extract human-readable location data from an image.
I eventually figured it out, but it was:
A base64 encoded
Binary Plist format
with one field containing a ProtoBuffer
which contained another protobuffer
which contained a unicode string
which contained improperly encoded data (for example, U+2013 EN DASH was encoded as \342\200\223)
Sure you can look at it[1], but you're not expected to look at Apple Photos database. The computer is.
Write a correct JSON parser, compare with protobuf on various metrics, and then we can talk.
[1]: although to be fair, I am older than kids whose first programming language was JavaScript, so I do not think of JSON object format with property names in quotes and integers that need to be wrapped as strings to be safe, etc., lack of comma after the last entry--to be fair this last one is a problem in writing, not reading JSON--as the most natural thing
I'm also "older" but I don't think that means anything.
> Sure you can look at it[1], but you're not expected to look at Apple Photos database.
How else are you supposed to figure it out? If you're older then you know that you can't rely on the existence or correctness of documentation. Being able to look at JSON and understand it as a human on the wire is huge advantage. JSON being pretty simple in structure is as advantage. I don't see a problem with quoting property names! As for large integers and datetimes, yes that could be much better designed. But that's true of every protocol and file format that has any success.
JSON parsers and writers are common and plentiful and are far less crazy than any complete XML parser/writer library.
> Being able to look at JSON and understand it as a human on the wire is huge advantage
I don’t think this is a given at all. Depends on the context. I think it’s often overvalued. A lot of times the performance matters more. If human readability was the only thing that mattered, I would still not count JSON as the winner. You will have to pipe it to jq, realistically. You’d do the same for any other serialization format too. Inside Google where proto is prevalent, that is just as easy if not more convenient.
The point is how hard or easy it is for an app’s end user to decipher its file database is not a design goal for the serialization library chosen by Apple Photos developers here. The constraints and requirements are all on different axis.
The JSON version would have also had the wrong encoding - all formats are just a framing for data fed in from code written by a human. In mac's case, em dash will always be an issue because that's just what Mac decided on intentionally.
I mean... you can nest-encode stuff in any serial format. You're not describing a problem either intrinsic or unique to Protobuf, you're just seeing the development org chart manifested into a data structure.
Good points this wasn't entirely a protobuf-specific issue, so much as it was a (likely hierarchical and historical set of) bad decisions to use it at all.
Using Protobuffers for a few KB of metadata, when the photo library otherwise is taking multiple GB of data, is just pennywise pound foolish.
Of course, even my preference for a simple JSON string would be problematic: data in a database really should be stored properly normalized to a separate table and fields.
My guess is that protobuffers did play a role here in causing this poor design. I imagine this scenario:
- Photos.app wants to look up location data
- the server returns structured data in a ProtoBuffer
- there's no easy or reasonable way to map a protobuf to database fields (one point of TFA)
- Surrender! just store the binary blob in SQLITE and let the next poor sod deal with it
You have to take into account the fact that iPhoto app has had many iterations. The binary plist stuff is very likely the native NSArchive "object archiving (serialization)" that is done by Obj-C libraries. They probably started using protobuf at some point later after iCloud. I suspect the unicode crap you are facing may even predate Cocoaization of the app (they probably used Carbon API).
So it would make it a set of historical decisions, but I am not convinced they are necessarily bad decisions given the constraints. Each layer is likely responsible for handing edge cases in the application that you and I are not privy to.
That's horrendous. For some reason I imagine Apple's software to be much cleaner, but I guess that's just the marketing getting to my head. Under the hood it's still the same spaghetti.
Yeah, the problem is Apple and all the other contemporary tech companies have engineers bounce around between them all the time, and they take their habits with them.
At some point there becomes a critical mass of xooglers in an org, and when a new use case happens no one bothers to ask “how is serialization typically done in Apple frameworks”, they just go with what they know. And then you get protobuf serialization inside a plist. (A plist being the vanilla “normal” serialization format at Apple. Protobuf inside a plist is a sign that somebody was shoehorning what they’re comfortable with into the code.)
There are a lot of great comments on these old threads, and I don't think there's a lot of new science in this field since 2018, so the old threads might be a better read than today's.
I don't know if the author is right or wrong; I've never dealt with protobufs professionally. But I recently implemented them for a hobby project and it was kind of a game-changer.
At some stage with every ESP or Arduino project, I want to send and receive data, i.e. telemetry and control messages. A lot of people use ad-hoc protocols or HTTP/JSON, but I decided to try the nanopb library. I ended up with a relatively neat solution that just uses UDP packets. For my purposes a single packet has plenty of space, and I can easily extend this approach in the future. I know I'm not the first person to do this but I'll probably keep using protobufs until something better comes along, because the ecosystem exists and I can focus on the stuff I consider to be fun.
Embedded/constrained UDP is where protobuf wire format (but not google's libraries) rocks: IoT over cellular and such, where you need to fit everything into a single datagram (number of roundtrips is what determines power consumption). As to those who say "UDP is unreliable" - what you do is you implement ARQ on the application level. Just like TCP does it, except you don't have to waste roundtrips on SYN-SYN-ACK handshake nor waste bytes on sending data that are no longer relevant.
Varints for the win. Send time series as columns of varint arrays - delta or RLL compression becomes quite straightforward. And as a bonus I can just implement new fields in the device and deploy right away - the server-side support can wait until we actually need it.
No, flatbuffers/cap'n'proto are unacceptably big because of fixed layout. No, CBOR is an absolute no go - why on earth would you waste precious bytes on schema every time? No, general-purpose compression like gzip wouldn't do much on such a small size, it will probably make things worse. Yes, ASN is supposed to be the right solution - but there is no full-featured implementation that doesn't cost $$$$ and the whole thing is just too damn bloated.
Kinda fun that it sucks for what it is supposed to do, but actually shines elsewhere.
> Yes, ASN is supposed to be the right solution - but there is no full-featured implementation that doesn't cost $$$$ and the whole thing is just too damn bloated.
Oh for crying out loud! PB had ZERO tooling available when it was created! It would have been much easier to create ASN.1 tooling w/ OER/PER and for some suitable subset of ASN.1 in 2001 that it was to a) create an IDL, b) create an encoding, and c) write tooling for N programming languages.
In fact, one thing one could have done is write a transpiler from the IDL to an AST that does all linting, analysis, and linking, and which one can then use to drive codegen for N languages. Or even better: have the transpiler produce a byte-coded representation of the modules and then for each programming language you only need to codegen the types but not the codecs -- instead for each language you need only write the interpreter for the byte-coded modules. I know because I've extended and maintained an [open source] ASN.1 compiler that fucking does [some of] these things.
Stop spreading this idea that ASN.1 is bloated. It's not. You can cut it down for your purposes. There's only 4 specifications for the language itself, of which the base one (x.680) is enough for almost everything (the others, X.681, X.682, and X.683, are mainly for parameterized types and formal typed hole specifications [the ASN.1 "information object system], which are awesome but you can live without). And these are some of the best-written and most-readable specifications ever written by any standards development organization -- they are a great gift from a few to all of mankind.
> It would have been much easier to create ASN.1 tooling w/ OER/PER and for some suitable subset of ASN.1 in 2001
Just by looking at your past comments - I agree that if google reused ASN.1, we would have lived in a better world. But the sad reality now is that PB gots tons of FOSS tooling and ASN.1 barely any (is there any free embedded-grade implementation other than asn1cc?) and figuring out what features you can use without having to pledge your kidney and soul to Nokalva is a bit hard.
I tried playing with ASN.1 before settling on protobuf. Don't recall which compiler I used, but immediately figured out that apparently datetime datatype is not supported, and the generated C code was bloated mess (so is google's protobuf - but not nanopb). Protobuf, on the other hand, was quite straightforward on what is and is not supported. So us mortals who aren't google and have a hard time justifying writing serdes from scratch gotta use what's available.
> Stop spreading this idea that ASN.1 is bloated
"Bloated" might be the wrong word - but it is large and it's damn hard for someone designing a new application to figure out which part is safe to use, because most sources focus on using it for decoding existing protocols.
> why on earth would you waste precious bytes on schema every time
cbor doesn't prescribe sending schema, in fact there is no schema, like json.
i just switched from protobuf to cbor because i needed better streaming support and find use it quite delightful. losing protobuf schema hurts a bit, but the amount of boilerplate code is actually less than what i had before with nanopb (embedded context). on top of it, i am saving approx. 20% in message size compared to protobuf bc i am using mostly arrays with fixed position parameters.
> cbor doesn't prescribe sending schema, in fact there is no schema, like json.
You are right, I must have confused CBOR with BSON where you send field names as strings.
>on top of it, i am saving approx. 20% in message size compared to protobuf bc i am using mostly arrays with fixed position parameters
Arrays with fixed position is always going to be the most compact format, but that means that you essentially give up on serialization. Also, when you have a large structure (e. g. full set of device state and settings)where most of the fields only change infrequently, it makes sense to only send what's changed, and then TLV is significantly better.
Other than ASN.1 PER, is there any other widely used encoding format that isn't self-describing? Using TLV certainly adds flexibility around schema evolution, but I feel like collectively we are wasting a fair amount of bytes because of it...
Cap'n'proto doesn't have tags, but it wastes even more bytes in favor of speed. Than again, omitting tags only saves space if you are sending all the fields every time. PER uses a bitmap, which is still a bit wasteful on large sparse structs.
Using protobuf is practical enough in embedded. This person isn't the first and won't be the last. Way faster than JSON, way slower than C structs.
However protobuf is ridiculously interchangeable and there are serializers for every language. So you can get your interfaces fleshed out early in a project without having to worry that someone will have a hard time ingesting it later on.
Yes it's a pain how an empty array is a valid instance of every message type, but at least the fields that you remember to send are strongly typed. And field optionality gives you a fighting chance that your software can still speak to the unit that hasn't been updated in the field for the last five years.
On the embedded side, nanopb has worked well for us. I'm not missing having to hand maintain ad-hoc command parsers on the embedded side, nor working around quirks and bugs of those parsers on the desktop side
The reasons for that line get at a fundamental tension. As David Wheeler famously said, "All problems in computer science can be solved by another level of indirection, except for the problem of too many indirections."
Over time we accumulate cleverer and cleverer abstractions. And any abstraction that we've internalized, we stop seeing. It just becomes how we want to do things, and we have no sense of what cost we are imposing with others. Because all abstractions leak. And all abstractions pose a barrier for the maintenance programmer.
All of which leads to the problem that Brian Kernighan warned about with, "Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?" Except that the person who will have to debug it is probably a maintenance programmer who doesn't know your abstractions.
One of the key pieces of wisdom that show through Google's approaches is that our industry's tendency towards abstraction is toxic. As much as any particular abstraction is powerful, allowing too many becomes its own problem. This is why, for example, Go was designed to strongly discourage over-abstraction.
Protobufs do exactly what it says on the tin. As long as you are using them in the straightforward way which they are intended for, they work great. All of his complaints boil down to, "I tried to do some meta-manipulation to generate new abstractions, and the design said I couldn't."
That isn't the result of them being written by amateurs. That's the result of them being written to incorporate a piece of engineering wisdom that most programmers think that they are smart enough to ignore. (My past self was definitely one of those programmers.)
Can the technology be abused? Do people do stupid things with them? Are there things that you might want to do that you can't? Absolutely. But if you KISS, they work great. And the more you keep it simple, the better they work. I consider that an incentive towards creating better engineered designs.
I think you nailed it. So many complaints about Go for example basically come down to "it didn't let me create X abstraction" and that's basically the point.
IMO it's a pretty reasonable claim about experience level, not intelligence, and isn't at all an ad hominem attack because it's referring directly to the fundamental design choices of protocol buffers and thus is not at all a fallacy of irrelevance.
Whatever else Jeff Dean and Sanjay Ghemawat are, and whatever mistakes they made in designing protobufs, they are not amateurs.
Not long after they designed and implemented protobuffers, they shared the ACM prize in computing, as well as many other similar honors. And the honors keep stacking up.
None of this means that protobufs are perfect (or even good), but it does mean they weren't amateurs when they did it.
Yeah, let's pretend that type algebra doesn't exist, and even if it does exist then it's not useful and definitely isn't practical in data protocols. Let's believe that the authors of protobuf considered everything, and since they aren't amateurs (by the virtue of having worked on protobuf at Google, presumably), every elaborated opinion that draws them as amateurs at applying type algebra in data protocol designs is a personal ad-hominem attack.
They're not amateurs by virtue of being some of the most senior engineers ever to work at Google. You don't get to play the "ad hominem" card while calling them names. This whole thread is embarrassing.
Ok, "some of the most senior engineers ever to work at Google" don't seem to know that static bounds checking don't require dependent types: https://news.ycombinator.com/item?id=45150008
> You don't get to play the "ad hominem" card while calling them names
The entire article explains it at length why there's the impression, it's not ad-hominem.
It's a terrible attitude and I agree that sort of thing shouldn't be (and generally isn't) tolerated for long in a professional environment.
That said the article is full of technical detail and voices several serious shortcomings of protobuf that I've encountered myself, along with suggestions as to how it could be done better. It's a shame it comes packaged with unwarranted personal attacks.
Imagine calling google amateurs, and then the only code you write has a first year student error in failing to distinguish assignment from comparision operator.
There's a class of rant on the internet where programmers complain about increasingly foundational tech instead of admitting skill issues. If you go far deep into that hole, you end up rewriting the kernel in Rust.
I'm afraid that this is a case of someone imagining that there are Platonic ideal concepts that don't evolve over time, that programs are perfectible. But people are not immortal and everything is always changing.
I almost burst out in laughter when the article argued that you should reuse types in preference to inlining definitions. If you've ever felt the pain of needing to split something up, you would not be so eager to reuse. In a codebase with a single process, it's pretty trivial to refactor to split things apart; you can make one CL and be done. In a system with persistence and distribution, it's a lot more awkward.
That whole meaning of data vs representation thing. There's fundamentally a truth in the correspondence. As a program evolves, its understanding of its domain increases, and the fidelity of its internal representations increase too, by becoming more specific, more differentiated, more nuanced. But the old data doesn't go away. You don't get to fill in detail for data that was gathered in older times. Sometimes, the referents don't even exist any more. Everything is optional; what was one field may become two fields in the future, with split responsibilities, increased fidelity to the domain.
Yeah, oneOf fields can be repeated but you can just wrap them in a message. It's not as pretty but I've never had any issues with this.
The fact that the author is arguing for making all messages required means they don't understand the reasoning for why all fields are optional. This breaks systems (there are are postmortems outlining this) then there are proto mismatches .
I'm not sure why this post gets boosted every few years- and unfortunately (as many have pointed out) the author demonstrates here that they do not understand distributed system design, nor how to use protocol buffers. I have found them to be one of the most useful tools in modern software development when used correctly. Not only are they much faster than JSON, they prevent the inevitable redefinition of nearly identical code across a large number of repos (which is what i've seen in 95% of corporate codebases that eschew tooling such as this). Sure, there are alternatives to protocol buffers, but I have not seen them gain widespread adoption yet.
Protobuf's original sin was failing to distinguish zero/false from undefined/unset/nil. Confusion around the semantics of a zero value are the root of most proto-related bugs I've come across. At the same time, that very characteristic of protobuf makes its on-wire form really efficient in a lot of cases.
Nearly every other complaint is solved by wrapping things in messages (sorry, product types). Don't get the enum limitation on map keys, that complaint is fair.
Protobuf eliminates truckloads of stupid serialization/deserialization code that, in my embedded world, almost always has to be hand-written otherwise. If there was a tool that automatically spat out matching C, Kotlin, and Swift parsers from CDDL, I'd certainly give it a shot.
> Protobuf's original sin was failing to distinguish zero/false from undefined/unset/nil.
It's only proto3 that doesn't distinguish between zero and unset by default. Both the earlier and later versions support it.
Proto3 was a giant pile of poop in most respects, including removing support for field presence. They eventually put it back in as a per-field opt-in property, but by then the damage was done.
A huge unforced mistake, but I don't think a change made after the library had existed for 15 years and reverted qualifies as an "original sin".
Agreed the CDDL to codegen pipeline / tooling is the biggest thing holding back CBOR at the moment.
Some solutions do exist like here’s a C one[1] which maybe you could throw in some WASI / WASM compilation and get “somewhat” idiomatic bindings in a bunch of languages.
Here’s another for Rust [2] but I’m sure I’ve seen a bunch of others around. I think what’s missing is a unified protoc style binary with language specific plugins.
> Protobuffers correspond to the data you want to send over the wire, which is often related but not identical to the actual data the application would like to work with
This sums up a lot of the issues I’ve seen with protobuf as well. It’s not an expressive enough language to be the core data model, yet people use it that way.
In general, if you don’t have extreme network needs, then protobuf seems to cause more harm than good. I’ve watched Go teams spend months of time implementing proto based systems with little to no gain over just REST.
On the other hand, ASN.1 is very expressive and can cover pretty much anything, but Protobuff was created because people thought ASN.1 is too complex. I guess we can't have both.
If people though ASN.1 was too big all they had to do was create a small profile of it large enough for the task at hand.
X.680 is fairly small. Require AUTOMATIC TAGs, remove manual tagging, remove REAL and EMBEDDED PDV and such things, and what you're left with is pretty small.
Things people say who know very little about ASN.1:
- it's bloated! (it's not)
- it's had lots of vulnerabilities! (mainly in hand-coded codecs)
- it's expensive (it's not -- it's free and has been for two decades)
- it's ugly (well, sure, but so is PB's IDL)
- the language is context-dependent, making it harder to write a parser for (this is quite true, but so what, it's not that big a deal)
The vulnerabilities were only ever in implementations, and almost entirely in cases of hand-coded codecs, and the thing that made many of these vulnerabilities possible was the use of tag-length-value encoding rules (BER/DER/CER) which, ironically, Protocol Buffers bloody is too.
If you have a different objections to ASN.1, please list them.
> * Make all fields in a message required. This makes messages product types.
Meanwhile in the capnproto FAQ:
>How do I make a field “required”, like in Protocol Buffers?
>You don’t. You may find this surprising, but the “required” keyword in Protocol Buffers turned out to be a horrible mistake.
I recommend reading the rest of the FAQ [0], but if you are in a hurry: Fixed schema based protocols like protobuffers do not let you remove fields like self describing formats such as JSON. Removing fields or switching them from required to optional is an ABI breaking change. Nobody wants to update all servers and all clients simultaneously. At that point, you would be better off defining a new API endpoint and deprecating the old one.
The capnproto faq article also brings up the fact that validation should be handled on the application level rather than the ABI level.
I lost the plot here when the author argued that repeated fields should be implemented as in the pure lambda calculus...
Most of the other issues in the article can be solved be wrapping things in more messages. Not great, not terrible.
As with the tightly-coupled issues with Go, I'll keep waiting for a better approach any decade now. In the meantime, both tools (for their glaring imperfections) work well enough, solve real business use cases, and have a massive ecosystem moat that makes them easy to work with.
They didn't. Pure lambda calculus would have been "a function that when applied to a number encoded as a function, extracts that value".
They did it essentially as a linked list, C-strings, or UTF-8 characters: "current data, and is there more (next pointer, non-null byte, continuation bit set)?" They also noted that it could have this semantics without necessarily following this implementation encoding, though that seems like a dodge to me; length-prefixed array is a perfectly fine primitive to have, and shouldn't be inferred from something that can map to it.
I recently made a realization, that I can use MessagePack with a static schema defined in the code, and even pre-defined numeric field IDs, essentially replacing Protobuf for my use cases. I saw MessagePack as an alternative for JSON, with loose message structure, but it's actually a nice binary format and can be used more effectively than that. So now I enjoy things like tagged unions (in Zig/Python), and other types that are awkward to express in Protobuf. I settled on single character field names, for compatibility with msgspec, and I'm pretty happy with it. Still super compact messages with predictable schema, that are fast to parse, because I know which fields to expect.
I went into this article expecting to agree with part of it. I came away agreeing with all of it. And I want to point out that Go also shares some of these catastrophic data decisions (automatic struct zero values that silently do the wrong thing by default).
We got bit by a default value in a DMS task where the target column didn't exist so the data wasn't replicated and the default value was "this work needs to be done."
This is not pb nor go. A sensible default of invalid state would have caught this. So would an error and crash. Either would have been better than corrupt data.
So, that target column was called the wrong name, meaning data intended for the column never arrived, causing the default value in the database to be used, which was an integer that mapped to "this work item needs to be processed still" which led to double processing the record post dms migration
With these serialization libraries, do any of them have a facility that allows you to specify a wire format and an application format, with recipes for converting one to the other?
I haven't used these very seriously but a problem I had a while back was that that the wire format was not what the applications wanted to use, but a good application format was to space-inefficient for wire.
As far as I could see there was not a great way to do this. You could rewrite wire<->app converter in every app, or have a converter program and now you essentially have two wire formats and need to put this extra program and data movement into workflows, or write a library and maintain bindings for all your languages.
The way to do this starts with not hard-wiring the code generation step.
Instead, make codegen a function of BOTH a data schema object and a code template (eg expressed in Jinja2 template language - or ZeroMQ GSL where I first saw this approach). The codegen stage is then simply the application of the template to the data schema to produce a code artifact.
The templates are written assuming the data schema is provided following a meta-schema (eg JSON Schema for a data schema in JSON). One can develop, eg per-language templates to produce serialization code or intra-language converters between serialization forms (on wire) and application friendly forms. The extra effort to develop a template for a particular target is amortized as it will work across all data schemas that adhere to a common meta-schema.
The "codegen" stage can of course be given non "code" templates to produce, eg, reference documentation about the data schema in different formats like HTML, text, nroff/man, etc.
If you care about network bandwidth you can compress before sending, as virtually all web applications do. Then you don't need to worry much about the space efficiency of the application format.
Of the wire format you mean? I compress it and still need to care about the space efficiency of the wire format beyond that. Compression ratio does improve a lot when not doing our own, end result is significantly larger. Also it becomes also significantly slower because more data to process which is possibly the bigger problem.
It's probably not like most web application, it's hardware data loggers that produce about hundreds of millions to billions of events per second (each with minimum about 4 bytes of wire format and maximum roughly 500 bytes).
Sometimes you are integrating with system that already use proto though. I recently wrote a tiny, dependency-free, practical protobuf (proto3) encoder/decoder. For those situations where you need just a little bit of protobuf in your project, and don't want to bother with the whole proto ecosystem of codegen and deps: https://github.com/allanrbo/pb.py
> Maintain a separate type that describes the data you actually want, and ensure that the two evolve simultaneously.
I don't actually want to do this, because then you have N + 1 implementations of each data type, where N = number of programming languages touching the data, and + 1 for the proto implementation.
What I personally want to do is use a language-agnostic IDL to describe the types that my programs use. Within Google you can even do things like just store them in the database.
The practical alternative is to use JSON everywhere, possibly with some additional tooling to generate code from a JSON schema. JSON is IMO not as nice to work with. The fact that it's also slower probably doesn't matter to most codebases.
> I don't actually want to do this, because then you have N + 1 implementations of each data type, where N = number of programming languages touching the data, and + 1 for the proto implementation.
I think this is exactly what you end up with using protobuf. You have an IDL that describes the interface types but then protoc generates language-specific types that are horrible so you end up converting the generated types to some internal type that is easier to use.
Ideally if you have an IDL that is more expressive then the code generator can create more "natural" data structures in the target language. I haven't used it a ton, but when I have used thrift the generated code has been 100x better than what protoc generates. I've been able to actually model my domain in the thrift IDL and end up with types that look like what I would have written by hand so I don't need to create a parallel set of types as a separate domain model.
> The practical alternative is to use JSON everywhere, possibly with some additional tooling to generate code from a JSON schema.
Protobuf has a bidirectional JSON mapping that works reasonably well for a lot of use cases.
I have used it to skip the protobuf wire format all together and just use protobuf for the IDL and multi-language binding, both of which IMO are far better than JSON-Schema.
JSON-Schema is definitely more powerful though, letting you do things like field level constraints. I'd love to see you tomorrow that paired the best of both.
Something like MessagePack or CBOR, and if you want versioning, just have a version field at the start. You don't require a schema to pack/unpack, which I personally think is a good thing.
Support across languages etc is much less mature but I find thrift serialization format to be much nicer than protobuf. The codegen somehow manages to produce types that look like types I would actually write compared to the monstrosities that protoc generates.
Always initializing with a default and no algebraic types is an always loaded foot gun. I wonder if the people behind golang took inspiration from this.
The simplest way to understand go is that it is a language that integrates some of Google's best cpp features (their lightweight threads and other multi threading primitives are the highlights)
Beyond that it is a very simple language. But yes, 100%, for better and worse, it is deeply inspired by Google's codebase and needs
The crappy system that everyone ends up using is better than the perfectly designed system that's only seen in academic papers. Javascript is the poster-child of Worse is Better. Protobuffs are a PITA, but they are widely used and getting new adoption in industry.
https://en.wikipedia.org/wiki/Worse_is_better
I worked at a company that had their own homegrown Protobuf alternative which would add friction to life constantly. Especially if you had the audacity to build anything that wasn't meant to live in the company monorepo (your Python script is now a Docker image that takes 30 minutes to build).
One day I got annoyed enough to dig for the original proposal and like 99.9% of initiatives like this, it was predicated on:
- building a list of existing solutions
- building an overly exhaustive list, of every facet of the problem to be solved
- declare that no existing solution hits every point on your inflated list
- "we must build it ourselves."
It's such a tired playbook, but it works so often unfortunately.
The person who architects and sells it gets points for "impact", then eventually moves onto the next company.
In the meantime the problem being solved evolves and grows (as products and businesses tend to), the homegrown solution no longer solves anything perfectly, and everyone is still stuck dragging along said solution, seemingly forever.
-
Usually eventually someone will get tired enough of the homegrown solution and rightfully question why they're dragging it along, and if you're lucky it gets replaced with something sane.
If you're unlucky that person also uses it as justification to build a new in-house solution (we're built the old one after all), and you replay the loop.
In the case of serialization though, that's not always doable. This company was storing petabytes (if not exabytes) of data in the format for example.
The author makes good arguments; I wish they'd offered some alternatives.
Despite issues, protobufs solve real problems and (imo) bring more value than cost to a project. In particular, I'd much rather work with protobufs and their generated ser/de than untyped json
the complaints about the Protobuf type system being not flexible enough are also really funny to read.
fundamentally, the author refuses to contend with the fact that the context in which Protobufs are used -- millions of messages strewn around random databases and files, read and written by software using different versions of libraries -- is NOT the same scenario where you get to design your types once and then EVERYTHING that ever touches those types is forced through a type checker.
again, this betrays a certain degree of amateurishness on the author's part.
> is NOT the same scenario where you get to design your types once and then EVERYTHING that ever touches those types is forced through a type checker.
the author never claimed the types had to be designed only once, he claimed that schema evolution chosen by protobuf is inadequate for the purpose of lossless evolution.
> type algebra either doesn't exist or impractical because only PL theorists know about it, not Kenton.
Hi I'm Kenton. I, too, was enamored with advanced PL theory in college. Designed and implemented my own purely-functional programming language. Still wish someone would figure out a working version of dependent types for real-world use, mainly so we could prove array bounds-safety without runtime checks.
In two decades building real-world complex systems, though, I've found that getting PL theory right is rarely the highest-leverage way to address the real problems of software engineering.
> Still wish someone would figure out a working version of dependent types for real-world use, mainly so we could prove array bounds-safety without runtime checks.
Hi Kenton, I'm not sure what kind of PL theory you studied in college, but "array bounds-safety without runtime checks" don't require dependent types. They are being proven with several available SMT solvers as of right now, just ask LLVM folks with their "LLVM_ENABLE_Z3_SOLVER" compiler flag, the one that people build their real-world solutions on.
By the way, you don't have to say "real-world" in every comment to appeal to your google years as a token of "real-world vs the rest of you". "But my team at google wouldn't use it", or something along that line, right?
Throwing a theorem-prover at the problem, unaided by developer hints, is not realistic in a large codebase. You need annotations that let you say "this array's size is is the same as that array" or "this integer is within the bounds of that array" -- that's dependent types.
> Throwing a theorem-prover at the problem, unaided by developer hints, is not realistic in a large codebase.
Please, Kenton, don't move your goalpost. Annotations, whether they come directly from a developer, or from IR meta, don't make a provided SAT-constraint suddenly a "dependent type" component of your type system, it needs a bit more than that. Let's not miss the "types" in "dependent types". You don't modify type systems of your languages to run SAT solvers in large codebases.
Truly, if you believe that annotations for the purpose of static bounds checking "is not realistic in a large codebase", I've got "google/pytype" and the entire Python community to sell to you.
I sniffed this. I am not familiar with protobufs, but aware they are for efficiency on the wire. The fact he only really talks about type systems and not the before vs. after of the affect on the wire was disappointing but also made me suspect to if this was a good piece.
I agree with the author that protobuf is bad and I ran into many of the issues mentioned. It's pretty much mandatory to add version fields to do backwards compatibility properly.
Recently, however, I had the displeasure of working with FlatBuffers. It's worse.
I'm more than a little curious what event caused such a strong objection to protobuffers. :D
I do tend to agree that they are bad. I also agree that people put a little too much credence in "came from Google." I can't bring myself to have this much anger towards it. Had to have been something that sparked this.
I'm just a frontend developer so most of my exposure is just as an API consumer and not someone working on the service side of things. That said:
A few years ago I moved to a large company where protobufs were the standard way APIs were defined. When I first started working with the generated TypeScript code, I was confused as to why almost all fields on generated object types were marked as optional. I assumed it was due to the way people were choosing to define the API at first, but then I learned this was an intentional design choice on the part of protobufs.
We ended up having to write our own code to parse the responses from the "helpfully" generated TypeScript client's responses. This meant we had to also handle rejecting nonsensical responses where an actually required field wasn't present, which is exactly the sort of thing I'd want generated clients to do. I would expect having to do some transformation myself, but not to that degree. The generated client was essentially useless to us, and the protocol's looseness offered no discernible benefit over any other API format I've used.
I imagine some of my other complaints could be solved with better codegen tools, but I think fundamentally the looseness of the type system is a fatal issue for me.
It used to be that there was no official TypeScript protobuf generator from Google and third-party generators sucked. Using protobufs from web browser or in nodejs was painful.
Couple years ago Connect released very good generator for TypeScript, we use in in production and it's great:
Yeah, as soon as you have a moderately complex type the generated code is basically useless. Honestly, ~80% of my gripes about protocol buffers could be alleviated by just allowing me to mark a message field as required.
Proto2 let you do this and the "required" keyword was removed because of the problems it introduces when evolving the schema in a system with many users that you don't necessarily control. Let's say you want to add a new required field, if your system receives messages from clients some clients may be sending you old data without the field and now the parse step fails because it detects a missing field. If you ever want to remove a required field you have the opposite problem, there will components that have to have those fields present just to satisfy the parser even if they're only interested in some other fields.
Philosophically, checking that a field is required or not is data validation and doesn't have anything to do with serialization. You can't specify that an integer falls into a certain valid range or that a string has a valid number of characters or is the correct format (e.g. if it's supposed to be an email or a phone number). The application code needs to do that kind of validation anyway. If something really is required then that should be the application's responsibility to deal with it appropriately if it's missing.
> Philosophically, checking that a field is required or not is data validation and doesn't have anything to do with serialization
But protocol buffers is not just a serialization format it is an interface definition language. And not being able to communicate that a field is required or not is very limiting. Sometimes things are required to process a message. If you need to add a new field but be able to process older versions of the message where the field wasn't required (or didn't exist) then you can just add it as optional.
I understand that in some situations you have very hard compatibility requirements and it makes sense to make everything optional and deal with it in application code, but adding a required attribute to fields doesn't stop you from doing that. You can still just make everything optional. You can even add a CI lint that prevents people from merging code with required fields. But making required fields illegal at the interface definition level just strikes me as killing a fly with a bazooka.
> Philosophically, checking that a field is required or not is data validation and doesn't have anything to do with serialization.
My issue is that people seem to like to use protobuf to describe the shape of APIs rather than just something to handle serialization. I think it's very bad at the describing API shapes.
I think it is somewhat of a natural failure of DRY taken to the extreme? People seem to want to get it so that they describe the API in a way that is then generated for clients and implementations.
It is amusing, in many ways. This is specifically part of what WSDL aspired to, but people were betrayed by the big companies not having a common ground for what shapes they would support in a description.
> Let's say you want to add a new required field, if your system receives messages from clients some clients may be sending you old data without the field and now the parse step fails because it detects a missing field.
A parser has to (inherently) neither fail (compatibility mode) nor lose the new field (a passthrough mode), nor allow diverging (strict mode). The fact that capnproto/parser authors don't realize that the same single protocol can operate in three different scenarios (but strictly speaking: at boundaries vs in middleware) at the same time, should not result in your thinking that there are problems with required fields in protocols. This is one of the most bizzare kinds of FUD in the industry.
Hi, I'm the apparently-FUD-spreading Cap'n Proto author.
Sure! You could certainly imagine extending Protobuf or Cap'n Proto with a way to specify validation that only happens when you explicitly request it. You'd then have separate functions to parse vs. to validate a message, and then you can perform strict validation at the endpoints but skip it in middleware.
This is a perfectly valid feature idea which many people have entertained an even implemented successfully. But I tend to think it's not worth trying to do have this in the schema language because in order to support every kind of validation you might want, you end up needing a complete programming language. Plus different components might have different requirements and therefore need different validation (e.g. middleware vs. endpoints). In the end I think it is better to write any validation functions in your actual programming language. But I can certainly see where people might disagree.
It gets super frustrating to have to empty/null check fields everywhere you use them, especially for fields that are effectively required for the message to make sense.
A very common example I see is Vec3 (just x, y, z). In proto2 you should be checking for the presence of x,y,z every time you use them, and when you do that in math equations, the incessant existence checks completely obscure the math. Really, you want to validate the presence of these fields during the parse. But in practice, what I see is either just assuming the fields exist in code and crashing on null, or admitting that protos are too clunky to use, and immediately converting every proto into a mirror internal type. It really feels like there's a major design gap here.
Don't get me started on the moronic design of proto3, where every time you see Vec3(0,0,0) you get to wonder whether it's the right value or mistakenly unset.
> It gets super frustrating to have to empty/null check fields everywhere you use them, especially for fields that are effectively required for the message to make sense.
That's why Protobuf and Cap'n Proto have default values. You should not bother checking for presence of fields that are always supposed to be there. If the sender forgot to set a field, then they get the default value. That's their problem.
> just assuming the fields exist in code and crashing on null
There shouldn't be any nulls you can crash on. If your protobuf implementation is returning null rather than a default value, it's a bad implementation, not just frustrating to use but arguably insecure. No implementation of mine ever worked that way, for sure.
Sadly, the default values are an even bigger source of bugs. We just caught another one at $work where a field was never being filled in, but the default values made it look fine. It caused hidden failures later on.
It's an incredibly frustrating "feature" to deal with, and causes lots of problems in proto3.
What happens if you mark a field as required and then you need to delete it in the future? You can't because if someone stored that proto somewhere and is no longer seeing the field, you just broke their code.
If you need to deserialize an old version then it's not a problem. The unknown field is just ignored during deserialization. The problem is adding a required field since some clients might be sending the old value during the rollout.
But in some situations you can be pretty confident that a field will be required always. And if you turn out to be wrong then it's not a huge deal. You add the new field as optional first (with all upgraded clients setting the value) and then once that is rolled out you make it required.
And if a field is in fact semantically required (like the API cannot process a request without the data in a field) then making it optional at the interface level doesn't really solve anything. The message will get deserialized but if the field is not set it's just an immediate error which doesn't seem much worse to me than a deserialization error.
1. Then it's not really required if it can be ignored.
2. This is the problem, software (and protos) can live for a long time). They might be used by other clients elsewhere that you don't control. What you thought might not required 10 years down the line is not anymore. What you "think" is not a huge deal then becomes a huge deal and can cause downtime.
3. You're mixing business logic and over the wire field requirement. If a message is required for an interface to function, you should be checking it anyway and returning the correct error. How is that change with proto supporting require?
> Then it's not really required if it can be ignored.
It can be required in v2 but not in v1 which was my point. If the client is running v2 while the server is still on v1 temporarily, then there is no problem. The server just ignores the new field until it is upgraded.
> This is the problem, software (and protos) can live for a long time). They might be used by other clients elsewhere that you don't control. What you thought might not required 10 years down the line is not anymore. What you "think" is not a huge deal then becomes a huge deal and can cause downtime.
Part of this is just that trying to create a format that is suitable both as an rpc wire serialization format and ALSO a format suitable for long term storage leads to something that is not great for either use case. But even taking that into account, RDBMS have been dealing with this problem for decades and every RDBMS lets you define fields as non-nullable.
> If a message is required for an interface to function, you should be checking it anyway and returning the correct error. How is that change with proto supporting require?
That's my point, you have to do that check in code which clutters the implementation with validation noise. That and you often can't use the wire message in your internal domain model since you now have to do that defensive null-check everywhere the object is used.
Aside from that, protocol buffers are an interface definition language so should be able to encode some of the validation logic at least (make invalid states unrepresentable and all that). If you are just looking at the proto IDL you have no way of knowing whether a field is really required or not because there is no way to specify that.
I mean, this is essentially the same lesson that database admins learn with nullable fields. Often it isn't the "deleting one is hard" so much as "adding one can be costly."
It isn't that you can't do it. But the code side of the equation is the cheap side.
To add to the sibling, I've seen this with Java enums a lot. People will add it so that the value is consumed using the enum as fast as they can. This works well as long as the value is not retrieved from data. As soon as you do that, you lose the ability to add to the possible values in a rolling release way. It can be very frustrating to know that we can't push a new producer of a value before we first change all consumers. Even if all consumers already use switch statements with default clauses to exhaustively cover behavior.
But this is something you should be able to handle on a case-by-case basis. If you have a type which is stored durably as protobuf then adding required fields is much harder. But if you are just dealing with transient rpc messages then it can be done relatively easily in a two step process. First you add the field as optional and then once all producers are upgraded (and setting the new field), make it required. It's annoying for sure but still seems better than having everything optional always and needing to deal with that in application code everywhere.
Largely true. If you are at Google scale, odds are you have mixed fleets deployed. Such that it is a bit of involved process. But it is well defined and doable. I think a lot of us would rather not do a dance we don't have to do?
Sure, you just have to balance that against the cost of a poorly specified API interface. The errors because clients aren't clear on what is really required or not, what they should consider an error if it is not defined, etc. And of course all the boilerplate code that you have to write to convert the interface model to an internal domain model you can actually use inside your code.
I've used them almost daily for 15 years. They are way down the list of things I'd want improved. It has been interesting to see the protobuffers killers die out every few years though
I feel like I could have written an article like this at various points. Probably while spending two hours trying to figure out a way to represent some protobuf type in a sane way internally.
As a developer I always see "came from Google" as a yellow flag.
Too often I find something mildly interesting, but then realize that in order for me to try to use it I need to set up a personal mirror of half of Google's tech stack to even get it to start.
He says that in the article; he had to work on a "compiler" project that was much harder than it should have been because of protobuf's design choices.
Yeah, I saw that. I took that as something that happened in the past, though. Certainly colored a lot of the thinking, but feels like something more immediate had to have happened. :D
I like the problems that Protobuf solves, just not the way it solves them.
Protobuf as a language feels clunky. The “type before identifier” syntax looks ancient and Java-esque.
The tools are clunky too. protoc is full of gotchas, and for something as simple as validation, you need to add a zillion plugins and memorize their invocation flags.
From tooling to workflow to generated code, it’s full of Google-isms and can be awkward to use at times.
That said, the serialization format is solid, and the backward-compatibility paradigms are genuinely useful. Buf adds some niceties to the tooling and makes it more tolerable. There’s nothing else that solves all the problems Protobuf solves.
almost the entire purpose of anything like protocol buffers is to provide a safe mechanism for backwards-compatible forward changes -- "no one uses that stuff"?? what a weird and broken take
The "no enums as map keys" thing enrages me constantly. Every protobuf project I've ever worked with either has stringly-typed maps all over the place because of this, or has to write its own function to parse Map<String, V> into Map<K, V> from the enums and then remember to call that right after deserialization, completely defeating the purpose of autogenerated types and deserializers. Why does Google put up with this? Surely it's the same inside their codebase.
Maps are not a good fit for a wire protocol in my experience. Different languages often have different quirks around them, and they're non-trivial to represent in a type-safe way.
If a Map is truly necessary I find it better to just send a repeated Message { Key K, Value V } and then convert that to a map in the receiving end.
I believe that the reason for this limitation is that not all languages can represent open enums cleanly to gracefully handle unknown enums upon schema skew.
It could be, it looks like there was some versions misalignment:
The maps syntax is only supported starting from v3.0.0. The "proto2" in the doc is referring to the syntax version, not protobuf release version. v3.0.0 supports both proto2 syntax and proto3 syntax while v2.6.1 only supports proto2 syntax. For all users, it's recommended to use v3.0.0-beta-1 instead of v2.6.1.
https://stackoverflow.com/questions/50241452/using-maps-in-p...
Protobuf's main design goal is to make space-optimized binary tag-length-value encoding easy. The mentality is kinda like "who cares what the API looks like as long as it can support anything you want to do with TLV encoding and has great performance." Things like oneofs and maps are best understood as slightly different ways of creating TLV fields in a message, rather than pieces of a comprehensive modern type system. The provided types are simply the necessary and sufficient elements to model any fuller type system using TLV.
I've written several screeds in the comments here on HN about protobufs being terrible over the past few years. Basically the creators of PB ignored ASN.1 and built a bad version of mid-1980s ASN.1 and DER.
Tag-length-value (TLV) encodings are just overly verbose for no good reason. They are _NOT_ "self-describing", and one does not need everything tagged to support extensibility. Even where one does need tags, tag assignments can be fully automatic and need not be exposed to the module designer. Anyone with a modicum of time spent researching how ASN.1 handles extensibility with non-TLV encoding rules knows these things. The entire arc of ASN.1's evolution over two plus decades was all about extensibility and non-TLV encoding rules!
And yes, ASN.1 started with the same premise as PB, but 40 years ago. Thus it's terribly egregious that PB's designers did not learn any lessons at all from ASN.1!
Near as I can tell PB's designers thought they knew about encodings, but didn't, and near as I can tell they refused to look at ASN.1 and such because of the lack of tooling for ASN.1, but of course there was even less tooling for PB since it hadn't existed.
Avro (and others) has its own set of problems as well.
For messaging, JSON, used in the same way and with the same versioning practices as we have established for evolving schemas in REST APIs, has never failed me.
It seems to me that all these rigid type systems for remote procedure calls introduce more problems that they really solve and bring unnecessary complexity.
Sure, there are tradeoffs with flexible JSONs - but simplicity of it beats the potential advantages we get from systems like Avro or ProtoBuf.
> This insane list of restrictions is the result of unprincipled design choices and bolting on features after the fact
I'm not very upset that protobuf evolved to be slightly more ergonomic. Bolting on features after you build the prototype is how you improve things.
Unfortunately, they really did design themselves into a corner (not unlike python 2). Again, I can't be too upset. They didn't have the benefit of hindsight or other high performance libraries that we have today.
I've created several IDL compilers addressing all issues of protobuf and others.
This particular one provides strongest backward compatibility guarantees with automatic conversion derivation where possible: https://github.com/7mind/baboon
Protobuf is dated, it's not that hard to make better things.
We thought for a long time about using protobufs in our product [1] and in the end we went with JSON-RPC 2.0 over BLE, base64 for bigger chunks. Yeah, you still need to pass sample format and decode manually. The overhead is fine tho, debugging is way easier (also pulling in all of protobuf just wasn't fun).
Type system fans are so irritating. The author doesn't engage with the point of protocol buffers, which is that they are thin adapters between the union of things that common languages can represent with their type systems and a reasonably efficient marshaling scheme that can be compact on the wire.
But why do you need serialization?
Because the data structure on disk is not the same as in memory.
Arthur Whitney's k/q/kdb+ solved this problem by making them the same.
An array has the same format in memory and on disk, so there is no serialization,
and even better, you can mmap files into memory, so you don't need cache!
He also removed the capability to define a structure, and force you to use dictionary(structure) of array, instead of array of structure.
Forget on-disk. Different CPUs represent basic data types with different in-memory representations (endianness). Furthermore different CPUs have different capabilities with respect to how data must be aligned in memory in order to read or write it (aligned/unaligned access). At least historically unaligned access could fault your process. Then there's the problem, that you allude to, that different programming languages use different data layouts (or often a non-standardised layout). If you want communication within a system comprising heterogeneous CPUs and/or languages, you need to translate or standardise your a wire format and/or provide a translation layer aka serialisation.
If you mostly write software with Go you'll likely enjoy working with protocol buffers. If you use the Python or Ruby wrappers you'd wish you had picked another tech.
The generated types in go are horrible to work with. You can't store instances of them anywhere, or pass them by value, because they contain a bunch of state and pointers (including a [0]sync.Mutex just to explicitly prohibit copying). So you have to pass around pointers at all times, making ownership and lifetime much more complicated than it needs to be. A message definition like this
type Example struct {
state protoimpl.MessageState
xxx_hidden_Value1 int32
xxx_hidden_Value2 float64
xxx_hidden_unknownFields protoimpl.UnknownFields
sizeCache protoimpl.SizeCache
}
For [place of work] where we use protobuf I ended up making a plugin to generate structs that don't do any of the nonsense (essentially automating Option 1 in the article):
type ExamplePOD struct {
Value1 int32
Value2 float64
}
I actually really strongly prefer 0 being identical to unset. If you have an unset state then you have to check if the field is unset every time you use it. Using 0 allows you to make all of your code "just work" when you pass 0 to it so you don't need to check at all.
It's like how in go most structs don't have a constructor, they just use the 0 value.
Also oneof is made that way so that it is backwards compatible to add a new field and make it a oneof with an existing field. Not everything needs to be pure functional programming.
Protobuffers suck as a core data model. My take? Use them as a serialization and interchange format, nothing more.
> This puts us in the uncomfortable position of needing to choose between one of three bad alternatives:
I don’t think there is a good system out there that works for both serialization and data models. I’d say it’s a mostly unsolved problem. I think I am happy with protobufs. I know that I have to fight against them contaminating the codebase—basically, your code that uses protobufs is code that directly communicates over raw RPC or directly serializes data to/from storage, and protobufs shouldn’t escape into higher-level code.
But, and this is a big but, you want that anyway. You probably WANT your serialization to be able to evolve independently of your application logic, and the easy way to do that is to use different types for each. You write application logic using types that have all sorts of validation (in the "parse, don't validate" sense) and your serialization layer uses looser validation. This looser validation is nice because you often end up with e.g. buggy code getting shipped that writes invalid data, and if you have a loose serialization layer that just preseves structure (like proto or json), you at least have a good way to munge it into the right shape.
Evolving serialized types has been such a massive pain at a lot of workplaces and the ad-hoc systems I've seen often get pulled into adopting some of the same design choices as protos, like "optional fields everywhere" and "unknown fields are ok". Partly it may be because a lot of ex-Google employees are inevitably hanging around on your team, but partly because some of those design tradeoffs (not ALL of them, just some of them) are really useful long-term, and if you stick around, you may come to the same conclusion.
In the end I mostly want something that's a little more efficient and a little more typed than JSON, and protos fit the bill. I can put my full efforts into safety and the "correct" representation at a different layer, and yes, people will fuck it up and contaminate the code base with protos, but I can fix that or live with it.
Not specific to protobufs but a lot of people/projects especially if doing MVC, push the models in the API layer all the way down the stack and they become the domain, instead of having a loose coupling between the domain and serialization format. In the old days we used to have DTO's for separation but they went out of fashion.
Agreed, it's interesting to see so many people complaining when they are just misunderstanding / misusing protobufs entirely. Sure the implementation could be better but it's not a huge problem.
Hello. I didn't invent Protocol Buffers, but I did write version 2 and was responsible for open sourcing it. I believe I am the author of the "manifesto" entitled "required considered harmful" mentioned in the footnote. Note that I mostly haven't touched Protobufs since I left Google in early 2013, but I have created Cap'n Proto since then, which I imagine this guy would criticize in similar ways.
This article appears to be written by a programming language design theorist who, unfortunately, does not understand (or, perhaps, does not value) practical software engineering. Type theory is a lot of fun to think about, but being simple and elegant from a type theory perspective does not necessarily translate to real value in real systems. Protobuf has undoubtedly, empirically proven its real value in real systems, despite its admittedly large number of warts.
The main thing that the author of this article does not seem to understand -- and, indeed, many PL theorists seem to miss -- is that the main challenge in real-world software engineering is not writing code but changing code once it is written and deployed. In general, type systems can be both helpful and harmful when it comes to changing code -- type systems are invaluable for detecting problems introduced by a change, but an overly-rigid type system can be a hindrance if it means common types of changes are difficult to make.
This is especially true when it comes to protocols, because in a distributed system, you cannot update both sides of a protocol simultaneously. I have found that type theorists tend to promote "version negotiation" schemes where the two sides agree on one rigid protocol to follow, but this is extremely painful in practice: you end up needing to maintain parallel code paths, leading to ugly and hard-to-test code. Inevitably, developers are pushed towards hacks in order to avoid protocol changes, which makes things worse.
I don't have time to address all the author's points, so let me choose a few that I think are representative of the misunderstanding.
> Make all fields in a message required. This makes messages product types.
> Promote oneof fields to instead be standalone data types. These are coproduct types.
This seems to miss the point of optional fields. Optional fields are not primarily about nullability but about compatibility. Protobuf's single most important feature is the ability to add new fields over time while maintaining compatibility. This has proven -- in real practice, not in theory -- to be an extremely powerful way to allow protocol evolution. It allows developers to build new features with minimal work.
Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time, hence the "required considered harmful" manifesto. In practice, you want to declare all fields optional to give yourself maximum flexibility for change.
The author dismisses this later on:
> What protobuffers are is permissive. They manage to not shit the bed when receiving messages from the past or from the future because they make absolutely no promises about what your data will look like. Everything is optional! But if you need it anyway, protobuffers will happily cook up and serve you something that typechecks, regardless of whether or not it's meaningful.
In real world practice, the permissiveness of Protocol Buffers has proven to be a powerful way to allow for protocols to change over time.
Maybe there's an amazing type system idea out there that would be even better, but I don't know what it is. Certainly the usual proposals I see seem like steps backwards. I'd love to be proven wrong, but not on the basis of perceived elegance and simplicity, but rather in real-world use.
> oneof fields can't be repeated.
(background: A "oneof" is essentially a tagged union -- a "sum type" for type theorists. A "repeated field" is an array.)
Two things:
1. It's that way because the "oneof" pattern long-predates the "oneof" language construct. A "oneof" is actually syntax sugar for a bunch of "optional" fields where exactly one is expected to be filled in. Lots of protocols used this pattern before I added "oneof" to the language, and I wanted those protocols to be able to upgrade to the new construct without breaking compatibility.
You might argue that this is a side-effect of a system evolving over time rather than being designed, and you'd be right. However, there is no such thing as a successful system which was designed perfectly upfront. All successful systems become successful by evolving, and thus you will always see this kind of wart in anything that works well. You should want a system that thinks about its existing users when creating new features, because once you adopt it, you'll be an existing user.
2. You actually do not want a oneof field to be repeated!
Here's the problem: Say you have your repeated "oneof" representing an array of values where each value can be one of 10 different types. For a concrete example, let's say you're writing a parser and they represent tokens (number, identifier, string, operator, etc.).
Now, at some point later on, you realize there's some additional piece of data you want to attach to every element. In our example, it could be that you now want to record the original source location (line and column number) where the token appeared.
How do you make this change without breaking compatibility? Now you wish that you had defined your array as an array of messages, each containing a oneof, so that you could add a new field to that message. But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.
In every single case where you might want a repeated oneof, you always want to wrap it in a message (product type), and then repeat that. That's exactly what you can do with the existing design.
The author's complaints about several other features have similar stories.
> One possible argument here is that protobuffers will hold onto any information present in a message that they don't understand. In principle this means that it's nondestructive to route a message through an intermediary that doesn't understand this version of its schema. Surely that's a win, isn't it?
> Granted, on paper it's a cool feature. But I've never once seen an application that will actually preserve that property.
OK, well, I've worked on lots of systems -- across three different companies -- where this feature is essential.
To me, it seems that version-change-safety and the usefulness of the generated code constitute a design tradeoff: If you mark a field as required, then the generated data structures can skip using Option/pointers, and this very common form of validation can be generated for free. If you disallow marking a field as required, then all fields must be checked for existence, even ones required for a system to function, which is quite a burden and will lead to developers having to write their own types anyway as a place to put their validated data into. If data is required to be present for an app to function, then why can't I be given the tools to express this, and benefit from the constraints applied to the data model?
Most of the time when I would like to use a schema-driven, efficient data format and code generation tool, the data contract doesn't change frequently. And when it does, assuming it's a backwards-incompatible change, I think I would be happy to generate a MyDataV2 message along with GetMyDataV2 method, allow existing clients to keep using the original version, and allow new or existing clients to use the newly supported structures at their leisure. Meanwhile, everyone that shares my schema can have much more idiomatic generated code, and in the most common cases won't have to write their own data types or be stuck with a bunch of `if data.x != null {` statements.
Protobufs are an amazing tool, but I think there is a need for a simpler tool which supports a restricted set of use cases cleanly and allows for wider expression of data models.
> Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time
how often? as practiced by who, and where?
> 2. You actually do not want a oneof field to be repeated!
> How do you make this change without breaking compatibility? Now you wish that you had defined your array as an array of messages, each containing a oneof, so that you could add a new field to that message. But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.
Nice, "explain to me how you're going to implement a backward-compatible SUM in the spec-parser that doesn't have the notions needed. Ha! You can't! Told you so!"
> But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.
Not really, `oneoff token` is isomorphic to `oneoff (token unit)` and going from the former to the latter doesn't require binary encoding change at all, if the encoding is optimal. Getting from `oneoff (token unit)` to `oneoff (token { linepos })`, depending on the binary encoding format you design, doesn't require you making changes to the parser's runtime, as long as the parser takes into account that `unit` is isomorphic to the zero-arity-product `{}`, and since both `{}` and `{ linepos }` can be presented with a fixed positional addressing, you get your values in a backward-compatible way, but under a specific condition: the parser library API provides `repeated (oneoff <A>)` as a non-materialised stream of values <A>, so that the exact interpretation of <A> happens at a user's calling site, according to the existing stated protocol spec: if it says `<A> = token`, then `list (repeated (oneoff (token { linepos })))` is isomorphic to `list (repeated (oneoff token))` in the deployed version of the protocol that knows nothing about the line positions, so my endpoints can send you either of:
* Version 0: [len][oneoff_bincode][token_arr]
* Version 1: [len][oneoff_bincode_sum][token_arr][unit]
* Version 2: [len][oneoff_bincode_sumprod][token_arr][prod_arr]
* Version 3: [len][oneoff_bincode_sumprod_sparse][token_arr][presence_arr][prod_arr]
This was my experience in Google Search infrastructure circa 2005-2010. This was a system with dozens of teams and hundreds of developers all pushing their data through a common message bus.
It happened all the damned time and caused multiple real outages (from overzealous validation), along with a lot of tech debt involving having to initialize fields with dummy data because they weren't used anymore but still required.
Reports from other large teams at google, e.g. gmail, indicated they had the same problems.
> Nice, "explain to me how you're going to implement a backward-compatible SUM in the spec-parser that doesn't have the notions needed. Ha! You can't! Told you so!"
Sure sure, we could expand the type system to support some way of adding a new tag to every element of the repeated oneof, implement it, teach everyone how that works, etc.
Or we could just tell people to wrap the thing in a `message`. It works fine already, and everyone already understands how to do it. No new cognitive load is created, and no time is wasted chasing theoretical purity that provides no actual real-world benefit.
> This was my experience in Google Search infrastructure circa 2005-2010 [...]
> Reports from other large teams at google
> teach everyone how that works, etc.
> Or we could just tell people to wrap the thing in a `message`
It really sounds like a self-inflicted internal google issue. Can you address the part where I mention isomorphism of (oneof token) and (oneof (token {})), and clarify what exactly do you think you'd have to teach other engineers to do, if your protocol's encoders and decoders took this property into account?
I don't recall properly (because I did selve my mapping projects for the moment), but don't openstreet map core data distribution format based on protobuffers?
i used protobuffers a lot at $previous_job and i agree with the entire article. i feel the author’s pain in my bones. protobuffers are so awful i can’t imagine google associating itself with such an amateur, ad hoc, ill-defined, user hostile, time wasting piece of shit.
the fact that protobuffers wasn’t immediately relegated to the dustbin shows just how low the bar is for serialization formats.
Google claimed Protobuffers are the solution but Google's planetary engineers clearly have ZERO respect for the mixed-endian remote systems keeping the galactic federation afloat with their cheap CORBA knockoff. It's like, sure which Plan 9 mainframe do you want to connect to like we all live on planet Google. Like hello???
I too was using PBs a lot, as they are quite popular in the Go world. But i came to the conclusion that they and gRPC are more trouble than they are worth. I switched to JSON, HTTP "REST" and websockets, if i need streaming, and am as happy as i could be.
I get the api interoperability between various languages when one wants to build a client with strict schema but in reality, this is more of a theory than real life.
In essence, anyone who subscribes to YAGNI understands that PB and gRPC are a big no-no.
PS: if you need binary format, just use cbor or msgpack. Otherwise the beauty of json is that it human-readable and easily parseable, so even if you lack access to the original schema, you can still EASILY process the data and UNDERSTAND it as well.
I am very partial to msgpack. It has routinely met or exceeded my performance needs and doesn’t depend on weird code generation, and is super easy to set up.
Something that I don’t see talked about much with msgpack, but I think is cool: if your project doesn’t span across multiple languages, you can actually embed those language semantics into your encoder with extensions.
For example, in Clojure’s port of msgpack out of the box, you can use Clojure keywords out of the box and it will parse correctly without issue. You also can have it work with sets.
Obviously you could define some kind of mapping yourself and use any binary format to do this, ultimately the [en|de]coder is just using regular msgpack constructs behind the scenes, but i have always had to do that manually while with msgpack it seems like the libraries readily embrace it.
Indeed, the support is widespread across languages. OTOH, using compression, like basic gzip, for http responses, turns the text format into binary format and with http2 or http3 there is no overhead like it would be with http1. so in the end the binary aspect of these encoders might be obsolete for this use case, as long as one uses compression.
If you opt for non-human-readable wire-formats it better be because of very important reasons. Something about measuring performance and operational costs.
If you need to exchange data with other systems that you don't control, a simple format like JSON is vastly superior.
You are restricted to handing over tree-like structures. That is a good thing as your consumers will have no problems reading tree-like structures.
It also makes it very simple for each consumer/producer to coerce this data into structs or objects as they please and that make sense to their usage of the data.
You have to validate the data anyhow (you do validate data received by the outside world, do you?), so throwing in coercing is honestly the smallest of your problems.
You only need to touch your data coercion if someone decides to send you data in a different shape.
For tree-like structures it is simple to add new things and stay backwards compatible.
Adding a spec on top of your data shapes that can potentially help consumers generate client code is a cherry on top of it and an orthogonal concern.
Making as little assumptions as possible how your consumers deal with your data is a Good Thing(tm) that enabled such useful(still?) things as the WWW.
Protocol buffers suck but so does everything else. Name another serialization declaration format that both (a) defines which changes can be make backwards-compatibly, and (b) has a linter that enforces backwards compatible changes.
Just with those two criteria you’re down to, like, six formats at most, of which Protocol Buffers is the most widely used.
And I know the article says no one uses the backwards compatible stuff but that’s bizarre to me – setting up N clients and a server that use protocol buffers to communicate and then being able to add fields to the schema and then deploy the servers and clients in any order is way nicer than it is with some other formats that force you to babysit deployment order.
The reason why protos suck is because remote procedure calls suck, and protos expose that suckage instead of trying to hide it until you trip on it. I hope the people working on protos, and other alternatives, continue to improve them, but they’re not worse than not using them today.
Not widely used but I like Typical's approach
https://github.com/stepchowfun/typical
> Typical offers a new solution ("asymmetric" fields) to the classic problem of how to safely add or remove fields in record types without breaking compatibility. The concept of asymmetric fields also solves the dual problem of how to preserve compatibility when adding or removing cases in sum types.
More direct link to the juicy bit: https://github.com/stepchowfun/typical?tab=readme-ov-file#as...
An asymmetric field in a struct is considered required for the writer, but optional for the reader.
That's a nice idea... But I believe the design direction of proto buffers was to make everything `optional`, because `required` tends to bite you later when you realize it should actually be optional.
My understanding is that asymmetric fields provide a migration path in case that happens, as stated in the docs:
> Unlike optional fields, an asymmetric field can safely be promoted to required and vice versa.
> [...]
> Suppose we now want to remove a required field. It may be unsafe to delete the field directly, since then clients might stop setting it before servers can handle its absence. But we can demote it to asymmetric, which forces servers to consider it optional and handle its potential absence, even though clients are still required to set it. Once that change has been rolled out (at least to servers), we can confidently delete the field (or demote it to optional), as the servers no longer rely on it.
This seems interesting. Still not sure if `required` is a good thing to have (for persistent data like log you cannot really guarantee some field's presence without schema versioning baked into the file itself) but for an intermediate wire use cases, this will help.
I've never heard of Typical but the fact they didn't repeat protobuf's sin regarding varint encoding (or use leb128 encoding...) makes me very interested! Thank you for sharing, I'm going to have to give it a spin.
It looks similar to how vint64 lib encodes varints. Total length of varint can be determined via the first byte alone.
I advocated for PrefixVarint (which seems equivalent to vint64 ) for WebAssembly, but it was decided against, in favor of LEB128: https://github.com/WebAssembly/design/issues/601
The recent CREL format for ELF also uses the more established LEB128: https://news.ycombinator.com/item?id=41222021
At this point I don't feel like I have a clear opinion about whether PrefixVarint is worth it, compared with LEB128.
Just remember that XML was more established than JSON for a long time.
This actually looks quite interesting.
Seems like a lot of effort to avoid adding a message version field. I’m not a web guy, so maybe I’m missing the point here, but I always embed a schema version field in my data.
I get that.
The point is that its hard to prevent asymmetry in message versions if you are working with many communicating systems. Lets say four services inter-communicate with some protocol, it is extremely annoying to impose a deployment order where the producer of a message type is the last to upgrade the message schema, as this causes unnecessary dependencies between the release trains of these services. At the same time, one cannot simply say: "I don't know this message version, I will disregard it" because in live systems this will mean the systems go out of sync, data is lost, stuff breaks, etc.
There's probably more issues I haven't mentioned, but long story short: in live, interconnected systems, it becomes important to have intelligent message versioning, i.e: a version number is not enough.
I think I see what you’re getting at? My mental model is client and server, but you’re implying a more complex topology where no one service is uniquely a server or a client. You’d like to insert a new version at an arbitrary position in the graph without worrying about dependencies or the operational complexity of doing a phased deployment. The result is that you try to maintain a principled, constructive ambiguity around the message schema, hence asymmetrical fields? I guess I’m still unconvinced and I may have started the argument wrong, but I can see a reasonable person doing it that way.
Yes thats a big part, but even bigger is just the alignment of teams.
Imagine team A building feature XYZ Team B is building TUV
one of those features in each team deals with messages, the others are unrelated. At some point in time, both teams have to deploy.
If you have to sync them up just to get the protocol to work, thats an extra complexity in the already complex work of the teams.
If you can ignore this, great!
It becomes even more complex with rolling updates though: not all deployments of a service will have the new code immediately, because you want multiple to be online to scale on demand. This creates an immediate necessary ambiguity in the qeustion: "which version does this service accept?" because its not about the service anymore, but about the deployments.
Ah, I see. Team A would like to deploy a new version of a service. It used to accept messages with schema S, but the new version accepts only S’ and not S. So the only thing you can do is define S’ so that it is ambiguous with S. Team B uses Team A’s service but doesn’t want to have to coordinate deployments with Team A.
I think the key source of my confusion was Team A not being able to continue supporting schema S once the new version is released. That certainly makes the problem harder.
Exactly!
Idk I generally think “magic numbers” are just extra effort. The main annoyance is adding if statements everywhere on version number instead of checking the data field you need being present.
It also really depends on the scope of the issue. Protos really excel at “rolling” updates and continuous changes instead of fixed APIs. For example, MicroserviceA calls MicroserviceB, but the teams do deployments different times of the week. Constant rolling of the version number for each change is annoying vs just checking for the new feature. Especially if you could have several active versions at a time.
It also frees you from actually propagating a single version number everywhere. If you own a bunch of API endpoints, you either need to put the version in the URL, which impacts every endpoint at once, or you need to put it in the request/response of every one.
I think this is only a problem if you’re using a weak data interchange library that can’t use the schema number field to discriminate a union. Because you really shouldn’t have to write that if statement yourself.
I'm really hoping Typical will catch on, as I quite like the design. One important gap right now is the lack of Go and Python support.
We use protocol buffers on a game and we use the back compat stuff all the time.
We include a version number with each release of the game. If we change a proto we add new fields and deprecate old ones and increment the version. We use the version number to run a series of steps on each proto to upgrade old fields to new ones.
> We use the version number to run a series of steps on each proto to upgrade old fields to new ones
It sounds like you've built your own back-compat functionality on top of protobuf?
The only functionality protobuf is giving you here is optional-by-default (and mandatory version numbers, but most wire formats require that)
Yeah, I’d probably say something more like, “we leverage protobuf built ins to make a slightly more advanced back compat system”
We do rename deprecated fields and often give new fields their names. We rely on the field number to make that work.
> We do rename deprecated fields and often give new fields their names. We rely on the field number to make that work.
Why share names? Wouldn't it be safer to, well, not?
ASN.1 implements message versioning in an extremely precise way. Implementing a linter would be trivial.
This. Plus ASN.1 is pluggable as to encoding rules and has a large family of them:
ASN.1 also gives you a way to do things like formalize typed holes.Not looking at ASN.1, not even its history and evolution, when creating PB was a crime.
The people who wrote PB clearly knew ASN.1. It was the most famous IDL at the time. Do you assume they just came one morning and decided to write PB without taking a look at what existed?
Anyway, as stated PB does more than ASN.1. It specifies both the description format and the encoding. PB is ready to be used out of the box. You have a compact IDL and a performant encoding format without having to think about anything. You have to remember that PB was designed for internal Google use as a tool to solve their problems, not as a generic solution.
ASN.1 is extremely unwieldy in comparaison. It has accumulated a lot of cruft through the year. Plus they don’t provide a default implementation.
I agree that saying that no-one uses backwards compatible stuff is bizarre. Rolling deploys, being able to function with a mixed deployment is often worth the backwards compatibility overhead for many reasons.
In Java, you can accomplish some of this with using of Jackson JSON serialization of plain objects, where there several ways in which changes can be made backwards-compatibly (e.g. in the recent years, post-deserialization hooks can be used to handle more complex cases), which satisfies (a). For (b), there’s no automatic linter. However, in practice, I found that writing tests that deserialize prior release’s serialized objects get you pretty far along the line of regression protection for major changes. Also it was pretty easy to write an automatic round-trip serialization tester to catch mistakes in the ser/deser chain. Finally, you stay away from non-schemable ser/deser (such as a method that handles any property name), which can be enforced w/ a linter, you can output the JSON schema of your objects to committed source. Then any time the generated schema changes, you can look for corresponding test coverage in code reviews.
I know that’s not the same as an automatic linter, but it gets you pretty far in practice. It does not absolve you from cross-release/upgrade testing, because serialization backwards-compatibility does not catch all backwards-compatibility bugs.
Additionally, Jackson has many techniques, such as unwrapping objects, which let you execute more complicated refactoring backwards-compatibly, such as extracting a set of fields into a sub-object.
I like that the same schema can be used to interact with your SPA web clients for your domain objects, giving you nice inspectable JSON. Things serialized to unprivileged clients can be filtered with views, such that sensitive fields are never serialized, for example.
You can generate TypeScript objects from this schema or generate clients for other languages (e.g. with Swagger). Granted it won’t port your custom migration deserialization hooks automatically, so you will either have to stay within a subset of backwards-compatible changes, or add custom code for each client.
You can also serialize your RPC comms to a binary format, such as Smile, which uses back-references for property names, should you need to reduce on-the-wire size.
It’s also nice to be able to define Jackson mix-ins to serialize classes from other libraries’ code or code that you can’t modify.
This is always the thing to look for; "What are the alternatives?", and/why aren't there better ones.
I don't understand most use cases of protobufs, including ones that informed their design. I use it for ESP-hosted, to communicate between two MCUs. It is the highest-friction serialization protocol I've seen, and is not very byte-efficient.
Maybe something like the specialized serialization libraries (bincode, postcard etc) would be easier? But I suspect I'm missing something about the abstraction that applies to networked systems, beyond serialization.
> Name another serialization declaration format that both (a) defines which changes can be make backwards-compatibly, and (b) has a linter that enforces backwards compatible changes.
The article covers this in the section "The Lie of Backwards- and Forwards-Compatibility." My experience working with protocol buffers matches what the author describes in this section.
Exactly, I think of protobuffers like I think of Java or Go - at least they weren’t writing it in C++.
Dragging your org away from using poorly specified json is often worth these papercuts IMO.
Protobufs are better but not best. Still, by far, the easiest thing to use and the safest is actual APIs. Like, in your application. Interfaces and stuff.
Obviously if your thing HAS to communicate over the network that's one thing, but a lot of applications don't. The distributed system micro service stuff is a choice.
Guys, distributed systems are hard. The extremely low API visibility combined with fragile network calls and unsafe, poorly specified API versioning means your stuff is going to break, and a lot.
Want a version controlled API? Just write in interface in C# or PHP or whatever.
> Protobufs are better but not best.
This sort of comments doesn't add anything to the discussion unless you are able to point out what you believe to be the best. It reads as an unnecessary and unsubstantiated put-down.
The original RPC code, from which Google derived their protobuf stuff was written in (pre-ANSI) C at Sun Microsystems.
> And I know the article says no one uses the backwards compatible stuff but that’s bizarre to me – setting up N clients and a server that use protocol buffers to communicate and then being able to add fields to the schema and then deploy the servers and clients in any order is way nicer than it is with some other formats that force you to babysit deployment order.
Yet the author has the audacity to call the authors of protobuf (originally Jeff Dean et al) "amateurs."
As someone who has written many mapreduce jobs over years old protobufs I can confidently report the backwards compatibility made it possible at all.
> Just with those two criteria you’re down to, like, six formats at most, of which Protocol Buffers is the most widely used.
What I dislike the most about blog posts like this is that, although the blogger is very opinionated and critical of many things, the post dates back to 2018, protobuf is still dominant, and apparently during all these years the blogger failed to put together something that they felt was a better way to solve the problem. I mean, it's perfectly fine if they feel strongly about a topic. However, investing so much energy to criticize and even throw personal attacks on whoever contributed to the project feels pointless and an exercise in self promotion at the expense of shit-talking. Either you put something together that you feel implements your vision and rights some wrongs, or don't go out of your day to put down people. Not cool.
JSON exists, and when compressed it is pretty efficient. (not as efficient as protobuff though).
For client facing protocol Protobufs is a nightmare to use. For Machine to Machine services, it is ok-ish, yet personally I still don't like it.
When I was at Spotify we ditched it for client side apis (server to mobile/web), and never looked back. No one liked working with it.
> JSON exists (...)
The blog post leads with the personal assertion that "ad-hoc and built by amateurs". Therefore I doubt that JSON, a data serialization language designed by trimming most of JavaScript out and to be parses with eval(), would meet the opinionated high bar.
Also, JSON is a data interchange language, and has no support for types beyond the notoriously ill-defined primitives. In contrast, protobuf is a data serialization language which supports specifying types. This means that for JSON, to start to come close to meet the requirements met by protobuf, would need to be paired with schema validation frameworks and custom configurable parsers. Which it definitely does not cover.
You must be young. XML and XML Schemas existed before JSON or Protobuf, and people ditched them for a good reason and JSON took over.
Protobuf is just another version of the old RPC/Java Beans, etc... of a binary format. Yes, it is more efficient data wise than JSON, but it is a PITA to work on and debug with.
> You must be young. XML and XML Schemas existed before JSON or Protobuf, and people ditched them for a good reason and JSON took over.
I'm not sure you got the point. It's irrelevant how old JSON or XML (a non sequitur) are. The point is that one of the main features and selling points of protobuf is strong typing and model validation implemented at the parsing level. JSON does not support any of these, and you need to onboard more than one ad-hoc tool to have a shot at feature parity, which goes against the blogger's opinionated position on the topic.
Not that I love it -- but SBE (Simple Binary Encoding) is a _decent_ solution in the realm of backwards/forwards compatibility.
Flatbuffers satisfies those requirements and doesn’t have varint shenanigans.
What about Cap’n Proto https://capnproto.org/ ? (Don't know much about these things myself, but it's a name that usually comes up in these discussions.)
Cap'n'proto is not very nice to work with in C++, and I'd discourage anyone from using it from other programming languages, the implementations are just not there yet. We use both cnp and protobufs at work, and I vastly prefer protobufs, even for C++. I only wish they stayed the hell away from abseil, though.
I always thought people had a positive view on abseil, never used it myself other than when tinkering on random projects. What's the main issue?
The thing is a huge pain to manage as a dependency, especially if you wander away from the official google-approved way of doing things. Protobuf went from a breeze to use to the single most common source of build issues in our cross-platform project the moment they added this dependency. It's so bad that many distros and package managers keep the pre-abseil version as a separate package, and many just prefer to get stuck with it rather than upgrade. Same with other google libraries that added abseil as a dependency, as far as I'm aware
I like abseil besides the compile times. Not having to specialize my own hash when using maps is nice.
I'd rather they just used the abseil headers they needed with the abseil license at the top than make it a build dependency.
The concept of a package is antithetical to C++ and no amount of tooling can fix that.
abseil is not header-only, though
Skill issue
But you can't trust flatbuffers sent from unknown senders.
https://github.com/dfinity/candid/blob/master/spec/Candid.md
in the systems I built I didn't bother with backwards compatibility.
If you make any change, it's a new message type.
For compatibility you can coerce the new message to the old message and dual-publish.
I prefer a little builtin backwards (and forwards!) compatibility (by always enforcing a length for each object, to be zero-padded or truncated as needed), but yes "don't fear adding new types" is an important lesson.
That only works if you control all the clients.
Dual-publishing makes it transparent to older clients.
Obviously you need to track when the old clients have been moved over so you can eventually retire the dual-publishing.
You could also do the conversion on the receiving side without a-priori information, but that would be extremely slow.
Protobufs aren’t new. They’re really just rpc over https. I’ve used dce-rpc in 1997 which had IDL. I believe CORBA used IDL as well although I personally did not use it. There have been other attempts like ejb, etc. which are pretty much the same paradigm.
The biggest plus with protobuf is the social/financial side and not the technology side. It’s open source and free from proprietary hacks like previous solutions.
Apart from that, distributed systems of which rpc is a sub topic are hard in general. So the expectation would be that it sucks.
Backwards compatibility is just not an issue in self-describing structures like JSON, Java serialization, and (dating myself) Hessian. You can add fields and you can remove fields. That's enough to allow seamless migrations.
It's only positional protocols that have this problem.
You can remove JSON fields at the cost of breaking your clients at runtime that expect those fields. Of course the same can happen with any deserialization libraries, but protobufs at least make it more explicit - and you may also be more easily able to track down consumers using older versions.
For the missing case, whenever I use json, I always start with a sane default struct, then overwrite those with the externally provided values. If a field is missing, it will be handled reasonably.
At the cost of much larger payloads.
With gzip encoding... not really.
> Name another serialization declaration format that both (a) defines which changes can be make backwards-compatibly, and (b) has a linter that enforces backwards compatible changes.
ASCII text (tongue in cheek here)
TLV style binary formats are all you need. The “Type” in that acronym is a 32-bit number which you can use to version all of your stuff so that files are backwards compatible. Software that reads these should read all versions of a particular type and write only the latest version.
Code for TLV is easy to write and to read, which makes viewing programs easy. TLV data is fast for computers to write and to read.
Protobuf is overused because people are fucking scared to death to write binary data. They don’t trust themselves to do it, which is just nonsense to me. It’s easy. It’s reliable. It’s fast.
Protobuf is typically serialised using a TLV-style encoding.
https://protobuf.dev/programming-guides/encoding/
A major value of protobuf is in its ecosystem of tools (codegen, lint, etc); it's not only an encoding. And you don't generally have to build or maintain any of it yourself, since it already exists and has significant industry investment.
Real ones know that serialization is what sucks.
https://news.ycombinator.com/item?id=18190005
Just FYI: an obligatory comment from the protobuf v2 designer.
Yeah, protobuf has lots of design mistakes but this article is written by someone who does not understand the problem space. Most of the complexity of serialization comes from implementation compatibility between different timepoints. This significantly limits design space.
Relatedly, most of the author's concerns are solved by wrapping things in a message.
> oneof fields can’t be repeated.
Wrap oneof field in message which can be repeated
> map fields cannot be repeated.
Wrap in message which can contain repeated fields
> map values cannot be other maps.
Wrap map in message which can be a value
Perhaps this is slightly inconvenient/un-ergonomic, but the author is positioning these things as "protos fundamentally can't do this".
To clarify. Protobuf’s simplest change is adding a field to a message so wrapping maps of maps, maps of fields, oneof fields into a message makes these play to its strengths. It feels like over engineering to turn your Inventory map of items into a Inventory message, but you will be grateful for it when you need a capacity field later.
>Most of the complexity of serialization comes from implementation compatibility between different timepoints.
The author talks about compatibility a fair bit, specifically the importance of distinguishing a field that wasn't set from one that was intentionally set to a default, and how protobuffs punted on this.
What do you think they don't understand?
If you see some statements like below on the serialization topic:
> Make all fields in a message required. This makes messages product types.
> One possible argument here is that protobuffers will hold onto any information present in a message that they don't understand. In principle this means that it's nondestructive to route a message through an intermediary that doesn't understand this version of its schema. Surely that's a win, isn't it?
> Granted, on paper it's a cool feature. But I've never once seen an application that will actually preserve that property.
Then it is fair to raise eyebrows on the author's expertise. And please don't ask if I'm attached to protobuf; I can roast the protocol buffer on its wrong designs for hours. It is just that the author makes series of wrong claims presumably due to their bias toward principled type systems and inexperience of working on large scale systems.
> If you see some statements like below on the serialization topic:
> Make all fields in a message required. This makes messages product types.
> Then it is fair to raise eyebrows on the author's expertise.
It's fair to raise eyebrows on your expertise, since required fields don't contribute to b/w incompatibility at all, as every real-world protocol has a mandatory required version number that's tied to a direct parsing strategy with strictly defined algebra, both for shrinking (removing data fragments) and growing (introducing data fragments) payloads. Zero-values and optionality in protobuf is one version of that algebra, it's the most inferior one, subject to lossy protocol upgrades, and is the easiest one for amateurs to design. Then, there's next lavel when the protocol upgrade is defined in terms of bijective functions and other elements of symmetric groups that can tell you whether the newly announced data change can be carried forward (new required field) or dropped (removed field) as long as both the sending and receiving ends are able to derive new compound structures from previously defined pervasive types (the things the protobuf says are oneoffs and messages, for example).
What you describe using many completely unnecessary mathematical terms is not only not found in “every real-world protocol”, but in fact is something virtually absent from overwhelming majority of actually used protocols, with a notable exception of the kind of protocol that gets a four digit numbered RFC document that describes it. Believe it or not, but in the software industry, nobody is defining a new “version number” with “strictly defined algebra” when they want to add a new field to an communication protocol between two internal backend services.
> What you describe using many completely unnecessary mathematical terms
Unnecessary for you, surely.
> Believe it or not, but in the software industry, nobody is defining a new “version number” with “strictly defined algebra” when they want to add a new field to an communication protocol between two internal backend services.
Name a protocol that doesn't have a respective version number, or without the defined algebra in terms of the associated spec clarifications that accompany the new version. The word "strictly" in "strictly defined algebra" has to do with the fact that you cannot evolve a protocol without strictly publishing the changed spec, that is you're strictly obliged to publish a spec, even the loosely defined one, with lots of omissions and zero-values. That's the inferior algebra for protobuf, but you can think it is unnecessary and doesn't exist.
Instead of just handwaving about whether it's necessary or not, why not point to any protocol that relies on that attribute, and we can then evaluate how important that protocol is?
Yeah. And for anyone curious about the actual content hidden under the jargon-kludge-FP-nerd parent comment, here's my attempt at deciphering it.
They seem to be saying that you have to publish code that can change a type from schema A to schema B... And back, whenever you make a schema B. This is the "algebra". The "and back" part makes it bijective. You do this at the level of your core primitive types so that it's reused everywhere. This is what they meant by "pervasive" and it ties into the whole symmetric groups thing.
Finally, it seems like when you're making a lossy change, where a bijection isn't possible, they want you to make it incompatible. i.e, if you replaced address with city, then you cannot decode the message in code that expects address.
> since required fields don't contribute to b/w incompatibility at all, as every real-world protocol has a mandatory required version number that's tied to a direct parsing strategy with strictly defined algebra
At least I know 10 different tech companies with billion dollars revenue which does not suit to your description. This comment makes me wonder if you have any experience of working on real world distributed systems. Oh and I'm pretty sure you did not read Kenton's comment; he already precisely addressed your point:
> This is especially true when it comes to protocols, because in a distributed system, you cannot update both sides of a protocol simultaneously. I have found that type theorists tend to promote "version negotiation" schemes where the two sides agree on one rigid protocol to follow, but this is extremely painful in practice: you end up needing to maintain parallel code paths, leading to ugly and hard-to-test code. Inevitably, developers are pushed towards hacks in order to avoid protocol changes, which makes things worse.
I recommend you to do your homework before making such a strong argument. Reading a 5 mins long comment is not that hard. You can avoid lots of shame by doing so.
Is this satire?
Granted, on paper it’s a cool feature. But I’ve never once seen an application that will actually preserve that property.
Chances are, the author literally used software that does it as he wrote these words. This feature is critical to how Chrome Sync works. You wouldn’t want to lose synced state if you use an older browser version on another device that doesn’t recognize the unknown fields and silently drops them. This is so important that at some point Chrome literally forked protobuf library so that unknown fields are preserved even if you are using protobuf lite mode.
I'm starting to wonder if some of those bad design decisions are symptoms of a larger "cultural bias" at Google. Specifically the "No Compositionality" point: It reminds me of similar bad designs in Go, CSS and the web platform at large.
The pattern seems to be that generalized, user-composable solutions are discouraged in favor of a myriad of special constructs that satisfy whatever concrete use cases seem relevant for the designers in the moment.
This works for a while and reduces the complexity of the language upfront, while delivering results - but over time, the designs devolve into a rats's nest of hyperspecific design features with awkward and unintuitive restrictions.
Eventually, the designers might give up and add more general constructs to the language - but those feel tacked on and have to coexist with specific features that can't be removed anymore.
It works both ways. General constructs tend to become overly abstract and you end up with sneaky errors in different places due to a minor change to an abstraction.
Like the old adage, this is just a matter of preference. Good software engineering requires, first and foremost, great discipline, regardless of the path or tool you choose.
If there are errors in implementation of general constructs, they tend to be visible at their every use, and get rapidly fixed.
Some general constructs are better than the others, because they have an algebraic theory behind them, and sometimes that theory was already researched for a few hundred years.
For example, product/coproduct types mentioned in the article are quite close to addition and multiplication that we've all learned in school, and obey the same laws.
So there are several levels where the choice of ad-hoc constructs is wrong, and in the end the only valid reason to choose them is time constraints.
If they had 24 years to figure out how to do it properly, but they didn't, the technology is just dead.
Hm, that's idealistic...
I've certainly run into cases where small changes in general systems led to hard-to-detect bugs, which took a great deal of investigation to figure out. Not all failures are catastrophic.
The technology is quite alive, which is why it hasn't been 'fixed' - changing the wheels on a moving car, and all that. The actual disappointment is that a better alternative hasn't taken off in the six years since this post was written... If its so easy, where's the alternatives?
> This works for a while and reduces the complexity of the language upfront, while delivering results - but over time, the designs devolve into a rats's nest of hyperspecific design features with awkward and unintuitive restrictions.
But that's true for almost anything, though.
I share the author's sentiment. I hate these things.
True story: trying to reverse engineer macOS Photos.app sqlite database format to extract human-readable location data from an image.
I eventually figured it out, but it was:
A base64 encoded Binary Plist format with one field containing a ProtoBuffer which contained another protobuffer which contained a unicode string which contained improperly encoded data (for example, U+2013 EN DASH was encoded as \342\200\223)
This could have been a simple JSON string.
> This could have been a simple JSON string.
There's nothing "simple" about parsing JSON as a serialization format.
Having attempted writing a JSON parser from scratch and a protobuf parser from scratch and only completing one of them, I disagree.
Except that most often you can just look at it and figure it out.
Sure you can look at it[1], but you're not expected to look at Apple Photos database. The computer is.
Write a correct JSON parser, compare with protobuf on various metrics, and then we can talk.
[1]: although to be fair, I am older than kids whose first programming language was JavaScript, so I do not think of JSON object format with property names in quotes and integers that need to be wrapped as strings to be safe, etc., lack of comma after the last entry--to be fair this last one is a problem in writing, not reading JSON--as the most natural thing
I'm also "older" but I don't think that means anything.
> Sure you can look at it[1], but you're not expected to look at Apple Photos database.
How else are you supposed to figure it out? If you're older then you know that you can't rely on the existence or correctness of documentation. Being able to look at JSON and understand it as a human on the wire is huge advantage. JSON being pretty simple in structure is as advantage. I don't see a problem with quoting property names! As for large integers and datetimes, yes that could be much better designed. But that's true of every protocol and file format that has any success.
JSON parsers and writers are common and plentiful and are far less crazy than any complete XML parser/writer library.
> Being able to look at JSON and understand it as a human on the wire is huge advantage
I don’t think this is a given at all. Depends on the context. I think it’s often overvalued. A lot of times the performance matters more. If human readability was the only thing that mattered, I would still not count JSON as the winner. You will have to pipe it to jq, realistically. You’d do the same for any other serialization format too. Inside Google where proto is prevalent, that is just as easy if not more convenient.
The point is how hard or easy it is for an app’s end user to decipher its file database is not a design goal for the serialization library chosen by Apple Photos developers here. The constraints and requirements are all on different axis.
Sure but unless you want to embed an LLM in every JSON library, computers can't do that.
https://github.com/RhetTbull/osxphotos
The JSON version would have also had the wrong encoding - all formats are just a framing for data fed in from code written by a human. In mac's case, em dash will always be an issue because that's just what Mac decided on intentionally.
I mean... you can nest-encode stuff in any serial format. You're not describing a problem either intrinsic or unique to Protobuf, you're just seeing the development org chart manifested into a data structure.
Good points this wasn't entirely a protobuf-specific issue, so much as it was a (likely hierarchical and historical set of) bad decisions to use it at all.
Using Protobuffers for a few KB of metadata, when the photo library otherwise is taking multiple GB of data, is just pennywise pound foolish.
Of course, even my preference for a simple JSON string would be problematic: data in a database really should be stored properly normalized to a separate table and fields.
My guess is that protobuffers did play a role here in causing this poor design. I imagine this scenario:
- Photos.app wants to look up location data
- the server returns structured data in a ProtoBuffer
- there's no easy or reasonable way to map a protobuf to database fields (one point of TFA)
- Surrender! just store the binary blob in SQLITE and let the next poor sod deal with it
You have to take into account the fact that iPhoto app has had many iterations. The binary plist stuff is very likely the native NSArchive "object archiving (serialization)" that is done by Obj-C libraries. They probably started using protobuf at some point later after iCloud. I suspect the unicode crap you are facing may even predate Cocoaization of the app (they probably used Carbon API).
So it would make it a set of historical decisions, but I am not convinced they are necessarily bad decisions given the constraints. Each layer is likely responsible for handing edge cases in the application that you and I are not privy to.
It that's any consolation, in the current version's schema they are just plain ZLATITUDE FLOAT, ZLONGITUDE FLOAT in ZASSET table..
That's horrendous. For some reason I imagine Apple's software to be much cleaner, but I guess that's just the marketing getting to my head. Under the hood it's still the same spaghetti.
Yeah, the problem is Apple and all the other contemporary tech companies have engineers bounce around between them all the time, and they take their habits with them.
At some point there becomes a critical mass of xooglers in an org, and when a new use case happens no one bothers to ask “how is serialization typically done in Apple frameworks”, they just go with what they know. And then you get protobuf serialization inside a plist. (A plist being the vanilla “normal” serialization format at Apple. Protobuf inside a plist is a sign that somebody was shoehorning what they’re comfortable with into the code.)
Discussed many times over the years:
https://news.ycombinator.com/item?id=18188519 (299 comments)
https://news.ycombinator.com/item?id=21871514 (215 comments)
https://news.ycombinator.com/item?id=35281561 (59 comments)
There are a lot of great comments on these old threads, and I don't think there's a lot of new science in this field since 2018, so the old threads might be a better read than today's.
Here's a fun one:
https://news.ycombinator.com/item?id=21873926
I don't know if the author is right or wrong; I've never dealt with protobufs professionally. But I recently implemented them for a hobby project and it was kind of a game-changer.
At some stage with every ESP or Arduino project, I want to send and receive data, i.e. telemetry and control messages. A lot of people use ad-hoc protocols or HTTP/JSON, but I decided to try the nanopb library. I ended up with a relatively neat solution that just uses UDP packets. For my purposes a single packet has plenty of space, and I can easily extend this approach in the future. I know I'm not the first person to do this but I'll probably keep using protobufs until something better comes along, because the ecosystem exists and I can focus on the stuff I consider to be fun.
Embedded/constrained UDP is where protobuf wire format (but not google's libraries) rocks: IoT over cellular and such, where you need to fit everything into a single datagram (number of roundtrips is what determines power consumption). As to those who say "UDP is unreliable" - what you do is you implement ARQ on the application level. Just like TCP does it, except you don't have to waste roundtrips on SYN-SYN-ACK handshake nor waste bytes on sending data that are no longer relevant.
Varints for the win. Send time series as columns of varint arrays - delta or RLL compression becomes quite straightforward. And as a bonus I can just implement new fields in the device and deploy right away - the server-side support can wait until we actually need it.
No, flatbuffers/cap'n'proto are unacceptably big because of fixed layout. No, CBOR is an absolute no go - why on earth would you waste precious bytes on schema every time? No, general-purpose compression like gzip wouldn't do much on such a small size, it will probably make things worse. Yes, ASN is supposed to be the right solution - but there is no full-featured implementation that doesn't cost $$$$ and the whole thing is just too damn bloated.
Kinda fun that it sucks for what it is supposed to do, but actually shines elsewhere.
> Yes, ASN is supposed to be the right solution - but there is no full-featured implementation that doesn't cost $$$$ and the whole thing is just too damn bloated.
Oh for crying out loud! PB had ZERO tooling available when it was created! It would have been much easier to create ASN.1 tooling w/ OER/PER and for some suitable subset of ASN.1 in 2001 that it was to a) create an IDL, b) create an encoding, and c) write tooling for N programming languages.
In fact, one thing one could have done is write a transpiler from the IDL to an AST that does all linting, analysis, and linking, and which one can then use to drive codegen for N languages. Or even better: have the transpiler produce a byte-coded representation of the modules and then for each programming language you only need to codegen the types but not the codecs -- instead for each language you need only write the interpreter for the byte-coded modules. I know because I've extended and maintained an [open source] ASN.1 compiler that fucking does [some of] these things.
Stop spreading this idea that ASN.1 is bloated. It's not. You can cut it down for your purposes. There's only 4 specifications for the language itself, of which the base one (x.680) is enough for almost everything (the others, X.681, X.682, and X.683, are mainly for parameterized types and formal typed hole specifications [the ASN.1 "information object system], which are awesome but you can live without). And these are some of the best-written and most-readable specifications ever written by any standards development organization -- they are a great gift from a few to all of mankind.
> It would have been much easier to create ASN.1 tooling w/ OER/PER and for some suitable subset of ASN.1 in 2001
Just by looking at your past comments - I agree that if google reused ASN.1, we would have lived in a better world. But the sad reality now is that PB gots tons of FOSS tooling and ASN.1 barely any (is there any free embedded-grade implementation other than asn1cc?) and figuring out what features you can use without having to pledge your kidney and soul to Nokalva is a bit hard.
I tried playing with ASN.1 before settling on protobuf. Don't recall which compiler I used, but immediately figured out that apparently datetime datatype is not supported, and the generated C code was bloated mess (so is google's protobuf - but not nanopb). Protobuf, on the other hand, was quite straightforward on what is and is not supported. So us mortals who aren't google and have a hard time justifying writing serdes from scratch gotta use what's available.
> Stop spreading this idea that ASN.1 is bloated
"Bloated" might be the wrong word - but it is large and it's damn hard for someone designing a new application to figure out which part is safe to use, because most sources focus on using it for decoding existing protocols.
For sure PB is a fact of life now. A regrettable fact of life, but perhaps a lesson (that few will heed).
> why on earth would you waste precious bytes on schema every time
cbor doesn't prescribe sending schema, in fact there is no schema, like json.
i just switched from protobuf to cbor because i needed better streaming support and find use it quite delightful. losing protobuf schema hurts a bit, but the amount of boilerplate code is actually less than what i had before with nanopb (embedded context). on top of it, i am saving approx. 20% in message size compared to protobuf bc i am using mostly arrays with fixed position parameters.
> cbor doesn't prescribe sending schema, in fact there is no schema, like json.
You are right, I must have confused CBOR with BSON where you send field names as strings.
>on top of it, i am saving approx. 20% in message size compared to protobuf bc i am using mostly arrays with fixed position parameters
Arrays with fixed position is always going to be the most compact format, but that means that you essentially give up on serialization. Also, when you have a large structure (e. g. full set of device state and settings)where most of the fields only change infrequently, it makes sense to only send what's changed, and then TLV is significantly better.
Other than ASN.1 PER, is there any other widely used encoding format that isn't self-describing? Using TLV certainly adds flexibility around schema evolution, but I feel like collectively we are wasting a fair amount of bytes because of it...
Cap'n'proto doesn't have tags, but it wastes even more bytes in favor of speed. Than again, omitting tags only saves space if you are sending all the fields every time. PER uses a bitmap, which is still a bit wasteful on large sparse structs.
OER (related to PER)
XDR (ONC RPC, NFS)
MS RPC (DCE RPC w/ tweaks)
Flat Buffers
You can also send JSON over UDP. Wiz smart bulbs do this for communication.
https://github.com/sbidy/pywizlight?tab=readme-ov-file#examp...
and since it's UDP, if it's lost it's lost. and since it's not standard http/JSON, nobody will have a clue in a year and can't decode it.
to learn and play with it it's fine, else why complicate life?
Using protobuf is practical enough in embedded. This person isn't the first and won't be the last. Way faster than JSON, way slower than C structs.
However protobuf is ridiculously interchangeable and there are serializers for every language. So you can get your interfaces fleshed out early in a project without having to worry that someone will have a hard time ingesting it later on.
Yes it's a pain how an empty array is a valid instance of every message type, but at least the fields that you remember to send are strongly typed. And field optionality gives you a fighting chance that your software can still speak to the unit that hasn't been updated in the field for the last five years.
On the embedded side, nanopb has worked well for us. I'm not missing having to hand maintain ad-hoc command parsers on the embedded side, nor working around quirks and bugs of those parsers on the desktop side
Not even before the first line ends you get "They’re clearly written by amateurs".
This is a rage bait, not worth the read.
The reasons for that line get at a fundamental tension. As David Wheeler famously said, "All problems in computer science can be solved by another level of indirection, except for the problem of too many indirections."
Over time we accumulate cleverer and cleverer abstractions. And any abstraction that we've internalized, we stop seeing. It just becomes how we want to do things, and we have no sense of what cost we are imposing with others. Because all abstractions leak. And all abstractions pose a barrier for the maintenance programmer.
All of which leads to the problem that Brian Kernighan warned about with, "Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?" Except that the person who will have to debug it is probably a maintenance programmer who doesn't know your abstractions.
One of the key pieces of wisdom that show through Google's approaches is that our industry's tendency towards abstraction is toxic. As much as any particular abstraction is powerful, allowing too many becomes its own problem. This is why, for example, Go was designed to strongly discourage over-abstraction.
Protobufs do exactly what it says on the tin. As long as you are using them in the straightforward way which they are intended for, they work great. All of his complaints boil down to, "I tried to do some meta-manipulation to generate new abstractions, and the design said I couldn't."
That isn't the result of them being written by amateurs. That's the result of them being written to incorporate a piece of engineering wisdom that most programmers think that they are smart enough to ignore. (My past self was definitely one of those programmers.)
Can the technology be abused? Do people do stupid things with them? Are there things that you might want to do that you can't? Absolutely. But if you KISS, they work great. And the more you keep it simple, the better they work. I consider that an incentive towards creating better engineered designs.
I think you nailed it. So many complaints about Go for example basically come down to "it didn't let me create X abstraction" and that's basically the point.
The best way to get your point across is by starting with ad-hominem attacks to assert your superior intelligence.
IMO it's a pretty reasonable claim about experience level, not intelligence, and isn't at all an ad hominem attack because it's referring directly to the fundamental design choices of protocol buffers and thus is not at all a fallacy of irrelevance.
Whatever else Jeff Dean and Sanjay Ghemawat are, and whatever mistakes they made in designing protobufs, they are not amateurs.
Not long after they designed and implemented protobuffers, they shared the ACM prize in computing, as well as many other similar honors. And the honors keep stacking up.
None of this means that protobufs are perfect (or even good), but it does mean they weren't amateurs when they did it.
https://en.wikipedia.org/wiki/Jeff_Dean
https://en.wikipedia.org/wiki/Sanjay_Ghemawat
I disagree, unless you are in the majority.
Is this in reference to the blogpost, the comment above, or your own comment? Cause it honestly works for all of them.
[dead]
Yeah, let's pretend that type algebra doesn't exist, and even if it does exist then it's not useful and definitely isn't practical in data protocols. Let's believe that the authors of protobuf considered everything, and since they aren't amateurs (by the virtue of having worked on protobuf at Google, presumably), every elaborated opinion that draws them as amateurs at applying type algebra in data protocol designs is a personal ad-hominem attack.
They're not amateurs by virtue of being some of the most senior engineers ever to work at Google. You don't get to play the "ad hominem" card while calling them names. This whole thread is embarrassing.
Ok, "some of the most senior engineers ever to work at Google" don't seem to know that static bounds checking don't require dependent types: https://news.ycombinator.com/item?id=45150008
> You don't get to play the "ad hominem" card while calling them names
The entire article explains it at length why there's the impression, it's not ad-hominem.
[flagged]
If only the article offered both detailed analyses of the problems and also solutions. Wait, it does! You should try reading it.
Where's the download link for the solution? I must have missed it.
it does not
Yep, the article opens with a Hall of Fame-grade compound fallacy: a strawman refutation of a hypothetical ad hominem that nobody has argued.
You can kinda see how this author got bounced out of several major tech firms in one year or less, each, according to their linkedin.
It's a terrible attitude and I agree that sort of thing shouldn't be (and generally isn't) tolerated for long in a professional environment.
That said the article is full of technical detail and voices several serious shortcomings of protobuf that I've encountered myself, along with suggestions as to how it could be done better. It's a shame it comes packaged with unwarranted personal attacks.
Yeah, there is a lot of snark in the article which undermines their argument.
> if (m_foo = null)
Imagine calling google amateurs, and then the only code you write has a first year student error in failing to distinguish assignment from comparision operator.
There's a class of rant on the internet where programmers complain about increasingly foundational tech instead of admitting skill issues. If you go far deep into that hole, you end up rewriting the kernel in Rust.
It's written by amateurs, but solves problems that only Google(one of the biggest/most advanced tech companies in the world) has.
I'm afraid that this is a case of someone imagining that there are Platonic ideal concepts that don't evolve over time, that programs are perfectible. But people are not immortal and everything is always changing.
I almost burst out in laughter when the article argued that you should reuse types in preference to inlining definitions. If you've ever felt the pain of needing to split something up, you would not be so eager to reuse. In a codebase with a single process, it's pretty trivial to refactor to split things apart; you can make one CL and be done. In a system with persistence and distribution, it's a lot more awkward.
That whole meaning of data vs representation thing. There's fundamentally a truth in the correspondence. As a program evolves, its understanding of its domain increases, and the fidelity of its internal representations increase too, by becoming more specific, more differentiated, more nuanced. But the old data doesn't go away. You don't get to fill in detail for data that was gathered in older times. Sometimes, the referents don't even exist any more. Everything is optional; what was one field may become two fields in the future, with split responsibilities, increased fidelity to the domain.
Yeah, oneOf fields can be repeated but you can just wrap them in a message. It's not as pretty but I've never had any issues with this.
The fact that the author is arguing for making all messages required means they don't understand the reasoning for why all fields are optional. This breaks systems (there are are postmortems outlining this) then there are proto mismatches .
I'm not sure why this post gets boosted every few years- and unfortunately (as many have pointed out) the author demonstrates here that they do not understand distributed system design, nor how to use protocol buffers. I have found them to be one of the most useful tools in modern software development when used correctly. Not only are they much faster than JSON, they prevent the inevitable redefinition of nearly identical code across a large number of repos (which is what i've seen in 95% of corporate codebases that eschew tooling such as this). Sure, there are alternatives to protocol buffers, but I have not seen them gain widespread adoption yet.
Protobuf's original sin was failing to distinguish zero/false from undefined/unset/nil. Confusion around the semantics of a zero value are the root of most proto-related bugs I've come across. At the same time, that very characteristic of protobuf makes its on-wire form really efficient in a lot of cases.
Nearly every other complaint is solved by wrapping things in messages (sorry, product types). Don't get the enum limitation on map keys, that complaint is fair.
Protobuf eliminates truckloads of stupid serialization/deserialization code that, in my embedded world, almost always has to be hand-written otherwise. If there was a tool that automatically spat out matching C, Kotlin, and Swift parsers from CDDL, I'd certainly give it a shot.
> Protobuf's original sin was failing to distinguish zero/false from undefined/unset/nil.
It's only proto3 that doesn't distinguish between zero and unset by default. Both the earlier and later versions support it.
Proto3 was a giant pile of poop in most respects, including removing support for field presence. They eventually put it back in as a per-field opt-in property, but by then the damage was done.
A huge unforced mistake, but I don't think a change made after the library had existed for 15 years and reverted qualifies as an "original sin".
Agreed the CDDL to codegen pipeline / tooling is the biggest thing holding back CBOR at the moment.
Some solutions do exist like here’s a C one[1] which maybe you could throw in some WASI / WASM compilation and get “somewhat” idiomatic bindings in a bunch of languages.
Here’s another for Rust [2] but I’m sure I’ve seen a bunch of others around. I think what’s missing is a unified protoc style binary with language specific plugins.
[1] https://github.com/NordicSemiconductor/zcbor
[2] https://github.com/dcSpark/cddl-codegen
> Protobuffers correspond to the data you want to send over the wire, which is often related but not identical to the actual data the application would like to work with
This sums up a lot of the issues I’ve seen with protobuf as well. It’s not an expressive enough language to be the core data model, yet people use it that way.
In general, if you don’t have extreme network needs, then protobuf seems to cause more harm than good. I’ve watched Go teams spend months of time implementing proto based systems with little to no gain over just REST.
Protobuf is independent from REST. You can have either one. Or both. Or neither. One has nothing to do with the other.
yes I fully understand that, the point is a lot of teams focus on protobuf for their network layer
[dead]
On the other hand, ASN.1 is very expressive and can cover pretty much anything, but Protobuff was created because people thought ASN.1 is too complex. I guess we can't have both.
If people though ASN.1 was too big all they had to do was create a small profile of it large enough for the task at hand.
X.680 is fairly small. Require AUTOMATIC TAGs, remove manual tagging, remove REAL and EMBEDDED PDV and such things, and what you're left with is pretty small.
"Those who cannot remember the past are condemned to repeat it" -- George Santayana
Oh, I remember ASN.1 very well, and I would not want to repeat it again.
Protobufs have lots of problems, but at least they are better than ASN.1!
Details please.
Things people say who know very little about ASN.1:
- it's bloated! (it's not)
- it's had lots of vulnerabilities! (mainly in hand-coded codecs)
- it's expensive (it's not -- it's free and has been for two decades)
- it's ugly (well, sure, but so is PB's IDL)
- the language is context-dependent, making it harder to write a parser for (this is quite true, but so what, it's not that big a deal)
The vulnerabilities were only ever in implementations, and almost entirely in cases of hand-coded codecs, and the thing that made many of these vulnerabilities possible was the use of tag-length-value encoding rules (BER/DER/CER) which, ironically, Protocol Buffers bloody is too.
If you have a different objections to ASN.1, please list them.
>The solution is as follows:
> * Make all fields in a message required. This makes messages product types.
Meanwhile in the capnproto FAQ:
>How do I make a field “required”, like in Protocol Buffers?
>You don’t. You may find this surprising, but the “required” keyword in Protocol Buffers turned out to be a horrible mistake.
I recommend reading the rest of the FAQ [0], but if you are in a hurry: Fixed schema based protocols like protobuffers do not let you remove fields like self describing formats such as JSON. Removing fields or switching them from required to optional is an ABI breaking change. Nobody wants to update all servers and all clients simultaneously. At that point, you would be better off defining a new API endpoint and deprecating the old one.
The capnproto faq article also brings up the fact that validation should be handled on the application level rather than the ABI level.
[0] https://capnproto.org/faq.html
I lost the plot here when the author argued that repeated fields should be implemented as in the pure lambda calculus...
Most of the other issues in the article can be solved be wrapping things in more messages. Not great, not terrible.
As with the tightly-coupled issues with Go, I'll keep waiting for a better approach any decade now. In the meantime, both tools (for their glaring imperfections) work well enough, solve real business use cases, and have a massive ecosystem moat that makes them easy to work with.
They didn't. Pure lambda calculus would have been "a function that when applied to a number encoded as a function, extracts that value".
They did it essentially as a linked list, C-strings, or UTF-8 characters: "current data, and is there more (next pointer, non-null byte, continuation bit set)?" They also noted that it could have this semantics without necessarily following this implementation encoding, though that seems like a dodge to me; length-prefixed array is a perfectly fine primitive to have, and shouldn't be inferred from something that can map to it.
I recently made a realization, that I can use MessagePack with a static schema defined in the code, and even pre-defined numeric field IDs, essentially replacing Protobuf for my use cases. I saw MessagePack as an alternative for JSON, with loose message structure, but it's actually a nice binary format and can be used more effectively than that. So now I enjoy things like tagged unions (in Zig/Python), and other types that are awkward to express in Protobuf. I settled on single character field names, for compatibility with msgspec, and I'm pretty happy with it. Still super compact messages with predictable schema, that are fast to parse, because I know which fields to expect.
I went into this article expecting to agree with part of it. I came away agreeing with all of it. And I want to point out that Go also shares some of these catastrophic data decisions (automatic struct zero values that silently do the wrong thing by default).
We got bit by a default value in a DMS task where the target column didn't exist so the data wasn't replicated and the default value was "this work needs to be done."
This is not pb nor go. A sensible default of invalid state would have caught this. So would an error and crash. Either would have been better than corrupt data.
You mean aws dms insterted the string literal “this work needs to be done” into your db?
So, that target column was called the wrong name, meaning data intended for the column never arrived, causing the default value in the database to be used, which was an integer that mapped to "this work item needs to be processed still" which led to double processing the record post dms migration
With these serialization libraries, do any of them have a facility that allows you to specify a wire format and an application format, with recipes for converting one to the other?
I haven't used these very seriously but a problem I had a while back was that that the wire format was not what the applications wanted to use, but a good application format was to space-inefficient for wire.
As far as I could see there was not a great way to do this. You could rewrite wire<->app converter in every app, or have a converter program and now you essentially have two wire formats and need to put this extra program and data movement into workflows, or write a library and maintain bindings for all your languages.
>> You could rewrite wire<->app converter in every app
This is what Google does. We joke that our entire jobs are "convert protobuf A into protobuf B".
The way to do this starts with not hard-wiring the code generation step.
Instead, make codegen a function of BOTH a data schema object and a code template (eg expressed in Jinja2 template language - or ZeroMQ GSL where I first saw this approach). The codegen stage is then simply the application of the template to the data schema to produce a code artifact.
The templates are written assuming the data schema is provided following a meta-schema (eg JSON Schema for a data schema in JSON). One can develop, eg per-language templates to produce serialization code or intra-language converters between serialization forms (on wire) and application friendly forms. The extra effort to develop a template for a particular target is amortized as it will work across all data schemas that adhere to a common meta-schema.
The "codegen" stage can of course be given non "code" templates to produce, eg, reference documentation about the data schema in different formats like HTML, text, nroff/man, etc.
I didn't recognize the GSL citation, so for others:
- https://github.com/zeromq/gsl/blob/v4.1.5/examples/fsm_c.gsl
whew, this readme has everything
- XML in, text out: https://github.com/zeromq/gsl#:~:text=feed%20it%20some%20dat...
- a whole section on software engineering https://github.com/zeromq/gsl#model-oriented-programming
- they support COBOL https://github.com/zeromq/gsl#cobol
- and then a project 11 years old with "we're going to document these functions one day" https://github.com/zeromq/gsl#global-functions
What a journey that was
If you care about network bandwidth you can compress before sending, as virtually all web applications do. Then you don't need to worry much about the space efficiency of the application format.
Of the wire format you mean? I compress it and still need to care about the space efficiency of the wire format beyond that. Compression ratio does improve a lot when not doing our own, end result is significantly larger. Also it becomes also significantly slower because more data to process which is possibly the bigger problem.
It's probably not like most web application, it's hardware data loggers that produce about hundreds of millions to billions of events per second (each with minimum about 4 bytes of wire format and maximum roughly 500 bytes).
Sometimes you are integrating with system that already use proto though. I recently wrote a tiny, dependency-free, practical protobuf (proto3) encoder/decoder. For those situations where you need just a little bit of protobuf in your project, and don't want to bother with the whole proto ecosystem of codegen and deps: https://github.com/allanrbo/pb.py
> Maintain a separate type that describes the data you actually want, and ensure that the two evolve simultaneously.
I don't actually want to do this, because then you have N + 1 implementations of each data type, where N = number of programming languages touching the data, and + 1 for the proto implementation.
What I personally want to do is use a language-agnostic IDL to describe the types that my programs use. Within Google you can even do things like just store them in the database.
The practical alternative is to use JSON everywhere, possibly with some additional tooling to generate code from a JSON schema. JSON is IMO not as nice to work with. The fact that it's also slower probably doesn't matter to most codebases.
> I don't actually want to do this, because then you have N + 1 implementations of each data type, where N = number of programming languages touching the data, and + 1 for the proto implementation.
I think this is exactly what you end up with using protobuf. You have an IDL that describes the interface types but then protoc generates language-specific types that are horrible so you end up converting the generated types to some internal type that is easier to use.
Ideally if you have an IDL that is more expressive then the code generator can create more "natural" data structures in the target language. I haven't used it a ton, but when I have used thrift the generated code has been 100x better than what protoc generates. I've been able to actually model my domain in the thrift IDL and end up with types that look like what I would have written by hand so I don't need to create a parallel set of types as a separate domain model.
> The practical alternative is to use JSON everywhere, possibly with some additional tooling to generate code from a JSON schema.
Protobuf has a bidirectional JSON mapping that works reasonably well for a lot of use cases.
I have used it to skip the protobuf wire format all together and just use protobuf for the IDL and multi-language binding, both of which IMO are far better than JSON-Schema.
JSON-Schema is definitely more powerful though, letting you do things like field level constraints. I'd love to see you tomorrow that paired the best of both.
It is a 7 year old article without specifying alternatives to an "already solved problem."
So HN, what are the best alternatives available today and why?
Something like MessagePack or CBOR, and if you want versioning, just have a version field at the start. You don't require a schema to pack/unpack, which I personally think is a good thing.
> You don't require a schema to pack/unpack
Then it hardly solves the same problem Protobuf solves.
Arrow is also becoming a good contender, with the extra benefit it is better optimized for data batches.
Support across languages etc is much less mature but I find thrift serialization format to be much nicer than protobuf. The codegen somehow manages to produce types that look like types I would actually write compared to the monstrosities that protoc generates.
There are none, protobufs are great.
Depends. ASN.1 is a beast and another industry standard, but unfortunately the best tooling is closed source.
There was ZERO PB tooling in 2000. Just write it for ASN.1 instead.
Mentioned above: https://github.com/stepchowfun/typical
CBOR is probably the best and most standards compliant thing out there that I’m aware of.
It’s the new default in a lot of IOT specs, it’s the backbone for deep space communication networks etc..
Maintains interoperability with JSON. Is very much battle tested in very challenging environments.
Always initializing with a default and no algebraic types is an always loaded foot gun. I wonder if the people behind golang took inspiration from this.
The simplest way to understand go is that it is a language that integrates some of Google's best cpp features (their lightweight threads and other multi threading primitives are the highlights)
Beyond that it is a very simple language. But yes, 100%, for better and worse, it is deeply inspired by Google's codebase and needs
The crappy system that everyone ends up using is better than the perfectly designed system that's only seen in academic papers. Javascript is the poster-child of Worse is Better. Protobuffs are a PITA, but they are widely used and getting new adoption in industry. https://en.wikipedia.org/wiki/Worse_is_better
I worked at a company that had their own homegrown Protobuf alternative which would add friction to life constantly. Especially if you had the audacity to build anything that wasn't meant to live in the company monorepo (your Python script is now a Docker image that takes 30 minutes to build).
One day I got annoyed enough to dig for the original proposal and like 99.9% of initiatives like this, it was predicated on:
- building a list of existing solutions
- building an overly exhaustive list, of every facet of the problem to be solved
- declare that no existing solution hits every point on your inflated list
- "we must build it ourselves."
It's such a tired playbook, but it works so often unfortunately.
The person who architects and sells it gets points for "impact", then eventually moves onto the next company.
In the meantime the problem being solved evolves and grows (as products and businesses tend to), the homegrown solution no longer solves anything perfectly, and everyone is still stuck dragging along said solution, seemingly forever.
-
Usually eventually someone will get tired enough of the homegrown solution and rightfully question why they're dragging it along, and if you're lucky it gets replaced with something sane.
If you're unlucky that person also uses it as justification to build a new in-house solution (we're built the old one after all), and you replay the loop.
In the case of serialization though, that's not always doable. This company was storing petabytes (if not exabytes) of data in the format for example.
The author makes good arguments; I wish they'd offered some alternatives.
Despite issues, protobufs solve real problems and (imo) bring more value than cost to a project. In particular, I'd much rather work with protobufs and their generated ser/de than untyped json
> Make all fields in a message required.
funnily enough, this line alone reveals the author to be an amateur in the problem space they are writing so confidently about.
the complaints about the Protobuf type system being not flexible enough are also really funny to read.
fundamentally, the author refuses to contend with the fact that the context in which Protobufs are used -- millions of messages strewn around random databases and files, read and written by software using different versions of libraries -- is NOT the same scenario where you get to design your types once and then EVERYTHING that ever touches those types is forced through a type checker.
again, this betrays a certain degree of amateurishness on the author's part.
Kenton has already provided a good explanation here: https://news.ycombinator.com/item?id=45140590
> is NOT the same scenario where you get to design your types once and then EVERYTHING that ever touches those types is forced through a type checker.
the author never claimed the types had to be designed only once, he claimed that schema evolution chosen by protobuf is inadequate for the purpose of lossless evolution.
> Kenton has already provided a good explanation here: https://news.ycombinator.com/item?id=45140590
TLDR: yada-yada [...] protobuf is practical, type algebra either doesn't exist or impractical because only PL theorists know about it, not Kenton.
> type algebra either doesn't exist or impractical because only PL theorists know about it, not Kenton.
Hi I'm Kenton. I, too, was enamored with advanced PL theory in college. Designed and implemented my own purely-functional programming language. Still wish someone would figure out a working version of dependent types for real-world use, mainly so we could prove array bounds-safety without runtime checks.
In two decades building real-world complex systems, though, I've found that getting PL theory right is rarely the highest-leverage way to address the real problems of software engineering.
> Still wish someone would figure out a working version of dependent types for real-world use, mainly so we could prove array bounds-safety without runtime checks.
Hi Kenton, I'm not sure what kind of PL theory you studied in college, but "array bounds-safety without runtime checks" don't require dependent types. They are being proven with several available SMT solvers as of right now, just ask LLVM folks with their "LLVM_ENABLE_Z3_SOLVER" compiler flag, the one that people build their real-world solutions on.
By the way, you don't have to say "real-world" in every comment to appeal to your google years as a token of "real-world vs the rest of you". "But my team at google wouldn't use it", or something along that line, right?
https://ats-lang.sourceforge.net/DOCUMENT/INT2PROGINATS/HTML...
Throwing a theorem-prover at the problem, unaided by developer hints, is not realistic in a large codebase. You need annotations that let you say "this array's size is is the same as that array" or "this integer is within the bounds of that array" -- that's dependent types.
> Throwing a theorem-prover at the problem, unaided by developer hints, is not realistic in a large codebase.
Please, Kenton, don't move your goalpost. Annotations, whether they come directly from a developer, or from IR meta, don't make a provided SAT-constraint suddenly a "dependent type" component of your type system, it needs a bit more than that. Let's not miss the "types" in "dependent types". You don't modify type systems of your languages to run SAT solvers in large codebases.
Truly, if you believe that annotations for the purpose of static bounds checking "is not realistic in a large codebase", I've got "google/pytype" and the entire Python community to sell to you.
I sniffed this. I am not familiar with protobufs, but aware they are for efficiency on the wire. The fact he only really talks about type systems and not the before vs. after of the affect on the wire was disappointing but also made me suspect to if this was a good piece.
> Your guess is as good as mine for why an enum can’t be used as a map key.
I filed an issue requesting this and it was denied with an explanation:
https://github.com/protocolbuffers/protobuf/issues/7791#issu...
> Contrast this behavior against message types. While scalar fields are dumb, the behavior for message fields is outright insane.
The reason messages are initialized is that you can easily set a deep property path:
```
message SomeY { string example = 1; }
message SomeX { SomeY y = 1; }
```
later, in java:
```
SomeX some = SomeX.newBuilder();
some.getY().setExample("hello"); // does not produce npe
```
in kotlin this syntax makes even more sense:
```
some {
}```
> It’s impossible to differentiate a field that was missing in a protobuffer from one that was assigned to the default value.
This is purportedly fixed in proto3 and latest SDK copies (IIRC)
I agree with the author that protobuf is bad and I ran into many of the issues mentioned. It's pretty much mandatory to add version fields to do backwards compatibility properly.
Recently, however, I had the displeasure of working with FlatBuffers. It's worse.
Out of interest why not make the version part of say the URL?
That one was used to implement save data in a game.
I'm more than a little curious what event caused such a strong objection to protobuffers. :D
I do tend to agree that they are bad. I also agree that people put a little too much credence in "came from Google." I can't bring myself to have this much anger towards it. Had to have been something that sparked this.
I'm just a frontend developer so most of my exposure is just as an API consumer and not someone working on the service side of things. That said:
A few years ago I moved to a large company where protobufs were the standard way APIs were defined. When I first started working with the generated TypeScript code, I was confused as to why almost all fields on generated object types were marked as optional. I assumed it was due to the way people were choosing to define the API at first, but then I learned this was an intentional design choice on the part of protobufs.
We ended up having to write our own code to parse the responses from the "helpfully" generated TypeScript client's responses. This meant we had to also handle rejecting nonsensical responses where an actually required field wasn't present, which is exactly the sort of thing I'd want generated clients to do. I would expect having to do some transformation myself, but not to that degree. The generated client was essentially useless to us, and the protocol's looseness offered no discernible benefit over any other API format I've used.
I imagine some of my other complaints could be solved with better codegen tools, but I think fundamentally the looseness of the type system is a fatal issue for me.
It used to be that there was no official TypeScript protobuf generator from Google and third-party generators sucked. Using protobufs from web browser or in nodejs was painful.
Couple years ago Connect released very good generator for TypeScript, we use in in production and it's great:
https://github.com/connectrpc/connect-es
Yeah, as soon as you have a moderately complex type the generated code is basically useless. Honestly, ~80% of my gripes about protocol buffers could be alleviated by just allowing me to mark a message field as required.
Proto2 let you do this and the "required" keyword was removed because of the problems it introduces when evolving the schema in a system with many users that you don't necessarily control. Let's say you want to add a new required field, if your system receives messages from clients some clients may be sending you old data without the field and now the parse step fails because it detects a missing field. If you ever want to remove a required field you have the opposite problem, there will components that have to have those fields present just to satisfy the parser even if they're only interested in some other fields.
Philosophically, checking that a field is required or not is data validation and doesn't have anything to do with serialization. You can't specify that an integer falls into a certain valid range or that a string has a valid number of characters or is the correct format (e.g. if it's supposed to be an email or a phone number). The application code needs to do that kind of validation anyway. If something really is required then that should be the application's responsibility to deal with it appropriately if it's missing.
The Captn Proto docs also describe why being able to declare required fields is a bad idea: https://capnproto.org/faq.html#how-do-i-make-a-field-require...
> Philosophically, checking that a field is required or not is data validation and doesn't have anything to do with serialization
But protocol buffers is not just a serialization format it is an interface definition language. And not being able to communicate that a field is required or not is very limiting. Sometimes things are required to process a message. If you need to add a new field but be able to process older versions of the message where the field wasn't required (or didn't exist) then you can just add it as optional.
I understand that in some situations you have very hard compatibility requirements and it makes sense to make everything optional and deal with it in application code, but adding a required attribute to fields doesn't stop you from doing that. You can still just make everything optional. You can even add a CI lint that prevents people from merging code with required fields. But making required fields illegal at the interface definition level just strikes me as killing a fly with a bazooka.
> Philosophically, checking that a field is required or not is data validation and doesn't have anything to do with serialization.
My issue is that people seem to like to use protobuf to describe the shape of APIs rather than just something to handle serialization. I think it's very bad at the describing API shapes.
I think it is somewhat of a natural failure of DRY taken to the extreme? People seem to want to get it so that they describe the API in a way that is then generated for clients and implementations.
It is amusing, in many ways. This is specifically part of what WSDL aspired to, but people were betrayed by the big companies not having a common ground for what shapes they would support in a description.
> Let's say you want to add a new required field, if your system receives messages from clients some clients may be sending you old data without the field and now the parse step fails because it detects a missing field.
A parser has to (inherently) neither fail (compatibility mode) nor lose the new field (a passthrough mode), nor allow diverging (strict mode). The fact that capnproto/parser authors don't realize that the same single protocol can operate in three different scenarios (but strictly speaking: at boundaries vs in middleware) at the same time, should not result in your thinking that there are problems with required fields in protocols. This is one of the most bizzare kinds of FUD in the industry.
Hi, I'm the apparently-FUD-spreading Cap'n Proto author.
Sure! You could certainly imagine extending Protobuf or Cap'n Proto with a way to specify validation that only happens when you explicitly request it. You'd then have separate functions to parse vs. to validate a message, and then you can perform strict validation at the endpoints but skip it in middleware.
This is a perfectly valid feature idea which many people have entertained an even implemented successfully. But I tend to think it's not worth trying to do have this in the schema language because in order to support every kind of validation you might want, you end up needing a complete programming language. Plus different components might have different requirements and therefore need different validation (e.g. middleware vs. endpoints). In the end I think it is better to write any validation functions in your actual programming language. But I can certainly see where people might disagree.
It gets super frustrating to have to empty/null check fields everywhere you use them, especially for fields that are effectively required for the message to make sense.
A very common example I see is Vec3 (just x, y, z). In proto2 you should be checking for the presence of x,y,z every time you use them, and when you do that in math equations, the incessant existence checks completely obscure the math. Really, you want to validate the presence of these fields during the parse. But in practice, what I see is either just assuming the fields exist in code and crashing on null, or admitting that protos are too clunky to use, and immediately converting every proto into a mirror internal type. It really feels like there's a major design gap here.
Don't get me started on the moronic design of proto3, where every time you see Vec3(0,0,0) you get to wonder whether it's the right value or mistakenly unset.
> It gets super frustrating to have to empty/null check fields everywhere you use them, especially for fields that are effectively required for the message to make sense.
That's why Protobuf and Cap'n Proto have default values. You should not bother checking for presence of fields that are always supposed to be there. If the sender forgot to set a field, then they get the default value. That's their problem.
> just assuming the fields exist in code and crashing on null
There shouldn't be any nulls you can crash on. If your protobuf implementation is returning null rather than a default value, it's a bad implementation, not just frustrating to use but arguably insecure. No implementation of mine ever worked that way, for sure.
Sadly, the default values are an even bigger source of bugs. We just caught another one at $work where a field was never being filled in, but the default values made it look fine. It caused hidden failures later on.
It's an incredibly frustrating "feature" to deal with, and causes lots of problems in proto3.
You can still verify presence explicitly if you want, with the `has` methods.
But if you don't check, it should return a default value rather than null. You don't want your server to crash on bad input.
You think you do but you really don't.
What happens if you mark a field as required and then you need to delete it in the future? You can't because if someone stored that proto somewhere and is no longer seeing the field, you just broke their code.
If you need to deserialize an old version then it's not a problem. The unknown field is just ignored during deserialization. The problem is adding a required field since some clients might be sending the old value during the rollout.
But in some situations you can be pretty confident that a field will be required always. And if you turn out to be wrong then it's not a huge deal. You add the new field as optional first (with all upgraded clients setting the value) and then once that is rolled out you make it required.
And if a field is in fact semantically required (like the API cannot process a request without the data in a field) then making it optional at the interface level doesn't really solve anything. The message will get deserialized but if the field is not set it's just an immediate error which doesn't seem much worse to me than a deserialization error.
1. Then it's not really required if it can be ignored.
2. This is the problem, software (and protos) can live for a long time). They might be used by other clients elsewhere that you don't control. What you thought might not required 10 years down the line is not anymore. What you "think" is not a huge deal then becomes a huge deal and can cause downtime.
3. You're mixing business logic and over the wire field requirement. If a message is required for an interface to function, you should be checking it anyway and returning the correct error. How is that change with proto supporting require?
> Then it's not really required if it can be ignored.
It can be required in v2 but not in v1 which was my point. If the client is running v2 while the server is still on v1 temporarily, then there is no problem. The server just ignores the new field until it is upgraded.
> This is the problem, software (and protos) can live for a long time). They might be used by other clients elsewhere that you don't control. What you thought might not required 10 years down the line is not anymore. What you "think" is not a huge deal then becomes a huge deal and can cause downtime.
Part of this is just that trying to create a format that is suitable both as an rpc wire serialization format and ALSO a format suitable for long term storage leads to something that is not great for either use case. But even taking that into account, RDBMS have been dealing with this problem for decades and every RDBMS lets you define fields as non-nullable.
> If a message is required for an interface to function, you should be checking it anyway and returning the correct error. How is that change with proto supporting require?
That's my point, you have to do that check in code which clutters the implementation with validation noise. That and you often can't use the wire message in your internal domain model since you now have to do that defensive null-check everywhere the object is used.
Aside from that, protocol buffers are an interface definition language so should be able to encode some of the validation logic at least (make invalid states unrepresentable and all that). If you are just looking at the proto IDL you have no way of knowing whether a field is really required or not because there is no way to specify that.
Maybe you don’t delete it then?
I mean, this is essentially the same lesson that database admins learn with nullable fields. Often it isn't the "deleting one is hard" so much as "adding one can be costly."
It isn't that you can't do it. But the code side of the equation is the cheap side.
To add to the sibling, I've seen this with Java enums a lot. People will add it so that the value is consumed using the enum as fast as they can. This works well as long as the value is not retrieved from data. As soon as you do that, you lose the ability to add to the possible values in a rolling release way. It can be very frustrating to know that we can't push a new producer of a value before we first change all consumers. Even if all consumers already use switch statements with default clauses to exhaustively cover behavior.
But this is something you should be able to handle on a case-by-case basis. If you have a type which is stored durably as protobuf then adding required fields is much harder. But if you are just dealing with transient rpc messages then it can be done relatively easily in a two step process. First you add the field as optional and then once all producers are upgraded (and setting the new field), make it required. It's annoying for sure but still seems better than having everything optional always and needing to deal with that in application code everywhere.
Largely true. If you are at Google scale, odds are you have mixed fleets deployed. Such that it is a bit of involved process. But it is well defined and doable. I think a lot of us would rather not do a dance we don't have to do?
Sure, you just have to balance that against the cost of a poorly specified API interface. The errors because clients aren't clear on what is really required or not, what they should consider an error if it is not defined, etc. And of course all the boilerplate code that you have to write to convert the interface model to an internal domain model you can actually use inside your code.
I've used them almost daily for 15 years. They are way down the list of things I'd want improved. It has been interesting to see the protobuffers killers die out every few years though
I feel like I could have written an article like this at various points. Probably while spending two hours trying to figure out a way to represent some protobuf type in a sane way internally.
As a developer I always see "came from Google" as a yellow flag.
Too often I find something mildly interesting, but then realize that in order for me to try to use it I need to set up a personal mirror of half of Google's tech stack to even get it to start.
He says that in the article; he had to work on a "compiler" project that was much harder than it should have been because of protobuf's design choices.
Yeah, I saw that. I took that as something that happened in the past, though. Certainly colored a lot of the thinking, but feels like something more immediate had to have happened. :D
Among other things, I don't like that they won't support nullable getters/setters:
https://protobuf.dev/design-decisions/nullable-getters-sette...
I like the problems that Protobuf solves, just not the way it solves them.
Protobuf as a language feels clunky. The “type before identifier” syntax looks ancient and Java-esque.
The tools are clunky too. protoc is full of gotchas, and for something as simple as validation, you need to add a zillion plugins and memorize their invocation flags.
From tooling to workflow to generated code, it’s full of Google-isms and can be awkward to use at times.
That said, the serialization format is solid, and the backward-compatibility paradigms are genuinely useful. Buf adds some niceties to the tooling and makes it more tolerable. There’s nothing else that solves all the problems Protobuf solves.
almost the entire purpose of anything like protocol buffers is to provide a safe mechanism for backwards-compatible forward changes -- "no one uses that stuff"?? what a weird and broken take
The "no enums as map keys" thing enrages me constantly. Every protobuf project I've ever worked with either has stringly-typed maps all over the place because of this, or has to write its own function to parse Map<String, V> into Map<K, V> from the enums and then remember to call that right after deserialization, completely defeating the purpose of autogenerated types and deserializers. Why does Google put up with this? Surely it's the same inside their codebase.
Maps are not a good fit for a wire protocol in my experience. Different languages often have different quirks around them, and they're non-trivial to represent in a type-safe way.
If a Map is truly necessary I find it better to just send a repeated Message { Key K, Value V } and then convert that to a map in the receiving end.
I believe that the reason for this limitation is that not all languages can represent open enums cleanly to gracefully handle unknown enums upon schema skew.
And v1 and v2 protos didn't even have maps.
Also, why you use string as a key and not int?
proto2 absolutely supported the map type.
It could be, it looks like there was some versions misalignment:
The maps syntax is only supported starting from v3.0.0. The "proto2" in the doc is referring to the syntax version, not protobuf release version. v3.0.0 supports both proto2 syntax and proto3 syntax while v2.6.1 only supports proto2 syntax. For all users, it's recommended to use v3.0.0-beta-1 instead of v2.6.1. https://stackoverflow.com/questions/50241452/using-maps-in-p...
Protobuf's main design goal is to make space-optimized binary tag-length-value encoding easy. The mentality is kinda like "who cares what the API looks like as long as it can support anything you want to do with TLV encoding and has great performance." Things like oneofs and maps are best understood as slightly different ways of creating TLV fields in a message, rather than pieces of a comprehensive modern type system. The provided types are simply the necessary and sufficient elements to model any fuller type system using TLV.
Yes but the point is that nobody outside of super big tech has a need to optimize a few bytes here and there at the expense of atrocious devx.
I've written several screeds in the comments here on HN about protobufs being terrible over the past few years. Basically the creators of PB ignored ASN.1 and built a bad version of mid-1980s ASN.1 and DER.
Tag-length-value (TLV) encodings are just overly verbose for no good reason. They are _NOT_ "self-describing", and one does not need everything tagged to support extensibility. Even where one does need tags, tag assignments can be fully automatic and need not be exposed to the module designer. Anyone with a modicum of time spent researching how ASN.1 handles extensibility with non-TLV encoding rules knows these things. The entire arc of ASN.1's evolution over two plus decades was all about extensibility and non-TLV encoding rules!
And yes, ASN.1 started with the same premise as PB, but 40 years ago. Thus it's terribly egregious that PB's designers did not learn any lessons at all from ASN.1!
Near as I can tell PB's designers thought they knew about encodings, but didn't, and near as I can tell they refused to look at ASN.1 and such because of the lack of tooling for ASN.1, but of course there was even less tooling for PB since it hadn't existed.
It's all exasperating.
Avro (and others) has its own set of problems as well.
For messaging, JSON, used in the same way and with the same versioning practices as we have established for evolving schemas in REST APIs, has never failed me.
It seems to me that all these rigid type systems for remote procedure calls introduce more problems that they really solve and bring unnecessary complexity.
Sure, there are tradeoffs with flexible JSONs - but simplicity of it beats the potential advantages we get from systems like Avro or ProtoBuf.
> This insane list of restrictions is the result of unprincipled design choices and bolting on features after the fact
I'm not very upset that protobuf evolved to be slightly more ergonomic. Bolting on features after you build the prototype is how you improve things.
Unfortunately, they really did design themselves into a corner (not unlike python 2). Again, I can't be too upset. They didn't have the benefit of hindsight or other high performance libraries that we have today.
Even the low level implementation of protobuffers is pretty uninspiring.
Adds a lot of space overhead, specially for structs only used one yet not self descriptive either.
Doesn’t solve a lot of problems related to changes either.
Quite frankly, too many are using up in it because it came from Google and is supposed to be some sort of divinely inspired thing.
JSON, ASN.1, and even rigid C structs start to look a lot better.
I've created several IDL compilers addressing all issues of protobuf and others.
This particular one provides strongest backward compatibility guarantees with automatic conversion derivation where possible: https://github.com/7mind/baboon
Protobuf is dated, it's not that hard to make better things.
We thought for a long time about using protobufs in our product [1] and in the end we went with JSON-RPC 2.0 over BLE, base64 for bigger chunks. Yeah, you still need to pass sample format and decode manually. The overhead is fine tho, debugging is way easier (also pulling in all of protobuf just wasn't fun).
[1] aidlab.com/aidlab-2
Type system fans are so irritating. The author doesn't engage with the point of protocol buffers, which is that they are thin adapters between the union of things that common languages can represent with their type systems and a reasonably efficient marshaling scheme that can be compact on the wire.
I really liked the typography/layout of the page, reminds me of gwern.net. But people will probably complain about serif fonts regardless
But why do you need serialization? Because the data structure on disk is not the same as in memory. Arthur Whitney's k/q/kdb+ solved this problem by making them the same. An array has the same format in memory and on disk, so there is no serialization, and even better, you can mmap files into memory, so you don't need cache!
He also removed the capability to define a structure, and force you to use dictionary(structure) of array, instead of array of structure.
Forget on-disk. Different CPUs represent basic data types with different in-memory representations (endianness). Furthermore different CPUs have different capabilities with respect to how data must be aligned in memory in order to read or write it (aligned/unaligned access). At least historically unaligned access could fault your process. Then there's the problem, that you allude to, that different programming languages use different data layouts (or often a non-standardised layout). If you want communication within a system comprising heterogeneous CPUs and/or languages, you need to translate or standardise your a wire format and/or provide a translation layer aka serialisation.
> But why do you need serialization? Because the data structure on disk is not the same as in memory.
Not always - in browser applications for example, there is no way to directly access the disk, nevermind mmap().
The author is right, but it could have been worse too. At least they were not using JSON for serialization.
lols, the weird protobuf initialization semantics has caused so many OMGs. Even on my team it lead to various hard to debug bugs.
It's a lesson most people learns the hard way after using PBs for a few months.
I just wish protobuf had proper delta compression out of the box
If you mostly write software with Go you'll likely enjoy working with protocol buffers. If you use the Python or Ruby wrappers you'd wish you had picked another tech.
The generated types in go are horrible to work with. You can't store instances of them anywhere, or pass them by value, because they contain a bunch of state and pointers (including a [0]sync.Mutex just to explicitly prohibit copying). So you have to pass around pointers at all times, making ownership and lifetime much more complicated than it needs to be. A message definition like this
becomes For [place of work] where we use protobuf I ended up making a plugin to generate structs that don't do any of the nonsense (essentially automating Option 1 in the article): with converters between the two versions.I actually really strongly prefer 0 being identical to unset. If you have an unset state then you have to check if the field is unset every time you use it. Using 0 allows you to make all of your code "just work" when you pass 0 to it so you don't need to check at all.
It's like how in go most structs don't have a constructor, they just use the 0 value.
Also oneof is made that way so that it is backwards compatible to add a new field and make it a oneof with an existing field. Not everything needs to be pure functional programming.
(2018)
Indeed. Discussed before at https://news.ycombinator.com/item?id=35281561
https://hn.algolia.com/?q=Protobuffers+Are+Wrong
Protobuffers suck as a core data model. My take? Use them as a serialization and interchange format, nothing more.
> This puts us in the uncomfortable position of needing to choose between one of three bad alternatives:
I don’t think there is a good system out there that works for both serialization and data models. I’d say it’s a mostly unsolved problem. I think I am happy with protobufs. I know that I have to fight against them contaminating the codebase—basically, your code that uses protobufs is code that directly communicates over raw RPC or directly serializes data to/from storage, and protobufs shouldn’t escape into higher-level code.
But, and this is a big but, you want that anyway. You probably WANT your serialization to be able to evolve independently of your application logic, and the easy way to do that is to use different types for each. You write application logic using types that have all sorts of validation (in the "parse, don't validate" sense) and your serialization layer uses looser validation. This looser validation is nice because you often end up with e.g. buggy code getting shipped that writes invalid data, and if you have a loose serialization layer that just preseves structure (like proto or json), you at least have a good way to munge it into the right shape.
Evolving serialized types has been such a massive pain at a lot of workplaces and the ad-hoc systems I've seen often get pulled into adopting some of the same design choices as protos, like "optional fields everywhere" and "unknown fields are ok". Partly it may be because a lot of ex-Google employees are inevitably hanging around on your team, but partly because some of those design tradeoffs (not ALL of them, just some of them) are really useful long-term, and if you stick around, you may come to the same conclusion.
In the end I mostly want something that's a little more efficient and a little more typed than JSON, and protos fit the bill. I can put my full efforts into safety and the "correct" representation at a different layer, and yes, people will fuck it up and contaminate the code base with protos, but I can fix that or live with it.
> My take? Use them as a serialization and interchange format, nothing more.
Isn't that exactly what they're intended for? I'm confused how anyone would even think to use them any other way.
Not specific to protobufs but a lot of people/projects especially if doing MVC, push the models in the API layer all the way down the stack and they become the domain, instead of having a loose coupling between the domain and serialization format. In the old days we used to have DTO's for separation but they went out of fashion.
Agreed, it's interesting to see so many people complaining when they are just misunderstanding / misusing protobufs entirely. Sure the implementation could be better but it's not a huge problem.
Like the author said, their usage in practice often creeps outside that.
Previous discussions:
* https://news.ycombinator.com/item?id=18188519
* https://hn.algolia.com/?q=%22Protobuffers+Are+Wrong%22
I guess I'll, once again, copy/paste the comment I made when this was first posted: https://news.ycombinator.com/item?id=18190005
--------
Hello. I didn't invent Protocol Buffers, but I did write version 2 and was responsible for open sourcing it. I believe I am the author of the "manifesto" entitled "required considered harmful" mentioned in the footnote. Note that I mostly haven't touched Protobufs since I left Google in early 2013, but I have created Cap'n Proto since then, which I imagine this guy would criticize in similar ways.
This article appears to be written by a programming language design theorist who, unfortunately, does not understand (or, perhaps, does not value) practical software engineering. Type theory is a lot of fun to think about, but being simple and elegant from a type theory perspective does not necessarily translate to real value in real systems. Protobuf has undoubtedly, empirically proven its real value in real systems, despite its admittedly large number of warts.
The main thing that the author of this article does not seem to understand -- and, indeed, many PL theorists seem to miss -- is that the main challenge in real-world software engineering is not writing code but changing code once it is written and deployed. In general, type systems can be both helpful and harmful when it comes to changing code -- type systems are invaluable for detecting problems introduced by a change, but an overly-rigid type system can be a hindrance if it means common types of changes are difficult to make.
This is especially true when it comes to protocols, because in a distributed system, you cannot update both sides of a protocol simultaneously. I have found that type theorists tend to promote "version negotiation" schemes where the two sides agree on one rigid protocol to follow, but this is extremely painful in practice: you end up needing to maintain parallel code paths, leading to ugly and hard-to-test code. Inevitably, developers are pushed towards hacks in order to avoid protocol changes, which makes things worse.
I don't have time to address all the author's points, so let me choose a few that I think are representative of the misunderstanding.
> Make all fields in a message required. This makes messages product types.
> Promote oneof fields to instead be standalone data types. These are coproduct types.
This seems to miss the point of optional fields. Optional fields are not primarily about nullability but about compatibility. Protobuf's single most important feature is the ability to add new fields over time while maintaining compatibility. This has proven -- in real practice, not in theory -- to be an extremely powerful way to allow protocol evolution. It allows developers to build new features with minimal work.
Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time, hence the "required considered harmful" manifesto. In practice, you want to declare all fields optional to give yourself maximum flexibility for change.
The author dismisses this later on:
> What protobuffers are is permissive. They manage to not shit the bed when receiving messages from the past or from the future because they make absolutely no promises about what your data will look like. Everything is optional! But if you need it anyway, protobuffers will happily cook up and serve you something that typechecks, regardless of whether or not it's meaningful.
In real world practice, the permissiveness of Protocol Buffers has proven to be a powerful way to allow for protocols to change over time.
Maybe there's an amazing type system idea out there that would be even better, but I don't know what it is. Certainly the usual proposals I see seem like steps backwards. I'd love to be proven wrong, but not on the basis of perceived elegance and simplicity, but rather in real-world use.
> oneof fields can't be repeated.
(background: A "oneof" is essentially a tagged union -- a "sum type" for type theorists. A "repeated field" is an array.)
Two things:
1. It's that way because the "oneof" pattern long-predates the "oneof" language construct. A "oneof" is actually syntax sugar for a bunch of "optional" fields where exactly one is expected to be filled in. Lots of protocols used this pattern before I added "oneof" to the language, and I wanted those protocols to be able to upgrade to the new construct without breaking compatibility.
You might argue that this is a side-effect of a system evolving over time rather than being designed, and you'd be right. However, there is no such thing as a successful system which was designed perfectly upfront. All successful systems become successful by evolving, and thus you will always see this kind of wart in anything that works well. You should want a system that thinks about its existing users when creating new features, because once you adopt it, you'll be an existing user.
2. You actually do not want a oneof field to be repeated!
Here's the problem: Say you have your repeated "oneof" representing an array of values where each value can be one of 10 different types. For a concrete example, let's say you're writing a parser and they represent tokens (number, identifier, string, operator, etc.).
Now, at some point later on, you realize there's some additional piece of data you want to attach to every element. In our example, it could be that you now want to record the original source location (line and column number) where the token appeared.
How do you make this change without breaking compatibility? Now you wish that you had defined your array as an array of messages, each containing a oneof, so that you could add a new field to that message. But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.
In every single case where you might want a repeated oneof, you always want to wrap it in a message (product type), and then repeat that. That's exactly what you can do with the existing design.
The author's complaints about several other features have similar stories.
> One possible argument here is that protobuffers will hold onto any information present in a message that they don't understand. In principle this means that it's nondestructive to route a message through an intermediary that doesn't understand this version of its schema. Surely that's a win, isn't it?
> Granted, on paper it's a cool feature. But I've never once seen an application that will actually preserve that property.
OK, well, I've worked on lots of systems -- across three different companies -- where this feature is essential.
To me, it seems that version-change-safety and the usefulness of the generated code constitute a design tradeoff: If you mark a field as required, then the generated data structures can skip using Option/pointers, and this very common form of validation can be generated for free. If you disallow marking a field as required, then all fields must be checked for existence, even ones required for a system to function, which is quite a burden and will lead to developers having to write their own types anyway as a place to put their validated data into. If data is required to be present for an app to function, then why can't I be given the tools to express this, and benefit from the constraints applied to the data model?
Most of the time when I would like to use a schema-driven, efficient data format and code generation tool, the data contract doesn't change frequently. And when it does, assuming it's a backwards-incompatible change, I think I would be happy to generate a MyDataV2 message along with GetMyDataV2 method, allow existing clients to keep using the original version, and allow new or existing clients to use the newly supported structures at their leisure. Meanwhile, everyone that shares my schema can have much more idiomatic generated code, and in the most common cases won't have to write their own data types or be stuck with a bunch of `if data.x != null {` statements.
Protobufs are an amazing tool, but I think there is a need for a simpler tool which supports a restricted set of use cases cleanly and allows for wider expression of data models.
> I guess I'll, once again, copy/paste the comment I made when this was first posted
I had missed it those other times, and it's super interesting. So thank you for copy/pasting it once again :-).
> Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time
how often? as practiced by who, and where?
> 2. You actually do not want a oneof field to be repeated!
> How do you make this change without breaking compatibility? Now you wish that you had defined your array as an array of messages, each containing a oneof, so that you could add a new field to that message. But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.
Nice, "explain to me how you're going to implement a backward-compatible SUM in the spec-parser that doesn't have the notions needed. Ha! You can't! Told you so!"
> But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.
Not really, `oneoff token` is isomorphic to `oneoff (token unit)` and going from the former to the latter doesn't require binary encoding change at all, if the encoding is optimal. Getting from `oneoff (token unit)` to `oneoff (token { linepos })`, depending on the binary encoding format you design, doesn't require you making changes to the parser's runtime, as long as the parser takes into account that `unit` is isomorphic to the zero-arity-product `{}`, and since both `{}` and `{ linepos }` can be presented with a fixed positional addressing, you get your values in a backward-compatible way, but under a specific condition: the parser library API provides `repeated (oneoff <A>)` as a non-materialised stream of values <A>, so that the exact interpretation of <A> happens at a user's calling site, according to the existing stated protocol spec: if it says `<A> = token`, then `list (repeated (oneoff (token { linepos })))` is isomorphic to `list (repeated (oneoff token))` in the deployed version of the protocol that knows nothing about the line positions, so my endpoints can send you either of:
> how often? as practiced by who, and where?
This was my experience in Google Search infrastructure circa 2005-2010. This was a system with dozens of teams and hundreds of developers all pushing their data through a common message bus.
It happened all the damned time and caused multiple real outages (from overzealous validation), along with a lot of tech debt involving having to initialize fields with dummy data because they weren't used anymore but still required.
Reports from other large teams at google, e.g. gmail, indicated they had the same problems.
> Nice, "explain to me how you're going to implement a backward-compatible SUM in the spec-parser that doesn't have the notions needed. Ha! You can't! Told you so!"
Sure sure, we could expand the type system to support some way of adding a new tag to every element of the repeated oneof, implement it, teach everyone how that works, etc.
Or we could just tell people to wrap the thing in a `message`. It works fine already, and everyone already understands how to do it. No new cognitive load is created, and no time is wasted chasing theoretical purity that provides no actual real-world benefit.
> This was my experience in Google Search infrastructure circa 2005-2010 [...]
> Reports from other large teams at google
> teach everyone how that works, etc.
> Or we could just tell people to wrap the thing in a `message`
It really sounds like a self-inflicted internal google issue. Can you address the part where I mention isomorphism of (oneof token) and (oneof (token {})), and clarify what exactly do you think you'd have to teach other engineers to do, if your protocol's encoders and decoders took this property into account?
you are absolutely right!
what alternative do we have? sending json and base64 strings
Or XML. Maybe C structures as stored in memory?
I see your age
yeah, c structures sounds about right
Well, worse is better.
Should have (2018) call out
The ignorance on display in this post is truly breathtaking. I’m impressed something can be this wrong.
I don't recall properly (because I did selve my mapping projects for the moment), but don't openstreet map core data distribution format based on protobuffers?
persuasive or pervasive?
"you're the worst serialization/config format I've ever heard of"
i used protobuffers a lot at $previous_job and i agree with the entire article. i feel the author’s pain in my bones. protobuffers are so awful i can’t imagine google associating itself with such an amateur, ad hoc, ill-defined, user hostile, time wasting piece of shit.
the fact that protobuffers wasn’t immediately relegated to the dustbin shows just how low the bar is for serialization formats.
What do you use?
Google claimed Protobuffers are the solution but Google's planetary engineers clearly have ZERO respect for the mixed-endian remote systems keeping the galactic federation afloat with their cheap CORBA knockoff. It's like, sure which Plan 9 mainframe do you want to connect to like we all live on planet Google. Like hello???
I too was using PBs a lot, as they are quite popular in the Go world. But i came to the conclusion that they and gRPC are more trouble than they are worth. I switched to JSON, HTTP "REST" and websockets, if i need streaming, and am as happy as i could be.
I get the api interoperability between various languages when one wants to build a client with strict schema but in reality, this is more of a theory than real life.
In essence, anyone who subscribes to YAGNI understands that PB and gRPC are a big no-no.
PS: if you need binary format, just use cbor or msgpack. Otherwise the beauty of json is that it human-readable and easily parseable, so even if you lack access to the original schema, you can still EASILY process the data and UNDERSTAND it as well.
I am very partial to msgpack. It has routinely met or exceeded my performance needs and doesn’t depend on weird code generation, and is super easy to set up.
Something that I don’t see talked about much with msgpack, but I think is cool: if your project doesn’t span across multiple languages, you can actually embed those language semantics into your encoder with extensions.
For example, in Clojure’s port of msgpack out of the box, you can use Clojure keywords out of the box and it will parse correctly without issue. You also can have it work with sets.
Obviously you could define some kind of mapping yourself and use any binary format to do this, ultimately the [en|de]coder is just using regular msgpack constructs behind the scenes, but i have always had to do that manually while with msgpack it seems like the libraries readily embrace it.
Indeed, the support is widespread across languages. OTOH, using compression, like basic gzip, for http responses, turns the text format into binary format and with http2 or http3 there is no overhead like it would be with http1. so in the end the binary aspect of these encoders might be obsolete for this use case, as long as one uses compression.
If you opt for non-human-readable wire-formats it better be because of very important reasons. Something about measuring performance and operational costs.
If you need to exchange data with other systems that you don't control, a simple format like JSON is vastly superior. You are restricted to handing over tree-like structures. That is a good thing as your consumers will have no problems reading tree-like structures.
It also makes it very simple for each consumer/producer to coerce this data into structs or objects as they please and that make sense to their usage of the data.
You have to validate the data anyhow (you do validate data received by the outside world, do you?), so throwing in coercing is honestly the smallest of your problems.
You only need to touch your data coercion if someone decides to send you data in a different shape. For tree-like structures it is simple to add new things and stay backwards compatible.
Adding a spec on top of your data shapes that can potentially help consumers generate client code is a cherry on top of it and an orthogonal concern.
Making as little assumptions as possible how your consumers deal with your data is a Good Thing(tm) that enabled such useful(still?) things as the WWW.