Distributed systems programming has stalled

287 points by shadaj 5 months ago

bsnnkv 5 months ago

Last month I switched from a role working on a distributed system (FAANG) to a role working on embedded software which runs on cards in data center racks.

I was in my last role for a year, and 90%+ of my time was spent investigating things that went "missing" at one of many failure points between one of the many distributed components.

I wrote less than 200 lines of code that year and I experienced the highest level of burnout in my professional career.

The technical aspect that contributed the most to this burnout was both the lack of observability tooling and the lack of organizational desire to invest in it. Whenever I would bring up this gap I would be told that we can't spend time/money and wait for people to create "magic tools".

So far the culture in my new embedded (Rust, fwiw) position is the complete opposite. If you're burnt out working on distributed systems and you care about some of the same things that I do, it's worth giving embedded software dev a shot.

alabastervlog 5 months ago

I've found the rush to distributed computing when it's not strictly necessary kinda baffling. The costs in complexity are extreme. I can't imagine the median company doing this stuff is actually getting either better uptime or performance out of it—sure, it maybe recovers better if something breaks, maybe if you did everything right and regularly test that stuff (approximately nobody does though), but there's also so very much more crap that can break in the first place.
Plus: far worse performance ("but it scales smoothly" OK but your max probable scale, which I'll admit does seem high on paper if you've not done much of this stuff before, can fit on one mid-size server, you've just forgotten how powerful computers are because you've been in cloud-land too long...) and crazy-high costs for related hardware(-equivalents), resources, and services.
All because we're afraid to shell into an actual server and tail a log, I guess? I don't know what else it could be aside from some allergy to doing things the "old way"? I dunno man, seems way simpler and less likely to waste my whole day trying to figure out why, in fact, the logs I need weren't fucking collected in the first place, or got buried some damn corner of our Cloud I'll never find without writing a 20-line "log query" in some awful language I never use for anything else, in some shitty web dashboard.
Fewer, or cheaper, personnel? I've never seen cloud transitions do anything but the opposite.
It's like the whole industry went collectively insane at the same time.
[EDIT] Oh, and I forgot, for everything you gain in cloud capabilities it seems like you lose two or three things that are feasible when you're running your own servers. Simple shit that's just "add two lines to the nginx config and do an apt-install" becomes three sprints of custom work or whatever, or just doesn't happen because it'd be too expensive. I don't get why someone would give that stuff up unless they really, really had to.
[EDIT EDIT] I get that this rant is more about "the cloud" than distributed systems per se, but trying to build "cloud native" is the way that most orgs accidentally end up dealing with distributed systems in a much bigger way than they have to.
- whstl 5 months ago
  
  I share your opinions, and really enjoyed your rant.
  But it's funny. The transition to distributed/cloud feels like the rush to OOP early in my career. All of a sudden there were certain developers who would claim it was impossible to ship features in procedural codebases, and then proceed to make a fucking mess out of everything using classes, completely misunderstanding what they were selling.
  It is also not unlike what Web-MVC felt like in the mid-2000s. Suddenly everything that came before was considered complete trash by some people that started appearing around me. Then the same people disparaging the old ways started building super rigid CRUD apps with mountains of boilerplate.
  (Probably the only thing I was immediately on board with was the transition from desktop to web, because it actually solved more problems than it created. IMO, IME and YYMV)
  Later we also had React and Docker.
  I'm not salty or anything: I also tried and became proficient in all of those things. Including microservices and the cloud. But it was more out of market pressure than out of personal preference. And like you said, it has a place when it's strictly necessary.
  But now I finally do mostly procedural programming, in Go, in single servers.
  
  sakesun 5 months ago
  
  Your comment inspire me to brush up my Delphi skill.
- dekhn 5 months ago
  
  I am always happy when I can take a system that is based on distributed computing, and convert it to a stateless single machine job that runs just as quickly but does not have the complexity associated with distributed computing.
  Reccently I was going to do a fairly big download of a dataset (45T) and when I first looked at it, figured I could shard the file list and run a bunch of parallel loaders on our cluster.
  Instead, I made a VM with 120TB storage (using AWS with FSX) and ran a single instance of git clone for several days (unattended; just periodically checking in to make sure that git was still running). The storage was more than 2X the dataset size because git LFS requires 2X disk space. A single multithreaded git process was able to download at 350MB/sec and it finished at the predicted time (about 3 days). Then I used 'aws sync' to copy the data back to s3, writing at over 1GB/sec. When I copied the data between two buckets, the rate was 3GB/sec.
  That said, there are things we simply can't do without distributed computing because there are strong limits on how many CPUs and local storage can be connected to a single memory address space.
  
  achierius 5 months ago
  
  My wheelhouse is lower on the stack, so I'm curious as to what you mean by "stateless single machine job" -- do you just mean that it runs from start to end, without options for suspension/migration/resumption/etc.?
  
  dekhn 5 months ago
  
  it's a pretty generic term but in my mind I was thinking of a job that ran on a machine with remote attached storage (EBS, S3, etc); the state I meant was local storage.
- jimbokun 5 months ago
  
  Distributed or not is a very binary function. If you can run in one large server, great, just write everything in non-distributed fashion.
  But once you need that second server, everything about your application needs to work in distributed fashion.
  
  th0ma5 5 months ago
  
  I wish I could upvote you again. The complexity balloons when you try to adapt something that wasn't distributed, and often things can be way simpler and more robust if you start with a distributed concept.
  
  CogitoCogito 5 months ago
  
  I couldn't disagree more. My principle is to write systems extremely simply and then distribute portions of it as it becomes necessary. Almost always it never becomes necessary and the rare cases it does, it is entirely straight forward to do so unless you have an over-complicated design. I don't think I've ever seen it done well when done in the opposite direction. It's always cost more in time and effort and resulted in something worse.
  
  th0ma5 4 months ago
  
  Tons of vendors offer cloud first, distributed deployments. Erlang is distributed by default. Spark is distributed by default. Most databases are distributed by default.
  
  nonc3dr3w 4 months ago
  
  [dead]
  
  nonc3dr3w 4 months ago
  
  [dead]
- FpUser 5 months ago
  
  This is part of what I do for living. C++ backend software running on real hardware which is currently insanely powerful. There is of course spare standby in case things go South. Works like a charm and I have yet to have a client that scratched it anywhere close to overloading server.
  I understand that it can not deal with FAANG scale problems, but those are relevant only to a small subset of businesses.
  
  intelVISA 5 months ago
  
  The highly profitable, self-inflicted problem of using 200 QPS Python frameworks everywhere.
- tayo42 5 months ago
  
  This rant misses two things that people always miss
  On distributed. Qps scaling isn't the only reason and I suspect rarely the reason. It's mostly driven by availability needs.
  It's also driven my organizational structure and teams. Two teams don't need to be fighting over the same server to deploy their code. So it gets broken out into services with clear api boundaries.
  And ssh to servers might be fine for you. But systems and access are designed to protect the bottom tier of employees that will mess things up when they tweak things manually. And tweaking things by hand isn't reproducible when they break.
  
  Karrot_Kream 5 months ago
  
  Horizontal scaling is also a huge cost savings. If you can run your application with a tiny VM most of the time and scale it up when things get hot, then you save money. If you know your service is used during business hours you can provision extra capacity during business hours and release that capacity during off hours.
- icedchai 4 months ago
  
  I've seen this as well. A relatively simple application becomes a mess of terraform configuration for CloudFront, Lambda, API Gateway, S3, RDS and a half dozen other lesser services because someone had an obsession with "serverless." And performance is worse. And there's as much Terraform as there is actually application code.
- motorest 5 months ago
  
  > I've found the rush to distributed computing when it's not strictly necessary kinda baffling.
  I'm not entirely sure you understand the problem domain, or even the high-level problem. The is or ever was a "rush" to distributed computing.
  What you actually have is this global epifany that having multiple computers communicating over a network to do something actually has a name, and it's called distributed computing.
  This means that we had (and still have) guys like you who look at distributed systems and somehow do not understand they are looking at distributed systems. They don't understand that mundane things like a mobile app supporting authentication or someone opening a webpage or email is a distributed system. They don't understand that the discussion on monolith vs microservices is orthogonal to the topic of distributed systems.
  So the people railing against distributed systems are essentially complaining about their own ignorance and failure to actually understand the high-level problem.
  You have two options: acknowledge that, unless you're writing a desktop app that does nothing over a network, odds are every single application you touch is a node in a distributed system, or keep fooling yourself into believing it isn't. I mean, if a webpage fails to load then you just hit F5, right? And if your app just fails to fetch something from a service you just restart it, right? That can't possibly be a distributed system, and those scenarios can't possibly be mitigated by basic distributed computing strategies, isn't it?
  Everything is simple to those who do not understand the problem, and those who do are just making things up.
  
  lucyjojo 5 months ago
  
  you and the guy you are answering too are not talking the same language (technically yes but you are putting different meanings to the same words).
  this would lead to a pointless conversation, if it were to ever happen.
  
  motorest 4 months ago
  
  > you and the guy you are answering too are not talking the same language (technically yes but you are putting different meanings to the same words).
  That's the point, isn't it? It's simply wrong to assert that there's a rush to distributed systems when they are already ubiquitous in the real world, even if this comes as a surprise to people like OP. Get acquainted with the definition of distributed computing, and look at reality.
  The only epiphany taking place is people looking at distributed systems and thinking that, yes, perhaps they should be treated as distributed systems. Perhaps the interfaces between multiple microservices are points of failure, but replacing them with a monolith does not make it less of a distributed system. Worse, taking down your monolith is also a failure mode, one with higher severity. How do you mitigate that failure mode? Well, educate yourself about distributed computing.
  If you look at a distributed system and call it something other than distributed system, are you really speaking a different language, or are you simply misguided?
- throwawaymaths 5 months ago
  
  the minute you have a client (browser, e.g.) and a server you're doing a distributed system and you should be thinking a little bit about edge cases like loss of connection, incomplete tx. a lot of the goto protocols (tcp, http, even stuff like s3) are built with the complexities of distributed systems in mind so for most basic cases, a little thought goes a long way. but you get weird shit happening all the time (that may be tolerable) if you don't put any effort into it.
- ahartmetz 5 months ago
  
  >It's like the whole industry went collectively insane at the same time.
  Welcome to computing.
  - OOP will solve all of our problems
  - P2P will solve all of our problems
  - XML will solve all of our problems
  - SOAP will solve all of our problems
  - VMs will solve all of our problems
  - Ruby on Rails and by extension dynamically typed languages will solve all of our problems
  - Docker [etc...]
  - Functional programming
  - node.js
  - Cloud
  - Kubernetes
  - Statically typed languages
  - "Serverless"
  - Rust?
  - AI
  Some have more merit (IMO notably FP, static typing and Rust), some less (notably XML and SOAP)...
- Karrot_Kream 5 months ago
  
  [flagged]
jasonjayr 5 months ago

> Whenever I would bring up this gap I would be told that we can't spent time and wait for people to create "magic tools".
That sounds like an awful organizational ethos. 30hrs to make a "magic tool" to save 300hrs across the organization sounds like a no-brainer to anyone paying attention. It sounds like they didn't even want to invest in out-sourced "magic tools" to help either.
- bsnnkv 5 months ago
  
  The real kicker is that it wasn't even management saying this, it was "senior" developers on the team.
  I wonder if these roles tend to attract people who get the most job enjoyment and satisfaction out of the (manual) investigation aspect; it might explain some of the reluctance to adopting or creating more sophisticated observability tooling.
  
  zelphirkalt 5 months ago
  
  Senior doesn't always mean smarter or more experienced or anything really. It just all depends on the company and its culture. It can also mean "worked for longer" (which is not equal to more experienced, as you can famously have 10 times 1y experience, instead of 10y experience) and "more aligned with how management at the company acts".
  
  bongodongobob 5 months ago
  
  I'd probably take 10x 1y experience. Where I'm at now, everyone has been with the company 10-40 years. They think the way they do things is the only way because they've never seen anything else. I have many stories similar to the parent. They are a decade behind in their monitoring tooling, if it even exists at all. It's so frustrating when you know there are better ways.
  
  HPsquared 5 months ago
  
  "10x1y" means someone did the same thing for 10 years with no change or personal development. The learning stopped after the first year which then repeated Groundhog Day style.
  
  bongodongobob 5 months ago
  
  Ah, I misunderstood.
  
  Nevermark 5 months ago
  
  I see a dual. Between 10x1 workers and 1x10 workers working at 10x1 companies.
  Either way, doing the same kinds of things, the same kind of ways, more than a few times, is an automation/tool/practice improvement opportunity lost.
  I have yet to complete a single project I couldn't do much better, differently, if I were to do something similar again. Not everything is high creative, but software is such a complex balancing act/value terrain. Every project should deliver some new wisdom, however modest.
  
  lazystar 5 months ago
  
  another term for the phenomena is the "expert beginner" trap.
  
  fuzztester 5 months ago
  
  I have heard it as 20 versus 1, but it is the same thing.
  also called by some other names, including NIH syndrome, protecting your turf, we do it this way around here, our culture, etc.
  
  Henchman21 5 months ago
  
  IME, “senior” often means “who is left after the brain-drain & layoffs are done” when you’re at a medium sized company that isn’t prominent.
  
  getnormality 5 months ago
  
  I feel so seen.
  
  the_sleaze_ 5 months ago
  
  _To play devils advocate_: It could've sounded like the "new guy" came in and decided he needed to rewrite everything; bring in new xyx; steer the ship. New guy could even have been stepping directly on the toes of those senior developers who had fought and won wars to get were they are now.
  In my -very- humble opinion, you should wait at least a year before making big swinging changes or recommendations, most importantly in any big company.
  
  jiggawatts 5 months ago
  
  In my less humble opinion: the only honest and objective review you’ll get about a system is from a new hire for about a month. Measure the “what the fucks per hour” as a barometer of how bad your org is and how deep a hole it has dug itself into.
  After that honeymoon period, all but the most autistic people will learn the organisational politics, keep their head down, and “play the game” to be assigned trivial menial tasks in some unimportant corner of the system. At that point, only after two beers will they give their closest colleagues their true opinion.
  I’ve seen this play out over and over, organisation after organisation.
  The corollary is that you yourself are not immune to this effect and will grow accustomed to almost any amount of insanity. You too will find yourself saying sentences like “oh, it always has been like this” and “don’t try to change that” or “that’s the responsibility of another team” even though you know full well they’re barely even aware of what that thing is, let alone maintaining it in a responsible fashion.
  PS: This is my purpose in a nutshell as a consultant. I turn up and provide my unvarnished opinion, without being even aware of what I’m “not supposed to say” because “it upsets that psychotic manager”. I’ll be gone before I have any personal political consequences, but the report document will remain, pointing the finger at people that would normally try to bite it off.
  
  the_sleaze_ 5 months ago
  
  This is quite a defensive posture. In my current role I've been able to see an incredible raft of insanity, not be obtuse or arrogant enough to dismiss solutions or the intelligence of those who made them, but literally make a communal list of refactor candidates. Then slowly but surely wrangle people and political capital to my side to eventually change them. Years later we still have cruft leftover but there are many many projects, some multi-year, which are now complete.
  I also see a single-mindedness to specific technical implementations where a more mature view would be to see tech as a business and us less as artisans than blue collar workers.
  > steve jobs on "You're right, but it doesn't matter" https://www.youtube.com/watch?v=oeqPrUmVz-o
  
  jiggawatts 5 months ago
  
  Your attitude is commendable! It’s what a true leader should do, and you deserve to be promoted for it.
  My comment was a statistical observation of what typically happens in ordinary organisations without a strong-willed, technically capable leader at the helm.
  Disclaimer: Also, I have a biased view, because as a consultant I will generally only turn up if there is something already wrong with an organisation that insiders are unable to fix.
  
  disqard 5 months ago
  
  This rings true in my experience across different orgs, teams, in the tech industry.
  FWIW, academia has off-the-charts levels of "wtf" that newcomers will point out, though it's even more ossified than corporate culture, and they don't hire consultants to come in and fix things :)
  
  fc417fc802 5 months ago
  
  Not sure which specific field you have in mind there but many parts of academia also have off the charts levels of, as GP put it, "the most autistic people". Outside of the university bureaucracy (which is its own separate thing) nearly all of the "wtf" that I encountered there had good reasons behind it. Often simply "we don't have the cash" but also frequently things that seemed weird or wrong at first glance but were actually better given the goals in that specific case.
  Interfacing with IT, who thought they knew the "right" way to do everything but in reality had little to no understanding of our constraints, was always interesting.
  
  Jach 5 months ago
  
  There's also immense resistance to figuring out how to code something if an approach isn't at once obvious. Hence "magic". Sometimes a "spike doc" can convince people. My favorite second-hand instance of this was a MS employee insisting that a fast rendering terminal emulator was so hard as to require "an entire doctoral research project in performant terminal emulation".
  
  scottlamb 5 months ago
  
  > I wonder if these roles tend to attract people who get the most job enjoyment and satisfaction out of the (manual) investigation aspect; it might explain some of the reluctance to adopting or creating more sophisticated observability tooling.
  That's weird. I love debugging, and so I'm always trying to learn new ways to do it better. I mean, how can it be any other way? How can someone love something and be that committed to sucking at it?
  
  whstl 5 months ago
  
  I saw a case like this recently, and the fact is that the team responsible was completely burned out and was just doing anything to avoid people from giving them more work, but they also didn't trust anyone else to do it.
  One of the engineers just quit on the spot for a better paid position, the other was demoted and is currently under heavy depression last I heard from him.
  
  jbreckmckye 5 months ago
  
  Why would people who are good at [scarce, valuable skill] and get paid [many bananas] to practice it want to even imagine a world where that skill is now redundant? ;-)
  
  filoleg 5 months ago
  
  The real skill is “problem-solving”, not “doing lots of specific manual steps that could be automated and made easier.”
  Unfortunately, some people confuse the two and believe they are paid to do the latter, not the former, simply because others look at those steps and go “wtf, we could make that hell more pleasant and easier to deal with”.
  In the same vein, “creating perceived job security for yourself by willing to continuously deal with stupid bs that others rightfully aren’t interested in wasting time on.”
  Sadly, you are ultimately right though, as misguided self-interest often tends to win over well-meant proposals.
  
  fc417fc802 5 months ago
  
  If the goal is ensuring a future stream of bananas then can you really say the behavior is misguided?
- cmrdporcupine 5 months ago
  
  Consider that there is a class of human motivation / work culture that considers "figuring it out" to be the point of the job and just accepts or embraces complexity as "that's what I'm paid to do" and gets an ego-satisfaction from it. Why admit weakness? I can read the logs by timestamp and resolve the confusions from the CAP theorem from there!
  Excessive drawing of boxes and lines, and the production of systems around them becomes a kind of Glass Bead Game. "I'm paid to build abstractions and then figure out how to keep them glued together!" Likewise, recomposing events in your head from logs, or from side effects -- that's somehow the marker of being good at your job.
  The same kind of motivation underlies people who eschew or disparage GUI debuggers (log statements should be good enough or you're not a real programmer), too.
  Investing in observability tools means admitting that the complexity might overwhelm you.
  As an older software engineer the complexity overwhelmed me a long time ago and I strongly believe in making the machines do analysis work so I don't have to. Observability is a huge part of that.
  Also many people need to be shown what observability tools / frameworks can do for them, as they may not have had prior exposure.
  And back to the topic of the whole thread, too: can we back up and admit that distributed systems is questionable as an end in itself? It's a means to an end, and distributing something should be considered only as an approach when a simpler, monolithic system (that is easier to reasona bout) no longer suffices.
  Finally I find that the original authors of systems are generally not the ones interested in building out observability hooks and tools because for them the way the system works (or doesn't work) is naturally intuitive because of their experience writing it.
lumost 5 months ago

Anecdotally, I see a major under appreciation for just how fast and efficient modern hardware is in the distributed systems community.
I’ve seen a great many engineers become so used to provisioning compute that they forget that the same “service” can be deployed in multiple places. Or jump to building an orchestration component when a simple single process job would do the trick.
intelVISA 5 months ago

Distributed systems always ends up a dumping ground of failed tech solutions to deep org dysfunction.
Weak tech leadership? Let's "fix" that with some microservices.
Now it's FUBAR? Conceal it with some cloud native horrors, sacrifice a revolving door of 'smart' disempowered engineers to keep the theater going til you can jump to the next target.
Funny because dis sys is pretty solved since Lamport, 40+ years ago.
- whstl 5 months ago
  
  I suffered through this in two companies and man, it isn't easy.
  First one was a multi-billion-Unicorn had everything converted to microservices, with everything customized in Kubernetes. One day I even had to fix a few bugs in the service mesh because the guy who wrote it left and I was the only person not fighting fires able to write the language it was in. I left right after the backend-of-the-frontend failed to sustain traffic during a month where they literally had zero customers (Corona).
  At the second one there was a mandate to rewrite everything to microservices and it took another team 5 months to migrate a single 100-line class I wrote into a microservice. It just wasn't meant to be. Then the only guy who knows how the infrastructure works got burnout after being yelled at too many times and then got demoted, and last I heard is at home with depression.
  Weak leadership doesn't even begin to describe it, especially the second.
  But remembering it is a nice reminder that a job is just a means of getting a payment.
- rbjorklin 5 months ago
  
  Would you mind sharing some more specific information/references to Lamport’s work?
  
  vitus 5 months ago
  
  The three big papers: clocks [0], Paxos [1], Byzantine generals [2].
  [0] https://lamport.azurewebsites.net/pubs/time-clocks.pdf
  [1] https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf
  [2] https://lamport.azurewebsites.net/pubs/byz.pdf
  Or, if you prefer wiki articles:
  https://en.wikipedia.org/wiki/Lamport_timestamp
  https://en.wikipedia.org/wiki/Paxos_(computer_science)
  https://en.wikipedia.org/wiki/Byzantine_fault
  I don't know that I would call it "solved", but he certainly contributed a huge amount to the field.
  
  madhadron 5 months ago
  
  Lamport's website has his collected works. The paper to start with is "Time, clocks, and the ordering of events in a distributed system." Read it closely all the way to the end. Everyone seems to miss the last couple sections for some reason.
bob1029 5 months ago

> Whenever I would bring up this gap I would be told that we can't spend time/money and wait for people to create "magic tools".
I've never once been granted explicit permission to try a different path without being burdened by a mountain of constraints that ultimately render the effort pointless.
If you want to try a new thing, just build it. No one is going to encourage you to shoot holes through things that they hang their own egos from.
- DrFalkyn 5 months ago
  
  Hope you can justify that during sprint planning / standup
  
  bob1029 5 months ago
  
  If you are going to just build it in the absence of explicit buy-in, you certainly shouldn't spend time on the standup talking about it. Wait until your idea is completely formed and then drop a 5 minute demo on the team.
  It can be challenging to push through to a completed demo without someone cheering you on every morning. I find this to be helpful more than hurtful if we are interested in the greater good. If you want to go against the grain (everyone else on the team), then you need to be really sure before you start wasting everyone else's time. Prove it to yourself first.
fatnoah 5 months ago

> The technical aspect that contributed the most to this burnout was both the lack of observability tooling and the lack of organizational desire to invest in it.
One of the most significant "triumphs" of my technical career came at a startup where I started as a Principal Engineer and left as the VP Engineering. When I started, we had nightly outages requiring Engineering on-call, and by the time I left, no one could remember a recent issue that required Engineers to wake up.
It was a ton of work and required a strong investment in quality & resilience, but even bigger impact was from observability. We couldn't afford APM, so we took a very deliberate approach to what we logged and how, and stuffed it into an ELK stack for reporting. The immediate benefit was a drastic reduction in time to diagnose issues, and effectively let our small operations team triage issues and easily identify app vs. infra issues almost immediately. Additionally, it was much easier to identify and mitigate fragility in our code and infra.
The net result was an increase in availability from 98.5% to 99.995%, and I think observability contributed to at least half of that.
fra 5 months ago

As someone who builds observability tools for embedded software, I am flabbergasted that you're finding a more tools-friendly culture in embedded than in distributed systems!
Most hardware companies have zero observability, and haven't yet seen the light ("our code doesn't really have bugs" is a quote I hear multiple times a week!).
- whstl 5 months ago
  
  It's probably a "grass is greener" situation.
  My experience with mid-size to enterprise is having lots of observability and observability-adjacent tools purchased but not properly configured. Or the completely wrong tools for the job being used.
  A few I've seen recently: Grafana running on local Docker of developers because of lack of permissions in the production version (the cherry on top: the CTO himself installed this on the PMs computers), Prometheus integration implemented by dev team but env variables still missing after a couple years, several thousand a month being paid to Datadog but nothing being done with the data nor with the dog.
  On startups it's surprisingly different, IME. But as soon as you "elect" a group to be administrator of a certain tool or some resource needed by those tools, you're doomed.
anonzzzies 5 months ago

I really love embedded work; at least it gives you the feeling that you have control over things. Not everything being confused and black boxed where you have to burn a goat to make it work, sometimes.
- porridgeraisin 5 months ago
  
  > where you have to burn a goat to make it work, sometimes.
  Or talk to a goat, sometimes
  https://modernfarmer.com/2014/05/successful-video-game-devel...
EtCepeyd 5 months ago

This resonates a lot with me.
Distributed systems require insanely hard math at the bottom (paxos, raft, gossip, vector clocks, ...) It's not how the human brain works natively -- we can learn abstract thinking, but it's very hard. Embedded systems sometimes require the parallelization of some hot spots, but those are more like the exception AIUI, and you have a lot more control over things; everything is more local and sequential. Even data race free multi-threaded programming in modern C and C++ is incredibly annoying; I dislike dealing with both an explicit mesh of peers, and with a leaky abstraction that lies that threads are "symmetric" (as in SMP) while in reality there's a complicated messaging network underneath. Embedded is simpler, and it seems to require less that practitioners become advanced mathematicians for day to day work.
- AlotOfReading 5 months ago
  
  Most embedded systems are distributed systems these days, there's simply a cultural barrier that prevents most practitioners from fully grappling with that fact. A lot of systems I've worked on have benefited from copying ideas invented by distributed systems folks working on networking stuff 20 years ago.
  
  DanielHB 5 months ago
  
  I worked in an IoT platform that consisted of 3 embedded CPUs and one linux board. The kicker was that the linux board could only talk directly to one of the chips, but had to be capable of updating the software running on all of them.
  That platform was parallelizable of up to 6 of its kind in a master-slave configuration (so the platform in the physical position 1 would assume the "master role" for a total of 18 embedded chips and 6 linux boards) on top of having optionally one more box with one more CPU in it for managing some other stuff and integrating with each of our clients hardware. Each client had a different integration, but at least they mostly integrated with us, not the other way around.
  Yeah it was MUCH more complex than your average cloud. Of course the original designers didn't even bother to make a common network protocol for the messages, so each point of communication not only used a different binary format, they also used different wire formats (CAN bus, Modbus and ethernet).
  But at least you didn't need to know kubernetes, just a bunch of custom stuff that wasn't well documented. Oh yeah and don't forget the boot loaders for each embedded CPU, we had to update the bootloaders so many times...
  The only saving grace is that a lot of the system could rely on the literal physical security because you need to have physical access (and a crane) to reach most of the system. Pretty much only the linux boards had to have high security standards and that was not that complicated to lock down (besides maintaining a custom yocto distribution that is).
  
  AlotOfReading 5 months ago
  
  Many automotive systems have >100 processors scattered around the vehicle, maybe a dozen of which are "important". I'm amazed they ever work given the quality of the code running on them.
  
  DanielHB 5 months ago
  
  A LOT of QA
  
  zootboy 5 months ago
  
  Indeed. I've been building systems that orchestrate batteries and power sources. Turns out, it's a difficult problem to temporally align data points produced by separate components that don't share any sort of common clock source. Just take the latest power supply current reading and subtract the latest battery current reading to get load current? Oops, they don't line up, and now you get bizarre values (like negative load power) when there's a fast load transient.
  Even more fun when multiple devices share a single communication bus, so you're basically guaranteed to not get temporally-aligned readings from all of the devices.
  
  szvsw 5 months ago
  
  I run a small SaaS side hustle where the core value proposition of the product - at least what got us our first customers, even if they did not realize what was happening under the hood - is, essentially, an implementation of NTP running over HTTPS that can be run on some odd devices and sync those devices to mobile phones via a front end app and backend server. There’s some other CMS stuff that makes it easy for the various customers to serve their content to their customers’ devices, but at the end of the day our core trade secret is just using a roll-your-own NTP implementation… I love how NTP is just the tip of the iceberg when it comes to the wicked problem of aligning clocks. This is all just to say - I feel your pain, but also not really since it sounds like you are dealing with higher precision and greater challenges than I ever had to!
  Here’s a great podcast on the topic which you will surely like!
  https://signalsandthreads.com/clock-synchronization/
  And a related HN thread in case you missed it:
  https://news.ycombinator.com/item?id=39298652
  
  zootboy 5 months ago
  
  The ultimate frustration is when you have no real ability to fix the core problem. NTP (and its 'roided-up cousin PTP) are great, but they require a degree of control and influence over the end devices that I just don't have. No amount of pleading will get a battery vendor to implement NTP in their BMS firmware, and I don't have nearly enough stacks of cash to wave around to commission a custom firmware. So I'm pretty much stuck with the "black box cat herding" technique of interoperation.
  
  szvsw 5 months ago
  
  Yeah, that makes sense. We are lucky in that we get to deploy our code to the devices. It’s not really “embedded” in the sense most people use as these are essentially sandboxed Linux devices that only run applications written in a programming language specific to these devices which is similar to Lua/python but the scripts get turned into byte code at boot IIRC, but none the less very powerful/fast.
  You work on BMS stuff? That’s cool- a little bit outside my domain (I do energy modeling research for buildings) but have been to some fun talks semi-recently about BMs/BAS/telemetry in buildings etc. The whole landscape seems like a real mess there.
  FYI that podcast I linked has some interesting discussion about some issues with PTP over NTP- worth listening to for sure.
  
  anitil 5 months ago
  
  Yes even 'simple' devices these days will have devices (ADC/SPI etc) running in parallel often using DMA, multiple semi-independent clocks, possibly nested interrupts etc. Oh and the UART for some reason always, always has bugs, so hopefully you're using multiple levels of error checking.
  
  zootboy 5 months ago
  
  Yeah, it was a "fun" surprise to discover the errata sheet for the microcontroller I was working with after beating my head against the wall trying to figure out why it doesn't do what the reference manual says it should do. It's especially "fun" when the errata is "The hardware flow control doesn't work. Like, at all. Just don't even try."
  
  anitil 5 months ago
  
  The thing that would break my brain is that the errata is a pdf that you get from .... some link, somewhere
- motorest 5 months ago
  
  > Distributed systems require insanely hard math at the bottom (paxos, raft, gossip, vector clocks, ...) It's not how the human brain works natively -- we can learn abstract thinking, but it's very hard.
  I think this take is misguided. Most of the systems nowadays, specially those involving any sort of network cals, are already distributed systems. Yet, the amount of systems go even close to touching fancy consensus algorithms is very very limited. If you are in a position to design a system and you hear "Paxos" coming out of your mouth, that's the moment you need to step back and think about what you are doing. Odds are you are creating your own problems, and then blaming the tools.
  
  yodsanklai 5 months ago
  
  I remember when I prepared for system design interviews in FAANG, I was anxious I would get asked about Paxos (which I learned at school). Now that I'm working there, never heard about Paxos or fancy distributed algorithms. We rely on various high-level services for deployment, partitioning, monitoring, logging, service discovery, storage...
  And Paxos doesn't require much maths. It's pretty tricky to consider all possible interleavings, but in term of maths, it's really basic discrete maths.
  
  motorest 5 months ago
  
  I'm starting to believe these talks of fancy high-complexity solutions come from people who desperately try to come up with convoluted problems they create for themselves only to be able to say they did a fancy high-complexity solution. Instead of going with obvious simple reliable solutions, they opt for convoluted high-complexity unreliable hacks. Then, when they are confronted by the mess they created for themselves, they hide behind the high-complexity of their solution, as if the problem was the solution itself and not making the misjudged call to adopt it.
  It's so funny how all of a sudden every single company absolutely must implement Paxos. No exception. Your average senior engineer at a FANG working with global deployments doesn't come close to even hearing about it, but these guys somehow absolutely must have Paxos. Funny.
  
  convolvatron 5 months ago
  
  this is completely backwards. the tools may have some internal consistency guarantees, handle some classes of failures, etc. They are leaky abstractions that are partially correct. There were not collectively designed to handle all failures and consistent views no matter their composition.
  From the other direction, Paxos, two generals, serializability, etc. are not hard concepts at all. Implementing custome solutions in this space _is_ hard and prone to error, but the foundations are simple and sound.
  You seem to be claiming that you shouldn't need to understand the latter, that the former gives you everything you need. I would say that if you build systems using existing tools without even thinking about the latter, you're just signing up to handling preventable errors manually and treating this box that you own and black and inscrutable.
- Thaxll 5 months ago
  
  It does not requires any math because 99.9% of the time the issue is not in the low level implementation but in the business logic that the dev did.
  No one goes to review the transaction engine of Postgress.
  
  EtCepeyd 5 months ago
  
  I tend to disagree.
  - You work on postgres: you have to deal with the transaction engine's internals.
  - You work in enterprise application intergration (EAI): you have ten legacy systems that inevitably don't all interoperate with any one specific transaction manager product. Thus, you have to build adapters, message routing and propagation, gateways, at-least-once-but-idempotent delivery, and similar stuff, yourself. SQL business logic will be part of it, but it will not solve the hard problems, and you still have to dig through multiple log files on multiple servers, hoping that you can rely on unique request IDs end-to-end (and that the timestamps across those multiple servers won't be overly contradictory).
  In other words: same challenges at either end of the spectrum.
  
  pfannkuchen 5 months ago
  
  Yeah this is kind of an abstraction failure of the infrastructure. Ideally the surface visible to the user should be simple across the entire spectrum of use cases. In some very, very rare cases one necessarily has to spelunk under the facade and know something about the internals, but for some reason it seems to happen much more often in the real world. I think people often don't put enough effort into making their system model fit with the native model of the infrastructure, and instead torture the infrastructure interface (often including the "break glass" parts) to fit into their a priori system model.
- toast0 5 months ago
  
  That's true, but you can do a lot of that once, and then get on with your life, if you build the right structures. I've gotten a huge amount of mileage from consensus to decide where to send reads/writes to, then everyone sends their reads/writes for the same piece of data to the same place; that place does the application logic where it's simple, and sends the result back. If you don't get the result back in time, bubble it up to the end-user application and it may retry or not, depending.
  This is built upon a framework of the network is either working or the server team / ops team is paged and will be actively trying to figure it out. It doesn't work nearly as well if you work in an environment where the network is consistently slightly broken.
- PaulDavisThe1st 5 months ago
  
  > Even data race free multi-threaded programming in modern C and C++ is incredibly annoying; I dislike dealing with both an explicit mesh of peers, and with a leaky abstraction that lies that threads are "symmetric" (as in SMP) while in reality there's a complicated messaging network underneath.
  If you're using traditional (p)threads-derived APIs to get work done on a message passing system, I'd say you're using the wrong API.
  More likely, I don't understand what you might mean here.
  
  EtCepeyd 5 months ago
  
  Sorry, I figure I ended up spewing a bit of gibberish.
  - By "explicit mesh of peers", I referred to atomics, and the modern (C11 and later) memory model. The memory model, for example as written up in the C11 and later standards, is impenetrable. While the atomics interfaces do resemble a messaging passing system between threads, and therefore seem to match the underlying hardware closely, they are discomforting because their foundation, the memory model, is in fact laid out in the PhD dissertation of Mark John Batty, "The C11 and C++11 Concurrency Model" -- 400+ pages! <https://www.cl.cam.ac.uk/~pes20/papers/topic.c11.group_abstr...>
  - By "leaky abstraction", I mean the stronger posix threads / standard C threads interfaces. They are more intuitive and safer, but are more distant from the hardware, so people sometimes frown at them for being expensive.
beoberha 5 months ago

Yep - I’ve very much been living the former for almost a decade now. It is especially difficult when the components stretch across organizations. It doesn’t quite address what the author here is getting at, but it does make me believe that this new programming model will come from academia and not industry.
lelanthran 5 months ago

I spent the majority of my career as an embedded dev. There are ... different ... challenges, and I'm not so sure that I would want to go back to it.
It pays poorly, the tooling more often than not sucks (more than once I've had to do some sort of stub for an out-of-date gcc), observability is non-existent unless you're looking at a device on your desk, in which case your observability tool is an oscilloscope (or bus pirate type of device, if you're lucky in having the lower layers completely free of bugs).
The datasheets/application notes are almost always incomplete, with errata (in a different document) telling you "Yeah, that application note is wrong, don't do that".
The required math background can be strict as well: R/F, analog ... basically anything interesting you want to do requires a solid grounding in undergrad maths.
I went independent about 2 years ago. You know what pays better and has less work? Line of business applications. I've delivered maybe two handfuls of LoB applications but only one embedded system, and my experience with doing that as a contractor is that I won't take an embedded contract anymore unless it's a client I've already done work for, or if the client is willing to pay 75% upfront, and they agree to a special hourly rate that takes into account my need for maintaining all my own equipment.
alfiedotwtf 5 months ago

I have talked to many people in the Embedded space doing Rust, and every single one of them had the biggest grin while talking about work. Sounds like you’ll have fun :)
ithkuil 5 months ago

10 years ago I went on a similar journey. I left faang to work on a startup working on embedded firmware for esp8266. The lack of tooling was very frustrating. I ended up writing a gdb stub (before espressif released one) and a malloc debugger (via serial port) just to manage to get shit done.
bryanlarsen 5 months ago

I think you were unlucky in your distributed system job and lucky in your embedded job. Embedded is filled with crappy 3rd party and in-house tooling, far more so than distributed, in my experience. That crappiness perhaps leads to a higher likelihood to spend time on them, but it doesn't have to.
Embedded does give you a greater feeling of control. When things aren't working, it's much more likely to be your own fault.
sly010 5 months ago

I don't disagree, but funny that I recently made a point to someone that modern consumer embedded systems (with multiple MCUs connected with buses and sometimes shared memory) are basically small distributed systems, because partial restarts are common and the start/restart order of the MCUs is not very well defined. At least in the space I am working in. (Needless to say we use C, not rust)
literallyroy 5 months ago

How did you make that transition/find a position? Were you already using Rust in a previous role?
yolovoe 5 months ago

Is the “card” work EC2 Nitro by any chance? Sounds similar to what I used to do
junon 5 months ago

Can concur, I also switched mostly to firmware and have enjoyed it much more. Though Rust firmware jobs are hard to come by.
fons 5 months ago

Would you mind disclosing your current employer? I am also interested in moving to an embedded systems role.
Scramblejams 5 months ago

I've often heard embedded is a nightmare of slapdashery. Any tips for finding shops that do it right?
- AlotOfReading 5 months ago
  
  It's not foolproof, but I've found there's a strong correlation between product margin and the sanity of the dev experience.
- DanielHB 5 months ago
  
  There is inherent complexity and self-inflicted complexity, they tend to go hand in hand but self-inflicted complexity can be exacerbated in bad projects. A lot of embedded software is just inherent complex, cars for example.
- api 5 months ago
  
  A lot of times it is, but it's not your fault. It's the fault of vendors and/or third party code you have to use.
bagels 5 months ago

Which company? Doesn't sound like the infra org I was in at a FAANG
englishspot 5 months ago

curious as to how you made that transition. seems like that'd be tough in today's job market.
DaiPlusPlus 5 months ago

> I switched from a role working on a distributed system [...] to embedded software which runs on cards in data center racks
Would you agree that, technically (or philosophically?) that both roles involved distributed systems (e.g. the world-wide-web of web-servers and web-browsers exists as a single distributed system) - unless your embedded boxes weren't doing any network IO at all?
...which makes me genuinely curious exactly what your aforementioned distributed-system role was about and what aspects of distributed-computing theory were involved.
im_down_w_otp 5 months ago

We built a bunch of tools & technology for leveraging observability (docs.auxon.io) to do V&V, stress testing, auto root-cause analysis, etc. in clusters of embedded development (all of it built in Rust too :waves: ), since the same challenges exist for folks building vehicle platforms, lunar rovers, drones, etc. Both within a single system as well as across fleets of systems. Many embedded developers are actually distributed systems developers... they just don't think of it that way.
It's often quite a challenge to get that class of engineer to adopt things that give them visibility and data to track things down as well. Sometimes it's just a capability/experience gap and sometimes it's just over indexing on a perception of time getting to a solution vs. the time wasted on repeated problems and yak shavings.

gklitt 5 months ago

This is outside my area of expertise, but the post sounds like it’s asking for “choreographic programming”, where you can write an algorithm in a single function while reasoning explicitly about how it gets distributed:

https://en.m.wikipedia.org/wiki/Choreographic_programming

I’m curious to what extent the work in that area meets the need.

shadaj 5 months ago

You caught me! That's what my next post is about :)
- lachlan_gray 5 months ago
  
  This could be a fun example to work with :p
  https://en.m.wikipedia.org/wiki/Shakespeare_Programming_Lang...
  
  shadaj 5 months ago
  
  You might enjoy my first ever blog post from ~10 years ago, when I first learned about distributed systems: https://www.shadaj.me/writing/romeo-juliet-and-reactive-prog...
- ashton314 5 months ago
  
  I see from your bio that you are a PhD student. What are you doing with choreographies? (I’m in this space too.)
roadbuster 5 months ago

How does "choreographic programming" differ from the actor model?
- LegionMammal978 4 months ago
  
  From what I can tell, the important distinction is that all actors (and their messages) are described alongside each other, instead of being described separately. There are many implementations of the actor model, but most of them are the 'static-location architectures' that TFA talks about.

rectang 5 months ago

Ten years ago, I had lunch with Patricia Shanahan, who worked for Sun on multi-core CPUs several decades ago (before taking a post-career turn volunteering at the ASF which is where I met her). There was a striking similarity between the problems that Sun had been concerned with back then and the problems of the distributed systems that power so much the world today.

Some time has passed since then — and yet, most people still develop software using sequential programming models, thinking about concurrency occasionally.

It is a durable paradigm. There has been no revolution of the sort that the author of this post yearns for. If "Distributed Systems Programming Has Stalled", it stalled a long time ago, and perhaps for good reasons.

EtCepeyd 5 months ago

> and perhaps for good reasons
For the very good reason that the underlying math is insanely complicated and tiresome for mere practitioners (which, although I have a background in math, I openly aim to be).
For example, even if you assume sequential consistency (which is an expensive assumption) in a C or C++ language multi-threaded program, reasoning about the program isn't easy. And once you consider barriers, atomics, load-acqire/store-release explicitly, the "SMP" (shared memory) proposition falls apart, and you can't avoid programming for a message passing system, with independent actors -- be those separate networked servers, or separate CPUs on a board. I claim that struggling with async messaging between independent peers as a baseline is not why most people get interested in programming.
Our systems (= normal motherboards on one and, and networked peer to peer systems on the other end) have become so concurrent that doing nearly anything efficiently nowadays requires us to think about messaging between peers, and that's very-very foreign to our traditional, sequential, imperative programming languages. (It's also foreign to how most of us think.)
Thus, I certainly don't want a simple (but leaky) software / programming abstraction that hides the underlying hardware complexity; instead, I want the hardware to be simple (as little internally-distributed as possible), so that the simplicity of the (sequential, imperative) programming language then reflect and match the hardware well. I think this can only be found in embedded nowadays (if at all), which is why I think many are drawn to embedded recently.
- hinkley 5 months ago
  
  I think SaaS and multicore hardware are evolving together because a queue of unrelated, partially ordered tasks running in parallel is a hell of a lot easier to think about than trying to leverage 6-128 cores to keep from ending up with a single user process that’s wasting 84-99% of available resources. Most people are not equipped to contend with Amdahl’s Law. Carving 5% out of the sequential part of a calculation is quickly becoming more time efficient than taking 50% out of the parallel parts, and we’ve spent 40 years beating the urge to reach for 1-4% improvements out of people. When people find out I got a 30% improvement by doing 8+6+4+4+3+2+1.5+1.5 they quickly find someplace else to be. The person who did the compressed pointer work on v8 to make it as fast as 64 bit pointers is the only other person in over a decade I’ve seen document working this way. If you’re reading this we should do lunch.
  So because we discovered a lucrative, embarrassingly parallel problem domain that’s what basically the entire industry has been doing for 15 years, since multicore became unavoidable. We have web services and compilers being multi-core and not a lot in between. How many video games still run like three threads and each of those for completely distinct tasks?
  
  gue5t 5 months ago
  
  Personally I've been inspired by nnethercote's logs (https://nnethercote.github.io/) of incremental single-digit percentage performance improvements to rustc over the past several years. The serial portion of compilers is still quite significant and efforts to e.g. parallelize the entire rustc frontend are heroic slogs that have run into subtle semantic problems (deadlocks and races) that have made it very hard to land them. Not to disparage those working on that approach, but it is really difficult! Meanwhile, dozens of small speedups accumulate to really significant performance improvements over time.
  
  linkregister 5 months ago
  
  > 8+6+4+4+3+2+1.5+1.5
  What is this referring to? It sounds like a fascinating problem.
  
  EtCepeyd 5 months ago
  
  >> When people find out I got a 30% improvement by doing 8+6+4+4+3+2+1.5+1.5
  > What is this referring to?
  30 = 8+6+4+4+3+2+1.5+1.5
- gmadsen 5 months ago
  
  I know c++ has a lack luster implementation, but do coroutines and channels solve some of these complaints? although not inherently multithreaded, many things shouldn't be multithreaded , just paused. and channels insteaded of shared memory can control order
  
  hinkley 5 months ago
  
  Coroutines basically make the same observation as transmit windows in TCP/IP: you don’t send data as fast as you can if the other end can’t process it, but also if you send one at a time then you’re going to be twiddling your fingers an awful lot. So you send ten, or twenty, and you wait for signs of progress before you send more.
  On coroutines it’s not the network but the L1 cache. You’re better off running a function a dozen times and then running another than running each in turn.
  
  gmadsen 5 months ago
  
  fair enough, that was the design choice c++ went with to not break ABI and have moveable coroutine handles
  rust accepted the tradeoff and can do pure stack async,
  there are things you can do in c++ to not get the dynamic allocation to heap, but it requires a custom allocator + predefining size of coroutines.
  https://pigweed.dev/docs/blog/05-coroutines.html
  
  EtCepeyd 5 months ago
  
  I've found both explicit future/promise management and coroutines difficult (even irritating) to reason about. Co-routines look simpler at the surface (than explicit future chaining), and so their the syntax is less atrocious, but there are nasty traps. For example:
  https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines...
- vacuity 5 months ago
  
  I think trying to shoehorn everything into sequential, imperative code is a mistake. The burden of performance should be on the programmer's cognitive load, aided where possible by the computer. Hardware should indeed be simple, but not molded to current assumptions. It's indeed true that concurrency of various fashions and the attempts at standardizing it are taxing on programmers. However, I posit this is largely essential complexity and we should accept that big problems deserve focus and commitment. People malign frameworks and standards (obligatory https://xckd.com/927), but the answer is not shying away from them but rather leveraging them while being flexible.
- cmrdporcupine 5 months ago
  
  What we need is for formal verification tools (for linearizability, etc.) to be far more understood and common.
hinkley 5 months ago

I think the underlying premise of Cloud is:
Pay a 100% premium on compute resources in order to pretend the 8 Fallacies of Distributed Computing don’t exist.
I sat out the beginning of Cloud and was shocked at how completely absent they are from conversations within the space. When the hangover hits it’ll be ugly. The Devil always gets his due.
jimbokun 5 months ago

The author critiques having sequential code executing on individual nodes, uninformed by the larger distributed algorithm in which they play a part.
However, I think there are great advantages to that style. It’s easier to analyze and test the sequential code for correctness. Then it writes a Kafka message or makes an HTTP call and doesn’t need to be concerned with whatever is handling the next step in the process.
Then assembling the sequential components once they are all working individually is a much simpler task.
bigmutant 5 months ago

The fundamental problems are communication lag and lack of information about why issues occur (encapsulated by the Byzantine Generals problem). I like to imagine trying to build a fault-tolerant, reliable system for the Solar System. Would the techniques we use today (retries, timeouts, etc) really be adequate given that lag is upwards of hours instead of milliseconds? But that's the crux of these systems, coordination (mostly) works because systems are close together (same board, at most same DC)
shadaj 5 months ago

Stay tuned for the next blog post for one potential answer :) My PhD has been focused on this gap!
- rectang 5 months ago
  
  As a programmer, I hope that your answer continues to abstract away the problems of concurrency from me, the way that CPU designers have managed, so that I can still think sequentially except when I need to. (And as a senior engineer, you need to — developing reliable concurrent systems is like pilots landing planes in bad weather, part of the job.)
  
  hinkley 5 months ago
  
  I was doing some Java code recently after spending a decade in async code and boy that first few minutes was like jumping into a cold pool. Took me a moment to switch gears back to everything is blocking and that function just takes 500ms sometimes, waiting for IO.

hinkley 5 months ago

I don’t think there’s anyone in the Elixir community who wouldn’t love it if companies would figure out that everyone is writing software that contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Erlang, and start hiring Elixir or Gleam devs.

The future is here, but it is not evenly distributed.

ikety 5 months ago

It's so odd seeing people dissuade others for implementing "niche" languages like Elixir or Gleam. If you post a job opportunity with these languages, I guarantee you will be swamped with qualified candidates that are very passionate and excited to work with these languages full time.
- hinkley 5 months ago
  
  At this point I’m worried that because elixir is over 10 years old that it’ll never arrive. But then Python is older than Java and here we are.
rramadass 5 months ago

I decided long ago (after having implemented various protocols and shared-memory multi-threaded code) that what i like best is to use Erlang as "the fabric" for the graph of distributed computing and C/C++ for heavy lifting at any node.
- hinkley 5 months ago
  
  I’m hoping the JIT will finally start to change that.
jimbokun 5 months ago

Yes, my sense reading the article was the user is reinventing Erlang.
ignoramous 5 months ago

> writing software that contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Erlang
Since you were at AWS (?), you'd know that Erlang did get its shot at distributed systems there. I'm unsure what went wrong, but if not c/c++, it was all JVM based languages soon after that.
- hinkley 5 months ago
  
  No I worked a contract in the retail side and would not wish that job on anyone. My most recent favorite boss works there now and I haven’t even said hi because I’m afraid he’ll offer me a job.
- worthless-trash 4 months ago
  
  I have noticed that in corporate, langauges dont fail on technical merit, but its fashion merits.
seivan 5 months ago

[dead]

bigmutant 5 months ago

Good resources for understanding Distributed Systems:

- MIT course with Robert Morris (of Morris Worm fame): https://www.youtube.com/watch?v=cQP8WApzIQQ&list=PLrw6a1wE39...

- Martin Kleppmann (author of DDIA): https://www.youtube.com/watch?v=UEAMfLPZZhE&list=PLeKd45zvjc...

If you can work through the above (and DDIA), you'll have a solid understanding of the issues in Distributed System, like Consensus, Causality, Split Brain, etc. You'll also gain a critical eye of Cloud Services and be able to articulate their drawbacks (ex: did you know that replication to DynamoDB Secondary Indexes is eventually consistent? What effects can that have on your applications?)

ignoramous 5 months ago

> Robert Morris (of Morris Worm fame)
(of Y Combinator fame, too)

dingnuts 5 months ago

>The static-location model seems like the right place to start, since it is at least capable of expressing all the types of distributed systems we might want to implement, even if the programming model offers us little help in reasoning about the distribution. We were missing two things that the arbitrary-location model offered:

> Writing logic that spans several machines right next to each other, in a single function

> Surfacing semantic information on distributed behavior such as message reordering, retries, and serialization formats across network boundaries

Aren't these features offered by Erlang?

shadaj 5 months ago

Erlang (is great but) is still much closer to the static-location (Actors) paradigm than what I’m aspiring for. For example, if you have stateful calculations, they are typically implemented as isolated (static-location) loops that aren’t textually co-located with the message senders.
chuckledog 5 months ago

Great point. Erlang is still going strong, in fact WhatsApp is implemented in Erlang
prophesi 5 months ago

Yep, the words fault tolerance and distributed computing immediately brings to my mind Erlang/Elixir.

tracnar 5 months ago

The unison programming language does foray into a truly distributed programming language: https://www.unison-lang.org/

aDyslecticCrow 5 months ago

Functional programming languages already have alot of powerful concepts for distributed programming. Loads of the distributed programming techniques used elsewhere are often taken from an obscure fp language from years prior. Erlang comes to mind as still quite uniquely distributed without non fp comparison
Unison seems to build on it further. Very cool

KaiserPro 5 months ago

Distributed systems are hard, as well all know.

However the number of people that actually need a distributed system is pretty small. With the rise of kubernetes, the number of people who've not been burnt by going distributed when they didn't need to has rapidly dropped.

You go distributed either because you are desperate, or because you think it would be fun. K8s takes the fun out of most things.

Moreover, with machines suddenly getting vast IO improvements, the need for going distributed is much less than it was 10 years. (yes i know there is fault tolerance, but that adds another dimension of pain.)

sd9 5 months ago

> the number of people who've not been burnt by going distributed when they didn't need to has rapidly dropped
Gosh, this was hard to parse! I’m still not sure I’ve got it. Do you mean “kubernetes has caused more people to suffer due to going distributed unnecessarily”, or something else?
- boarush 5 months ago
  
  Had me confused for a second too, but I think it is the former that they meant.
  K8s has unneeded complexity which is really not required at even decent enough scales, if you've put in enough effort to architect a solution that makes the right calls for your business.
  
  KaiserPro 5 months ago
  
  yeah sorry, double negatives.
  People got burnt by kubernetes, and that pissed in the well of enthusiasm for experimenting with distributed systems
  
  DrFalkyn 5 months ago
  
  Because people, especially Devops, thought k8s was some magic, when all it really does is makes the mechanics easier
  If you’re architecture is poor k8s won’t help you
bormaj 5 months ago

Any specific pitfalls to avoid with K8s? I've used it to some degree of success in a production environment, but I keep deployments relatively simple.
- KaiserPro 5 months ago
  
  Its a spectrum rather than a binary thing, however you are asking the right questions!
  One of the things that is most powerful about K8s is that it gives you a lot of primitives to build things with. This is also its biggest drawback.
  If you are running real physical infrastructure and want to run several hundreds of "services" (as in software, not k8s services) then kubernetes is probably a good fit, but you have a storage and secrets problem to solve as well.
  On the cloud, unless you're using a managed service, its almost certainly easier to either use lambdas (for low traffic services) or one of the many managed docker hosting services they have.
  Some of them are even K8s API compatible.
  but _why_?
  At its heart, k8s is a "run this thing here with these resources" system. AWS also does this, so duplicating it costs time and money. For most people the benefit of running ~20 services < 5 dbs and storage on k8s is negative. Its a steep learning curve, very large attack surface (You need ot secure the instance and then k8s permissions) and its an extra layer of things to maintain. For example, running a DB on k8s is perfoectly possible, and there are bunch of patterns you can follow. But you're on the hook for persistence, backup and recovery. managed DBs are more expensive to run, but they cost 0 engineer hours to implement.
  BUT
  You do get access to helm, which means that you can copypasta mostly working systems into your cluster. (but again like running untrusted docker images, thats not a great thing to do.)
  The other thing to note is the networking scheme is badshit crazy and working with ipv6 is still tricky.

margorczynski 5 months ago

Distributed systems are cool but most people don't really get how much complexity it introduces which leads them to fad-driven decisions like using Event Sourcing where there is no fundamental need to use it. I've seen projects getting burned because of the complexity and overhead it introduces where "simpler" approaches worked well and were easy to extend/fix. Hard to find and fix bugs, much slower feature addition and lots of more goodies the blogs with toy examples don't speak about.

nine_k 5 months ago

The best recipe I know is to start from a modular monolith [1] and split it when and if you need to scale way past a few dozen nodes.
Event sourcing is a logical structure; you can implement it with SQLite or even flat files, locally, if you your problem domain is served well by it. Adding Kafka as the first step is most likely a costly overkill.
[1]: https://awesome-architecture.com/modular-monolith/
- margorczynski 5 months ago
  
  What you're speaking of is a need/usability-based design and extension where you design the solution with certain "safety valves" that let you scale it up when needed.
  This is in contrast to the fad-driven design and over-engineering that I'm speaking of (here I simply used ES as an example) that is usually introduced because someone in power saw a blog post or 1h talk and it looked cool. And Kafka will be used because it is the most "scalable" and shiny solution, there is no pros-vs-cons analysis.
rjbwork 5 months ago

If the choice has already been made to do a distributed system (outside of the engineer's control...), is a choice to use Event Sourcing by the engineer then a good idea?
mrkeen 5 months ago

In my experience:
1) We are surrounded by distributed systems all the time. When we buy and sell B2B software, we don't know what's stored in our partners databases, they don't know what's in ours. Who should ask whom, and when? If the data sources disagree, whose is correct? Just being given access to a REST API and a couple of webhooks is all you need to be in full distributed systems land.
2) I honestly do not know of a better approach than event-sourcing (i.e. replicated state machine) to coordinate among multiple masters like this. The only technique I can think of that comes close is Paxos - which does not depend on events. But then the first thing I would do if I only had Paxos, would be to use it to bootstrap some kind of event system on top of it.
Even the non-event-sourcing technologies like DBs use events (journals, write-ahead-logs, sstables, etc.) in their own implementation. (However that does not imply that you're getting events 'for free' by using these systems.)
My co-workers do not put any alternatives forward. Reading a database, deciding what action to do, and then carrying out said action is basically the working definition of a race-condition. Bankers and accountants had this figured out thousands of years ago: a bank can't send a wagon across the country with queries like "How much money is in Joe's account?" wait a week for the reply, and then send a second wagon saying "Update Joe's account so it has $36.43 in it now". It's laughable. But now that we have 50-150ms latencies, we feel comfortable doing GETs and POSTs (with a million times more traffic) and somehow think we're not going to get our numbers wrong.
Like, what's an alternative? I have a shiny billion-dollar fully-ACID SQL db with my customer accounts in them. And my SAAS partner bank also has that technology. Put forward literally any idea other than events that will let us coordinate their accounts such that they're not able to double-spend money, or are prevented from spending money if a node is down. I want an alternative to event sourcing.
- margorczynski 5 months ago
  
  Again - do not fixate on the ES thing as it was put forward only as an example. You're presenting a case when for the given scenario after analysis and weighting the alternatives this is the most optimal solution where I'm speaking about introducing unnecessary complexity just because the tech is cool and trendy.

sanity 5 months ago

The article makes great points about why distributed programming has stalled, but I think there's still room for innovation—especially in how we handle state consistency in decentralized systems.

In Freenet[1], we’ve been exploring a novel approach to consistency that avoids the usual trade-offs between strong consistency and availability. Instead of treating state as a single evolving object, we model updates as summarizable deltas—each a commutative monoid—allowing peers to merge state independently in any order while achieving eventual consistency.

This eliminates the need for heavyweight consensus protocols while still ensuring nodes converge on a consistent view of the data. More details here: https://freenet.org/news/summary-delta-sync/

Would love to hear thoughts from others working on similar problems!

[1] https://freenet.org/

Karrot_Kream 5 months ago

Haven't read the post yet (I should, I have been vaguely following y'all along but obviously not close enough!) How is this different from delta-based CRDTs? I've built (admittedly toy) CRDTs as DAGs that ship deltas using lattice operations and it's really not that hard to have it work. There's already CRDT based distributed stores out there. How is this any different?
- sanity 5 months ago
  
  Good question! Freenet is a decentralized key-value store, but unlike traditional KV stores, the keys are WebAssembly (WASM) contracts. These contracts define not just what values (i.e., data or state) are valid for that key but also when and how the value can be mutated. They also specify how to efficiently synchronize the value across peers using summaries and deltas.
  Each contract determines how state changes are validated, summarized, and merged, meaning you can efficiently implement almost any CRDT mechanism in WASM on top of Freenet. Another key difference is that Freenet is an observable KV store, allowing you to subscribe to values and receive immediate updates when they change.
  
  Karrot_Kream 5 months ago
  
  That's really cool, now I have to read this post. Thanks!
  
  sanity 5 months ago
  
  Thanks!
  That article just scratches the surface, if you'd like a good overview of the entire system check out this talk: https://freenet.org/news/building-apps-video-talk/

herval 5 months ago

Throwing in my two cents on the LLM impact - I've been seeing an increasing number of systems where core part of the functionality is either LLMs or LLM-generated code (sometimes on the fly, sometimes cached for reuse). If you think distributed systems were difficult before, try to imagine a system where the code being executed _isn't even debuggable or repeatable_.

It feels like we're racing towards a level of complexity in software that's just impossible for humans to grasp.

klysm 5 months ago

That's okay though! We can just make LLMs grasp it!
- herval 5 months ago
  
  ironically or not, the best way to have LLMs be effective at writing valid code is when they work on microservices. Since the scope is smaller and the boundary is clear, tools like Cursor/Windsurf seem to make very few mistakes (compared to pointing them at your monorepo, where they usually end up completely wrong)
  
  klysm 4 months ago
  
  Is it then up to the human to specify the services and how they interact?
- moffkalast 5 months ago
  
  "There's always a larger model."

synergy20 5 months ago

I saw comments about embedded development, which I have been doing that for a long time, just want to make a point here: the pay has upper limits, you will be paid fine but will reach the pay limit very fast, and it will stay there for the rest of your career. they can swap someone in with that price tag to do whatever you are working on, because, after all, embedded devel is not rocket science.

cmrdporcupine 5 months ago

The problem with embedded is its proximity to EE which is frankly underpaid.
But it's also more that the "other" kind of SWE work -- "backend" etc is frankly overpaid because of the copious quantities of $$ dumped into it by VC and ad money.

sn9 5 months ago

It sounds like they're looking for the Rust project Hydro [0].

EDIT: Lol nvm the author is one of the authors of Hydro [1].

[0] https://github.com/hydro-project/hydro

[1] https://hydro.run/people

seivan 5 months ago

[dead]

jderick 5 months ago

Distributed systems are hard. I like the idea of "semantic locality." I think it can be achieved to some degree via abstraction. The code that runs across many machines does a lot of stuff but only a small fraction of that is actually involved in coordination. If you can abstract away those details you should end up with a much simpler protocol that can be modeled in a succinct way. Then you can verify your protocol much more easily. Formal methods have used tools such as spin (promela) or guarded commands (murphi) for modeling these kinds of systems. I'm sure you could do something similar with the lean theorem prover. The tricky part is mapping back and forth between your abstract system and the real one. Perhaps LLMs could help here.

I work on hardware and concurrency is a constant problem even at that low level. We use model checking tools which can help.

MisterTea 5 months ago

This has been an issue for quite some time. Rob Pike wrote about it going back - https://doc.cat-v.org/bell_labs/utah2000/

cmrdporcupine 5 months ago

Two things:

Distributed systems are difficult to reason about.

Computer hardware today is very powerful.

There is a yo-yo process in our industry over the last 50 years between centralization and distribution. We necessarily distribute when we hit the limits of what centralization can accomplish because in general centralization is easier to reason about.

When we hit those junctures, there's a flush of effort into distributed systems. The last major example of this I can think of was the 2000-2010 period, when MapReduce, "NoSQL" databases, Google's massive arrays of supposedly identical commodity grey boxes (not the case anymore), the High Scalability blog, etc. were the flavour of the time.

But then, frankly, mass adoption of SSDs, much more powerful computers, etc. made a lot of those things less necessary. The stuff that most people are doing doesn't require a high level of distributed systems sophistication.

Distributed systems are an interesting intellectual puzzle. But they should be a means to an end not an end in themselves.

tonyarkles 5 months ago

> But then, frankly, mass adoption of SSDs, much more powerful computers, etc. made a lot of those things less necessary. The stuff that most people are doing doesn't require a high level of distributed systems sophistication.
I did my MSc in Distributed Systems and it was always funny (to me) to ask a super simple question when someone was presenting distributed system performance metrics that they'd captured to compare how a system scaled across multiple systems: how long does it take your laptop to process the same dataset? No one ever seemed to have that data.
And then the (in)famous COST paper came out and validated the question I'd been asking for years: https://www.usenix.org/system/files/conference/hotos15/hotos...
- cmrdporcupine 5 months ago
  
  “You can have a second computer once you’ve shown you know how to use the first one.” –Paul Barham
  Wow I love that.
  Many people in our profession didn't seem to really notice when the number of IOPS on predominant storage media went from under 200 to well over 100,000 in a matter of just a few years.
  I remember evaluating and using clusters of stuff like Cassandra back in the late 00s because it just wasn't possible to push enough data to disk to keep up with traffic on a single machine. It's such an insanely different scenario now.
  
  tonyarkles 5 months ago
  
  My not-super-humble opinion is that people didn’t notice because SSDs became mainstream/cheap around the same time cloud migration got popular. Lots of VPS providers offer pretty mediocre IOPS and disk bandwidth on the lower tiers; I’d argue disproportionately so. A $300 desktop from Costco with 8GB of RAM and a 500GB SSD is going to kick the crap out of most 8GB RAM VPSes for IO performance. So… right when rack mounted servers could affordably provide insane amounts of IO performance, we all quit buying rack mount servers and didn’t notice how much worse off we are with VPSes.
spratzt 5 months ago

I would go even further and argue that vast majority of businesses will never need to think about distributed systems. Modern hardware makes them irrelevant to all but the most niche of applications.
- th0ma5 5 months ago
  
  I had a longer comment elsewhere but to me this says that the distribution is happening somewhere and what you're also saying is that companies have to decide how much they want or care to control it.
  
  spratzt 5 months ago
  
  No. The issue is whether you NEED to not whether you want to.
  10 to 15 years ago, you could argue, however implausibly, that hardware constraints meant vertical scaling was impossible, and you were forced to adopt a distributed architecture. Subsequent improvement in hardware performance, means that in 2025, vertical scaling is perfect acceptable in nearly all areas, relegating distributed architecture to the most niche and marginal applications. The type of applications that the vast majority of businesses will never encounter.
  
  th0ma5 4 months ago
  
  There is essentially no tooling for this and vendors all default to distributed patterns. Either you directly control the scaling or you're relinquishing it.
porridgeraisin 5 months ago

I think the reason that distributed systems still are the go-to choice for many software teams is to do with people/career expectations/careers orienting themselves around distributed systems over the time period you mentioned. It will take a while for it to re-orient, and then distributed systems might become a fad again ;) An example of this is typical promotion incentives being easier to get in microservice teams, thereby incentivising people to organize the team/architecture in that way.
- cmrdporcupine 5 months ago
  
  Honestly, I am more cynical and just think people are always looking for ways to make their jobs more interesting than they actually are.
  
  EtCepeyd 5 months ago
  
  FWIW, at least one other comment seems to correlate job complexity with job security: https://news.ycombinator.com/item?id=43197623
  
  cmrdporcupine 5 months ago
  
  I work in embedded and it's absolutely not "less complex"
  
  davemp 5 months ago
  
  For sure. Maybe less accidental complexity. Embedded has been doing asynchronous IO since before it was cool. You also have distributed computing if you’re doing something like sensor networks.
  
  Nevermark 5 months ago
  
  > Honestly, I am more cynical and just think people are always looking for ways to make their jobs more interesting than they actually are.
  When you frame the problem that way, unnecessary complexity seems like part of a healthy solution path. /h
  Companies get reliability benefits from slack, but creative people abhor wasted slack. Some basic business strategy/wisdom for maintaining/managing creative slack is needed.

Karrot_Kream 5 months ago

When I was graduating from my Masters (failed PhD :) this overview of various programming models is generally how I thought of things.

I've been writing distributed code now in industry for a long time and in practice, having worked at a some pretty high-scale tech companies over the years, most shops tend to favor static-location style models. As the post states, it's due largely to control and performance. Scaling external-distribution systems has been difficult everywhere I've seen it tried and usually ends up creating a few knowledgeable owners of a system with high bus-factor. Scaling tends to work fine until it doesn't and these discontinuous, sharp edges are very very painful as they're hard to predict and allocate resourcing for.

Are external-distribution systems dead ends then? Even if they can achieve high theoretical performance, operation of these systems tends to be very difficult. Another problem I find with external-distribution systems is that there's a lot of hidden complexity in just connecting, reading, and writing to them. So you want to talk to a distributed relational DB, okay, but are you using a threaded concurrency model or an async concurrency model? You probably want a connection pool so that TCP HOL blocking doesn't tank your throughput. But if you're using threads, how do you map your threads to the connections in the pool? The pool itself represents a bottleneck as well. How do you monitor the status of this pool? Tools like Istio strive to standardize this a little bit but fundamentally we're working with 3 domains here just to write to the external-distribution system itself: the runtime/language's concurrency model, the underlying RPC stack, and the ingress point for the external-distribution system.

Does anyone have strong stories of scaling an external-distribution system that worked well? I'd be very curious. I agree that progress here has stalled significantly. But I find myself designing big distributed architecture after big distributed architecture continuing to use my deep experience of architecting these systems to build static-location systems because if I'm already dealing with scaling pains and cross-domain concerns, I may as well rip off the band-aid and be explicit about crossing execution domains.

kodablah 5 months ago

> Just like the external-distribution model, arbitrary-location architectures often come with a performance cost. Durable execution systems typically snapshot their state to a persistent store between every step.

This is not true by most definitions of "snapshot". Most (all?) durable execution systems use event sourcing and therefore it's effectively an immutable event log. And it's only events that have external side effects enough to rebuild the state, not all state. While technically this is not free, it's much more optimal than the traditional definition of capturing and storing a "snapshot".

> But this simplicity comes at a significant cost: control. By letting the runtime decide how the code is distributed [...] we don’t want to give up: Explicit control over placement of logic on machines, with the ability to perform local, atomic computations

Not all durable execution systems require you to give this up completely. Temporal (disclaimer: my employer) allows grouping of logical work by task queue which many users use to pick locations of work, even so far as a task queue per physical resource which is very common for those wanting that explicit control. Also there are primitives for executing short, local operations within workflows assuming that's what is meant there.

nchammas 5 months ago

There is an old project out of Berkeley called BOOM [1] that developed a language for distributed programming called Bloom [2].

I don't know enough about it to map it to the author's distributed programming paradigms, but the Bloom features page [3] is interesting:

> disorderly programming: Traditional languages like Java and C are based on the von Neumann model, where a program counter steps through individual instructions in order. Distributed systems don’t work like that. Much of the pain in traditional distributed programming comes from this mismatch: programmers are expected to bridge from an ordered programming model into a disordered reality that executes their code. Bloom was designed to match–and exploit–the disorderly reality of distributed systems. Bloom programmers write programs made up of unordered collections of statements, and are given constructs to impose order when needed.

[1]: https://boom.cs.berkeley.edu

[2]: http://bloom-lang.net/index.html

[3]: http://bloom-lang.net/features/

jmhucb 5 months ago

Good pattern matching. Bloom is a predecessor project to the OP's PhD thesis work :-) This area takes time and many good ideas to mature, but as the post hints, progress is being made.

th0ma5 5 months ago

Since multicore processing a ton of software you use or create is distributed you have to ask if you want to be in control of how it is distributed or not. If you want it easy and let the library figure it out then you have to accept the topological ideas it has. For instance H2O is a great machine learning package that even has its own transparent multi core processing. If you want to go across machines it has its own cluster built in. You can also install it into Hadoop, Spark, etc but once you start going that direction you're more and more on the hook for what that means and if it even is more effective for your problem and what your distributed strategy should be.

Things like re-entrant idempotence, software transactional memory, copy on write, CRDTs etc are going to have waste and overhead but can vastly simplify conceptually the ongoing development and maintenance of even non-distributed efforts in my opinion, and we keep having the room to eat the overhead.

There's a ton of bias against this for good reasons that the non distributed concepts still just work without any hassle but we'd be less in the mud in a fundamental way of we learned to let go of non-eventual consistency.

gregw2 5 months ago

What I noticed missing in this analysis of distributed systems programming was a recognition/discussion of how distributed databases (or datalakes) decoupling storage from compute have changed the art of the possible.

In the old days of databases, if you put all your data in one place, you could scale up (SMP) but scaling out (MPP) really was challenging. Nowdays, you (iceberg), or a DB vendor (Snowflake, Databricks, BigQuery, even BigTable, etc), put all your data on S3/GCS/ADLS and you can scale out compute to read traffic as much as you want (as long as you accept something like a snapshot isolation read level and traffic is largely read-only or writes are distributed across your tables and not all to one big table.)

You can now share data across your different compute nodes or applications/systems by managing permissions pointers managed via a cloud metadata/catalog service. You can get microservice databases without each having completely separate datastores in a way.

hintymad 5 months ago

This reminds me of Rob Pike's article "Systems Software Research is Irrelevant," written about 15 years ago. Perhaps many systems have matured to a point where any improvement appears incremental to engineers, so the conviction to develop a new programming model isn't strong enough. Or perhaps we're in a temporary plateau, and a groundbreaking tool will emerge in a few years.

Regarding Laddad's point, building tools native to distributed systems programming might be intrinsically difficult. It's not for lack of trying. We've invented numerous algebras, calculi, programming models, and experimental programming languages over the past decades, yet somehow none has really taken off. If anything, I'd venture to assert that object storage, perhaps including Amazon DynamoDB, has changed the landscape of programming distributed systems. These two systems, which optimize for throughput and reliability, make programming distributed systems much easier. Want a queue system? Build on top of S3. Want a database? Focus on query engines and outsource storage to S3. Want a task queue? Just poll DDB tables. Want to exchange states en masse? Use S3. The list goes on.

Internally to S3, I think the biggest achievement is that S3 can use scalability to its advantage. Adding a new machine makes S3 cheaper, faster, and more reliable. Unfortunately, this involves multiple moving parts and is therefore difficult to abstract into a tool. Perhaps an arbitrarily scalable metadata service is what everyone could benefit from? Case in point, Meta's warm storage can scale to multiple exabytes with a flat namespace. Reading the paper, I realized that many designs in the warm storage are standard, and the real magic lies in its metadata management, which happens to be outsourced to Meta's ZippyDB. Meanwhile, open-source solutions often boast about their scalability, but in reality, all known ones have certain limits, usually no more than 100PBs or a few thousand nodes.

jbarham 5 months ago

> This reminds me of Rob Pike's article "Systems Software Research is Irrelevant," written about 15 years ago.
25 years ago: http://herpolhode.com/rob/utah2000.pdf (Time flies.)
- hintymad 5 months ago
  
  Wow, time flies. Thanks!

nyrikki 5 months ago

> Distributed SQL Engines

This is what I see holding some applications back.

The relational model is flexible and sufficient for many needs but the ACID model is responsible for much of the complexity in some more recent solutions.

While only usable for one-to-many relationships, the hierarchical model would significantly help in some of the common areas like financial transactions.

Think IBM IMS fastpath, and the related channel model.

But it seems every neo paradime either initially hampers itself, or grows to be constrained by Codd's normalization rules, which result in transitive closure a the cost of independence.

As we have examples like Ceph's radios, Kafka etc...if you view the hierarchical file path model as being intrinsic to that parent child relationship we could be distributed.

Perhaps materialized views could be leveraged to allow for SQL queries without turning the fast path into a distributed monolith.

SQL is a multi tool, and sometimes you just need to use a specific tool.

taeric 5 months ago

I'm not clear what the proposal here is? It specifically eschews tooling as a driver in the solution, but why? Wouldn't tooling be one of the most likely areas to get solid progress, as you could make tooling and point it at existing products.

Would be interesting to see comparisons to other domains. Surely you could look at things like water processing plants to see how they build and maintain massive structures that do coordinated work between parts of it? Power generation plants. Assembly factories. Do we not have good artifacts for how these things are designed and reasoned about?

mgraczyk 5 months ago

The author is missing information about LLMs. In the "Obligatory LLM Section" he focuses on distributed systems that use LLMs.

But almost all of the new innovation I'm familiar with in distributed systems is about training LLMs. I wouldn't say the programming techniques are "new" in the way this post is describing them, but the specifics are pretty different from building a database or data pipeline engine (less message oriented, more heavily pipelined, more low level programming, etc)

riku_iki 5 months ago

> But almost all of the new innovation I'm familiar with in distributed systems is about training LLMs
I think database space is still hot topic with many unsolved problems.

sriram_malhar 5 months ago

The X10 language captures the notion of location of computation as a language primitive ("place")

http://x10-lang.org

tayo42 5 months ago

I agreee it has stalled, I think for almost everyone what the author considers bandaid is practical enough that isn't a need for innovation. Distributed systems is more or less solved imo.

Agree with another commenter, observability tools do suck. I think that's true in general for software beyond a certain amount of complexity. Storing large amounts of data for observability is expensive.

ptmcc 5 months ago

I'm very impressed with the quality of some observability tools like Datadog, which do many good things either automatically or very easily. The usability is leaps and bounds ahead of things like New Relic or the manual instrumentation intensive open source tools. But yes, the costs are insane and require some diligence to keep from running too wild, like most SaaS products these days.
But ultimately we pay it because it gives us incredibly valuable insights and has saved us countless hours in incident response, debugging, and performance profiling. It's lowered my stress level significantly.

sakesun 5 months ago

I have been told and believed that we are getting very close to the ultimate answer throughout my career since CORBA/DCOM.

Just learn that there is another discontinued attempt https://serviceweaver.dev/

anonymousDan 5 months ago

This article is just word salad. In what way does Redis 'abstract distribution' for example?

anacrolix 5 months ago

I've been trying to explain this to people for 8 years. All of our existing languages side step the problem. Developers are writing distributed systems every day but seem oblivious to the fact their tools aren't helping at all.

hinkley 5 months ago

I am so appalled every time I ask a group of devs and get the same answer that I’ve just stopped asking. How many of you took a distributed programming class in college? And it turns out yet again that not only am I the only one, but that none of them recollect it even being in the course catalog.

For me it was a required elective (you must take at least one of these 2-3 classes). And I went to college while web browsers were being invented.

When Cloud this and Cloud that started every university should have added it to the program. What the fuck is going on with colleges?

shermantanktop 5 months ago

My .02 is that any topic sufficiently important shouldn't be left to colleges. You can force feed a complex topic to a bunch of undergrads, but they will forget 95% of it, and 5 years later they'll say "ohhh, I think I have a textbook on that in my parents' basement."
The reality is that most of this profession is learned on the job, and college acts as a filter at the start of the funnel. If someone is not capable of picking up the Paxos paper, then having had someone tell them about it 5 years ago when they had a hangover won't help.
- hinkley 5 months ago
  
  I am 100% convinced we could delete the compiler class from college curricula and replace it with distributed computing and the world would be a better place.
  
  bigmutant 5 months ago
  
  Def agree. Most people will never touch an Abstract Syntax Tree or even Expression Trees. Almost everyone working in back-end will use Cloud Services, will make mistakes based on assumptions of what they provide
  
  EtCepeyd 5 months ago
  
  I was studying for my MSc in CS some 25 years ago. Our curriculum included both automata/formal languages (multiple courses over multiple semesters) and parallel programming.
  The latter course (a) was built on a mathematical formalism that had been developed at the university proper and not used anywhere else, (b) used PVM: <https://www.netlib.org/pvm3/>, <https://en.wikipedia.org/wiki/Parallel_Virtual_Machine>, for labs.
  Since then, I've repeatedly felt that I've seriously benefited from my formal languages courses, while the same couldn't be said about my parallel programming studies. PVM is dead technology (I think it must have counted as "nearly dead" right when we were using it). And the only aspect I recall about the formal parallel stuff is that it resembles nothing that I've read or seen about distributed and/or concurrent programming ever since.
  A funny old memory regarding PVM. (This was a time when we used landlines with 56 kbit/s modems and pppd to dial in to university servers.) I bought a cheap second computer just so I could actually "distribute" PVM over a "cluster". For connecting both machines, I used linux's PLIP implementation. I didn't have money for two ethernet cards. IIRC, PLIP allowed for 40 kbyte/s transfers! <https://en.wikipedia.org/wiki/Parallel_Line_Internet_Protoco...>
  
  bigmutant 5 months ago
  
  Sure, I did the same, BS/MS with a focus on Compilers/Programming Languages. It's been personally gratifying to understand programming "end-to-end" and to solve some tricky problems, but 99% of folks aren't going to hit those problems. There are tons of people interacting with Cloud Services every day that aren't aware of the basic issues like:
  - Consistency models (can I really count on data being there? What do I have to do to make sure that stale reads/write conflicts don't occur?)
  - Transactions (this has really fallen off, especially in larger companies outside of BI/Analytics)
  - Causality (how can I handle write conflicts at the App Layer? Are there Data Structures ie CDTs that can help in certain cases?)
  Even basic things like "use system time/monotonic clocks to measure elapsed time instead of wall-clock time" aren't well known, I've personally corrected dozens of CRs for this. Yes this can be built in to libs, AI agents etc but it never seems to actually be, and I see the same issues repeated over-and-over. So something is missing at the education layer
klysm 5 months ago

Definitely a problem, but I think there's always a gap here. What are colleges optimizing for with computer science programs? I would wager there is an incentive problem at the core which is causing these gaps to occur.
daedrdev 5 months ago

Same situation here. It was one of like 6 options and I had to take 2 of them. I found that I learned a lot from the class, but I was literally one of 7 people taking it that semester in a massive university.
- hinkley 5 months ago
  
  If I could remember the three logic+set theory and the one DC computing classes and forgot the rest of my college career, I could still do my job. Not as well mind you, but I could do it. I would miss graph theory but survive.
  I could just about create an associates degree around that and my graduates would run circles around any code camp you could name.

lifeisstillgood 5 months ago

This is a massive coming issue - I am not sure “distributed” can be exactly replaced with “parallel processing” but it’s close

So to simplify, from 1985 to 2005 ish you could keep sequential software exactly the same and it just ran faster each new hardware generation. One CPU but transistors got smaller and (hand wavy, on chip ram, pipelining )

Then roughly around 2010 single CPUs just stopped magically doubling. You got more cores, but that meant parallel or distributed programming - your software that in 1995 served 100 people was the same serving 10,000 people in 2000. But in 2015 we needed new coding - we got NOSQL and map reduce and facebook data centres.

But the hardware kept growing

TSMC now has wafer scale chips with 900,000 cores - but my non parallel, on distributed code won’t run 1 million times faster - Amdahls law just won’t let me

So yeah - no one wants to buy new chips with a million cores because you aren’t going to get the speed ups - why buy an expensive data centre full of 100x cores if you can’t sell them at 100x usage.

rstuart4133 5 months ago

> Although static-location architectures offer developers the most low-level control over their system, in practice they are difficult to implement robustly without distributed systems expertise.

This is the understatement of the article. There are two insanely difficult things to get right in computers. One is cryptography, and other is distributed systems. I'd argue the latter is harder.

The reason simple enough to understand. In any program the programmer has to carry in his head every piece of state that is accessible at any given point, the invariants that apply to that state, and the code responsible for modifying that state while preserving the invariants. In sequential programs the code that can modify the shared state is restricted to inner loops and functions you call, and you have to verify every modification preserves the invariants. It's a lot. The hidden enemy is aliasing, and you'll find entire books written on the counter measures like immutable objects, function programming, and locks. Coordinating all this is so hard only a small percentage of the population can program large systems. I guess you are thinking "but of a lot of people here can do that". True, but we are a tiny percentage.

In distributed systems those blessed restrictions a single execution thread gives us on what code can access shared state goes out the window. Every line that could read or write the shared state has to be considered, whether its adjacent or not, whether you called it here or not. The state interactions explode in the same way interactions between qubits explode. Both explode beyond the capability of human minds to assemble them all in one place. You have to start forming theorems and formulating proofs.

That worst part is newbie programmers are not usually aware this explosion has taken place. That's why experienced software engineers give the following advice on threads: just don't. You don't have a feel for what will happen, your code will appear to work when you test it while being rabbit warren of disastrous bugs that will likely never be fixed. It's why Linux RCU author Paul McKenney is still not confident his code is correct, despite being one of the greatest concurrent programming minds on the planet. It's why Paxos is hard to understand despite being relatively simple.

Expecting an above average programmer to work on a distributed system and not introduce bugs without leaning on one of one of the "but it is inefficient" tools he lists is an impossible dream. A merely experienced average has no hope. It's hard. Only a tiny, tiny fraction of the programmers on the planet can pull it off kind of hard.

ConanRus 5 months ago

No Erlang mention? Sad.

cruelmathlord 5 months ago

it has been a while since I've seen innovation in this arena. My guess is that other domains of programming have eaten its lunch

thway15269037 5 months ago

Oh god, even this article has AI and LLM section in it. When I thought distributed system design could not get any worse, someone actually pitched AI slop in it.

God I want to dig a cave and live in it.

Nevermark 5 months ago

> When I thought distributed system design could not get any worse, someone actually pitched AI slop in it.
I am not sure that pointing out that today's models are going to be MUCH worse at reasoning about distributed code than serial code is "pitching".
Conversely, pointing out that the reason they are so bad at distributed is the lack of related information locality, the same problem humans often have, puts a reasonable second underline on the value of more locality in our development artifacts.

hackburg 5 months ago

[dead]

samonurofil 5 months ago

[dead]

lincpa 4 months ago

[dead]