In my 3 years old laptop, system memory (dual channel DDR4-3200) delivers about 50 GB / second. You have measured 50M elements per second, with 8 bytes per element translates to 400 MB / second. If your hardware is similar, your implementation did less than 1% of theoretical bottleneck.
When your data is indeed big and the performance actually matters, consider doing something completely different.
In the real world, your application is running in a container among hundreds or thousands of other containers. The system's resources are also probably being managed by a hypervisor. The CPU is shared among N tenants _and_ is overcommitted. It's not that much to ask to optimize where you can, when it isn't unreasonably difficult.
More importantly, this attitude is precisely why software sucks today. "[CPU, Memory, Disk] is cheap, engineering time isn't." Fuck that. Bad engineering time is _incredibly_ expensive. This is an excuse to not spend time learning the ins and outs of your language, and your hardware.
It also frustrates me to no end that people are so deeply incurious. This seems to only happen in tech, which is baffling considering how incredibly broad and deep the industry is. Instead, everyone clamors to learn the new Javascript abstraction that lets them get even further away from the reality of what the magic boxes are doing.
> It also frustrates me to no end that people are so deeply incurious. This seems to only happen in tech
This doesn't match my observations. In many fields, training is limited and hides the details. We train workers to repeat specific tasks and they excel at them. But they don't have conceptual understanding. Any explanation of why things are done the way they are is surface-level. You can see it when a procedure fails for some reason. They way to deal with is to 1) do basic checks 2) try again and if it still fails 3) delegate to someone else. Nobody is going to troubleshoot or optimize tasks unless that's their main job.
It happens in construction, in the kitchens, on the assembly lines, and in the offices. It happens because it gets the job done.
There are a lot of developers who learn how to do a task and never question whether the task could be automated, either away or to reduce errors. They just do the same mundane task four hours a week in perpetuity.
That’s the part that frustrates me. Your job is to automate things. So why aren’t you automating things?
You're right. My wording was flat-out wrong, and I apologize. A more accurate sentiment (IMO) would be "that this happens in tech is baffling, considering..."
Yes, but in most fields, conceptual understanding doesn't matter for most tasks.
That's the problem with software. A brick-layer doesn't need to understand structural stability, but in software every task is structure validation, and people expect brick-layers to be productive.
> A brick-layer doesn't need to understand structural stability
Maybe a “junior” brick layer doesn’t need to, as much as a junior engineer doesn’t need to understand the ins and outs of their language. But a senior brick layer, or an architect, needs to understand more of the details so that they can set out the plan for the junior.
Some apple-pickers think the point of picking apples is to prove how good you are at climbing trees. They'll never not be mad at pluckers who just pluck the nearest apple, and the people who reward them for the apples, and the world that doesn't understand the pluckers picked their apples wrong, and didn't even earn them, and they're not even real apples.
A farmer who understands the value of crop rotation is completely right to be upset when other farmers are monocropping and killing the soil, or when they squander the local aquifer. It's directly impacting the sustainability of their industry.
Somehow, the trees keep getting exponentially bigger, and yet the pies we’re getting are actually kinda just getting smaller and worse. Maybe just picking the nearest apples isn’t working out as well as they say it is.
Climbing a ladder takes more time, so in order to get the apples to market faster, the apple tree owners keep the pickers focused on only picking the apples within arm's reach.
The owners also disallow the use of ladders because the pool of candidates to hire remains bigger.
It's one thing if you know the optimal (or at least approaching it) solution, and deliberately use something else for other reasons. At least thought went into it.
I'm not upset at this simply for purity's sake; it directly impacts me in my day-to-day job. A hundred devs individually decide that good enough is good enough (deliberately or otherwise), and suddenly my infrastructure is dealing with traffic and query patterns it shouldn't have to, and then I have to explain what network bandwidth limits are, and why it's absurd that we hit them in the first place.
In general, what I'm railing against is the continued push towards simplification and abstraction, while not requiring the understanding of what lies beneath. The hyperscalers certainly aren't trying to foster that attitude in-house – someone has to build the abstractions in the first place, after all, and keep datacenters humming. Yet they're happily shoveling it out, because it's a win-win for them: fewer people have the skills necessary to compete (or even to simply not use their services), and more people are able to use their services.
It's like the contractor (or subcontractor) who built the house you live in. They don't have to pay your monthly heating or cooling bills, so they often do whatever is fastest and cheapest for them.
Who cares if they didn't insulate a wall correctly or put an inefficient system in? If they can get it by the inspector, they will do it. If they can save themselves $100, that is all that matters to them. Who cares if it adds $10 to each month's utility bills for the next 30 years? They got their money and are now long gone.
Most people don't care about their job, they just want to put in minimum viable effort, get paid and go home to their family. Typically they do this because putting more effort didn't bring them anything (not even some recognition of efforts) or even backfired. It's applicable to all jobs, not only in tech.
I get that. My comment has nothing to do with people's pride in their work. I was commenting that there's not enough user demand for better preforming software because users don't seem to mind the software hellscape we currently live in.
Agree with the sentiment. However it's hard to stay curious, even harder to stay up-to-date.
I liked fiddling with storage for a while, got really into it, deepened my knowledge about it. A couple years later I realized everything else (networking, architectures, languages) developed so much, mot of my (non-basic) knowledge was obsolete. Picking up where I left off with all technologies is incredibly hard and caused fatigue.
Now I'm at a point where I have the feeling I don't know nothing about anything. It's factually not true, but my gut feeling tells this. Would I be younger, this would trigger a lot of anxiety. Thankfully I can janfle this by now.
That's understandable. I'm into databases (both professionally and personally), so I get to touch a wide variety of things. Following a query the whole way is pretty neat. Typed query --> client --> wire protocol --> network --> server --> DB frontend --> DB planner --> DB storage engine --> OS --> Memory / Disk. `blktrace` is super interesting to watch commands being sent to the disk.
When you are deeply involved in tech both personally and professionally you are probably very passionate about this and makes sense that you'd only look closely at this field and think "people are so deeply incurious. This seems to only happen in tech".
Tech is also one of the (if not the) most dynamic and fast evolving field a normal person will ever touch. Curiosity in tech can drain every single bit of free time and energy you have and you will hardly keep up with the progress, maybe barely scratch the surface. But people's available free time and energy wanes and curiosity is a collateral victim.
I've painfully gone through the entire cycle of this, including the bit of resurgence later on when you have a combination of free time but less energy. What I can say is that this absolutely does not happen just in tech. If anything tech is flooded with people with more curiosity than almost any other field.
> When you are deeply involved in tech both personally and professionally you are probably very passionate about this and makes sense that you'd only look closely at this field and think "people are so deeply incurious. This seems to only happen in tech".
Good point. I commented in a sibling post to the same effect.
I've certainly felt the personal strain of time sink and procrastination in my homelab. It's currently running k3os, which has been dead for about four years now, because everything I want is still running, and I never seem have the motivation on the weekend to yell at my laptop when I could be playing board games.
> including the bit of resurgence later on when you have a combination of free time but less energy.
I'm guessing that will happen in another decade or so, when my kids are grown.
Life is very different for many people and I think we just need to build empathy for people who treat a job as just a job. If they deliver and are not unkind about it there's nothing wrong about not going above and beyond the bare minimum, which is what they are paid for.
As I commented above, a large part of my umbrage stems from the impact these decisions have on my job. I dislike having to work harder because others didn't want to go above the bare minimum.
This isn't unique to any one company, nor my personal experience. At my last job, my team was initially human triaging practically all alerts, because we had the rare ability to read logs. I spoke to someone at Slack once who was stuck doing the same. That's an absurdly low expected level of competence.
You don't _have_ to work harder, as evidenced by those people who do the bare minimum. You just care about your work more than people who pay you for it (or the manager hired to manage you), which is the cause of your frustration here IMO.
I care about my teammates. Letting them down by not working to my full potential makes me feel bad. I think this is embedded into my pysche from years of military experience and then shift work.
> people are so deeply incurious. This seems to only happen in tech
It happens everywhere. If anything, techies are more curious than the average Joe. How many fellow programmers can you nerd-snipe with a comment that makes them say, “Well, that’s strange…”
I can somewhat echo some of the statements here and provide my own experience that is similar.
I spend a decent amount of time writing decent C++ code. My friends in other parts of the industry are writing certainly better C++ code than me because they are doing it in environments that are more constricted in various ways. In either case, I do spend my time catching up a bit and would consider myself a competent C++21 programmer in some ways.
My experience and my conversations lead me to understand there is so much left on the table with even the most basic implementations. When I implement it correctly in C++ we get close to some of the theoretical limits for the hardware for some workloads, compared to something that is literally 1% as fast running in NodeJS.
Wit that said, for so many situations I cannot justify the time and complexity to use C++ for many projects. At least for the stage most projects are in. In theory this optimization can happen later, but it never really does because the horizontal (or sometimes even vertical) scaling kicks in and we're all just willing to throw a few more dollars at the problem instead of re-engineering it. Sure, some of the really big companies like Netflix find a decent reason from time to time to invest the engineering time squeeze out those numbers, but it's becoming the exception and not the rule.
I've also found that not having some optimization mindset from the get go limits your product to achieve only a local maxima of performance in the end. It might not even be a good local maxima.
It's best to have at least some optimization considerations from the start. I'm saying some because too much is a problem.
> In the real world, your application is running in a container among hundreds or thousands of other containers
I mean, that’s an engineering decision too. In my day job we’re capturing, pre-processing, running inference on, and post-processing about 500Mpx/s worth of live image data at about 80ms/frame end-to-end at the edge. The processor SoM costs about $3000/unit and uses about 50W running flat out. The retail cost of our overall product is two orders of magnitude more than what the processor is worth but it incurs zero recurring costs for us.
Edit: and it’s got 64GB of Unified RAM that I’ve got all to myself :)
I was wondering if someone from a different sub-industry would disagree here :D
That sounds like a very interesting job, with quite different requirements and constraints from what I'm used to. One day, I'll get a job where application latency is critical, and optimizations matter deeply. Undoubtedly I'll find something else to be upset about, but at least it'll be a new complaint.
It’s hard to learn the ins and outs of dozens of programming languages. One doesn’t usually just use one or two PL over their entire career. I have worked professionally with at least PHP, Java, Python, JS, Go, and Ruby. That without taking into account the respective libraries and frameworks (and without taking into account as well the myriad of other components like dbs, web servers, caches, etc.)
It sounds like an excuse, I know. The true is I just don’t have that much time.
> In the real world, your application is running in a container among hundreds or thousands of other containers. The system's resources are also probably being managed by a hypervisor. The CPU is shared among N tenants _and_ is overcommitted. It's not that much to ask to optimize where you can, when it isn't unreasonably difficult.
And the real irony is that those programmers not giving a crap about performances and justifying it are resting on the shoulders of giants who went out of their way to have these kernels, hypervisors, containers, passthroughs, memory usage, etc. be as performant as they can.
Same for security.
Then you get the "I'm not in the business of writing fast code" and "I'm not a security company, I'm selling X and Y".
But they all, on a daily basis, use actual proper software written by people who know about these things.
In the real world, your application is running in a container among hundreds or thousands of other containers. The system's resources are also probably being managed by a hypervisor. The CPU is shared among N tenants _and_ is overcommitted.
When I read this, I thought the rest of your post would go entirely differently. As in, I immediately agreed with you, only for you to "turn this 180" (according to how I think about this at least :) )
It's not that much to ask to optimize where you can, when it isn't unreasonably difficult.
You take the above to mean we should optimize such that L2 cache is used as per the post as much as possible. Optimize the wazoo out of things.
But how does that even help in any way, when the CPU this runs on is like you said shared among N tenants and your carefully optimized L2 cache access is still going to be miss because another tenant got "your" CPU in between?
If you're paying for bare metal i.e. have the whole instance for yourself, by all means, if your domain actually requires you to use the system in such a way (things like high frequency trading come to mind), then optimize like that!
If you're running on seventeen levels of hypervisors that destroy any careful optimization in a random fashion anyway, then what's the point even? (non rhetorical question!)
It's a version of the tragedy of the commons. Yes, your software bloat isn't the main factor keeping you slow, and cleaning it up might be barely perceptible.
But... all the other software makes the same decision. Then, suddenly, everything on the computer is running at 10% efficiency.
In a complex, multitasking environment, keeping your code clean benefits other tasks as much or more than your own. It should be considered a responsibility.
I would agree with the sentiment but not necessarily the conclusion.
Like, don't use (the equivalent of) Stooge Sort (for your particular situation).
But unless you are in a very particular situation, it should be OK for everyone to "just use your language's built-in sort function" (hopefully that's a thing and you don't use something where you need to roll your own even in 2024) assuming that it uses a sane default like quicksort or merge sort that will work perfectly fine for most regular common situations.
Another example might be to not stupidly build slow algorithms saying "this won't see much data anyhow" (yes, I've seen that in PRs unfortunately) when a very simple hash lookup will ensure it's fast all the time. But it should be totally fine to assume that your language's hash table implementation is fine for common situations and you don't need to optimize anything unless you're a very special case.
> "[CPU, Memory, Disk] is cheap, engineering time isn't." Fuck that.
It is. It's absurdly cheap. I ensure I check the amount of time it would take for me to make a performance improvement against the runtime costs of my functions. It's rarely worth the extra effort.
Seriously, until you get into the millions of records per second level, you're almost never benefited. You may make your function 2x faster, at a cost of additional complexity, but you never run it enough in a year for it to pay itself back.
> Bad engineering time is _incredibly_ expensive.
Engineering time is expensive. Period. It speaks to the need to minimize it.
> This is an excuse to not spend time learning the ins and outs of your language, and your hardware.
All of which will change in a few years, which is fine, if you're also committing to keeping _all that code_ up to date right along with it. Otherwise you end up with an obscure mess that you have to unwind 5 years of context to understand and fix again.
Complexity and available mental contexts are forgotten costs. If your language even has that many "ins and outs" to begin with you may want to reconsider that.
> You may make your function 2x faster, at a cost of additional complexity, but you never run it enough in a year for it to pay itself back.
"You" is both singular and plural, which is often the problem with this thinking.
Is it worth spending a month of engineering time to make a page load in 50ms instead of 2s? Seems like a lot of engineering time for a noticeable but somewhat minor improvement.
But now, what if you have a million users who do this operation 100x/day? Absolutely worth it!
For example, I sure wish atlassian would spend a tiny bit of effort into making jira faster. Even if it is 1 second per ticket, since I'm viewing 100+ tickets per day that adds up. And there's many hundreds of us at the company doing the same thing, it really adds up.
Nit:
50ms vs 2000ms is 40x speed increase, i.e. ~1.5 order of magnitude.
I still keep words of my database optimization lecturer who said that by his experience optimization below 1 OOM aren’t worth it and most „good ones” are 3+
> Absolutely worth it!
Long reaching assumption. Even the biggest companies have limited resources (even if vast). Would you rather improve load times by 2x (from 500ms to 250ms) or improve checkout reliability from 99% to 99.5%? And there is much more to consider on some levels (e.g. planning for thermal efficiency is fun).
> I still keep words of my database optimization lecturer who said that by his experience optimization below 1 OOM aren’t worth it and most „good ones” are 3+
Like everything, it depends. Is the query gating a lot of other things, especially things that can be done in parallel? Shaving 10 ms off might very well be meaningful. Is it a large OLAP query, and the owning team has SLAs that depend on it? Going from 60 --> 55 minutes might actually matter.
The two biggest performance-related issues with RDBMS that I deal with, aside from indexing choices, are over-selection (why on earth do ORMs default to SELECT * ?), and poor JOIN strategies. Admittedly the latter is often a result of poor schema design, but for example, the existence of semi/anti-joins seems to be uncommon knowledge.
If SLAs are involved then I’d argue it’s not about optimization but about business goals, which unsurprisingly take precedence.
But there is another case that is very similar: threshold passing (or how I like to call it - waterfalling). Small inefficiencies add up and at some point a small slowdowns reach critical mass and some significant system breaks everything else.
When system was designed by competent engineers huge optimizations aren’t easy, so it’s shaving couple millis here and couple millis there. But, as in the first case, I’d categorize it as a performance failure.
Yeah, so the pathological incentives to build bad products are just insane. It’s more evidence against the whole market-Darwinian hypothesis being good for the consumer. “Fitness” doesn’t equal (let alone define) product quality.
That said, most developers (I assume, by most I mean myself) never work with code where the work they do has any serious impact on performance; it's layers on top of code that should be fast, but the work I do is implementing business logic and screens, in which case readability vastly trumps performance.
I mean right now I'm writing a line of code that can be represented as a nested ternary or a function call with some more written-out if/elses. The ternary outperforms the function call 2:1 or more, but either one can be executed hundreds of times a second BUT will only be executed once or twice per page view. It's not worth the tradeoff, even if this line of code will be executed hundreds of millions of times overall.
> You may make your function 2x faster, at a cost of additional complexity, but you never run it enough in a year for it to pay itself back.
I'm not talking about increased complexity, I'm talking about extremely basic things that take zero extra time, like using the correct data structure. For example, in Python:
In [8]: a = array("i", (x for x in range(1_000_000)))
...: l = [x for x in range(1_000_000)]
...: d = deque(l)
...: for x in (a, l, d):
...: print(f"{sys.getsizeof(x) / 2**20} MiB")
...:
3.902385711669922 MiB
8.057334899902344 MiB
7.868537902832031 MiB
Very similar structures, with very different memory requirements and access speeds. I can count on one hand with no fingers the number of times I've seen an array used.
Or knowing that `random.randint` is remarkably slow compared to `random.random()`, which can matter in a hot loop:
In [10]: %timeit math.floor(random.random() * 1_000_000)
31.9 ns ± 0.138 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [11]: %timeit random.randint(0, 1_000_000)
163 ns ± 0.653 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
> All of which will change in a few years, which is fine, if you're also committing to keeping _all that code_ up to date right along with it.
With the exception of list comprehension over large ranges slowing down from 3.11 --> now, I don't think there's been much in Python that's become dramatically worse such that you would need to refactor it later (I gather the Javascript community does this ritual every quarter or so). Anything being deprecated has years of warning.
> Very similar structures, with very different memory requirements and access speeds. I can count on one hand with no fingers the number of times I've seen an array used.
That is obvious when you actually check the access speed of arrays and find out it is about half that of lists on small integers (under 256), and worse on non-small integers. That is literally the opposite trade off of what you want in 99.99% of cases.
Deques are even less of a consideration, they’re unrolled linked lists so random access is impossible and iteration is slower, you use a deque when you need a deque (or at least a fifo), aka when you need to routinely manipulate the head of the collection.
It depends on your constraints. If you’re limited by RAM, arrays make a lot of sense for certain applications. If you need Python’s buffer protocol, again, they make a lot of sense.
As to deques, yes, they have specific uses, and being slightly smaller isn’t usually a selling point for them. My point was that I have seen many cases where an incorrect data structure was used, because a list or dict was “good enough.” And sure, they generally are, but if the language ships with other options, why wouldn’t you explore those?
This will need to happen 7.6 million times to save me 1 CPU second. On AWS lambda with 1GB of memory this will cost you a whopping: $0.0000166667.
The point is, you're not even wrong, but there are vanishingly few cases where it would actually matter to the bottom line in practice. You're taking an absolutist point of view to a discipline which thoroughly rejects it.
This is what I love about the cloud. It forces you to confront what your efforts are actually worth by placing a specific value on all of these commodities. In my experience they're often worth very little given that none of us have the scale of problems where this would show actual returns.
Reaching the scale where it shows actual returns isn't all that difficult. You need it to happen 7.6 million times to save 1 CPU second, but each CPU core can execute it nearly that many times every second.
Probably you don't leave it generating only random numbers all day, but suppose you do generate a good few, so that it's 1% of your total compute budget, and you have only a modest load, using on average four CPU cores at any given time. Then saving that amount of computation will have saved you something like $15/year in compute, recurring. Which isn't actually that bad a return for ten seconds worth of choosing the right function.
There are also a lot of even fairly small entities for which four cores is peanuts and they're running a hundred or a thousand at once, which quickly turns up the price.
And even the things with internet scale aren't all that rare. Suppose you're making a contribution to the mainline Linux kernel. It will run on billions of devices, possibly for decades. Even if it doesn't run very often, that's still a lot of cycles, and some of the kernel code does run very often. Likewise code in popular web browsers, javascript on popular websites or in popular libraries, etc.
You don't have to work for Google to make a contribution to zlib and that kind of stuff has the weight of the world on it.
Sure, but the cumulative effects of pervasive mediocre to bad decisions do add up. And it isn't just about cloud compute cost. Your own time is stolen by the slow ci jobs that you inevitably get stuck waiting for. For me, I prioritize my own personal
happiness in my work and this mindset taken too far makes me unhappy.
Yes but if you want to do things in a less obvious way you should be aware of the downsides, such as bias in your random numbers. Also making sure you watch out for off by one errors.
Stolen the number to show this off well from a bug report somewhere:
random_counter = Counter()
for i in range(10_000_000):
result = floor(random() * 6755399441055744) % 3
random_counter[result] += 1
print("floor method", random_counter.most_common(3))
randint_counter = Counter()
for i in range(10_000_000):
result = randint(0, 6755399441055743) % 3
randint_counter[result] += 1
print("randint method", randint_counter.most_common(3))
Python is 100% the wrong language to worry about this in. If your hot loops are in python and you care about performance, you should be rewriting them in another language.
Agreed; I used it partially because TFA used it to demonstrate ideas, and partially because I’m very familiar with it.
But you’re correct, of course. When I need something to go faster in Python, I write it in C. If it’s more than a small section, then a full rewrite is reasonable.
I was curious why randint was slower than random and found a good SO answer [0] (it's like chatgpt by humans for you youngsters out there), the gist of it is that `random()` calls a C function directly (which I presume goes straight to a syscall), whereas `randint` is implemented in Python and has a load of preconditions / defensive programming (which I presume is executed at every invocation, so seven potential branches etc it has to check every time).
Of course, if it's not in a tight loop and you need a whole integer between a range then it's the most developer-ergonomic way to get it. If you do need more performance but want to keep the ergonomics, writing your own random-int-between-two-numbers is fairly straightforward but it'll take some time.
Ok, but neither of these matter until you know they matter. Seriously. Like, yes, it's nice they exist and that they are available for when you want them, but I would generally advise people to use a list or random.randint, if only because I value their confidence with them over the 2x performance win, because most workloads are not simply just a single array or random number generator loop. And, to be clear, I work on performance professionally: most of my job is not making things as fast as possible, but considering the tradeoffs that go into writing that code. I understand your example as showing off an interesting performance story but in the real world most workloads are more complex than what can be solved with using a rare but drop-in API.
> until you get into the millions of records per second level, you're almost never benefited
Yeah, but the software landscape is very diverse.
On my job (CAM/CAE) I often handle data structures with gigabytes of data. Worse, unlike e.g. multimedia frameworks many algorithms operating on these numbers are global i.e. can’t be represented as a pipeline which splits data into independent chunks and processes them sequentially.
Making performance critical functions twice as fast might saves hours of time for a single end user in a single day.
> It also frustrates me to no end that people are so deeply incurious. This seems to only happen in tech,
I mean software engineering and computer science from the start is built on abstraction, it makes sense that it attracts people who find taking the abstraction as reality and ignoring the underlying "real" layers to be attractive.
(Also it's not "real" until you get down to... voltages? Electrons? I don't know... the whole thing about our whole endeavor being built of abstraction is there are so many layers. Nobody thinks everyone has to routinely go ALL THE WAY down, but I get your point that being more curious than many tend to be about a couple layers down is helpful, with how far down a couple layers is depending on the nature of your work. I'm not saying it's not helpful, I'm suggesting possibly why -- because we select for people who swim happily in abstractions, who love em even. And the success of all of CS and software engineering is based on the ability for it to often work. nobody really has to be curious about voltages when writing a python script to read and write from storage)
Nope. If a program is meant to run on the entire CPU, and if it's meant to use all the memory, then that's how it will run. This isn't a question of modernity or availability of tools or programmers' salary.
Or, to put it differently, using more or less memory / CPU isn't really an indication of anything about a program. Sometimes you mean to use all that's available, other times you mean to use as little as you can. There's no way to tell which one should be done w/o knowing your goals.
For example, the entire free memory on Linux is a free real estate for filesystem caching. So that memory is not wasted, if the system can help it. So, having your entire memory used (to support filesystem cache) wouldn't be really a performance problem (quite the contrary, it would look like the system is doing a good job at caching whatever reads your applications are making).
That's good food for thought. My hardware is quite old, but I agree that the numbers seem somewhat suspicious.
I believe that RAM can only be RMW'd in cache lines, so modifying just 8 bytes still requires 64 bytes to be transmitted. I'm assuming the 50 GB/s throughput is half-duplex, and 25 GB/s over 64 bytes is ~400 Melems/s, somewhat closer to my result.
I tried using non-temporal stores in the straightforward algorithm, but to my surprise, this led to a significant decrease of performance across all input lengths.
> When your data is indeed big and the performance actually matters, consider doing something completely different.
I'm not sure what you mean by this. Scaling across machines is just ignoring the problem. What do you mean by "something completely different"?
> I believe that RAM can only be RMW'd in cache lines, so modifying just 8 bytes still requires 64 bytes to be transmitted.
Ages ago I worked with several different memory controllers, and it depends on the memory controller, cache, and MMU configuration.
Plenty of systems do require cacheline updates, then modifying 8B requires reading one or two cachelines, updating them, and writing them out eventually.
Some caches track cacheline validity with a bit per byte. This enables a CPU to set a book out in memory without fetching the cacheline. The cache may then try to burst read that line from the memory controller, but if it doesn't get around to it before deciding to flush that cacheline, it may issue a single byte write to the controller. The controller can then issue a makes DRAM write to the memory, which will update only certain bytes in the DRAM column. However, this still takes about as long as sending the full cacheline but it offloads the read-modify.
Validity per byte is also useful to implement hit under miss.
I bet on newer, bigger systems tricks like this are less useful since the memory busses are faster and wider today.
Non-temporal stores will not help you if all you do are those accesses. The point of using them is that you don't want them pulled into the caches and pushing out everything else.
> What do you mean by "something completely different"?
I mean compute the exact thing you need to compute, optimizing for performance.
I’m pretty sure your current code is spending vast majority of time doing malloc / memcpy while moving input elements from the input array into per-bucket collections.
To aggregate the data the way you want, you do not have to do any of that. You only need a single number per bucket to compute the minimum. Updating that accumulator is very cheap as there’s no memory allocations involved: just load, compare, and store the updated number.
Moreover, that algorithm is easily to parallelize. Compute an array with these per-bucket minimums on different threads, wait for completion of all tasks, then aggregate multiple arrays computed by each thread into a single array with global minimums for each bucket. Then compute the final sum.
> In my 3 years old laptop, system memory (dual channel DDR4-3200) delivers about 50 GB / second.
That's almost certainly for (mostly) sequential access.
When you just want a couple bytes here and there, and access isn't pipelined and prefetch doesn't accelerate your use case, the real world bandwidth is going to be significantly less.
I don’t understand Rust but if the code is doing what’s written there, “simple multiplicative hash and perform a simple analysis on the buckets – say, compute the sum of minimums among buckets” it’s possible to improve substantially.
They don’t need to move these megabytes of elements between collections. I would split the input across a few CPU cores, compute per-bucket minimums in parallel on each core, when all completed aggregate minimums across all cores, then compute sum of the results.
Multiplicative hash and the final sum should be possible to vectorize on most computers. Updating per-bucket minimum is probably impossible to vectorize (technically AVX512 set has the required instructions but I’m not sure these are efficient) but there’re much fewer buckets than input numbers, which means arrays with per-bucket minimums are likely to fit in caches.
Isn't the point of the person you replied to that the article author wasn't able to eliminate latency because if they were they'd be constrained by throughput but they are not?
To/from what sort of devices could the GPU read or write at 1TB/sec, besides main memory?
The fastest consumer SSDs top out at several GB/sec (I guess with massive hardware RAID they could be faster, but not sure if they’d be 1TB/sec fast).
Even a network adapter that does 10 Gbit/sec is only recently becoming slightly more common for the average consumer. Not sure if any consumer adapters in the 10 or 1Tbit/sec range exist at all.
Consumer? Even in the "if-you-need-to-ask-what-it-costs-you-can't-afford-it" world of frontier HPC systems, were only getting teasers of 0.8 Tbit/sec NICs this year.
As you say, only the GPU, maybe RAM, and the NIC will be able to churn data at these speeds. There is a reason why Mellanox (Nvidia) has developed GPUDirect RDMA, so the GPU and NIC can talk directly to each other.
Main memory is an important thing to have fast, though. The faster (and lower-wallclock-latency) it is, the less time your system spends waiting around when it needs to swap things in and out of it. It's my understanding that programs that need to be fast (like many video games) take pains to preemptively load data into RAM from disk, and (when appropriate for the program) from main RAM into VRAM. If main RAM's transfer speed was equal to or greater than VRAM's, and it access latency was a small fraction of a frame render time, (presumably) some of that preloading complexity could go away.
> I guess with massive hardware RAID they could be faster...
This section of the comment is for folks who haven't been paying attention to how fast storage has gotten: It's nowhere near 1TB per second, but...
I have four 4TB SATA-attached Crucial MX500s set up in LVM2 RAID 0. This array is a bit faster than a 10gbit link. (That is, I get 1.5GByte/s transfer rate off of the thing.) Even a single non-garbage U.2-attached (or (barf) M.2-attached) device can saturate a 10Gbit link.
I'm reading the latest SSDs do 14 GB/s of sequential reading and/or 15 million IOPS; transfer speed wise that's close to the highest end DDR3 (2007) and the lowest end DDR4 (2014). SSDs are definitely not memory speed fast yet (definitely not for random access) but definitely getting closer.
> Not sure if any consumer adapters in the 10 or 1Tbit/sec range exist at all.
Further, what exactly is a consumer going to plug a 1Tbit/sec adapter into? Your little home ATT fiber connection isn’t going to come close to using that available bandwidth.
It actually almost never does. To see that you'd need to benchmark. It's pretty difficult get good utilization on GPU on either compute or memory bandwidth side. A lot of kernels irretrievably fuck up both. You need long, coalesced reads/writes, and judicious use of the memory hierarchy, or else everything gets very slow very quickly.
I mean they can only do that if you have hundreds of threads all making coalesced writes. Typical CPU workloads look nothing like that; if you pointer chase on the GPU you are going to get absurdly bad performance.
> Cache is seen as an optimization for small data: if it fits in L2, it’s going to be processed faster
Nobody worth their salt believes just this and nothing else.
Yes, if the data fits entirely into a given cache, that's a nice case that's easy to reason about. No matter what access pattern is applied to the data, it doesn't matter because it's in the cache.
Hopefully everyone working with caches understands that they provide a potential speedup when not everything fits into the cache, and that this depends on the pattern of access (mainly, does it exhibit "locality"). Moreover, this case is extremely important.
The article gives an example of exactly that: improving the locality of access.
If you don't know this, you don't know one of the first facts about caching.
There is something else to know about: you can't tell by size alone whether a given data set will fit into a cache. The problem is that caches are not always fully associative. In a set associative cache, a given block of data cannot be stored in any cache line: it is assigned to a small set of possible cache lines. Then within a set, the cache lines are dynamically allocated and tagged. A given bunch of working which appears to be just smaller than the cache might be arranged in such a poor way in memory that it doesn't map to all of the cache's sets. And so, it actually does not fit into the cache.
>Nobody worth their salt believes just this and nothing else.... and that this depends on the pattern of access (mainly, does it exhibit "locality").... If you don't know this, you don't know one of the first facts about caching.
Not necessarily specific to this issue, but I've found that surprisingly many people out there are not "worth their salt" in areas where you'd really expect them to be.
That's true, perhaps my wording is off. I believe that the devil in the details. Sure, knowing that better access patterns result in better performance is common. But the fact that the access pattern can be optimized when the problem is literally "access RAM at these random locations" is counterintuitive, IMO.
Not sure what "Myth" the author thinks they are exposing.
Jim Gray wrote "RAM is the new disk" at least in 2006, probably earlier, so 20 years ago. And people have been saying "RAM is the new tape" for quite some time now as well.
This was a great read, thanks. OP, readers might benefit from having it explicitly mentioned that while the pseudocode is in Python, actual Python code will likely not benefit from such an approach because of how memory is fragmented in the standard Python implementation - e.g. this discussion: https://stackoverflow.com/questions/49786441/python-get-proc...
I am tempted to actually test this (i.e. see if there's a speedup in Python), but I don't have the time right now.
That's not the problem at hand. Python is good for pseudocode. But not if you want to talk about cache misses because in the pseudocode written with higher level languages a lot of details on how memory is accessed are opaque.
Again, what would you suggest instead? If you can't guess that `list` is supposed to represent an array of consecutive elements, I have trouble thinking of a language that'll make that clear without being exceedingly verbose for no reason.
A bit later, in the article, you'll see that memory patterns in allocating the arrays have a role. A role that was hidden initially.
> Again, what would you suggest instead?
The answer is inside you. You have only to search for it.
Or, if you really want to extort me the obvious: any low level language (even not implementing every detail but calling imaginary functions whose name suggest what they are doing). This exercise will show you for instance that you'll have to immediately choose if to append_to_dynamic_array() or add_to_linked_list().
> This exercise will show you for instance that you'll have to immediately choose if to append_to_dynamic_array() or add_to_linked_list().
Linked lists, that's what you're worried about? `[...]` is not a linked list in Python, in fact I don't know any imperative language where it's something other than a dynamic array/vector. I can only assume someone who doesn't understand or intuit this is arguing in bad faith, especially when taking your attitude into account. What did I do to deserve being talked down to?
> any low level language
Like this?
std::vector<std::vector<element_t>> groups(n_groups);
for (auto&& element : elements) {
groups[element.group].push_back(std::move(element));
}
std::sort(elements.begin(), elements.end(), [](const auto& a, const auto& b) {
return a.group < b.group;
});
std::vector<std::vector<element_t>> groups;
for (auto group_elements : group_by(std::move(elements), [](const auto& element) {
return element.group;
})) {
groups.push_back(std::move(group_elements));
}
Is it worth it? If you don't know `list` is a dynamic array in Python, how will you know `std::vector` is a dynamic array in C++? Not to mention the syntax is terrible. C would be just as bad. Using Rust would get people angry about Rust evangelism. The only winning move is not to play.
// Create n empty arrays.
DynamicArray* groups[n_groups];
for (int i = 0; i < n_groups; i++) {
groups[i] = create_empty_array();
}
// For each element in elements[], append it to its group's array.
for (int i = 0; i < n_elements; i++) {
append_to_dynamic_array(groups[elements[i].group], elements[i].value);
}
antirez: You already got it wrong. You've got a double indirection, with `groups[i]` pointing to a heap-allocated instance of `DynamicArray`, which supposedly stores the actual heap pointer.
It's not about the language being low-level or high-level. It's about understanding the basics of memory layout. If the pseudocode being in Python is an obstacle for you, the problem is not in Python, but in your (lack of) intuition.
I wrote the first random semantic that came in mind, in the pseudocode in C above. The fact you believe it is not the right one is a proof that in C I can express the exact semantic, and in Python I can't (because everything is hidden inside how the Python interpreter can or can not implement a given thing).
So, modify my C code to match what you have in mind, if you wish. But the point is that low level code will specify in a much clearer way what's going on with memory, and in a piece of code that is about memory layout / access, that's a good idea. Otherwise I consider Python a good language for pseudocode. Just: not in this case.
Have a nice day, I'll stop replying since would be useless to continue. P.S. I didn't downvote you.
C is exactly at the right level. More high level and human readable than bare assembly, yet without the complexity and abstractions of Rust or C++ or opaqueness of Python.
C was suggested, but I find Go to be good too for this purpose, you get some of the low level memory access (stack vs heap stuff) but don't need to worry about memory allocation or whatnot. Great for showing the differences in algorithms.
It resembles sequential access memory with relatively fast seeks and a cache (ok, multiple layers of caches). Reading sequential, predictable addresses gives you much more throughput than random access, and reading a value that was recently accessed (or adjacent to such) is much lower latency than something that was not. There's further wrinkles in multicore systems as well, because then accesses to memory recently written to by another core can be slower again.
It resembles a hierarchy wherein a small fraction of memory - a "cache" region - can be accessed much faster than the rest. With careful planning, the programmer can increase the odds that the necessary information for the next step of an algorithm, at any given point, is within that small region. (This is a simplification; there are actually several layers of cache between a CPU and the most general part of the RAM.)
I can only conclude the word myth was chosen for attention. Modern CPU memory systems (with or without their caches) certainly resemble idealized RAMs.
It's a little surprising that this works at all, since the partitioning step in the radix sort is itself the same kind of sharding operation. But I guess that's because it allows for working on smaller pieces at a time that fit in whatever cache they need to.
> Python can't really reserve space for lists, but pretend `reserve` did that anyway.
FWIW, you can pre-fill a list with e.g. `None` values, and then replacing those values won't cause resizes or relocations (of the list's memory - the elements are still indirected). But of course you'd then need to keep track of the count of "real" elements yourself.
But of course, tricks like this are going to be counterproductive in Python anyway, because of all the indirection inherent in the system. They'll also get worse if you actually do have objects (unless perhaps they're statically, constantly sized, and in a language that can avoid indirection in that case) with a "group ID" attribute, rather than integers to which you can apply some hash function. Some test results, trying to approximate the Rust code in Python but with such objects:
It's important to understand the domain. For Pythonistas, simple really is better than complex. And identifying the bottlenecks and moving them to a faster language is also important (Numpy isn't successful by accident). The naive result here is still, if I'm understanding the graphs right, dozens of times worse than what any of the Rust code achieves.
(Edit: merely "several" times worse. I forgot that I told `timeit` to run 10 iterations over the same input, so each test processes ~10M elements.)
> It's a little surprising that this works at all, since the partitioning step in the radix sort is itself the same kind of sharding operation.
The key property here is the number of groups. When sharding data to n groups, only n locations have to be stored in cache -- the tails of the groups. In my radix sort implementation, this is just 256 locations, which works well with cache.
Windows starts doing awful things under slight memory pressure like 80% usage.
There are cascading failures like high ram usage -> swap+compression -> power+thermal limits -> desktop responsiveness is unreasonably bad (i.e. >5 minutes to open task manager to kill a single firefox.exe using >10G memory caused by a single misbehaving javascript among many tabs)
This is an incredibly common scenario for laptops that cost less than $1000.
Free memory means things will probably not go to shit in the next 10 minutes.
Linux is a whole different story with different sharp edges.
I find this article particularly hard to read. The author states:
> I’m using Python as pseudocode; pretend I used your favorite low-level language)
But then the code is full of Pythonic constructs alien to any other low-level language (and, I suspect, also to most casual Python users) that spoil all the fun. I found myself spending more time parsing the specific syntax than grokking the intent of the code.
I think using more explicit loops and less lambdas would have made the article more meaningful.
All modern CPUs have powerful hardware prefetchers, which are insufficiently documented.
On any modern CPU (i.e. from the last 20 years), explicit prefetching is very seldom useful.
What is important is to organize your memory accesses in such orders so that they will occur in one of the patterns that triggers the hardware prefetcher, which will then take care to provide the data on time.
The patterns recognized by the hardware prefetchers vary from CPU to CPU, but all of them include the accesses where the addresses are in arithmetic progressions, going either forwards or backwards, so usually all array operations will be accelerated automatically by the hardware prefetchers.
I agree that they can be powerful but I guess, as always, it doesn't apply in the general case. For example, AFAIK hw prefetcher runs at 32B or 64B per cycle which basically means that you're only going to see a benefit from it if the code you're running is CPU bound, no?
Radix sorting is great, as this article points out. But I followed the links and didn't see any mention of the most obvious optimization that makes it even better on modern systems.
If you do MSB for the first pass, then the subsequent passes within each bucket are completely independent :) So... Choose a large radix and do one MSB pass, then parallelize the LSB processing of each bucket. This is fantastic on a GPU. You might think that you would want many thousands of kernels running simultaneously, but in actuality each SM (and the Apple equivalent) has a thread-local cache, so you really only want a couple hundred simultaneous kernels at the most. I've written CUDA and Metal versions and they are fast. And you don't even have to be very good at GPU programming to get there. It's very basic. AoCP has the pseudocode.
I was expecting something about NUMA performance, temporal memory access instructions, shared versus global versus register memory on GPUs, SRAM, and so on. There's an article about all these things waiting to be written. This article is instead about memory-hierarchy-aware access pattern optimization, which is important, but not the whole story.
I sometimes get the feeling it would be better to just start treating RAM as secondary storage, with read/write access through a (hardware implemented) io_uring style API. Is there anything along those lines out there?
CXL RAM is something like this on the physical side. It's RAM over a PCIe connection, and PCIe is basically a very fast, serial, point to point network. However as far as I'm aware the "API" to software makes it looks just like regular, local RAM.
I'D love to understand the differences between the "devices" named A, Y, M in the performance measurement, referring to "(A, Y, M indicate different devices)"
Lately, "nulls should not be in your language" sounds like "reality should be simpler". Queue angry fist-shaking at the complexities of real life and a wistful longing for a logical, functional, mathematical world. In reality, sometimes out of 8 questions, you only know the answer to 7, and the missing answer could be any of the 8. So you can:
1. Prevent the user from doing anything until they go find that 8th answer, which cascades into your tool not helping anyone until they have a complete, validated, full, historical data set.
2. Subdivide those 8 questions into having independent existence, which cascades into your entire application being a giant key/value store, and all the difficulties of null are simply replaced with the difficulties of "this key/value pair does not exist yet".
3. Add a sentinel value for "I don't know this yet", while feeling good about yourself because that sentinel value is not technically null, while introducing all the same exact issues that null causes, plus a bunch of infrastructure you have to write yourself. Basically: reimplement null, but worse.
4. Make the answers nullable.
It'd be like claiming that an IDE should simply not allow syntax errors to be entered at all. Errors would be completely impossible! Except at some point your user needs to actually write the thing, and you've just abandoned the idea of helping them. So they write it in some other editor, and then paste it into your IDE. Or you embrace the fact that incomplete work is OK.
Yes, nulls are generally way over-used. But disallowing them entirely is a fool's errand. Source: I've been that fool.
In before: "You should just be using Maybe<> everywhere" -- "Maybe" is just another word for "nullable". The only difference is the level of support you get from the compiler / type-checker. So that argument is "the compiler should help you get null right", which I completely agree with! There's still work to be done getting type checkers better at null. But that's a far cry from "null should not be in your language" and equating the use of nulls to a doctor not washing his hands.
Of course nobody is seriously arguing about ditching the notion of a sentinel value that indicates absence of value. The question, rather, is how to best handle that case.
> In before: "You should just be using Maybe<> everywhere" -- "Maybe" is just another word for "nullable". The only difference is the level of support you get from the compiler / type-checker.
The difference is that in languages with null references/pointers, all references/pointers are implicitly nullable - that is, it is indeed like "using Maybe<> everywhere". But when you have proper option types, you use Maybe<> not everywhere, but only where you know nulls can actually occur.
> The difference is ... [whether] all references/pointers are implicitly nullable [or not]
Right so we're narrowing down the actual problem here, which revolves around implicit vs. explicit, compiler support vs. on-your-own, static vs. dynamic typing. I agree with you that I strongly prefer explicit over implicit nullability, assuming it fits within the idioms of the language. But this is a stronger condition than static typing, so to believe "implicit nulls" is as irresponsible as a doctor not washing their hands, you* have to believe that everyone using dynamic typing is even more irresponsible than that. There's a place for dynamically typed languages, there's a place for implicit nulls, there's a place for strict type- and null-checking, there's a place for mathematically proving code correct. Engineering is always about tradeoffs -- it's way too simplistic to just say "null bad".
* I know it wasn't literally you arguing that; I'm still partly addressing mrkeen here.
It really depends on your definition of static typing. The thing that always made implicit nullability such a mess is precisely that it breaks the subtype relationship - that is, given some (reference) type Foo, you can have a null value of that, but none of the operations that normally apply to Foo apply to that null Foo value! In other words, it's not actually of type Foo, at least not under any sane definition of what a "type" is. So it was always a hole in static typing, and experience with all the grief caused by this hole is why we're moving away from it now, and why even the person who originally introduced this approach in the first place believes it was a mistake.
But, yes, using dynamic typing would be even more irresponsible than using implicit nulls. And this is also something we've learned from experience - note how historically dynamically typed languages are acquiring some form of bolt-on statically checked typing (TypeScript, Python type annotations etc), and how quickly this is becoming the preferred way to write new code.
Completely agree with most of that, especially the problem that allowing nulls means that all of your values are now suspect.
Just picking up on the term "sentinel" here, which I believe refers to using an actual member of the set to represent something else - which I am totally against,
e.g.
int itemCount = countItemsIn() - countItemsOut();
The rationale goes: no-one would ever think -1 is a valid count of items, so we can use it as an error sentinel.
Then when countItemsIn() returns 1 and countItemsOut() fails, we deduce that itemCount is 2.
So I guess a sentinal is a kind of dual to a null (which is a non-member of a set pretending to be a member)
Anaesthetists also undergo many years of training prior to taking post. Typically they've got say 10 years training before they actually do the job, much of it specialised to this job in particular (and the rest in general or intensive care medicine)
If you're lucky a Software Engineer has a three year degree in CS, and probably only a semester at best was studying "Software Engineering" and even that might focus on something you don't care about, such as formal methods.
It is entirely possible that your junior engineers have never maintained a sizeable codebase for more than a few weeks, have never co-operated on software with more than a handful of other people, and have never used most of the common software libraries you use every day, regardless of whether these are in-house (so how could they) or widely available.
For example maybe you do lots of web apps fronting SQL Server databases in C#. Your new hire has six months of C#, they half-remember a course doing SQL on an Oracle database, and all their web knowledge is in Javascript. Do they know version control? Kinda. Have they used a test framework before? Well, they did in Java but never in C#.
The "All Your Tests Are Terrible" talk begins by pointing out that probably they're strictly wrong because you don't have any fucking tests. All of the rest of the talk is about the hideous tests that you're unhappy to find because you forget that somebody could just not have bothered to write any tests at all.
> All of the rest of the talk is about the hideous tests that you're unhappy to find because you forget that somebody could just not have bothered to write any tests at all.
At many times in my professional "career", I've found myself wishing that whoever wrote the test I'm staring at just hadn't bothered. "Tests that say and/or look like they test one thing, but test something entirely different." [0] and "Tests that actually test nothing at all and never fail when the code they claim to test is changed out from under them." are two of the many categories of tests I wish folks would just never have wasted their time writing.
[0] To be clear, I usually find tests of this type to claim to be testing something useful, but actually be testing something that's not worth testing.
Exemplar code in python, shows the benefit in rust. o---kaaaaay.
So the outcome is "for this compiled rust, around 1m records it gets better"
But you didn't actually prove a general case, you proved "for a good optimising rust compiler" didn't you?
Maybe I'm over-thinking it. Maybe this does just become simple, working-set-locality stuff.
I could take the lesson: for less than 10m things, on modern hardware with more than 4 cores and more than 4GB stop worrying and just code in the idiom which works for you.
Uh? The post very clearly says: "I’m using Python as pseudocode; pretend I used your favorite low-level language". The goal is to show what's possible when you do your best, of course I'm not going to use Python for anything beyond demonstrating ideas.
> But you didn't actually prove a general case, you proved "for a good optimising rust compiler" didn't you?
Again, that was never my goal. I chose Rust because I've stumbled upon this problem while working on a Rust library. I could've chosen C++, or C, or maybe even Go -- the result would've been the same, and I checked codegen to make sure of that.
> I could take the lesson: for less than 10m things, on modern hardware with more than 4 cores and more than 4GB stop worrying and just code in the idiom which works for you.
The number of cores and RAM capacity has nothing to do with this. It's all about how well data fits in cache, and "less than 10m things" are likely to fit in L3 anyway. If your main takeaway from "here's how to process large data" was "I don't need to worry about this for small data", well, I don't know what to say.
Large and Small are so contextual. I'm processing 350m events/day in 24h splits, and I managed to stop worrying about locality of reference because I'm the sole occupant of the machine. When I did worry about it, I found radix tree, awk hash and perl/python hash/dict pretty much all occupied much the same space and time but a tuned C implementation got 2-3x faster than any of them. Somebody else pointed out memory resident for most of this would be faster still but you have to then work to process 24h of things against a single memory instance. Which means buying into IPC to get the data "into" that memory.
It interested me you didn't show the idea in rust. That was the only point I was making: Python as pseudocode to think things in documents is fine with me.
But overall, I liked your outcome. I just think it's important to remember large and small are very contextual. Your large case looks to me to be >5m things and for an awful lot of people doing stupid things, 5m is bigger than they'll ever see. If the target was only people who routinely deal in hundreds of millions of things, then sure.
> The number of cores and RAM capacity has nothing to do with this. It's all about how well data fits in cache, and "less than 10m things" are likely to fit in L3 anyway.
What matters is locality since that allows prefetch to mask latency. If you have this, then you are in a good place even if your data does not fit in the L3 cache. What you did demonstrates the benefits that locality gives from the effect on prefetch. Fitting in L3 cache helps, but not as much as prefetch does. If you do not believe me, test a random access pattern on things in L3 cache vs a sequential access pattern. The sequential access pattern will win every time, because L3 cache is relatively slow and prefetch masks that latency.
I have seen options for disabling prefetch and cache as BIOS options (although I only remember the option for disabling cache in ancient systems). If you could get one, you could do some experiments to see which will matter more.
In my 3 years old laptop, system memory (dual channel DDR4-3200) delivers about 50 GB / second. You have measured 50M elements per second, with 8 bytes per element translates to 400 MB / second. If your hardware is similar, your implementation did less than 1% of theoretical bottleneck.
When your data is indeed big and the performance actually matters, consider doing something completely different.
In the real world, your application is running in a container among hundreds or thousands of other containers. The system's resources are also probably being managed by a hypervisor. The CPU is shared among N tenants _and_ is overcommitted. It's not that much to ask to optimize where you can, when it isn't unreasonably difficult.
More importantly, this attitude is precisely why software sucks today. "[CPU, Memory, Disk] is cheap, engineering time isn't." Fuck that. Bad engineering time is _incredibly_ expensive. This is an excuse to not spend time learning the ins and outs of your language, and your hardware.
It also frustrates me to no end that people are so deeply incurious. This seems to only happen in tech, which is baffling considering how incredibly broad and deep the industry is. Instead, everyone clamors to learn the new Javascript abstraction that lets them get even further away from the reality of what the magic boxes are doing.
> It also frustrates me to no end that people are so deeply incurious. This seems to only happen in tech
This doesn't match my observations. In many fields, training is limited and hides the details. We train workers to repeat specific tasks and they excel at them. But they don't have conceptual understanding. Any explanation of why things are done the way they are is surface-level. You can see it when a procedure fails for some reason. They way to deal with is to 1) do basic checks 2) try again and if it still fails 3) delegate to someone else. Nobody is going to troubleshoot or optimize tasks unless that's their main job.
It happens in construction, in the kitchens, on the assembly lines, and in the offices. It happens because it gets the job done.
There are a lot of developers who learn how to do a task and never question whether the task could be automated, either away or to reduce errors. They just do the same mundane task four hours a week in perpetuity.
That’s the part that frustrates me. Your job is to automate things. So why aren’t you automating things?
You're right. My wording was flat-out wrong, and I apologize. A more accurate sentiment (IMO) would be "that this happens in tech is baffling, considering..."
Yes, but in most fields, conceptual understanding doesn't matter for most tasks.
That's the problem with software. A brick-layer doesn't need to understand structural stability, but in software every task is structure validation, and people expect brick-layers to be productive.
> A brick-layer doesn't need to understand structural stability
Maybe a “junior” brick layer doesn’t need to, as much as a junior engineer doesn’t need to understand the ins and outs of their language. But a senior brick layer, or an architect, needs to understand more of the details so that they can set out the plan for the junior.
Eh... What language do brick layers work with? You mean their spoken language?
Also: "a senior brick layer, or an architect"
those are complete different professions.
Some apple-pickers think the point of picking apples is to prove how good you are at climbing trees. They'll never not be mad at pluckers who just pluck the nearest apple, and the people who reward them for the apples, and the world that doesn't understand the pluckers picked their apples wrong, and didn't even earn them, and they're not even real apples.
Maybe a better analogy would be farming.
A farmer who understands the value of crop rotation is completely right to be upset when other farmers are monocropping and killing the soil, or when they squander the local aquifer. It's directly impacting the sustainability of their industry.
Somehow, the trees keep getting exponentially bigger, and yet the pies we’re getting are actually kinda just getting smaller and worse. Maybe just picking the nearest apples isn’t working out as well as they say it is.
The part missing from this analogy is that the low-hanging fruit pickers are slowly killing the trees.
Why are fruit pickers hanging from trees at all?
Just get a panking pole already!
Climbing a ladder takes more time, so in order to get the apples to market faster, the apple tree owners keep the pickers focused on only picking the apples within arm's reach.
The owners also disallow the use of ladders because the pool of candidates to hire remains bigger.
And the highest 90% of apples remain unpicked.
Having recently been to an apple orchard...
- Apple trees are deliberately bred to be short and wide.
- An apple-picking-stick has a sort of basket on the end, which allows an apple to be selected and pulled off from 1-2 meters away.
What lessons can be learned?
- Using the right tools improves your yield, safely.
- Having the data in a convenient place means you can use it more readily.
It's one thing if you know the optimal (or at least approaching it) solution, and deliberately use something else for other reasons. At least thought went into it.
I'm not upset at this simply for purity's sake; it directly impacts me in my day-to-day job. A hundred devs individually decide that good enough is good enough (deliberately or otherwise), and suddenly my infrastructure is dealing with traffic and query patterns it shouldn't have to, and then I have to explain what network bandwidth limits are, and why it's absurd that we hit them in the first place.
In general, what I'm railing against is the continued push towards simplification and abstraction, while not requiring the understanding of what lies beneath. The hyperscalers certainly aren't trying to foster that attitude in-house – someone has to build the abstractions in the first place, after all, and keep datacenters humming. Yet they're happily shoveling it out, because it's a win-win for them: fewer people have the skills necessary to compete (or even to simply not use their services), and more people are able to use their services.
It's like the contractor (or subcontractor) who built the house you live in. They don't have to pay your monthly heating or cooling bills, so they often do whatever is fastest and cheapest for them.
Who cares if they didn't insulate a wall correctly or put an inefficient system in? If they can get it by the inspector, they will do it. If they can save themselves $100, that is all that matters to them. Who cares if it adds $10 to each month's utility bills for the next 30 years? They got their money and are now long gone.
To get a certain behaviour, incentivize that behaviour.
This is what the mythical Full Stack and DevOps movements were supposed to do. They did not.
> it directly impacts me in my day-to-day job
It directly impacts anyone that uses software everyday. Most people don't seem to understand and/or care how poorly the software they use runs.
Most people don't care about their job, they just want to put in minimum viable effort, get paid and go home to their family. Typically they do this because putting more effort didn't bring them anything (not even some recognition of efforts) or even backfired. It's applicable to all jobs, not only in tech.
I get that. My comment has nothing to do with people's pride in their work. I was commenting that there's not enough user demand for better preforming software because users don't seem to mind the software hellscape we currently live in.
You can always just throw money at the performance problem. Cloud is infinite scale, right? Right?!
Right :) Until enough misperforming software is in. Then you start hitting cloud limits and then people start asking: what? how?
So, I guess all I am saying is that you are not alone in these concerns.
Sir, your comment is poetry. I commend you.
Agree with the sentiment. However it's hard to stay curious, even harder to stay up-to-date.
I liked fiddling with storage for a while, got really into it, deepened my knowledge about it. A couple years later I realized everything else (networking, architectures, languages) developed so much, mot of my (non-basic) knowledge was obsolete. Picking up where I left off with all technologies is incredibly hard and caused fatigue.
Now I'm at a point where I have the feeling I don't know nothing about anything. It's factually not true, but my gut feeling tells this. Would I be younger, this would trigger a lot of anxiety. Thankfully I can janfle this by now.
That's understandable. I'm into databases (both professionally and personally), so I get to touch a wide variety of things. Following a query the whole way is pretty neat. Typed query --> client --> wire protocol --> network --> server --> DB frontend --> DB planner --> DB storage engine --> OS --> Memory / Disk. `blktrace` is super interesting to watch commands being sent to the disk.
When you are deeply involved in tech both personally and professionally you are probably very passionate about this and makes sense that you'd only look closely at this field and think "people are so deeply incurious. This seems to only happen in tech".
Tech is also one of the (if not the) most dynamic and fast evolving field a normal person will ever touch. Curiosity in tech can drain every single bit of free time and energy you have and you will hardly keep up with the progress, maybe barely scratch the surface. But people's available free time and energy wanes and curiosity is a collateral victim.
I've painfully gone through the entire cycle of this, including the bit of resurgence later on when you have a combination of free time but less energy. What I can say is that this absolutely does not happen just in tech. If anything tech is flooded with people with more curiosity than almost any other field.
> When you are deeply involved in tech both personally and professionally you are probably very passionate about this and makes sense that you'd only look closely at this field and think "people are so deeply incurious. This seems to only happen in tech".
Good point. I commented in a sibling post to the same effect.
I've certainly felt the personal strain of time sink and procrastination in my homelab. It's currently running k3os, which has been dead for about four years now, because everything I want is still running, and I never seem have the motivation on the weekend to yell at my laptop when I could be playing board games.
> including the bit of resurgence later on when you have a combination of free time but less energy.
I'm guessing that will happen in another decade or so, when my kids are grown.
Life is very different for many people and I think we just need to build empathy for people who treat a job as just a job. If they deliver and are not unkind about it there's nothing wrong about not going above and beyond the bare minimum, which is what they are paid for.
As I commented above, a large part of my umbrage stems from the impact these decisions have on my job. I dislike having to work harder because others didn't want to go above the bare minimum.
This isn't unique to any one company, nor my personal experience. At my last job, my team was initially human triaging practically all alerts, because we had the rare ability to read logs. I spoke to someone at Slack once who was stuck doing the same. That's an absurdly low expected level of competence.
You don't _have_ to work harder, as evidenced by those people who do the bare minimum. You just care about your work more than people who pay you for it (or the manager hired to manage you), which is the cause of your frustration here IMO.
I care about my teammates. Letting them down by not working to my full potential makes me feel bad. I think this is embedded into my pysche from years of military experience and then shift work.
I enjoy work, but I care about people.
> people are so deeply incurious. This seems to only happen in tech
It happens everywhere. If anything, techies are more curious than the average Joe. How many fellow programmers can you nerd-snipe with a comment that makes them say, “Well, that’s strange…”
I can somewhat echo some of the statements here and provide my own experience that is similar.
I spend a decent amount of time writing decent C++ code. My friends in other parts of the industry are writing certainly better C++ code than me because they are doing it in environments that are more constricted in various ways. In either case, I do spend my time catching up a bit and would consider myself a competent C++21 programmer in some ways.
My experience and my conversations lead me to understand there is so much left on the table with even the most basic implementations. When I implement it correctly in C++ we get close to some of the theoretical limits for the hardware for some workloads, compared to something that is literally 1% as fast running in NodeJS.
Wit that said, for so many situations I cannot justify the time and complexity to use C++ for many projects. At least for the stage most projects are in. In theory this optimization can happen later, but it never really does because the horizontal (or sometimes even vertical) scaling kicks in and we're all just willing to throw a few more dollars at the problem instead of re-engineering it. Sure, some of the really big companies like Netflix find a decent reason from time to time to invest the engineering time squeeze out those numbers, but it's becoming the exception and not the rule.
I've also found that not having some optimization mindset from the get go limits your product to achieve only a local maxima of performance in the end. It might not even be a good local maxima.
It's best to have at least some optimization considerations from the start. I'm saying some because too much is a problem.
>C++21
There's C++20 and C++23. No C++21.
Sorry, I meant to type C++11
> In the real world, your application is running in a container among hundreds or thousands of other containers
I mean, that’s an engineering decision too. In my day job we’re capturing, pre-processing, running inference on, and post-processing about 500Mpx/s worth of live image data at about 80ms/frame end-to-end at the edge. The processor SoM costs about $3000/unit and uses about 50W running flat out. The retail cost of our overall product is two orders of magnitude more than what the processor is worth but it incurs zero recurring costs for us.
Edit: and it’s got 64GB of Unified RAM that I’ve got all to myself :)
I was wondering if someone from a different sub-industry would disagree here :D
That sounds like a very interesting job, with quite different requirements and constraints from what I'm used to. One day, I'll get a job where application latency is critical, and optimizations matter deeply. Undoubtedly I'll find something else to be upset about, but at least it'll be a new complaint.
> Undoubtedly I’ll find something else to be upset about
Vendor SDKs. You’ll be upset about that I guarantee :)
It’s hard to learn the ins and outs of dozens of programming languages. One doesn’t usually just use one or two PL over their entire career. I have worked professionally with at least PHP, Java, Python, JS, Go, and Ruby. That without taking into account the respective libraries and frameworks (and without taking into account as well the myriad of other components like dbs, web servers, caches, etc.)
It sounds like an excuse, I know. The true is I just don’t have that much time.
> In the real world, your application is running in a container among hundreds or thousands of other containers. The system's resources are also probably being managed by a hypervisor. The CPU is shared among N tenants _and_ is overcommitted. It's not that much to ask to optimize where you can, when it isn't unreasonably difficult.
And the real irony is that those programmers not giving a crap about performances and justifying it are resting on the shoulders of giants who went out of their way to have these kernels, hypervisors, containers, passthroughs, memory usage, etc. be as performant as they can.
Same for security.
Then you get the "I'm not in the business of writing fast code" and "I'm not a security company, I'm selling X and Y".
But they all, on a daily basis, use actual proper software written by people who know about these things.
But how does that even help in any way, when the CPU this runs on is like you said shared among N tenants and your carefully optimized L2 cache access is still going to be miss because another tenant got "your" CPU in between?
If you're paying for bare metal i.e. have the whole instance for yourself, by all means, if your domain actually requires you to use the system in such a way (things like high frequency trading come to mind), then optimize like that!
If you're running on seventeen levels of hypervisors that destroy any careful optimization in a random fashion anyway, then what's the point even? (non rhetorical question!)
It's a version of the tragedy of the commons. Yes, your software bloat isn't the main factor keeping you slow, and cleaning it up might be barely perceptible.
But... all the other software makes the same decision. Then, suddenly, everything on the computer is running at 10% efficiency.
In a complex, multitasking environment, keeping your code clean benefits other tasks as much or more than your own. It should be considered a responsibility.
I would agree with the sentiment but not necessarily the conclusion.
Like, don't use (the equivalent of) Stooge Sort (for your particular situation).
But unless you are in a very particular situation, it should be OK for everyone to "just use your language's built-in sort function" (hopefully that's a thing and you don't use something where you need to roll your own even in 2024) assuming that it uses a sane default like quicksort or merge sort that will work perfectly fine for most regular common situations.
Another example might be to not stupidly build slow algorithms saying "this won't see much data anyhow" (yes, I've seen that in PRs unfortunately) when a very simple hash lookup will ensure it's fast all the time. But it should be totally fine to assume that your language's hash table implementation is fine for common situations and you don't need to optimize anything unless you're a very special case.
> "[CPU, Memory, Disk] is cheap, engineering time isn't." Fuck that.
It is. It's absurdly cheap. I ensure I check the amount of time it would take for me to make a performance improvement against the runtime costs of my functions. It's rarely worth the extra effort.
Seriously, until you get into the millions of records per second level, you're almost never benefited. You may make your function 2x faster, at a cost of additional complexity, but you never run it enough in a year for it to pay itself back.
> Bad engineering time is _incredibly_ expensive.
Engineering time is expensive. Period. It speaks to the need to minimize it.
> This is an excuse to not spend time learning the ins and outs of your language, and your hardware.
All of which will change in a few years, which is fine, if you're also committing to keeping _all that code_ up to date right along with it. Otherwise you end up with an obscure mess that you have to unwind 5 years of context to understand and fix again.
Complexity and available mental contexts are forgotten costs. If your language even has that many "ins and outs" to begin with you may want to reconsider that.
> You may make your function 2x faster, at a cost of additional complexity, but you never run it enough in a year for it to pay itself back.
"You" is both singular and plural, which is often the problem with this thinking.
Is it worth spending a month of engineering time to make a page load in 50ms instead of 2s? Seems like a lot of engineering time for a noticeable but somewhat minor improvement.
But now, what if you have a million users who do this operation 100x/day? Absolutely worth it!
For example, I sure wish atlassian would spend a tiny bit of effort into making jira faster. Even if it is 1 second per ticket, since I'm viewing 100+ tickets per day that adds up. And there's many hundreds of us at the company doing the same thing, it really adds up.
Nit: 50ms vs 2000ms is 40x speed increase, i.e. ~1.5 order of magnitude.
I still keep words of my database optimization lecturer who said that by his experience optimization below 1 OOM aren’t worth it and most „good ones” are 3+
> Absolutely worth it!
Long reaching assumption. Even the biggest companies have limited resources (even if vast). Would you rather improve load times by 2x (from 500ms to 250ms) or improve checkout reliability from 99% to 99.5%? And there is much more to consider on some levels (e.g. planning for thermal efficiency is fun).
Software development is always a game of choice.
> I still keep words of my database optimization lecturer who said that by his experience optimization below 1 OOM aren’t worth it and most „good ones” are 3+
Like everything, it depends. Is the query gating a lot of other things, especially things that can be done in parallel? Shaving 10 ms off might very well be meaningful. Is it a large OLAP query, and the owning team has SLAs that depend on it? Going from 60 --> 55 minutes might actually matter.
The two biggest performance-related issues with RDBMS that I deal with, aside from indexing choices, are over-selection (why on earth do ORMs default to SELECT * ?), and poor JOIN strategies. Admittedly the latter is often a result of poor schema design, but for example, the existence of semi/anti-joins seems to be uncommon knowledge.
If SLAs are involved then I’d argue it’s not about optimization but about business goals, which unsurprisingly take precedence.
But there is another case that is very similar: threshold passing (or how I like to call it - waterfalling). Small inefficiencies add up and at some point a small slowdowns reach critical mass and some significant system breaks everything else.
When system was designed by competent engineers huge optimizations aren’t easy, so it’s shaving couple millis here and couple millis there. But, as in the first case, I’d categorize it as a performance failure.
Most of the time they just move the expensive processing to the user's browser so they don't have to pay for it :)
Jira is incredibly slow, almost unusable. So awful.
More features = more customers = more revenue
More data collection = more revenue
More crap layered on = everything is slower
Everything sucks = more support demand = supply price leverage = more revenue
Enterprise software is necessarily slow and complicated, and not for your benefit.
I’m not quite following your point. It sounds like you’re agreeing that Jira sucks?
Yup
Yeah, so the pathological incentives to build bad products are just insane. It’s more evidence against the whole market-Darwinian hypothesis being good for the consumer. “Fitness” doesn’t equal (let alone define) product quality.
> 50ms instead of 2s
In the past I believe Google was very adament that page load time perception was very important to other metrics.
You're probably not going to achieve that with the kind of optimization described in this article though.
As always, https://xkcd.com/1205/ is a great reference to keep in mind.
That said, most developers (I assume, by most I mean myself) never work with code where the work they do has any serious impact on performance; it's layers on top of code that should be fast, but the work I do is implementing business logic and screens, in which case readability vastly trumps performance.
I mean right now I'm writing a line of code that can be represented as a nested ternary or a function call with some more written-out if/elses. The ternary outperforms the function call 2:1 or more, but either one can be executed hundreds of times a second BUT will only be executed once or twice per page view. It's not worth the tradeoff, even if this line of code will be executed hundreds of millions of times overall.
> You may make your function 2x faster, at a cost of additional complexity, but you never run it enough in a year for it to pay itself back.
I'm not talking about increased complexity, I'm talking about extremely basic things that take zero extra time, like using the correct data structure. For example, in Python:
Very similar structures, with very different memory requirements and access speeds. I can count on one hand with no fingers the number of times I've seen an array used.Or knowing that `random.randint` is remarkably slow compared to `random.random()`, which can matter in a hot loop:
> All of which will change in a few years, which is fine, if you're also committing to keeping _all that code_ up to date right along with it.With the exception of list comprehension over large ranges slowing down from 3.11 --> now, I don't think there's been much in Python that's become dramatically worse such that you would need to refactor it later (I gather the Javascript community does this ritual every quarter or so). Anything being deprecated has years of warning.
> Very similar structures, with very different memory requirements and access speeds. I can count on one hand with no fingers the number of times I've seen an array used.
That is obvious when you actually check the access speed of arrays and find out it is about half that of lists on small integers (under 256), and worse on non-small integers. That is literally the opposite trade off of what you want in 99.99% of cases.
Deques are even less of a consideration, they’re unrolled linked lists so random access is impossible and iteration is slower, you use a deque when you need a deque (or at least a fifo), aka when you need to routinely manipulate the head of the collection.
It depends on your constraints. If you’re limited by RAM, arrays make a lot of sense for certain applications. If you need Python’s buffer protocol, again, they make a lot of sense.
As to deques, yes, they have specific uses, and being slightly smaller isn’t usually a selling point for them. My point was that I have seen many cases where an incorrect data structure was used, because a list or dict was “good enough.” And sure, they generally are, but if the language ships with other options, why wouldn’t you explore those?
> which can matter in a hot loop:
163ns - 31.9ns == 131.1ns
This will need to happen 7.6 million times to save me 1 CPU second. On AWS lambda with 1GB of memory this will cost you a whopping: $0.0000166667.
The point is, you're not even wrong, but there are vanishingly few cases where it would actually matter to the bottom line in practice. You're taking an absolutist point of view to a discipline which thoroughly rejects it.
This is what I love about the cloud. It forces you to confront what your efforts are actually worth by placing a specific value on all of these commodities. In my experience they're often worth very little given that none of us have the scale of problems where this would show actual returns.
Reaching the scale where it shows actual returns isn't all that difficult. You need it to happen 7.6 million times to save 1 CPU second, but each CPU core can execute it nearly that many times every second.
Probably you don't leave it generating only random numbers all day, but suppose you do generate a good few, so that it's 1% of your total compute budget, and you have only a modest load, using on average four CPU cores at any given time. Then saving that amount of computation will have saved you something like $15/year in compute, recurring. Which isn't actually that bad a return for ten seconds worth of choosing the right function.
There are also a lot of even fairly small entities for which four cores is peanuts and they're running a hundred or a thousand at once, which quickly turns up the price.
And even the things with internet scale aren't all that rare. Suppose you're making a contribution to the mainline Linux kernel. It will run on billions of devices, possibly for decades. Even if it doesn't run very often, that's still a lot of cycles, and some of the kernel code does run very often. Likewise code in popular web browsers, javascript on popular websites or in popular libraries, etc.
You don't have to work for Google to make a contribution to zlib and that kind of stuff has the weight of the world on it.
Sure, but the cumulative effects of pervasive mediocre to bad decisions do add up. And it isn't just about cloud compute cost. Your own time is stolen by the slow ci jobs that you inevitably get stuck waiting for. For me, I prioritize my own personal happiness in my work and this mindset taken too far makes me unhappy.
Yes but if you want to do things in a less obvious way you should be aware of the downsides, such as bias in your random numbers. Also making sure you watch out for off by one errors.
Stolen the number to show this off well from a bug report somewhere:
Result https://bugs.python.org/issue9025Have you ran this in any modern version of Python? It’s been fixed for a long time.
3.10 so I redid it on 3.13.1, same results.
Ugh, I was checking `randrange` (as the bug mentions), not `random`. I stand corrected.
Ah yeah sorry I should have mentioned it wasn't the same, I used it as it has a nice number that shows the bias to a pretty extreme degree.
Python is 100% the wrong language to worry about this in. If your hot loops are in python and you care about performance, you should be rewriting them in another language.
Agreed; I used it partially because TFA used it to demonstrate ideas, and partially because I’m very familiar with it.
But you’re correct, of course. When I need something to go faster in Python, I write it in C. If it’s more than a small section, then a full rewrite is reasonable.
Even if so, their point still stands. It's a tiny change that grants huge speedups.
I was curious why randint was slower than random and found a good SO answer [0] (it's like chatgpt by humans for you youngsters out there), the gist of it is that `random()` calls a C function directly (which I presume goes straight to a syscall), whereas `randint` is implemented in Python and has a load of preconditions / defensive programming (which I presume is executed at every invocation, so seven potential branches etc it has to check every time).
Of course, if it's not in a tight loop and you need a whole integer between a range then it's the most developer-ergonomic way to get it. If you do need more performance but want to keep the ergonomics, writing your own random-int-between-two-numbers is fairly straightforward but it'll take some time.
[0] https://stackoverflow.com/a/58126026/204840
Ok, but neither of these matter until you know they matter. Seriously. Like, yes, it's nice they exist and that they are available for when you want them, but I would generally advise people to use a list or random.randint, if only because I value their confidence with them over the 2x performance win, because most workloads are not simply just a single array or random number generator loop. And, to be clear, I work on performance professionally: most of my job is not making things as fast as possible, but considering the tradeoffs that go into writing that code. I understand your example as showing off an interesting performance story but in the real world most workloads are more complex than what can be solved with using a rare but drop-in API.
You are saying that the potential gains are less than an order of magnitude. That mkes them a pretty hard sell in most instances.
> until you get into the millions of records per second level, you're almost never benefited
Yeah, but the software landscape is very diverse.
On my job (CAM/CAE) I often handle data structures with gigabytes of data. Worse, unlike e.g. multimedia frameworks many algorithms operating on these numbers are global i.e. can’t be represented as a pipeline which splits data into independent chunks and processes them sequentially.
Making performance critical functions twice as fast might saves hours of time for a single end user in a single day.
> It also frustrates me to no end that people are so deeply incurious. This seems to only happen in tech,
I mean software engineering and computer science from the start is built on abstraction, it makes sense that it attracts people who find taking the abstraction as reality and ignoring the underlying "real" layers to be attractive.
(Also it's not "real" until you get down to... voltages? Electrons? I don't know... the whole thing about our whole endeavor being built of abstraction is there are so many layers. Nobody thinks everyone has to routinely go ALL THE WAY down, but I get your point that being more curious than many tend to be about a couple layers down is helpful, with how far down a couple layers is depending on the nature of your work. I'm not saying it's not helpful, I'm suggesting possibly why -- because we select for people who swim happily in abstractions, who love em even. And the success of all of CS and software engineering is based on the ability for it to often work. nobody really has to be curious about voltages when writing a python script to read and write from storage)
It is interesting that at the lowest levels everything in fact is analog, only turning 'digital' on subjective decisions defining as such.
Nope. If a program is meant to run on the entire CPU, and if it's meant to use all the memory, then that's how it will run. This isn't a question of modernity or availability of tools or programmers' salary.
Or, to put it differently, using more or less memory / CPU isn't really an indication of anything about a program. Sometimes you mean to use all that's available, other times you mean to use as little as you can. There's no way to tell which one should be done w/o knowing your goals.
For example, the entire free memory on Linux is a free real estate for filesystem caching. So that memory is not wasted, if the system can help it. So, having your entire memory used (to support filesystem cache) wouldn't be really a performance problem (quite the contrary, it would look like the system is doing a good job at caching whatever reads your applications are making).
That's good food for thought. My hardware is quite old, but I agree that the numbers seem somewhat suspicious.
I believe that RAM can only be RMW'd in cache lines, so modifying just 8 bytes still requires 64 bytes to be transmitted. I'm assuming the 50 GB/s throughput is half-duplex, and 25 GB/s over 64 bytes is ~400 Melems/s, somewhat closer to my result.
I tried using non-temporal stores in the straightforward algorithm, but to my surprise, this led to a significant decrease of performance across all input lengths.
> When your data is indeed big and the performance actually matters, consider doing something completely different.
I'm not sure what you mean by this. Scaling across machines is just ignoring the problem. What do you mean by "something completely different"?
> I believe that RAM can only be RMW'd in cache lines, so modifying just 8 bytes still requires 64 bytes to be transmitted.
Ages ago I worked with several different memory controllers, and it depends on the memory controller, cache, and MMU configuration.
Plenty of systems do require cacheline updates, then modifying 8B requires reading one or two cachelines, updating them, and writing them out eventually.
Some caches track cacheline validity with a bit per byte. This enables a CPU to set a book out in memory without fetching the cacheline. The cache may then try to burst read that line from the memory controller, but if it doesn't get around to it before deciding to flush that cacheline, it may issue a single byte write to the controller. The controller can then issue a makes DRAM write to the memory, which will update only certain bytes in the DRAM column. However, this still takes about as long as sending the full cacheline but it offloads the read-modify.
Validity per byte is also useful to implement hit under miss.
I bet on newer, bigger systems tricks like this are less useful since the memory busses are faster and wider today.
Non-temporal stores will not help you if all you do are those accesses. The point of using them is that you don't want them pulled into the caches and pushing out everything else.
> What do you mean by "something completely different"?
I mean compute the exact thing you need to compute, optimizing for performance.
I’m pretty sure your current code is spending vast majority of time doing malloc / memcpy while moving input elements from the input array into per-bucket collections.
To aggregate the data the way you want, you do not have to do any of that. You only need a single number per bucket to compute the minimum. Updating that accumulator is very cheap as there’s no memory allocations involved: just load, compare, and store the updated number.
Moreover, that algorithm is easily to parallelize. Compute an array with these per-bucket minimums on different threads, wait for completion of all tasks, then aggregate multiple arrays computed by each thread into a single array with global minimums for each bucket. Then compute the final sum.
> In my 3 years old laptop, system memory (dual channel DDR4-3200) delivers about 50 GB / second.
That's almost certainly for (mostly) sequential access.
When you just want a couple bytes here and there, and access isn't pipelined and prefetch doesn't accelerate your use case, the real world bandwidth is going to be significantly less.
The input data is sequential.
I don’t understand Rust but if the code is doing what’s written there, “simple multiplicative hash and perform a simple analysis on the buckets – say, compute the sum of minimums among buckets” it’s possible to improve substantially.
They don’t need to move these megabytes of elements between collections. I would split the input across a few CPU cores, compute per-bucket minimums in parallel on each core, when all completed aggregate minimums across all cores, then compute sum of the results.
Multiplicative hash and the final sum should be possible to vectorize on most computers. Updating per-bucket minimum is probably impossible to vectorize (technically AVX512 set has the required instructions but I’m not sure these are efficient) but there’re much fewer buckets than input numbers, which means arrays with per-bucket minimums are likely to fit in caches.
cache misses are slow because of latency, not because of throughput.
Isn't the point of the person you replied to that the article author wasn't able to eliminate latency because if they were they'd be constrained by throughput but they are not?
cache misses can absolutely screw throughput because you may need to load more cache lines
And my GPU delivers 1TB/s. Massive difference <3
I wish we could get those kind of speeds on system RAM.
To/from what sort of devices could the GPU read or write at 1TB/sec, besides main memory?
The fastest consumer SSDs top out at several GB/sec (I guess with massive hardware RAID they could be faster, but not sure if they’d be 1TB/sec fast).
Even a network adapter that does 10 Gbit/sec is only recently becoming slightly more common for the average consumer. Not sure if any consumer adapters in the 10 or 1Tbit/sec range exist at all.
Consumer? Even in the "if-you-need-to-ask-what-it-costs-you-can't-afford-it" world of frontier HPC systems, were only getting teasers of 0.8 Tbit/sec NICs this year.
As you say, only the GPU, maybe RAM, and the NIC will be able to churn data at these speeds. There is a reason why Mellanox (Nvidia) has developed GPUDirect RDMA, so the GPU and NIC can talk directly to each other.
https://www.servethehome.com/this-is-the-next-gen-nvidia-con...
> ...besides main memory?
Main memory is an important thing to have fast, though. The faster (and lower-wallclock-latency) it is, the less time your system spends waiting around when it needs to swap things in and out of it. It's my understanding that programs that need to be fast (like many video games) take pains to preemptively load data into RAM from disk, and (when appropriate for the program) from main RAM into VRAM. If main RAM's transfer speed was equal to or greater than VRAM's, and it access latency was a small fraction of a frame render time, (presumably) some of that preloading complexity could go away.
> I guess with massive hardware RAID they could be faster...
This section of the comment is for folks who haven't been paying attention to how fast storage has gotten: It's nowhere near 1TB per second, but...
I have four 4TB SATA-attached Crucial MX500s set up in LVM2 RAID 0. This array is a bit faster than a 10gbit link. (That is, I get 1.5GByte/s transfer rate off of the thing.) Even a single non-garbage U.2-attached (or (barf) M.2-attached) device can saturate a 10Gbit link.
I'm reading the latest SSDs do 14 GB/s of sequential reading and/or 15 million IOPS; transfer speed wise that's close to the highest end DDR3 (2007) and the lowest end DDR4 (2014). SSDs are definitely not memory speed fast yet (definitely not for random access) but definitely getting closer.
Yeah that was intel's optane idea. To make SSDs so fast they could act as memory.
Didn't quite work out though.
> To/from what sort of devices could the GPU read or write at 1TB/sec, besides main memory?
The 1TB/sec is between the GPU and the GPU memory, which is the bottleneck.
You don’t need that much bandwidth for loading inputs and outputting results, just for random access during compute.
> Not sure if any consumer adapters in the 10 or 1Tbit/sec range exist at all.
Further, what exactly is a consumer going to plug a 1Tbit/sec adapter into? Your little home ATT fiber connection isn’t going to come close to using that available bandwidth.
> Further, what exactly is a consumer going to plug a 1Tbit/sec adapter into?
Another similarly-equipped machine on the LAN, and the switch(es) between them.
It actually almost never does. To see that you'd need to benchmark. It's pretty difficult get good utilization on GPU on either compute or memory bandwidth side. A lot of kernels irretrievably fuck up both. You need long, coalesced reads/writes, and judicious use of the memory hierarchy, or else everything gets very slow very quickly.
I mean they can only do that if you have hundreds of threads all making coalesced writes. Typical CPU workloads look nothing like that; if you pointer chase on the GPU you are going to get absurdly bad performance.
> Cache is seen as an optimization for small data: if it fits in L2, it’s going to be processed faster
Nobody worth their salt believes just this and nothing else.
Yes, if the data fits entirely into a given cache, that's a nice case that's easy to reason about. No matter what access pattern is applied to the data, it doesn't matter because it's in the cache.
Hopefully everyone working with caches understands that they provide a potential speedup when not everything fits into the cache, and that this depends on the pattern of access (mainly, does it exhibit "locality"). Moreover, this case is extremely important.
The article gives an example of exactly that: improving the locality of access.
If you don't know this, you don't know one of the first facts about caching.
There is something else to know about: you can't tell by size alone whether a given data set will fit into a cache. The problem is that caches are not always fully associative. In a set associative cache, a given block of data cannot be stored in any cache line: it is assigned to a small set of possible cache lines. Then within a set, the cache lines are dynamically allocated and tagged. A given bunch of working which appears to be just smaller than the cache might be arranged in such a poor way in memory that it doesn't map to all of the cache's sets. And so, it actually does not fit into the cache.
>Nobody worth their salt believes just this and nothing else.... and that this depends on the pattern of access (mainly, does it exhibit "locality").... If you don't know this, you don't know one of the first facts about caching.
Not necessarily specific to this issue, but I've found that surprisingly many people out there are not "worth their salt" in areas where you'd really expect them to be.
Assuming incompetence cannot be a general strategy, but there are many surprising ways to get jobs and pass exams.
That's true, perhaps my wording is off. I believe that the devil in the details. Sure, knowing that better access patterns result in better performance is common. But the fact that the access pattern can be optimized when the problem is literally "access RAM at these random locations" is counterintuitive, IMO.
When you have locality, prefetch can mask the latency of getting the next object, regardless of whether everything fits in cache.
There are actually some algorithms specifically designed to optimize usage of cache resources without knowing the specific features of the cache.
"Cache Oblivious algorithms" https://en.wikipedia.org/wiki/Cache-oblivious_algorithm
Not sure what "Myth" the author thinks they are exposing.
Jim Gray wrote "RAM is the new disk" at least in 2006, probably earlier, so 20 years ago. And people have been saying "RAM is the new tape" for quite some time now as well.
This was a great read, thanks. OP, readers might benefit from having it explicitly mentioned that while the pseudocode is in Python, actual Python code will likely not benefit from such an approach because of how memory is fragmented in the standard Python implementation - e.g. this discussion: https://stackoverflow.com/questions/49786441/python-get-proc...
I am tempted to actually test this (i.e. see if there's a speedup in Python), but I don't have the time right now.
I should have looked at the comments before I started writing the code. I'd have replied to you otherwise :) (see https://news.ycombinator.com/item?id=42459055 .)
Haha thanks
[dead]
Using Python pseudo code in this context is hardly understandable.
What concise and readable language would you suggest instead?
That's not the problem at hand. Python is good for pseudocode. But not if you want to talk about cache misses because in the pseudocode written with higher level languages a lot of details on how memory is accessed are opaque.
Again, what would you suggest instead? If you can't guess that `list` is supposed to represent an array of consecutive elements, I have trouble thinking of a language that'll make that clear without being exceedingly verbose for no reason.
A bit later, in the article, you'll see that memory patterns in allocating the arrays have a role. A role that was hidden initially.
> Again, what would you suggest instead?
The answer is inside you. You have only to search for it. Or, if you really want to extort me the obvious: any low level language (even not implementing every detail but calling imaginary functions whose name suggest what they are doing). This exercise will show you for instance that you'll have to immediately choose if to append_to_dynamic_array() or add_to_linked_list().
> This exercise will show you for instance that you'll have to immediately choose if to append_to_dynamic_array() or add_to_linked_list().
Linked lists, that's what you're worried about? `[...]` is not a linked list in Python, in fact I don't know any imperative language where it's something other than a dynamic array/vector. I can only assume someone who doesn't understand or intuit this is arguing in bad faith, especially when taking your attitude into account. What did I do to deserve being talked down to?
> any low level language
Like this?
Is it worth it? If you don't know `list` is a dynamic array in Python, how will you know `std::vector` is a dynamic array in C++? Not to mention the syntax is terrible. C would be just as bad. Using Rust would get people angry about Rust evangelism. The only winning move is not to play.antirez: You already got it wrong. You've got a double indirection, with `groups[i]` pointing to a heap-allocated instance of `DynamicArray`, which supposedly stores the actual heap pointer.
It's not about the language being low-level or high-level. It's about understanding the basics of memory layout. If the pseudocode being in Python is an obstacle for you, the problem is not in Python, but in your (lack of) intuition.
I wrote the first random semantic that came in mind, in the pseudocode in C above. The fact you believe it is not the right one is a proof that in C I can express the exact semantic, and in Python I can't (because everything is hidden inside how the Python interpreter can or can not implement a given thing).
So, modify my C code to match what you have in mind, if you wish. But the point is that low level code will specify in a much clearer way what's going on with memory, and in a piece of code that is about memory layout / access, that's a good idea. Otherwise I consider Python a good language for pseudocode. Just: not in this case.
Have a nice day, I'll stop replying since would be useless to continue. P.S. I didn't downvote you.
If only there were a better syntax for abstractions over machine code. :)
C is exactly at the right level. More high level and human readable than bare assembly, yet without the complexity and abstractions of Rust or C++ or opaqueness of Python.
C was suggested, but I find Go to be good too for this purpose, you get some of the low level memory access (stack vs heap stuff) but don't need to worry about memory allocation or whatnot. Great for showing the differences in algorithms.
For an article about caching and performance C is the lingua franca
> The RAM myth is a belief that modern computer memory resembles perfect random-access memory. Cache is seen as an optimization for small data
The post doesn't seem to answer what the memory actually resembles. If it's not resembling a random access memory, then what is it resembling?
It resembles sequential access memory with relatively fast seeks and a cache (ok, multiple layers of caches). Reading sequential, predictable addresses gives you much more throughput than random access, and reading a value that was recently accessed (or adjacent to such) is much lower latency than something that was not. There's further wrinkles in multicore systems as well, because then accesses to memory recently written to by another core can be slower again.
It resembles a hierarchy wherein a small fraction of memory - a "cache" region - can be accessed much faster than the rest. With careful planning, the programmer can increase the odds that the necessary information for the next step of an algorithm, at any given point, is within that small region. (This is a simplification; there are actually several layers of cache between a CPU and the most general part of the RAM.)
I can only conclude the word myth was chosen for attention. Modern CPU memory systems (with or without their caches) certainly resemble idealized RAMs.
It's a little surprising that this works at all, since the partitioning step in the radix sort is itself the same kind of sharding operation. But I guess that's because it allows for working on smaller pieces at a time that fit in whatever cache they need to.
> Python can't really reserve space for lists, but pretend `reserve` did that anyway.
FWIW, you can pre-fill a list with e.g. `None` values, and then replacing those values won't cause resizes or relocations (of the list's memory - the elements are still indirected). But of course you'd then need to keep track of the count of "real" elements yourself.
But of course, tricks like this are going to be counterproductive in Python anyway, because of all the indirection inherent in the system. They'll also get worse if you actually do have objects (unless perhaps they're statically, constantly sized, and in a language that can avoid indirection in that case) with a "group ID" attribute, rather than integers to which you can apply some hash function. Some test results, trying to approximate the Rust code in Python but with such objects:
https://gist.github.com/zahlman/c1d2e98eac57cbb853ce2af515fe...
And as I expected, the results on my (admittedly underpowered) machine are terrible (output is time in seconds):
It's important to understand the domain. For Pythonistas, simple really is better than complex. And identifying the bottlenecks and moving them to a faster language is also important (Numpy isn't successful by accident). The naive result here is still, if I'm understanding the graphs right, dozens of times worse than what any of the Rust code achieves.(Edit: merely "several" times worse. I forgot that I told `timeit` to run 10 iterations over the same input, so each test processes ~10M elements.)
> It's a little surprising that this works at all, since the partitioning step in the radix sort is itself the same kind of sharding operation.
The key property here is the number of groups. When sharding data to n groups, only n locations have to be stored in cache -- the tails of the groups. In my radix sort implementation, this is just 256 locations, which works well with cache.
Actual benchmarks and graphs. Thank you so much for that.
RAM Myth #2: Free memory is good thing. A billion windows "power" users have etched this into canon.
Windows starts doing awful things under slight memory pressure like 80% usage.
There are cascading failures like high ram usage -> swap+compression -> power+thermal limits -> desktop responsiveness is unreasonably bad (i.e. >5 minutes to open task manager to kill a single firefox.exe using >10G memory caused by a single misbehaving javascript among many tabs)
This is an incredibly common scenario for laptops that cost less than $1000.
Free memory means things will probably not go to shit in the next 10 minutes.
Linux is a whole different story with different sharp edges.
I find this article particularly hard to read. The author states:
> I’m using Python as pseudocode; pretend I used your favorite low-level language)
But then the code is full of Pythonic constructs alien to any other low-level language (and, I suspect, also to most casual Python users) that spoil all the fun. I found myself spending more time parsing the specific syntax than grokking the intent of the code.
I think using more explicit loops and less lambdas would have made the article more meaningful.
> The only way to prevent cache misses is to make the memory accesses more ordered.
Or prefetch. Not enough people know about prefetch.
All modern CPUs have powerful hardware prefetchers, which are insufficiently documented.
On any modern CPU (i.e. from the last 20 years), explicit prefetching is very seldom useful.
What is important is to organize your memory accesses in such orders so that they will occur in one of the patterns that triggers the hardware prefetcher, which will then take care to provide the data on time.
The patterns recognized by the hardware prefetchers vary from CPU to CPU, but all of them include the accesses where the addresses are in arithmetic progressions, going either forwards or backwards, so usually all array operations will be accelerated automatically by the hardware prefetchers.
I agree that they can be powerful but I guess, as always, it doesn't apply in the general case. For example, AFAIK hw prefetcher runs at 32B or 64B per cycle which basically means that you're only going to see a benefit from it if the code you're running is CPU bound, no?
There are situations where it's faster to prefetch manually than to rearrange and rely on the hardware prefetcher. Yes, even on modern CPU's.
Radix sorting is great, as this article points out. But I followed the links and didn't see any mention of the most obvious optimization that makes it even better on modern systems.
If you do MSB for the first pass, then the subsequent passes within each bucket are completely independent :) So... Choose a large radix and do one MSB pass, then parallelize the LSB processing of each bucket. This is fantastic on a GPU. You might think that you would want many thousands of kernels running simultaneously, but in actuality each SM (and the Apple equivalent) has a thread-local cache, so you really only want a couple hundred simultaneous kernels at the most. I've written CUDA and Metal versions and they are fast. And you don't even have to be very good at GPU programming to get there. It's very basic. AoCP has the pseudocode.
I was expecting something about NUMA performance, temporal memory access instructions, shared versus global versus register memory on GPUs, SRAM, and so on. There's an article about all these things waiting to be written. This article is instead about memory-hierarchy-aware access pattern optimization, which is important, but not the whole story.
I sometimes get the feeling it would be better to just start treating RAM as secondary storage, with read/write access through a (hardware implemented) io_uring style API. Is there anything along those lines out there?
CXL RAM is something like this on the physical side. It's RAM over a PCIe connection, and PCIe is basically a very fast, serial, point to point network. However as far as I'm aware the "API" to software makes it looks just like regular, local RAM.
https://en.wikipedia.org/wiki/Compute_Express_Link
> Sorting costs some time, but in return, removes cache misses from the for loop entirely.
Mmm. Can someone explain this? To avoid n cache misses in the for loop, we accept nlog(n) cache misses for sorting?
I'D love to understand the differences between the "devices" named A, Y, M in the performance measurement, referring to "(A, Y, M indicate different devices)"
any pointers appreciated
Tried running the code. Crashes with 'attempt to add with overflow'
Anaesthetists are required to undergo days of retraining per year because the field, and the safety numbers, keep moving.
To do this researchers publish findings and professionals aggregate those into useful training materials
Who / where is the useful training materials for software devs? It really cannot be just blogs.
That’s a VC investment that has ROI across the industry - tell a16z how to measure that and we are quids in
"nulls should not be in your language" is my version of "doctors should wash their hands"
"Immutability-first, with mutation-as-a-special-case" is my "accountants should not use erasers"
"make illegal states unrepresentable" is my "hospitals should sterilise their operating rooms and equipment"
As a field, we are very far from reaching broad agreement on the left side (which I consider the basics). So what would the training materials teach?
Lately, "nulls should not be in your language" sounds like "reality should be simpler". Queue angry fist-shaking at the complexities of real life and a wistful longing for a logical, functional, mathematical world. In reality, sometimes out of 8 questions, you only know the answer to 7, and the missing answer could be any of the 8. So you can:
1. Prevent the user from doing anything until they go find that 8th answer, which cascades into your tool not helping anyone until they have a complete, validated, full, historical data set.
2. Subdivide those 8 questions into having independent existence, which cascades into your entire application being a giant key/value store, and all the difficulties of null are simply replaced with the difficulties of "this key/value pair does not exist yet".
3. Add a sentinel value for "I don't know this yet", while feeling good about yourself because that sentinel value is not technically null, while introducing all the same exact issues that null causes, plus a bunch of infrastructure you have to write yourself. Basically: reimplement null, but worse.
4. Make the answers nullable.
It'd be like claiming that an IDE should simply not allow syntax errors to be entered at all. Errors would be completely impossible! Except at some point your user needs to actually write the thing, and you've just abandoned the idea of helping them. So they write it in some other editor, and then paste it into your IDE. Or you embrace the fact that incomplete work is OK.
Yes, nulls are generally way over-used. But disallowing them entirely is a fool's errand. Source: I've been that fool.
In before: "You should just be using Maybe<> everywhere" -- "Maybe" is just another word for "nullable". The only difference is the level of support you get from the compiler / type-checker. So that argument is "the compiler should help you get null right", which I completely agree with! There's still work to be done getting type checkers better at null. But that's a far cry from "null should not be in your language" and equating the use of nulls to a doctor not washing his hands.
Of course nobody is seriously arguing about ditching the notion of a sentinel value that indicates absence of value. The question, rather, is how to best handle that case.
> In before: "You should just be using Maybe<> everywhere" -- "Maybe" is just another word for "nullable". The only difference is the level of support you get from the compiler / type-checker.
The difference is that in languages with null references/pointers, all references/pointers are implicitly nullable - that is, it is indeed like "using Maybe<> everywhere". But when you have proper option types, you use Maybe<> not everywhere, but only where you know nulls can actually occur.
> The difference is ... [whether] all references/pointers are implicitly nullable [or not]
Right so we're narrowing down the actual problem here, which revolves around implicit vs. explicit, compiler support vs. on-your-own, static vs. dynamic typing. I agree with you that I strongly prefer explicit over implicit nullability, assuming it fits within the idioms of the language. But this is a stronger condition than static typing, so to believe "implicit nulls" is as irresponsible as a doctor not washing their hands, you* have to believe that everyone using dynamic typing is even more irresponsible than that. There's a place for dynamically typed languages, there's a place for implicit nulls, there's a place for strict type- and null-checking, there's a place for mathematically proving code correct. Engineering is always about tradeoffs -- it's way too simplistic to just say "null bad".
* I know it wasn't literally you arguing that; I'm still partly addressing mrkeen here.
It really depends on your definition of static typing. The thing that always made implicit nullability such a mess is precisely that it breaks the subtype relationship - that is, given some (reference) type Foo, you can have a null value of that, but none of the operations that normally apply to Foo apply to that null Foo value! In other words, it's not actually of type Foo, at least not under any sane definition of what a "type" is. So it was always a hole in static typing, and experience with all the grief caused by this hole is why we're moving away from it now, and why even the person who originally introduced this approach in the first place believes it was a mistake.
But, yes, using dynamic typing would be even more irresponsible than using implicit nulls. And this is also something we've learned from experience - note how historically dynamically typed languages are acquiring some form of bolt-on statically checked typing (TypeScript, Python type annotations etc), and how quickly this is becoming the preferred way to write new code.
Completely agree with most of that, especially the problem that allowing nulls means that all of your values are now suspect.
Just picking up on the term "sentinel" here, which I believe refers to using an actual member of the set to represent something else - which I am totally against,
e.g.
The rationale goes: no-one would ever think -1 is a valid count of items, so we can use it as an error sentinel.Then when countItemsIn() returns 1 and countItemsOut() fails, we deduce that itemCount is 2.
So I guess a sentinal is a kind of dual to a null (which is a non-member of a set pretending to be a member)
Anaesthetists also undergo many years of training prior to taking post. Typically they've got say 10 years training before they actually do the job, much of it specialised to this job in particular (and the rest in general or intensive care medicine)
If you're lucky a Software Engineer has a three year degree in CS, and probably only a semester at best was studying "Software Engineering" and even that might focus on something you don't care about, such as formal methods.
It is entirely possible that your junior engineers have never maintained a sizeable codebase for more than a few weeks, have never co-operated on software with more than a handful of other people, and have never used most of the common software libraries you use every day, regardless of whether these are in-house (so how could they) or widely available.
For example maybe you do lots of web apps fronting SQL Server databases in C#. Your new hire has six months of C#, they half-remember a course doing SQL on an Oracle database, and all their web knowledge is in Javascript. Do they know version control? Kinda. Have they used a test framework before? Well, they did in Java but never in C#.
The "All Your Tests Are Terrible" talk begins by pointing out that probably they're strictly wrong because you don't have any fucking tests. All of the rest of the talk is about the hideous tests that you're unhappy to find because you forget that somebody could just not have bothered to write any tests at all.
> All of the rest of the talk is about the hideous tests that you're unhappy to find because you forget that somebody could just not have bothered to write any tests at all.
At many times in my professional "career", I've found myself wishing that whoever wrote the test I'm staring at just hadn't bothered. "Tests that say and/or look like they test one thing, but test something entirely different." [0] and "Tests that actually test nothing at all and never fail when the code they claim to test is changed out from under them." are two of the many categories of tests I wish folks would just never have wasted their time writing.
[0] To be clear, I usually find tests of this type to claim to be testing something useful, but actually be testing something that's not worth testing.
Hey, I resent and agree with this!
> Who / where is the useful training materials for software devs? It really cannot be just blogs.
Why can't it be blogs, books, and word-of-mouth?
If you're a programmer (as your HN profile suggests that you are), you already know how little formal training we receive.
I think I am saying we should fund more training materials and spread the word on where they are to be found.
Maybe
But something !
Tl;dr: cache is the new RAM; RAM is the new disk; disk is the new tape.
Exemplar code in python, shows the benefit in rust. o---kaaaaay.
So the outcome is "for this compiled rust, around 1m records it gets better"
But you didn't actually prove a general case, you proved "for a good optimising rust compiler" didn't you?
Maybe I'm over-thinking it. Maybe this does just become simple, working-set-locality stuff.
I could take the lesson: for less than 10m things, on modern hardware with more than 4 cores and more than 4GB stop worrying and just code in the idiom which works for you.
Uh? The post very clearly says: "I’m using Python as pseudocode; pretend I used your favorite low-level language". The goal is to show what's possible when you do your best, of course I'm not going to use Python for anything beyond demonstrating ideas.
> But you didn't actually prove a general case, you proved "for a good optimising rust compiler" didn't you?
Again, that was never my goal. I chose Rust because I've stumbled upon this problem while working on a Rust library. I could've chosen C++, or C, or maybe even Go -- the result would've been the same, and I checked codegen to make sure of that.
> I could take the lesson: for less than 10m things, on modern hardware with more than 4 cores and more than 4GB stop worrying and just code in the idiom which works for you.
The number of cores and RAM capacity has nothing to do with this. It's all about how well data fits in cache, and "less than 10m things" are likely to fit in L3 anyway. If your main takeaway from "here's how to process large data" was "I don't need to worry about this for small data", well, I don't know what to say.
Large and Small are so contextual. I'm processing 350m events/day in 24h splits, and I managed to stop worrying about locality of reference because I'm the sole occupant of the machine. When I did worry about it, I found radix tree, awk hash and perl/python hash/dict pretty much all occupied much the same space and time but a tuned C implementation got 2-3x faster than any of them. Somebody else pointed out memory resident for most of this would be faster still but you have to then work to process 24h of things against a single memory instance. Which means buying into IPC to get the data "into" that memory.
It interested me you didn't show the idea in rust. That was the only point I was making: Python as pseudocode to think things in documents is fine with me.
But overall, I liked your outcome. I just think it's important to remember large and small are very contextual. Your large case looks to me to be >5m things and for an awful lot of people doing stupid things, 5m is bigger than they'll ever see. If the target was only people who routinely deal in hundreds of millions of things, then sure.
> The number of cores and RAM capacity has nothing to do with this. It's all about how well data fits in cache, and "less than 10m things" are likely to fit in L3 anyway.
What matters is locality since that allows prefetch to mask latency. If you have this, then you are in a good place even if your data does not fit in the L3 cache. What you did demonstrates the benefits that locality gives from the effect on prefetch. Fitting in L3 cache helps, but not as much as prefetch does. If you do not believe me, test a random access pattern on things in L3 cache vs a sequential access pattern. The sequential access pattern will win every time, because L3 cache is relatively slow and prefetch masks that latency.
I have seen options for disabling prefetch and cache as BIOS options (although I only remember the option for disabling cache in ancient systems). If you could get one, you could do some experiments to see which will matter more.