The Apple GPU and the impossible bug

965 points by stefan_ 4 years ago

> Why the duplication? I have not yet observed Metal using different programs for each.

I'm guessing whoever designed the system wasn't sure whether they would ever need to be different, and designed it so that they could be. It turned out that they didn't need to be, but it was either more work than it was worth to change it (considering that simply passing the same parameter twice is trivial), or they wanted to leave the flexibility in the system in case it's needed in future.

I've definitely had APIs like this in a few places in my code before.

pocak 4 years ago

I don't understand why the programs are the same. The partial render store program has to write out both the color and the depth buffer, while the final render store should only write out color and throw away depth.
- kimixa 4 years ago
  
  Possibly pixel local storage - I think this can be accessed with extended raster order groups and image blocks in metal.
  https://developer.apple.com/documentation/metal/resource_fun...
  E.g in their example in the link above for deferred rendering (figure 4) the multiple G buffers won't actually need to leave the on-chip tile buffer - unless there's a partial render before the final shading shader is run.
- plekter 4 years ago
  
  I think multisampling may be the answer.
  For partial rendering all samples must be written out, but for the final one you can resolve(average) them before writeout.
- hansihe 4 years ago
  
  Not necessarily, other render passes could need the depth data later.
  
  pocak 4 years ago
  
  Right, I had the article's bunny test program on my mind, which looks like it has only one pass.
  In OpenGL, the driver would have to scan the following commands to see if it can discard the depth data. If it doesn't see the depth buffer get cleared, it has to be conservative and save the data. I assume mobile GPU drivers in general do make the effort to do this optimization, as the bandwidth savings are significant.
  In Vulkan, the application explicitly specifies which attachment (i.e. stencil, depth, color buffer) must be persisted at the end of a render pass, and which need not. So that maps nicely to the "final render flush program".
  The quote is about Metal, though, which I'm not familiar with, but a sibling comment points out it's similar to Vulkan in this aspect.
  So that leaves me wondering: did Rosenzweig happen to only try Metal apps that always use MTLStoreAction.store in passes that overflow the TVB, or is the Metal driver skipping a useful optimization, or neither? E.g. because the hardware has another control for this?
  
  johntb86 4 years ago
  
  Most likely that would depend on what storeAction is set to: https://developer.apple.com/documentation/metal/mtlrenderpas...
  
  Someone 4 years ago
  
  So it seems it allows for optimization. If you know you don’t need everything, one of the steps can do less than the other.
jleahy 4 years ago

I think there are many cases where they need to be different, for example if you want to do any kind of post-processing on the final image.

VyseofArcadia 4 years ago

> Yes, AGX is a mobile GPU, designed for the iPhone. The M1 is a screaming fast desktop, but its unified memory and tiler GPU have roots in mobile phones.

PowerVR has its roots in a desktop video card with somewhat limited release and impact. It really took off when it was used in the Sega Dreamcast home console and the Sega Naomi arcade board. It was only later that people put them in phones.

wazoox 4 years ago

Unified memory was introduced by SGI with the O2 workstation in 1996, then they used it again with their x86 workstations SGI 320 and 540 in 1999. So it was a workstation-class technology before being a mobile one :)
- andrekandre 4 years ago
  
  even the n64 had unified memory way back in 1995
  
  nwallin 4 years ago
  
  The N64's unified memory model had a pretty big asterisk though. The system had only 4kB for textures out of 4MB of total RAM. And textures are what uses the most memory in a lot of games.
  
  klodolph 4 years ago
  
  That’s a somewhat misleading way to describe it. The N64 has 4K texture memory (TMEM), but you can use far more than 4K of texture during a frame—because you can load data into TMEM as many times as you like during a frame.
  In practice, you might think of TMEM like it’s a cache, it’s just that you have to manage this cache manually. You can use as much RAM as you like for textures.
  TMEM is also not part of main RAM, like the RSP’s DMEM and IMEM.
  
  ChuckNorris89 4 years ago
  
  N64 chip was also SGI designed
  
  djmips 4 years ago
  
  Then that SGI team broke out to form ArtX, developed the GameCube hardware, then were snatched up by ATI and went on to form the foundation of ATI, now AMD GPUs
robert_foss 4 years ago

But being a Tiling rendering architecture which is normal for mobile applications and not how desktop GPUs are architectured, it would be fair to call it a mobile GPU.
- Veliladon 4 years ago
  
  Nvidia appears to be an immediate mode renderer to the user but has used a tiled rendering architecture under the hood since Maxwell.
  
  pushrax 4 years ago
  
  According to the sources I've read, it uses a tiled rasterizing architecture but it's not deferred in the same way as typical mobile TBDR that bins all vertexes before starting rasterization, deferring all rasterization after all vertex generation, and flushing each tile to the framebuffer once.
  NV seems to rasterize vertexes in small batches (i.e. immediately) but buffers the rasterizer output on die in tiles. There can still be significant overlap between vertex generation and rasterization. Those tiles are flushed to the framebuffer, potentially before they are fully rendered, and potentially multiple times per draw call depending on the vertex ordering. They do some primitive reordering to try to avoid flushing as much, but it's not a full deferred architecture.
  
  monocasa 4 years ago
  
  Nvidia's is a tile-based immediate mode rasterizer. It's more a cache friendly immediate renderer than a TBDR.
  
  djmips 4 years ago
  
  And Maxwell is used in the Nintendo Switch, I guess that makes it mobile! This is a mostly pointless debate.
tomc1985 4 years ago

I actually had one of those cards! The only games I could get it to work with were Half-Life, glQuake, and Jedi Knight, and the bilinear texture filtering had some odd artifacting IIRC
deaddodo 4 years ago

To be fair, the architecture used in the early “desktop” variants was quite different from the modern mobile ones (MBX/SGX and beyond); excepting the TBDR.
iforgotpassword 4 years ago

Was it the kyro 2? I had one of these but killed it by overclocking... Would make for a good retro system.
- smcl 4 years ago
  
  The Kyro and Kyro 2 were a little after the Dreamcast.

tambourine_man 4 years ago

Few things are more enjoyable than reading a good bug story, even when it's not one's area of expertise. Well done.

alimov 4 years ago

I had the same thought. I really enjoy following along and getting a glimpse into the thought process of people working through challenges.

bob1029 4 years ago

I really appreciate the writing and work that was done here.

It is amazing to me how complicated these systems have become. I am looking over the source for the single triangle demo. Most of this is just about getting information from point A to point B in memory. Over 500 lines worth of GPU protocol overhead... Granted, this is a one-time cost once you get it working, but it's still a lot to think about and manage over time.

I've written software rasterizers that fit neatly within 200 lines and provide very flexible pixel shading techniques. Certainly not capable of running a cyberpunk 2077 scene, but interactive framerates otherwise. In the good case, I can go from a dead stop to final frame buffer in <5 milliseconds. Can you even get the GPU to wake up in that amount of time?

mef 4 years ago

with great optimization comes great complexity
- ip26 4 years ago
  
  Is it just optimization? I would call it capability. An automobile is more complicated to operate than a skateboard.
jfim 4 years ago

> In the good case, I can go from a dead stop to final frame buffer in <5 milliseconds. Can you even get the GPU to wake up in that amount of time?
Considering the fact that there are 240 Hz monitors nowadays, which means that an entire frame must be rendered in about 4ms, it has to be possible.
- plekter 4 years ago
  
  Modern gpus + drover stack usually had more than one frame in flight. You have to output a frame every 4ms, but you do not need the latency from the start of the application rendering code to the frame being on screen to be 4ms - pipelining is allowed. But keeping that pipelining down to a minimum is also important, as it contributes to input lag which gamers care about.

Jasper_ 4 years ago

Huh, I always thought tilers re-ran their vertex shaders multiple times -- once with position-only to do binning, and then again when computing for all attributes with each tile; that's what the "forward tilers" like Adreno/Mali do. That's crazy they dump all geometry to main memory rather than keeping it in pipe. It explains why geometry is more of a limit on AGX/PVR than Adreno/Mali.

pocak 4 years ago

That's what I thought, too, until I saw ARM's Hot Chips 2016 slides. Page 24 shows that they write transformed positions to RAM, and later write varyings to RAM. That's for Bifrost, but it's implied Midgard is the same, except it doesn't filter out vertices from culled primitives.
That makes me wonder whether the other GPUs with position-only shading - Intel and Adreno - do the same.
As for PowerVR, I've never seen them described as position-only shaders - I think they've always done full vertex processing upfront.
edit: slides are at https://old.hotchips.org/wp-content/uploads/hc_archives/hc28...
- Jasper_ 4 years ago
  
  Mali's slides here still show them doing two vertex shading passes, one for positions, and again for other attributes. I'm guessing "memory" here means high-performance in-unit memory like TMEM, rather than a full frame's worth of data, but I'm not sure!
atq2119 4 years ago

I was under that impression as well. If they write out all attributes, what is really the remaining difference to a traditional immediate more renderer? Nvidia reportedly has vertex attributes going through memory for many generations already (and they are at least partially tiled...).
I suppose the difference is whether the render target lives in the "SM" and is explicitly loaded and flushed (by a shader, no less!) or whether it lives in a separate hardware block that acts as a cache.
- Jasper_ 4 years ago
  
  NV has vertex attributes "in-pipe" (hence mesh shaders), and the appearance of a tiler is a misread, it's just a change to the macro-rasterizer about which quads get dispatched first, it's not a true tiler.
  The big difference is the end of the pipe, as mentioned; whether you have ROPs or whether your shader cores load/store from a framebuffer segment. Basically, whether or not framebuffer clears are expensive (assuming no fast-clear cheats), or free.

daenz 4 years ago

That image gave me flashbacks of gnarly shader debugging I did once. IIRC, I was dividing by zero in some very rare branch of a fragment shader, and it caused those black tiles to flicker in and out of existence. Excruciatingly painful to debug on a GPU.

paulmd 4 years ago

debugging in situations where there is no ability to halt and step, or in some cases even log, is extremely extremely tricky. Embedded is another domain where that's super common... or drivers or other peripherals.
there probably are tools these days for debugging shaders, potentially commercial packages if Nsight Studio doesn't have it, but yeah, that sort of thing isn't easy.

stefan_ 4 years ago

> The Tiled Vertex Buffer is the Parameter Buffer. PB is the PowerVR name, TVB is the public Apple name, and PB is still an internal Apple name.

Patent lawyers love this one silly trick.

robert_foss 4 years ago

Seeing how Apple licensed the full PowerVR hardware before, they probably currently have a license for the whatever hardware they based their design on.
- kimixa 4 years ago
  
  They originally claimed they completely redesigned it and announced they were therefore going to drop the PowerVR architecture license - that was the reason for the stock price crash and Imagination Technologies sale in 2017.
  Then they have since scrubbed the internet of all such claims and to this day pay for an architecture license. I think it's similar to an ARM architecture license - where it's a license for any derived technology and patents rather than actually being given the RTL for powervr-designed cores.
  I worked at PowerVR during that time (I have Opinions, but will try to keep them to myself), and my understanding was that Apple hadn't actually taken new PowerVR RTL for a number of years and had significant internal redesigns of large units (e.g. the shader ISA was rather different from the PowerVR designs of the time), but presumably they still use enough of the derived tech and ideas that paying the architecture license is necessary. This transfer was only one way - we never saw anything internal about Apple's designs, so reverse engineering efforts like this are still interesting.
  And as someone who worked on the PowerVR cores (not the Apple derivatives) I can assure you all this discussed in the original post is extremely familiar.
  
  girvo 4 years ago
  
  While I can understand why you might not want to in a public forum, I know I’d personally love to hear your Opinions on the matter!
  
  kimixa 4 years ago
  
  It's a small enough group that I'm likely personally identifiable anyway, so will be vague and try to stick to public info and my personal conjecture.
  Let's just say that the legal shenanigans of the time caused me to lose my job (part of the sale of Imagination Technologies required closing some countries offices to avoid more interference from various regulatory bodies). Judge bias accordingly.
  And all their noise about "Ground up redesign using no PowerVR tech" kinda conflicts with them still to this day paying for an architecture license - the very thing that they claimed they would be dropping in their press release that caused the imagination technologies share crash and corresponding sale. And this is without even going to court - they issued a press release then immediately relented (and have continued to relent for over 5 years now) at the slightest question. And then scrubbed all mention of that press release.
  My general suspicion is apple intended to game the market by intentionally dropping the share price and simply purchase PowerVR at a discount - but in the process pissed off enough people that they rejected the offer, even if it was "better" in terms of value. Or just let them go under and pick everything they want off the resulting fire sale - I heard rumors that apple had already put in an offer to purchase the company that was rejected, and under UK regulation a failed takeover attempt can't be re-attempted for some time, that much of this happened within (again, according to fuzzy scuttlebutt, nothing definite)
  That or the legal/C-suite of apple don't actually speak to the engineers of apple anymore - they honestly thought that it was a completely ground-up design that didn't derive anything from PowerVR tech, and just send out the press release thinking "Why are we paying for this??" - then the engineers shuffled in saying that actually they couldn't put together anything better that wasn't a direct derivative, and their noise about a completely internally designed-from-scratch apple GPU was a bit of a stretch.
- pyb 4 years ago
  
  Apple's claim is that they designed it themselves. https://en.wikipedia.org/wiki/Talk:Apple_M1#[dubious_%E2%80%...
  
  gjsman-1000 4 years ago
  
  There's no reason that couldn't be a half-truth - it could be a PowerVR with certain components replaced, or even the entire GPU replaced but with PowerVR-like commands and structure for compatibility reasons. Kind of like how AMD designed their own x86 chip despite it being x86 (Intel's architecture).
  Also, if you read Hector Martin's tweets (he's doing the reverse-engineering), Apple replacing the actual logic while maintaining the "API" of sorts is not unheard of. It's what they do with ARM themselves - using their own ARM designs instead of the stock Cortex ones while maintaining ARM compatibility.*
  *Thus, Apple has a right to the name "Apple Silicon" because the chip is designed by Apple, and just happens to be ARM-compatible. Other chips from almost everyone else use stock ARM designs from ARM themselves. Otherwise, we might as well call AMD an "Intel design" because its x86 by the same logic.
  
  rjsw 4 years ago
  
  > Apple replacing the actual logic while maintaining the "API" of sorts is not unheard of.
  They did this with ADB, early PowerPC systems contained a controller chip that has the same API that was implemented in software in the 6502 IOP coprocessor in the IIfx/Q900/Q950.
  
  quux 4 years ago
  
  Didn't Apple have a large or even dominant role in the design of the ARM64/AArch64 architecture? I remember reading somewhere that they developed ARM64 and essentially "gave it" to ARM who accepted but nobody could understand at the time why a 64 bit extension to ARM was needed so urgently, and why some of the details of the architecture had been designed the way they had. Years later with Apple Silicon it all became clear.
  
  kalleboo 4 years ago
  
  The source is a former Apple engineer (now at Nvidia apparently)
  https://twitter.com/stuntpants/status/1346470705446092811
  > arm64 is the Apple ISA, it was designed to enable Apple’s microarchitecture plans. There’s a reason Apple’s first 64 bit core (Cyclone) was years ahead of everyone else, and it isn’t just caches
  > Arm64 didn’t appear out of nowhere, Apple contracted ARM to design a new ISA for its purposes. When Apple began selling iPhones containing arm64 chips, ARM hadn’t even finished their own core design to license to others.
  > ARM designed a standard that serves its clients and gets feedback from them on ISA evolution. In 2010 few cared about a 64-bit ARM core. Samsung & Qualcomm, the biggest mobile vendors, were certainly caught unaware by it when Apple shipped in 2013.
  > > Samsung was the fab, but at that point they were already completely out of the design part. They likely found out that it was a 64 bit core from the diagnostics output. SEC and QCOM were aware of arm64 by then, but they hadn’t anticipated it entering the mobile market that soon.
  > Apple planned to go super-wide with low clocks, highly OoO, highly speculative. They needed an ISA to enable that, which ARM provided.
  > M1 performance is not so because of the ARM ISA, the ARM ISA is so because of Apple core performance plans a decade ago.
  > > ARMv8 is not arm64 (AArch64). The advantages over arm (AArch32) are huge. Arm is a nightmare of dependencies, almost every instruction can affect flow control, and must be executed and then dumped if its precondition is not met. Arm64 is made for reordering.
  
  travisgriggs 4 years ago
  
  > > M1 performance is not so because of the ARM ISA, the ARM ISA is so because of Apple core performance plans a decade ago.
  This is such an interesting counterpoint to the occasional “Just ship it” screed (just one yesterday I think?) we see on HN.
  I have to say, I find this long form delivery of tech to be enlightening. That kind of foresight has to mean some level of technical saaviness at high decision making levels. Whereas many of us are caught at companies with short sighted/tech naive leadership who clamor to just ship it so we can start making money and recoup the money we’re losing on these expensive tech type developers.
  
  kif 4 years ago
  
  I think the "just ship it" method is necessary when you're small and starting out. Unless you are well funded, you couldn't afford to do what Apple did.
  
  zozbot234 4 years ago
  
  Either way, "designed a new ISA" really should be "came up with yet another cleaned-up MIPS RISC". Does it really matter who did the work?
  
  adrian_b 4 years ago
  
  AArch64 does not resemble MIPS at all (beyond the fact that both use fixed-length instructions and separate register-register and load-store instruction groups; these RISC principles had already been used in IBM 801 about 5 years before MIPS, and then they have been used in more than a dozen of other CPU architectures, many of which are more similar to AArch64 than MIPS is).
  Therefore, there is no basis for saying that AArch64 is a cleaned-up MIPS-like ISA. Only RISC-V is a MIPS-like ISA.
  One of the few features of AArch64 that can be said to be similar to MIPS was its main mistake.
  In the initial ARMv8.0 version, the only means provided for implementing atomic operations was a load-and-reserve/store-conditional instruction pair.
  This kind of instruction has been popularized by MIPS II, but it had not been invented by MIPS, but by Jensen et al. (November 1987), for the S-1 AAP multiprocessor.
  While this instruction pair allows the implementation of lock-free/wait-free data structures, it can be extremely inefficient for implementing locks in systems with many cores (because progress is not guaranteed), so in the ARMv8.1 version the initial mistake has been corrected, by adding atomic instructions of the type fetch-and-op, besides the MIPS-like LL/SC pair.
  
  zozbot234 4 years ago
  
  The complex features of ARM64 were not newly designed by and large, they were mostly carried over from ARM32 - mostly to take advantage of shared ARM32/ARM64 implementation. Much of the actual design work involved in ARM64 was simplification and things like adding a zero register to the ISA, which is pretty comparable to MIPS.
  
  adrian_b 4 years ago
  
  A zero register already existed in some computers with vacuum tubes, almost 70 years ago, this is not a new idea that can be attributed to MIPS or RISC.
  It is a good feature, which can reduce substantially the number of instructions that must be implemented, because many single-operand operations are just special cases of double-operand operations with one null operand.
  This is why it was used in many early computers, which had to be simple due to the limitations of their technology, and then it was used again in most RISC CPUs, which have been simplified intentionally (and not only in MIPS; among the more successful RISC ISAs also IBM POWER has it; only 32-bit ARM does not have it, due to its unusually low number of general-purpose registers, in comparison with the other RISC ISAs).
  
  quux 4 years ago
  
  Thanks!
  
  dann0 4 years ago
  
  Apple, Acorn Computers and VLSI were founding partners of ARM, if I remember correctly.
  My StrongArm powered RiscPC was amazing for the time. It was strange that the contemporaneous Newton was powered by the same (and in some ways better) processor.
  The connection between ARM processors being used in desktop and mobile devices is in its early DNA.
  
  pyb 4 years ago
  
  I haven't followed the announcements CPU side - do Apple clearly claim that they designed their own CPU (with an ARM instruction set)?
  
  stephen_g 4 years ago
  
  They do, and their microarchitecture is unambiguously, hugely different to anything else (some details in 1). The last Apple Silicon chip to use a standard Arm design was the A5X, whereas they were using customised PowerVR GPUs until I think the A11.
  1. https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...
  
  daneel_w 4 years ago
  
  They are one of a handful of companies that hold a license allowing them to both customize the reference core and to implement the Arm ISA through their own silicon design. Everyone else's SoCs all use the same Arm reference mask. Qualcomm also holds such a license, which owes to their Snapdragon SoC, just like Apple's A- and M-series, occupying a performance hierarchy above everything else Arm.
  
  masklinn 4 years ago
  
  According to Hector Martin (the project lead of Asahi) in previous threads of the subject[0], Apple actually has an "architecture+" license which is completely exclusive to them, thanks to having literally been at the origins of ARM: not only can Apple implement the ISA on completely custom silicon rather than license ARM cores, they can customise the ISA (as in add instructions, as well as opt out of mandatory ISA features).
  [0] https://news.ycombinator.com/item?id=29798744
  
  happycube 4 years ago
  
  The only Qualcomm designed 64-bit mobile core so far was the Kyro core in the 820. They then assigned that team to server chips (Centriq) then sacked the whole team when they felt they needed to cut cash flow to stave off Avago/Broadcom. The "Kyro" cores from 835 on are rebadged/adjusted ARM cores.
  IMO the Kyro/820 wasn't a major failure, it turned out a lot better than the 810 which had A53/A57 cores.
  And then they decided they needed a mobile CPU team again and bought Nuvia for ~US$1 Billion.
  
  pyb 4 years ago
  
  Such a license is a big clue, but not quite what I was enquiring about...
  
  gjsman-1000 4 years ago
  
  Qualcomm did use their own design called Kyro for a little while, but is now focusing on cores designed by Nuvia which they just bought for the future.
  As for Apple, they've designed their own cores since the Apple A6 which used the Swift core. If you go to the Wikipedia page, you can actually see the names of their core designs, which they improve every year. For the M1 and A14, they use Firestorm High-Performance Cores and Icestorm Efficiency Cores. The A15 uses Avalanche and Blizzard. If you visit AnandTech, they have deep-dives on the technical details of many of Apple's core designs and how they differ from other core designs including stock ARM.
  The Apple A5 and earlier were stock ARM cores, the last one they used being Cortex A9.
  For this reason, Apple is about as much an ARM chip as AMD is an Intel chip. Technically compatible, implementation almost completely different. It's also why Apple calls it "Apple Silicon" and it is not just marketing, but actually justified just as much as AMD not calling their chips Intel derivatives.
  
  amaranth 4 years ago
  
  Kyro started as custom but flopped in the Snapdragon 820 so they moved to a "semi-custom" design, it's unclear how different it really is from the stock Cortex designs.
  
  GeekyBear 4 years ago
  
  > Qualcomm did use their own design called Kyro for a little while
  Before that, they had Scorpion and Krait, which were both quite successful 32 bit ARM compatible cores at the time.
  Kryo started as an attempt to quickly launch a custom 64 bit ARM core and the attempt failed badly enough that Qualcomm abandoned designing their own cores and turned to licensing semi-custom cores from ARM instead.
  
  paulmd 4 years ago
  
  To be blunt, you're asking about questions that could be solved with a quick google and you are coming off as a bit of a jerk asking for very specific citations with exact specific wording for basic facts like this that, again, could be solved by looking through the wikipedia for "apple silicon" and then bouncing to a specific source. People have answered your question and you're brushing them off because you want it answered in an exact specific way.
  https://en.wikipedia.org/wiki/Apple_silicon
  https://www.anandtech.com/show/7335/the-iphone-5s-review/2
  > NVIDIA and Samsung, up to this point, have gone the processor license route. They take ARM designed cores (e.g. Cortex A9, Cortex A15, Cortex A7) and integrate them into custom SoCs. In NVIDIA’s case the CPU cores are paired with NVIDIA’s own GPU, while Samsung licenses GPU designs from ARM and Imagination Technologies. Apple previously leveraged its ARM processor license as well. Until last year’s A6 SoC, all Apple SoCs leveraged CPU cores designed by and licensed from ARM.
  > With the A6 SoC however, Apple joined the ranks of Qualcomm with leveraging an ARM architecture license. At the heart of the A6 were a pair of Apple designed CPU cores that implemented the ARMv7-A ISA. I came to know these cores by their leaked codename: Swift.
  Yes, Apple has been designing and using non-reference cores since the A6 era, and were one of the first to the table with ARMv8 (apple engineers claim it was designed for them under contract to their specifications, but this part is difficult to verify with anything more than citations from individual engineers).
  I expect that Apple has said as much in their presentations somewhere, but if you're that keen on finding such an incredibly specific attribution, then knock yourself out. It'll be in an apple conference somewhere, like WWDC. They probably have said "apple-designed silicon" or "custom core" at some point, and that would be your citation - but they also sell products, not hardware, and they don't extensively talk about their architectures since they're not really the product, so you probably won't find a deep-dive like Anandtech from Apple directly where they say "we have 8-wide decode, 16-deep pipeline... etc" sorts of things.
  
  daneel_w 4 years ago
  
  The other-wordly performance-per-watt would be another.
- brian_herman 4 years ago
  
  Also laywers that can keep it in court long enough for a redesign.

danw1979 4 years ago

Alyssa and the rest of the Asahi team are basically magicians as far as I can tell.

What amazing work and great writing that takes an absolute graphics layman (me) on a very technical journey yet it is still largely understandable.

sh33sh 4 years ago

Really enjoyed the way it was written

GeekyBear 4 years ago

Alyssa's writing style steps you through a technical mystery in a way that remains compelling even if you lack the domain knowledge to solve the mystery yourself.

ninju 4 years ago

> Comparing a trace from our driver to a trace from Metal, looking for any relevant difference, we eventually stumble on the configuration required to make depth buffer flushes work.

> And with that, we get our bunny.

So what was the configuration that needed to change? Don't leave us hanging!!!

lukasb 4 years ago

Can you imagine if you had her problem and found her post, and then realized she'd omitted that detail?
(Yes she tells you how to figure it out yourself)

quux 4 years ago

Impressive work and really interesting write up. Thanks!

dry_soup 4 years ago

Very interesting and easy to follow writeup, even for a graphics ignoramus like myself.

aosmith 4 years ago

This is present in a lot of unreal engine games running on mac os x too. Tomb Raider is a great example.

thanatos519 4 years ago

What an entertaining story!

542458 4 years ago

It's been said more than a few times in the past, but I cannot get over just how smart and motivated Alyssa Rosenzweig is - she's currently an undergraduate university student, and was leading the Panfrost project when she was still in high school! Every time I read something she wrote I'm astounded at how competent and eloquent she is.

nyanpasu64 4 years ago

It's surprising to me the deep contrast between this awe-inspiring deep technical wizardry, and the sometimes-incompetence of driver developers (at least the impression I get from a month spent reverse-engineering Windows drivers) and poor pay of embedded programmers (https://news.ycombinator.com/item?id=31364360). I don't know if striving to develop this kind of deep knowledge myself (though I don't know if I'll ever learn all the skills she has today) is a useful work skill; I get the impression that deep knowledge of how to optimize compilers/compute/apps/servers at the assembly/cache level pays very well (despite being much more similar to embedded compared to web/mobile or backend development).
azinman2 4 years ago

Does anyone know if she has a proper interview somewhere? I'd love to know how she got so technical in high school to be able to reverse engineer a GPU -- something I would have no idea how to start even with many more years experience (although admittedly I know very little about GPUs and don't do graphics work).
pciexpgpu 4 years ago

Undergrad? I thought she was some Staff SWE in an OSS company. Seriously impressive, and ought to give anyone imposter syndrome.
- gjsman-1000 4 years ago
  
  Well, Alyssa is, and works for Collabora while also being undergrad.
- coverband 4 years ago
  
  I was about to post "very impressive", but that seems a huge understatement after finding out she's still in school...
aero-glide2 4 years ago

Have to admit, wherever i see people much younger than me do great things I get very depressed.
- kif 4 years ago
  
  I used to feel this way, too. However, every single one of us has their own unique circumstances.
  I can't give too many details unfortunately. But, there's a specific step I took in my career, which was completely random at the time. I was still a student, and I decided not to work somewhere. I resigned two weeks in. Had I not done that, I wouldn't be where I am today. My situation would be totally different.
  Yes, some people are very talented. But it does take quite a lot of work and dedication. And yes, sometimes you cannot afford to dedicate your time to learning something because life happens.
- ohgodplsno 4 years ago
  
  Be excited! This means amazing things are coming, from incredibly talented people. And even better when they put out their knowledge in public, in an easy to digest form, letting you learn from them.
- ip26 4 years ago
  
  I get that. But then I remember at that age, I was only just cobbling together my very first computer from the scrap bin. An honest comparison is nearly impossible.
- Shorel 4 years ago
  
  When I see young people do these things, I get happy and confident. Hopeful for a better future.
  When I see talent wasted in things like scientology, then I get depressed.
- cowvin 4 years ago
  
  No need to be depressed. It's not a competition between you. You can find inspiration in what others achieve and try to achieve more yourself.
- duckydude20 4 years ago
  
  me too, and idk how to cope up with this... i see younger guys than me creating os, and i am here achieved nothing in life... i feel so sad, so depressed, my mood flips so hard, sometimes i feel like just leaving everything and getting away. i know it's not a competition and i don't want to win this, i just want to point at one thing and say, this is created by me. That's all i want and nothing else... But i have nothing in hand... And this happens every time... and i get depressed and start crying...
- pimeys 4 years ago
  
  And for me, her existence is enough to keep me of getting depressed about my industry. Whatever she's doing, is keeping my hopes up for computer engineering.
- hycaria 4 years ago
  
  I don’t really envy her because her personal life doesn’t seem as amazing as her cs skills. She wrote about it at some point, I’m pretty sure.
frostwarrior 4 years ago

While I was reading I was already thinking that. I can't believe how smart and an awesome developer she is.
lynguist 4 years ago

She finished high school in 2019 and I assume she was 18 at that time. She would be 21 now. I am in awe with her.
Here’s her CV: https://rosenzweig.io/resume.pdf