johnklos 3 days ago

This clarifies a bit the dispute about AMD's settlement about the core count lawsuit. The one FPU per cluster of two cores is what I've always thought decided the case, and observations like these continue to point out that the two individual integer cores didn't share enough to impede either's resources:

> When running two threads in the module, Bulldozer can provide twice as much > L2 bandwidth. That suggests the L2 has separate paths to each thread’s > load/store unit, and lines up with AMD’s publications.

I never knew that the 32 nm process had so many issues. They certainly were never as bad as the 90 nm issues, but perhaps after all the noise about 90 nm, problems with 32 nm just weren't worth writing about in the news.

I wonder if my specific needs just matched the Bulldozer CPU particularly well, because I didn't see any significant advantages in Sandy Bridge over Bulldozer. Then again, I was running four concurrent but separate threads of -J 2 compiles simultaneously, which may've been a particularly good use case for Bulldozer.

I think I'll try out some new benchmarks using a modern OS and toolchain on both Sandy Bridge and Bulldozer to see how things have changed since 2011 / 2012.

chaimanmeow 3 days ago

̶g̶o̶o̶d̶ ̶l̶i̶n̶k̶ ̶a̶n̶d̶ ̶r̶e̶a̶d̶ ̶b̶u̶t̶ ̶d̶u̶p̶

My 2 cents,

I'd like to see Zen 2 merged with some CMT ingenuity for their low power high efficiency branch of cores. Introduce some clustering on one of these Zen 2 CCXs, and arrange this cluster with SMT4 on the FPU side while keeping SMT2 on the integer side. This would be an analogue to the Piledriver family's SMT1/SMT2 int/fpu scheme.

  • Symmetry 2 days ago

    Clustering certainly has some advantages on low power cores but I wouldn't expect them to go with SMT on lower power cores. Adding SMT doesn't cost much die area but it adds a ton of difficulty to validation. I'd say for little cores you're generally better off with more, smaller, SMT1 cores than fewer, larger SMT2 cores. Where SMT really shines on consumer workloads is that sometimes you only have a few threads, which is why you want it on your big cores.

    And as an aside, Zen actually has the exact same arrangement of gates in both the bigger and smaller configuration, the smaller one just uses a more consistently small transistor size while the larger one scales transistor size to fanout more aggressively.

    One place where clustering is being used right now is in Arm's new A510 where two cores share a FPU and L2 but not a front end like the Bulldozer.

  • gautamcgoel 3 days ago

    It's not a dup, that's Part I of the article, this is Part II.

    • chaimanmeow 3 days ago

      oh thanks and sorry! more to catch up on reading then!! sorry for my oops.

      • pvg 3 days ago

        follow-ons (usually) count as dupes so it would be an HN dupe if it had significant discussion but it doesn't so it's not.