jhallenworld 2 months ago

It could be faster:

    lda #firstbyte
    sta dest+0
    lda #secondbyte
    sta dest+1
6 cycles per byte, so 166KB/sec. But we can do better: there are only 256 different bytes, so group the stores by byte value so that you don't have to reload before each one:

    lda #0
    sta dest+10  ; All the places that get 0
    sta dest+29
    sta dest+42
    lda #1
    sta dest+3   ; All the places that get 1
    sta dest+82
Up to 250 KB/sec and less space. If the prior data in dest is known, it could be even faster: skip stores where the destination already has the correct data. And even faster: for video, skip stores which have a low hamming distance from the prior bytes and put up with a bit of noise..
  • vikingerik 2 months ago

    Even faster: use the X register instead of A, because there's an increment instruction for it (INX) that takes only 2 cycles instead of 3 to do LDA with the next value.

    Also FYI, that 6 cycles to load/store each byte is only if you're writing to zero page (an 8-bit address, meaning the first 256 bytes of the address space.) Stores to a 16-bit address take one more cycle, 4 instead of 3, to read the operand when it's a two-byte address instead of one.

  • tedunangst 2 months ago

    How do you know which places get 0 bytes?

    • jhallenworld 2 months ago

      With pre-processing: sort the source data so that the same bytes (along with their offsets) are grouped together, then convert this to code. It could be useful for drawing things like fixed sprites quickly.

      • rzzzt 2 months ago

        This presentation of "A Pig Quest" for the C64 describes similar methods for efficient bitmap copying (the timestamped part is more relevant, but I recommend watching the entire thing): https://youtu.be/I8CTKeh8N0Q?t=578

olliej 2 months ago

One thing I've always wondered is how fast you could make these chips go with modern fabrication - I know it's not necessarily just a trivial scale down design+scale up clock speed, but assuming the same fundamental design with necessary changes to keep things functioning (gate delays and stuff seem like they'd be the big changes?).

Is anyone here an EE, who has access to original circuit design, knows how to translate to vhdl, verilog, or what have you, and also just happen to have the money lying around to have TSMC tape out the design so we can find out? :D

  • toast0 2 months ago

    Your biggest limit, IMHO, is that the 6502 does memory access every cycle (some of the accesses are unused), so you need your memory (and perhipherals) to be able to respond fast enough, or you need to lengthen clock cycles when accessing slow perhipherals. I think ram access times tend to be in the 10s of nanosecond, so I'd guess you're going to have trouble scaling beyond 100Mhz without exotic ram or a different design. I don't know much about the internals to speculate if they could manage higher clock speeds, but I'd guess it would be fine?

    • vikingerik 2 months ago

      The 6502 can only address 2^16 bytes of memory, so you could just build that as a modern L1 cache. Heck, you might even be able to put 2^16 registers on the die these days. (You could say this falls under your "exotic ram" clause.)

      You could also do pipelining and parallel/out-of-order execution like modern processors, if you weren't worried about retaining cycle-for-cycle compatibility. You could even build it with a wider bus, so that you could read all of a 2/3/4 byte instruction in one cycle. Although at that point it's really semantics as whether you call it a 6502 or a new processor with binary compatibility.

      • toast0 2 months ago

        > you could just build that as a modern L1 cache.

        I think this could work, although if that means the ram is on die, setting up the memory map is much more complex (although maybe you just have a chip select pin for the on-die ram, and have your address decoder circuitry select it when appropriate for your system)

        > You could also do pipelining and parallel/out-of-order execution like modern processors

        I don't think that fits the prompt of same fundamental design though.

        • flohofwoe 2 months ago

          > setting up the memory map is much more complex

          This is where the Z80 would come in handy with its separate 16-bit IO address space (accessed through special IO instructions and on the hardware level an IORQ output pin). The Z80 can also easily handle slow memory and IO devices through its WAIT input pin (but of course this just causes the CPU to slow down to the speed of the memory or IO device).

      • shadowofneptune 2 months ago

        The zero page is small enough to fit into a single vector register nowadays. Since the zero page is used like a register file, that alone could be a big advantage.

    • yjftsjthsd-h 2 months ago

      Could you create a 6502 with cache and pipelining? My understanding is that those are both implementation details that shouldn't affect the "userspace ABI" that code running on the chip can see, other than making performance nonuniform (because it slows for every cache miss and branch misprediction). Obviously that only helps with true memory access; I don't think there's anything to be done for true I/O.

      • vikingerik 2 months ago

        I/O and memory access are the same thing for the 6502. All it can do is read or write from a bus with 16 address bits and 8 data bits. The environment/motherboard might have memory that responds to some combinations of those address bits, and it might have something like say an NES PPU that responds to other combinations, by handling those writes as tile or sprite data and generating a video output signal.

        You could hypothetically make a 6502 with a wider or dual bus, so that it could read or write more than one byte per cycle or both at the same time, if whatever else is on the bus on the other ends of the reads/writes is compatible with that. In that way you could speed up executing 6502 instructions by more than just increasing clock speed.

      • flohofwoe 2 months ago

        Plugging such an "advanced 6502" into a traditional home computer design would break pretty much all games and scene demos, because those depend on "hard realtime" behaviour of the whole system down to single clock cycles (this is also why cycle-correct emulators are slower than emulators with less strict timing requirements - a cycle correct emulator can only take shortcuts for behaviour that doesn't "leak" into the rest of the system).

      • ack_complete 2 months ago

        6502s or 65C02s were used with cache in many Apple II accelerators. Apple licensed one of them for the Apple IIc.

  • sgtnoodle 2 months ago

    It's pretty common to bake 8051 MCU cores into ASICs. They'll run several hundred Mhz. It seems kind of like lawnmower racing.

    You can get a 100Mhz FPGA based 6502. http://www.e-basteln.de/computing/65f02/65f02/

    Since FPGAs are significantly slower than ASICs, it seems like a 1Ghz 6502 wouldn't be unreasonable. The problem is, what would you do with it? The CPU would be severely bottlenecked by its peripheral interfaces.

  • hotpotamus 2 months ago

    It's an interesting question. I get that taping one out is a joke, but wouldn't it be possible to implement one with an FGPA? I know almost nothing about it, but I know that people have essentially built out new Super Nintendos with them, so the 6502 should be doable.

  • krallja 2 months ago

    Western Design Center will happily license the soft core IP of the 65C02 for you to use in a larger chip design.

    • jhgb 2 months ago

      Surely you could design one of your own? The complexity seems low enough for a hobby project, it's not like trying to recreate a IBM Z CPU.

      • anonymousiam 2 months ago

        Yes, the complexity of the 6502 is low. In 1979 I took an "Introduction to Microprocessors" course (which had a bunch of digital design prerequisites). During one session, the instructor handed out 3x5 cards to the (25-30 or so) students. (Not all students got a card.) Each card represented some function or register of the 6502 that the student would emulate. The student who got the "bus" was the busiest and had to walk around moving data between the other students. The "clock" would send fetch/execute triggers, the instructions would come across the "bus" from memory into the "instruction decoder" and execution would commence.

        Obviously execution was slow, but it was a great exercise to show the class just how simple the 6502 was.

        • krallja 2 months ago

          What a fun paper computer! I wonder if those cards still exist somewhere.

          • anonymousiam 2 months ago

            Good question. You inspired me to see what I could find. The instructor was still associated with the college as of 2015, but he may have retired now. Google shows nothing associating him with 6502. I expect that the course curriculum has changed in the last 43+ years.

            A little cyber stalking shows that he sold his house in 2017: https://www.youtube.com/watch?v=inCBLQJja2U

            While thinking about him and the course, I remembered my final exam in that class, which I nearly failed. I had mastered the material maybe a little too well. The assignment was to program our KIM-1 (https://en.wikipedia.org/wiki/KIM-1) to drive a speaker with a 1KHz square wave on a PIO port (bit 0) when a switch was closed between ground and the same PIO bit 1. So obviously you had to set up the data direction register of the PIO with bit 0 as output and bit 1 as input, and have a loop that would toggle the bit 0 output every 500us if the PIO bit 1 input was at logic-0. Even back then, I liked to optimize my code, and the instructor did not give us detailed requirements about the state of PIO output bit 0 when PIO input bit 1 was logic-1, other than to say that the tone would not be present. So instead of reading PIO bit 1 in the loop and performing a compare to branch one way or the other, I simply inverted PIO input bit 1 and stored it into the data direction register for PIO bit 0 on every loop iteration. Thus the tone would be present when bit 1 was at ground, but not present otherwise (because bit 0 had been defined as an input rather than a output, and was in a high impedance state). The instructor looked at my code and gave me a "F". I had to confront him after the class and he grudgingly changed my final grade to a "C".

            • krallja 2 months ago

              I don’t see the problem with your code?

      • krallja 2 months ago

        Yes, the original 6502 was laid out by hand on Mylar with pencils and X-acto knives by only a couple people. A motivated hobbyist could probably do it in a weekend in VHDL.

        The more interesting thing I think is that the existence of this official IP core implies that it is already out in production in millions of components, like the Tamagotchi[1] and Furby[2], presumably running at a very diverse array of clock frequencies. I wonder which product has the fastest WDC 6502?

        [1] https://hackaday.com/2013/05/24/tamagotchi-rom-dump-and-reve...

        [2] https://official-furby.fandom.com/wiki/Furby_(1998)/Technica...

dwheeler 2 months ago

You can also put the main loop in zero page, and increment the addresses directly by incrementing their zpage values. I would expect that to be slightly slower than total unrolling, but it wouldn't be as long and thus more practical.

GirishSharma643 2 months ago

For me, nothing is unclear on my xiaomi mi9 miui 12.5.

snoopy_telex 2 months ago

The images on that site are down right hostile to mobile devices. It’s completely unreadable.

  • colejohnson66 2 months ago

    They look perfectly fine for me on Firefox on an iPhone 13.

  • haneefmubarak 2 months ago

    Looks fine on Chrome on a Galaxy S10...