HocusLocus 2 months ago

I have lived my whole professional life with this being 'beyond obvious'... It's hard to imagine a generation where it's not. But then again, I did work with EBCDIC for awhile and we were reading and translating ASCII log tapes (ITT/Alcatel 1210 switch, phone calls, memory dumps).

I once got drunk with my elderly unix supernerd friend and he was talking about TTYs and how his passwords contained embedded ^S and ^Q characters and he traced the login process to learn they were just stalling the tty not actually used to construct the hash. No one else at the bar got the drift. He patched his system to put do 'raw' instead of 'cooked' mode for login passwords. He also used backspaces ^? ^H as part of his passwords. He was a real security tiger. I miss him.

  • Eduard 2 months ago

    Regarding ^?: shouldn't that be ^_ instead?

dcminter 2 months ago

It doesn't seem to have been mentioned in the comments so far, but as a floppy-disk era developer I remember my mind was blown by the discovery that DEL was all-bits-set because this allowed a character on paper tape and punched card to be deleted by punching any un-punched holes!

  • axblount 2 months ago

    Bit-level skeuomorphism! And since NUL is zero, does that mean the program ends wherever you stop punching? I've never used punch cards so I don't know how things were organized.

fix4fun 2 months ago

For me was interesting that all digits in ASCII starts with 0x3, eg. 0x30 - 0, 0x31 - 1, ..., 0x39 - 9. I thought it was accidental, but in real it was intended. This was giving possibility to build simple counting/accounting machines with minimal circuit logic with BCD (Binary Coded Decimals). That was wow for me ;)

  • satiated_grue 2 months ago

    ASCII was started in 1960. A terminal then would have been a mostly-mechanical teletype (keyboard and printer, possibly with paper tape reader/punch), without much by way of "circuit logic". Think of it more as a bit caused a physical shift of a linkage to do something like hit the upper or lower part of a hammer, or a separate set of hammers for the same remaining bits.

    Look at the Teletype ASR-33, introduced in 1963.

    • fix4fun 2 months ago

      Yes, that's true ASR-33 was first application, but IBM has impact on ANSI/ASA comeete and ASCII standardisation. In 1963 IBM System/360 was using BCD with digits quick "parse" and in it's peripherals. I remember it from some interview with old IBM tech employee ;)

  • zahlman 2 months ago

    And this is exactly why I find the usual 16x8 at least as insightful as this proposed 32x4 (well, 4x32, but that's just a rotation).

  • kibwen 2 months ago

    I still wonder if it wouldn't have been better to let each digit be represented by its exact value, and then use the high end of the scale rather than the low end for the control characters. I suppose by 1970 they were already dealing with the legacy of backwards-compatibility, and people were already accustomed to 0x0 meaning something akin to null?

    • mmilunic 2 months ago

      Either way you would still need some check to ensure your digits are digits and not some other type of character. Having zeroed out memory read as a bunch of NUL characters instead of like “00000000” would probably be useful, as “000000” is sometimes a legitimate user input

    • gpvos 2 months ago

      NUL was often sent as padding to slow (printing) terminals. Although that was just before my time.

kazinator 2 months ago

This is by design, so that case conversion and folding is just a bit operation.

The idea that SOH/1 is "Ctrl-A" or ESC/27 is "Ctrl-[" is not part of ASCII; that idea comes from they way terminals provided access to the control characters, by a Ctrl key that just masked out a few bits.

  • muyuu 2 months ago

    I guess it's an age thing, but I thought this was really basic CS knowledge. But I can see why this may be much less relevant nowadays.

    • Cthulhu_ 2 months ago

      I've been in IT for decades but never knew that ctrl was (as easy as) masking some bits.

      • muyuu 2 months ago

        You can go back maybe 2 decades without this being very relevant, but not 3 given the low level scope that was expected in CS and EE back then.

      • kazinator 2 months ago

        I learned about from 6502 machine language programming, from some example that did a simple bit manipulation to switch lower case to upper case. From that it became obvious that ASCII is divided into four banks of 32.

    • aa-jv 2 months ago

      Been an ASCII-naut since the 80's, so .. its always amusing to see people type 'man ascii' for the first time, gaze upon its beauty, and wonder at its relevance, even still today ...

  • nine_k 2 months ago

    Yes, the diagram just shows the ASCII table for the old teletype 6-bit code (and 5-bit code before), with the two most significant bits spread over 4 columns to show the extension that happened while going 5→6→7 bits. It makes obvious what was very simple bit operations on very limited hardware 70–100 years ago.

    (I assume everybody knows that on mechanical typewriters and teletypes the "shift" key physically shifted the caret position upwards, so that a different glyph would be printed when hit by a typebar.)

taejavu 2 months ago

For whatever reason, there are extraordinarily few references that I come back to over and over, across the years and decades. This is one of them.

kazinator 2 months ago

If Unicode had used a full 32 bits from the start, it could have usefully reserved a few bits as flags that would divide it into subspaces, and could be easily tested.

Imagine a Unicode like this:

8:8:16

- 8 bits of flags. - 8 bit script family code: 0 for BMP. - 16 bit plane for every script code and flag combination.

The flags could do usefuil things like indicate character display width, case, and other attributes (specific to a script code).

Unicode peaked too early and applied an economy of encoding which rings false now in an age in which consumer devices have two digit gigabyte memories, multi terabyte of storage, and high definition video is streamed over the internet.

mbreese 2 months ago

I came across this a week ago when I was looking at some LLM generated code for a ToUpper() function. At some point I “knew” this relationship, but I didn’t really “grok” it until I read a function that converted lowercase ascii to uppercase by using a bitwise XOR with 0x20.

It makes sense, but it didn’t really hit me until recently. Now, I’m wondering what other hidden cleverness is there that used to be common knowledge, but is now lost in the abstractions.

  • Findecanor 2 months ago

    A similar bit-flipping trick was used to swap between numeric row + symbol keys on the keyboard, and the shifted symbols on the same keys. These bit-flips made it easier to construct the circuits for keyboards that output ASCII.

    I believe the layout of the shifted symbols on the numeric row were based on an early IBM Selectric typewriter for the US market. Then IBM went and changed it, and the latter is the origin of the ANSI keyboard layout we have now.

  • auselen 2 months ago

    xor should toggle?

    • munk-a 2 months ago

      That's correct, a toUpper would just use OR.

      • mbreese 2 months ago

        I left out that the line before there was a check to make sure the input byte was between ‘a’ and ‘z’. This ensures that if the char is already upper case, you don’t do an extraneous OR. And at this point, OR, XOR, or even a subtract 0x20 would work. For some reason the LLM thought the XOR was faster.

        I honestly wouldn’t have thought anything of it if I hadn’t seen it written as `b ^ 0x20`.

rbanffy 2 months ago

This is also why the Teletype layout has parentheses on 8 and 9 unlike modem keyboards that have them on 9 and 0 (a layout popularised by the IBM Selectric). The original Apple IIs had this same layout, with a “bell” on top of the G.

  • Terretta 2 months ago

    What happened to this block and the keyboard key arrangement?

      ESC  [  {  11011
      FS   \  |  11100
      GS   ]  }  11101
    
    Also curious why the keys open and close braces, but ... the single and double curly quotes don't open and close, but are stacked. Seems nuts every time I type Option-{ and Option-Shift-{ …
    • kazinator 2 months ago

      You're no longer talking about ASCII. ASCII has only a double quote, apostrophe (which doubles as a single quote) and backtick/backquote.

      Note on your Mac that the Option-{ and Option-}, with and without Shift, produce quotes which are all distinct from the characters produced by your '/" key! They are Unicode characters not in ASCII.

      In the ASCII standard (1977 version here: https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub1-2-197...) the example table shows a glyph for the double quote which is vertical: it is neither an opening nor closing quote.

      The apostrophe is shown as a closing quote, by slanting to the right; approximately a mirror image of the backtick. So it looks as though those two are intended to form an opening and closing pair. Except, in many terminal fonts, the apostrophe is a just vertical tick, like half of a double quote.

      The ' being veritcal helps programming language '...' literals not look weird.

    • jolmg 2 months ago

      > What happened to this block and the keyboard key arrangement?

      There's also these:

        | ASCII      | US keyboard |
        |------------+-------------|
        | 041/0x21 ! | 1 !         |
        | 042/0x22 " | 2 @         |
        | 043/0x23 # | 3 #         |
        | 044/0x24 $ | 4 $         |
        | 045/0x25 % | 5 %         |
        |            | 6 ^         |
        | 046/0x26 & | 7 &         |
dveeden2 2 months ago

Also easy to see why Ctrl-D works for exiting sessions.

unnah 2 months ago

If Ctrl sets bit 6 to 0, and Shift sets bit 5 to 1, the logical extension is to use Ctrl and Shift together to set the top bits to 01. Surely there must be a system somewhere that maps Ctrl-Shift-A to !, Ctrl-Shift-B to " etc.

  • maybewhenthesun 2 months ago

    It's more that shift flips that bit. Also I'd call them bit 0 and 1 and not 5 and 6 as 'normally' you count bits from the right (least significant to most significant). But there are lots of differences for 'normal' of course ('middle endian' :-P )

  • Leszek 2 months ago

    I guess in this system, you'd also type lowercase letters by holding shift?

seyz 2 months ago

This is why Ctrl+C is 0x03 and Ctrl+G is the bell. The columns aren't arbitrary. They're the control codes with bit 6 flipped. Once you see it, you can't unsee it. Best ASCII explainer I've read.

gpvos 2 months ago

Back in early times, I used to type ctrl-M in some situations because it could be easier to reach than the return key, depending on what I was typing.

renox 2 months ago

I still find weird that they didn't make A,B... just after the digits, that would make binary to hexadecimal conversion more efficient..

  • iguessthislldo 2 months ago

    Going off the timelines on Wikipedia, the first version of ASCII was published (1963) before the 0-9,A-F hex notation became widely used (>=1966):

    - https://en.wikipedia.org/wiki/ASCII#History

    - https://en.wikipedia.org/wiki/Hexadecimal#Cultural_history

    • jolmg 2 months ago

      The alphanumeric codepoints are well placed hexadecimally-speaking though. I don't imagine that was just an accident. For example, they could've put '0' at 050/0x28, but they put it at 060/0x30. That seems to me that they did have hexadecimal in consideration.

      • kubanczyk 2 months ago

        It's a binary consideration if you think of it rather than hexadecimal.

        If you have to prominently represent 10 things in binary, then it's neat to allocate slot of size 16 and pad the remaining 6 items. Which is to say it's neat to proceed from all zeroes:

            x x x x 0 0 0 0
            x x x x 0 0 0 1
            x x x x 0 0 1 0
            ....
            x x x x 1 1 1 1
        
        It's more of a cause for hexadecimal notation than an effect of it.
  • jolmg 2 months ago

    Currently 'A' is 0x41 and 0101, 'a' is 0x61 and 0141, and '0' is 0x30 and 060. These are fairly simple to remember for converting between alphanumerics and their codepoint. Seems more advantageous, especially if you might be reasonably looking at punchcards.

  • tgv 2 months ago

    [0-9A-Z] doesn't fit in 5 bits, which impedes shift/ctrl bits.

  • vanderZwan 2 months ago

    I'm not sure if our convention for hexadecimal notation is old enough to have been a consideration.

    EDIT: it would need to predate the 6-bit teletype codes that preceded ASCII.

  • kps 2 months ago

    They put : ; immediately after the digits because they were considered the least used of the major punctuation, so that they could be replaced by ‘digits’ 10 and 11 where desired.

    (I'm almost reluctant to to spoil the fun for the kids these days, but https://en.wikipedia.org/wiki/%C2%A3sd )

gravifer 2 months ago

It really deserves some public documenting. People that was designing a charset for ANSI must have tried to think everything through; even more so than an 8-bit ISA because the charset was going to be inter-typewritters.

ezekiel68 2 months ago

I love this stuff. It's the kind of lore that keeps getting forgotten and re-discovered by swathes of curious computer scientists over the years. So easy to assume many of the old artifacts (such as the ASCII table) had no rhyme or reason to them.

msarnoff 2 months ago

On early bit-paired keyboards with parallel 7-bit outputs, possibly going back to mechanical teletypes, I think holding Control literally tied the upper two bits to zero. (citation needed)

Also explains why there is no difference between Ctrl-x and Ctrl-Shift-x.

meken 2 months ago

Very cool.

Though the 01 column is a bit unsatisfying because it doesn’t seem to have any connection to its siblings.

y42 2 months ago

first I was like "What but why? You don't save any space or what's that excercise about" then I read it again and it blew my mind. I thought I knew everything about ASCII. What a fool I am, Sokrates was right. Always.

mac3n 2 months ago

anyone remember 005 ENQ (also called WRU who are you) and its effect on a teletype?

joshcorbin 2 months ago

Just wait until someone finally gets why CSI ( aka the “other escape” from the 8-bit ansi realm, which is now eternalized in unicode C1 block ) is written ESC [ in 7-bit systems, such as the equally now eternal utf-8 encoding

timonoko 2 months ago

where does this character set come from? It looks different on xterm.

for x in range(0x0,0x20): print(chr(x),end=" ")

                    

  • voxelghost 2 months ago

    What are you trying to achieve, none of those characters are printable, and definetly not going to show up on the web.

        for x in range(0x0,0x20): print(f'({chr(x)})', end =' ')
        (0|) (1|) (2|) (3|) (4|) (5|) (6|) (7|) (8) (9| ) (10|
        ) (11|
              ) (12|
        ) (14|) (15|) (16|) (17|) (18|) (19|) (20|) (21|) (22|) (23|) (24|) (25|)    (26|␦) (27|8|) (29|) (30|) (31|)
    • timonoko 2 months ago

      Just asking why they have different icons in different environments? Maybe it is UTF-8 vs ISO-8859?

      • gschizas 2 months ago

        UTF-8 is not technically a character set (because it has way more than 256 characters). Characters 32-127 in UTF8 are the same as ASCII, which is the same as the OEM/CP437 and the ANSI/ISO-8859/CP1252.

        The characters in CP437 (and other OEM codepages) actually come from the ROM of the VGA (and EGA/CGA/MCGA/Hercules before them).

        What you are referring to is those (visually), right? I'm missing some characters in the first line, because HN drops them.

            0123456789abcdef
           0...♥♦♣♠•◘○◙..♪♫.
           1►◄↕‼¶§▬↨↑↓→←∟↔▲▼
        
        
        As far as I know, the equivalent control characters (characters 0-31) don't have any representation in CP1252, but that's also dependent on the font (since rendering of CP1252 is always done by Windows)

        As to their origin, originally the full CP437 character set was taken from Wang word processors. I don't know where Wang took it from, but they probably invented it themselves.

        EDIT: There's a more complete history here: https://www.os2museum.com/wp/weird-tales/

        EDIT 2: The CP437 character set didn't seem to come directly from Wang; it's just that they took some (a lot) of characters from Wang word processors character sets. The positions of those "graphic" characters was decided by Microsoft when they made MS-DOS (at least according to Bill Gates).

        • timonoko 2 months ago

          In my screen there is indeed about thirty icons. When I executed the program on xterm, they were different and when I pasted them on LibreOffice they were again different. And now it seems this shit is also different in every country.

          The world is broken.

      • rbanffy 2 months ago

        They shouldn't show as visual representations, but some "ASCII" charts show the IBM PC character set instead of the ASCII set. IIRC, up to 0xFF UTF-8 and 8859 are very close with the exceptions being the UTF-8 escapes for the longer characters.

        • gschizas 2 months ago

          There's no 0x80-0xFF in the UTF-8 encoding. Only up to 0x7F (127) it's the same.

      • timonoko 2 months ago

        Opera AI solved the problem:

        If you want to use symbols for Mars and Venus for example,they are not in range(0,0x20). They are in Miscellanous Symbols block.

  • timonoko 2 months ago

    Ok this set does not even show on Android, just some boxes. Very strange.

Aardwolf 2 months ago

Imho ascii wasted over 20 of its precious 128 values on control characters nobody ever needs (except perhaps the first few years of its lifetime) and could easily have had degree symbol, pilcrow sign, paragraph symbol, forward tick and other useful symbols instead :)

  • ogurechny 2 months ago

    Smaller, 6-bit code pages existed before and after that. They did not even have space for upper and lower case letters, but had control characters. Those codes directly moved the paper, switched to next punch card or cut the punched tape on the receiving end, so you would want them if you ever had to send more than a single line of text (or a block of data), which most users did.

    Even smaller 5-bit Baudot code had already had special characters to shift between two sets and discard the previous character. Murray code, used for typewriter-based devices, introduced CR and LF, so they were quite frequently needed in way more than few years.

  • mmooss 2 months ago

    It is interesting that, as a guess, we waste an average of ~5% of storage capacity for text (12.5% of Unicode's first byte, but many languages regularly use higher bytes of course).

    I don't fault the creators of ASCII - those control characters were probably needed at the time. The fault is ours for not moving on from the legacy technology. I think some non-ASCII/Unicode encodings did reuse the control character bytes. Why didn't Unicode implement that? I assume they were trying to be be compatible with some existing encodings, but couldn't they have chosen the encodings that made use of the control character code points?

    If Unicode were to change it now (probably not happening, but imagine ...), what would they do with those 32 code points? We couldn't move other common characters over to them - those already have well-known, heavily used code points in Unicode and also iirc Unicode promises backward compability with prior versions.

    There still are scripts and glyphs not in Unicode, but those are mostly quite rare and effectively would continue to waste the space. Is there some set of characters that would be used and be a good fit? Duplicate the most commonly used codepoints above 8 bits, as a form of compression? Duplicate combining characters? Have a contest? Make it a private area - I imagine we could do that anyway, because I doubt most systems interpret those bytes now.

    Also, how much old data, which legitimately uses the ASCII control characters, would become unreadable?

  • bee_rider 2 months ago

    On top of the control symbols being useful, providing those symbols would have reduced the motivation for Unicode, right?

    ASCII did us all the favor of hitting a good stopping point and leaving the “infinity” solution to the future.

  • gpvos 2 months ago

    Maybe 32 was a bit much, but even fitting a useful set of control characters into, say, 16, would be tricky for me. For example, ^S and ^Q are still useful when text is scrolling by too fast.

  • zygentoma 2 months ago

    I started using the separator symbols (file, group, record, unit separator, ascii 60-63 ... though mostly the last two) for CSV like data to store in a database. Not looking back!

    • mmooss 2 months ago

      I've wanted to do that but don't you have compatibility problems? What can read/import files with those deliminters? Don't people you are working with have problems?

    • gschizas 2 months ago

      ASCII 60-63 is just <=>?

      You probably mean 28-31 (∟↔▲▼, or ␜␝␞␟)

      Unless this is octal notation? But 0o60-0o63 in octal is 0123

  • y42 2 months ago

    only that would have broken the whole thing back in the days ;)