The C Standard Library Function Isspace() Depends on Locale

77 points by jandeboevrie 2 years ago

zokier 2 years ago

C locales are one of those optimistic features that we have inherited that in retrospect ended up being misguided and more trouble than worth; any program that really needs to deal with localization probably will end up pulling something like ICU to deal with all sorts of cases, and on the other hand locales cause all sorts of weird issues for programs not really expecting to be localized. As a bonus locale support incurs heavy performance hit.

In this case its extra-awkward to have an attempt of having unicode support on a function that takes a single char as an input; it can't actually handle arbitrary unicode codepoints anyways.

I feel a common theme with these sort of things is the thinking that difficult problems can be made tractable by presenting a "simple" naive interface while fudging things behind the scenes. Those supposedly simple interfaces actually become complex to think about once you start asking difficult questions about correctness, error handling, and edge cases.

Someone 2 years ago

> In this case its extra-awkward to have an attempt of having unicode support on a function that takes a single char as an input
Nitpick: it doesn’t take a char; it takes an int that must either be representable as unsigned char or be equal to EOF (https://en.cppreference.com/w/c/string/byte/isspace)
Given that description, I don’t think anybody attempted to have unicode support for isspace.
IMO, the bug is to call isspace for bytes extracted from utf-8 data.
tialaramex 2 years ago

There are a few of these "optimistic features", and presumably there will be more in today's languages as the state of the art moves on.
Another existing example in C: The idea of "wide" characters and strings, which was intended to be maybe the right shape for UCS-2, but it turns out UCS-2 wasn't the future, UCS-4 (approximately UTF-32) is way too bloated to be reasonable, and so we're all going to speak UTF-8 which ironically C was already at least somewhat suitable for.
An example from Java: The Java concurrent memory model. As shipped Java 1.0 assumed the way forward is you put locks on data structures, that way concurrent access "just works" correctly. It's amazing. It's also ludicrously expensive and you're paying for it even if you don't ever do concurrent access. Today our locks are much cheaper, but they aren't so cheap that you can afford to just lock all your data structures like it's no big deal. On a modern system taking and giving back the lock costs one atomic operation and maybe a fence - cheap but far from free. So modern Java data structures are mostly just unsafe to use concurrently because that's the trade everybody else took in other languages.
A cursed C++ example: In the 1980s, unless you were a hash table expert, the hash table you probably knew how to write was a separate chaining hash table, even by 1989 when ANSI C was standardised, experts might grudgingly admit this was a reasonable choice - there was promising alternatives, but hey separate chaining works and it has well understood behaviour. Somehow though, C++ ended up standardising this bad old data structure as std::unordered_map not in 1998 (when it'd have seemed rather outmoded) but in 2011 (by which point it was beginning to smell bad).
On the other hand it's also easy to make the mistake of assuming something is a variable when it's actually locked down by the time you're standardising. 'A' is 65. It was 65 in ASCII in the 1960s and it's still 65 in UTF-8 (as a result of grandfathering the entire ASCII 7-bit code) today. If you carefully wrote code so that 'A' needn't be 65 that was a waste of your time. Clearly in 1965 it was not safe to just assume 'A' is 65, it might change, and in 2023 it's crazy to make that adjustable, it won't change, but somewhere it shaded over and I'm sure some people got caught.
- Dylan16807 2 years ago
  
  > but it turns out UCS-2 wasn't the future, UCS-4 (approximately UTF-32) is way too bloated to be reasonable, and so we're all going to speak UTF-8 which ironically C was already at least somewhat suitable for.
  For UCS-4 I think it's less about the bloat and more about realizing that it's a hard problem. Not only was going to 16 bits not a golden bullet, no size is a golden bullet because of various forms of code point combining. 2 bytes per character when we started preferring UCS-2 was a lot more expensive than 3-4 bytes per character when we stopped preferring it.
- kevin_thibedeau 2 years ago
  
  wchar_t has the big problem that's width is platform dependent. On some it is 32-bits and can handle UCS-4. You just can't write portable code that uses that type.
  
  kps 2 years ago
  
  If a system supports Unicode, then wchar_t must be at least 21 bits, according to the C and C++ standards. Of course the 900lb gorilla has a 16-bit wchar_t, but the standards committees prefer to stick their fingers in their ears rather than deprecate it.
- kps 2 years ago
  
  C still does not require letters to be in order or contiguous. (In EBCDIC they're not contiguous.)
  
  account42 2 years ago
  
  I think today it's fair to say that if people want to have EBCDIC as the execution character set then its their problem.
JohnFen 2 years ago

Localization and support for multiple character sets turns out to be an incredibly hard problem that remains unsolved.
Unicode is horrible mess to deal with and I hate it with a passion, but as terrible as it is, nobody has come up with a better solution. I think it's because there literally is no good solution to the problem.
raxxorraxor 2 years ago

I think some of the "bussiness languages" of MS often feature such translations. You suddenly have different quotes if you use another locale.
I had to bugfix something here and I almost went crazy. Don't remember the language, it was just a rather simple formula in some arcane tool. Didn't want to believe it...
- account42 2 years ago
  
  Haha it's not just their own business languages but also MS' support for standard formats: In locales where the comma is used as the decimal separator, Excel will by default use semicolons for "CSV" exports.
kps 2 years ago

Did locales have any pre-existing implementation, or did X3J11 invent them?
- kps 2 years ago
  
  Too late to edit… the earliest relevant public reference I have found is https://groups.google.com/g/comp.lang.c/c/Tw-pMLfC52M/m/VQIm... including the phrase “a presentation of the issues involved there (and hopefully a solution) are planned for the next X3J11 meeting” which to me suggests that X3J11 created locales themselves.
- wahern 2 years ago
  
  The OpenBSD man pages say that isspace, et al come from Version 7 Unix, which was released in 1979. The C89 rationale seems to suggest that they copied this pre-existing API.
  Rationle section 4.3 says[1],
  > Pains were taken to eliminate any ASCII dependencies from the definition of the character handling functions. One notable result of this policy was the elimination of the function isascii, both because of the name and because its function was hard to generalize.
  Rationale section 2.2.1 says[2],
  > The Committee ultimately came to remarkable unanimity on the subject of character set requirements. There was strong sentiment that C should not be tied to ASCII, despite its heritage and despite the precedent of Ada being defined in terms of ASCII. ...
  but it also makes clear that their goal wasn't a perfectly generalized locale framework, but something sufficient for writing a C compiler, i.e. to distinguish character classes relevant to C source code,
  > ... Rather, an implementation is required to provide a unique character code for each of the printable graphics used by C, and for each of the control codes representable by an escape sequence. (No particular graphic representation for any character is prescribed --- thus the common Japanese practice of using the glyph ¥ for the C character \ is perfectly legitimate.) Translation and execution environments may have different character sets, but each must meet this requirement in its own way. The goal is to ensure that a conforming implementation can translate a C translator written in C.
  [1] http://port70.net/~nsz/c/c89/rationale/d3.html#ASCII-4-3
  [2] http://port70.net/~nsz/c/c89/rationale/b.html#Ada-2-2-1
  
  kps 2 years ago
  
  I know isspace() and the rest of ctype.h date to V7; I was wondering about locales specifically. I'm not aware of any pre-C89 implementation of anything similar to that, so I'm wondering whether X3J11 adopted it from some system I've never heard of, or created it themselves. C89 was supposed to ‘codify common existing practice’, but internationalization politics may have overridden that.
  
  wahern 2 years ago
  
  This doesn't answer the question, but FWIW the earliest reference I could find to setlocale or related is from a 1986 journal article, Programmer's Journal, Volume 4, page 45:
  > In order to preserve the existing, large body of code, it was decided to place the burden of having a non-USA/ENGLISH environment only on those users who needed it. This will be done by making the C program startup environment default to a native "locale" of USA. Users wanting something else will need to call the setlocale function to set their locale to some other implementation-defined environment.
  See https://books.google.com/books?id=XV4qAAAAMAAJ&focus=searchw...
  So either the 1985 draft already had setlocale, or it must have been added shortly thereafter. However, Microsoft's 1986 reference manual doesn't mention setlocale, despite the introduction saying it was based on the draft standard. See https://www.os2museum.com/files/docs/msc40/ms-c-4.0-rtlref-1...
  IEEE Std 1003.1-1988 (i.e. POSIX), approved 1988, already copied setlocale from ANSI C, presumably based on a draft as ANSI C wasn't ratified until 1989. See https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub151-1.p...
  I can't find any implementation before Version 10 Unix from 1989, followed by BSD4.4-Net/2 from 1991. See https://minnie.tuhs.org/cgi-bin/utree.pl.
  Unfortunately none of the early committee notes or minutes seem to be available online. Hopefully someone in the know (possibly an original committee member) is lurking on HN and can fill-in the gaps.
  EDIT: I found a post by Doug Gwyn in comp.lang.c from November 1986 explaining setlocale:
  Path: utzoo!mnetor!seismo!brl-adm!brl-smoke!gwyn From: gwyn@brl-smoke.ARPA (Doug Gwyn ) Newsgroups: comp.lang.c Subject: Re: sizeof(char) Message-ID: <5359@brl-smoke.ARPA> Date: Wed, 12-Nov-86 21:04:19 EST Article-I.D.: brl-smok.5359 Posted: Wed Nov 12 21:04:19 1986 Date-Received: Wed, 12-Nov-86 23:55:53 EST References: <4617@brl-smoke.ARPA> <657@dg_rtp.UUCP> Reply-To: gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) Organization: Ballistic Research Lab (BRL), APG, MD. Lines: 61 In article <9181@sun.uucp> guy@sun.uucp (Guy Harris) writes: >If it is indeed the case that there is more than one way of sorting text in, >say, Oriental languages, then either 1) "setlocale" is a poor name, because >it takes into account more than just the locale, or 2) it is a poor routine, >because it doesn't take into account more than just the locale. The name is short for "set locale-specific information", which reflects the main motivation for the function. There were several suggestions for the name, but we couldn't find one that we liked better, other than contractions of "set environment", which had to be rejected for the obvious reason. Actually, it WAS intended that setlocale() indeed mean "change or query the program's entire LOCALE or portions thereof", where the term "locale" was to be defined in section 1.5. However, something appears to have gone awry in the process of making this last-minute addition to the draft proposed standard document, since there are two sentences in the description of setlocale (section 4.4.1.1) that say almost the same thing using different words, and section 1.5 defines "locale-specific behavior" but not "locale". The general term "locale" is intended in the context of X3J11 to refer to a complete, orthogonal set of selections of conventions for items that are allowed to affect program operation based on nationality, culture, or language. Thus "locale" is not synonymous with "location".
  NB: I pasted only the first half of that post; the most relevant part.
  
  kps 2 years ago
  
  Thanks! I started working at a C compiler shop near the end of the '80s, and I don't recall knowing that this had been under way for so long already. (Though, we did freestanding targets, often with no conventional I/O, so locales would have been the last thing on our minds.)
  It would be nice if the X3J11 internal documents ended up in an online archive, given their historical importance.

wahern 2 years ago

The fact that Unicode codepoints were being passed to isspace instead of iswspace indicates the relevant code was already fubar'd.

> For example, isspace(0x01fe) is true. I can't figure out why this might be considered a whitespace character

Because the only valid values (independent of locale) that can be passed to isspace are 0 to UCHAR_MAX and -1/EOF, where UCHAR_MAX refers to unsigned char (usually 255), not Unicode character. Most implementations I've seen (glibc, musl, OpenBSD) index the passed value into a locale-specific array of length UCHAR_MAX + 1, possibly masking the index and/or return values. But TIL macOS (and possibly FreeBSD and NetBSD at some point, if not currently) had vestigial support for passing higher values as part of a presumably long-abandoned approach to I18N.

EDIT: FWIW, based on the glibc code (ctype/isctype.c),

    int
  __isctype (int ch, int mask)
  {
    return (((uint16_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_CLASS) + 128)
            [(int) (ch)] & mask);
  }

where isspace(c) seems to be translated to __isctype(c, _ISspace) there's a good chance the array is being overflowed. Without looking further (glibc isn't the easiest code to grok), I'd guess the array size is probably 128 + UCHAR_MAX with the offset of 128 (instead of 1) to handle the common case, especially on systems where char is signed, of people passing in negative values, though that only works for a locale like ASCII where -1/EOF and 255/(unsigned char)-1 aren't ambiguous.

JdeBP 2 years ago

FreeBSD layers narrow and wide character typing on top of a single common mechanism based upon 32-bit signed "runes". Basically, the 256 narrow characters are treated as the first 256 characters in the Unicode BMP, and an accident of implementation allows one to pass in other Unicode code points to the narrow character functions, given that the int and wint_t types are designed to be trivially convertible to a "rune".
dvh 2 years ago

Seems like instead of using boolean result better API should return tri-state: space, no space, bad request
- TeMPOraL 2 years ago
  
  This is already the case; "bad request" is usually returned as SIGSEGV / 0xC0000005 and similar.
  
  asveikau 2 years ago
  
  No. Bad request here would be a recoverable, handleable error that doesn't tank the process. You are describing a failure mode for a bad pointer dereference, which generally should not be handled in-process and should take down the program.
  It is not reasonable for isspace(), a simple function call with no pointer arguments, to fault like that, absent unlikely events such as a hardware issue or a bad stack pointer.
  
  TeMPOraL 2 years ago
  
  > It is not reasonable for isspace(), a simple function call with no pointer arguments, to fault like that
  Well, if it internally uses a lookup table and doesn't do bounds checking, then it can fail this way, and that failure will propagate to the caller.
  > You are describing a failure mode for a bad pointer dereference, which generally should not be handled in-process and should take down the program.
  Maybe this is where we went wrong? Bad pointer access isn't fundamentally different from doing a bounds check and throwing an exception. Either of them will tank the process if not handled up the stack. But when you do a bounds check solely to prevent a bad pointer access crash, you're effectively re-doing the work the OS already does for you, paying a cost at runtime just to switch to a slightly different semantics. Maybe instead of doing that, we should have the platform provide better granularity of protection (as to e.g. avoid bad pointer access targeted to overwrite some other memory), and lean on it, instead of bolting extra layers of the same thing, just Invented Here on top?
  
  Quekid5 2 years ago
  
  (I might be being overly nitpicky.)
  > Well, if it internally uses a lookup table and doesn't do bounds checking, then it can fail this way, and that failure will propagate to the caller.
  We're talking UB here, AFAICT. So it might do literally anything, including propagating to the caller. Or not. Or doing something entirely different, like returning true (or false)... or deleting your files.
  One wishes that out-of-bounds access would guarantee a SIGSEGV, but here we are...
  > Bad pointer access isn't fundamentally different from doing a bounds check and throwing an exception.
  Performance is the difference. If you're doing bounds checks for every access you're going to be slower than a compiler which can prove (for a loop, say) that no OOB access can possibly occur during said loop and just remove the bounds checks.
  > Either of them will tank the process if not handled up the stack.
  Again, not true -- an exception has defined behavior, a stray pointer doesn't. (It could, conceivably, but it fundamentally doesn't in C.)
  
  asveikau 2 years ago
  
  > Either of them will tank the process if not handled up the stack.
  No. I think your mind must have its grooves set by high level languages to think this way. I was very deliberate when I said it's unwise to recover from bad pointer dereference in your own process. It is not safe and invites a world of pain. It isn't "an exception" that you can "catch and move on from". It is death. Yes you can install a handler for SIGSEGV or Win32 access violation. That doesn't make it a good idea.
  You can handle it safely, but only from the safety of another process. Or the kernel can handle it in a page fault interrupt for a user process. This is because your own address space is known good in those cases. But handling it in your own address space, I would not advise that.
  
  asveikau 2 years ago
  
  Too late to edit the post, but i wanted to add something i have done in a SIGSEGV handler in production, which is log a stack trace without allocating any memory, then exit. Anything beyond that is pushing it.
  
  Quekid5 2 years ago
  
  Yeah, unless you have a very specific architecture and compiler, etc., in mind. "Handling" SIGSEGV is madness -- you're usually already deep in UB territory if a SIGSEGV happens.
  Re: Your use case, I believe the most recent C++ standard has a portable way to do stack traces, so yay!
  
  deredede 2 years ago
  
  You can't exactly lean on the OS to detect bad pointers references because it will only work some of the time.
  The OS doesn't know where your array end, so the error would have to depend on whether the access is outside the memory allocated to the process. A function specification of "if the input is outside valid range, either return garbage or raise an exception" provides no value over "either return garbage or crash the program", because there is still the possibility of returning garbage.
  
  TeMPOraL 2 years ago
  
  > A function specification of "if the input is outside valid range, either return garbage or raise an exception" provides no value over "either return garbage or crash the program", because there is still the possibility of returning garbage.
  Right. Yes, I was being a little tongue-in-cheek with the original comment, but only a little - imagine if there was a way for the OS / underlying runtime to prevent returning garbage (possibly by eagerly raising an exception). In that case, a lot of our error handling would be duplicating the work done by the OS. Now, with the garbage result being a possibility, we're still duplicating some of the work - we just can't really avoid it, in a kind of "50% of checks are redundant, we just don't know which ones" way.
  I'm raising this as something to think about, that wasn't obvious to me until recently. I grew up dreading SIGSEGV and 0xC0000005. But recently, having no other choice but to trap some of those and similar exceptions (due to bugs in some proprietary third-party dependencies), I finally realized those are just error handling mechanisms, conceptually not any different from regular exceptions or Result<T, E> types. They're just implemented one layer below.
  
  deredede 2 years ago
  
  I don't think there is work duplication here, not in any meaningful way at least (and even in non meaningful ways I think you're also way off with your 50% figure - considering a 4kB page size and 64-bit pointer alignment, 0.2% would be more realistic).
  SIGSEGV is a safety/isolation feature. It is a byproduct of physics limitations: if we had infinite RAM, we would allocate a separate 2^64 address space to each process and SIGSEGV would not exist. There are also zero guarantees that you will ever get a SIGSEGV, even if you do horrible, no-good, very bad things: that is entirely dependent on the OS and hardware. This allows for efficient implementations because the OS/hardware can implement checks at whatever granularity they deem appropriate. On some OS/hardware combinations you may never get a SIGSEGV at all!
  On the other hand, ensuring errors on, amongst others, out-of-bounds array accesses, is much more expensive to do dynamically, and the OS can't really do it more efficiently than the compiler. In fact, it is the opposite: the compiler knows the specifics of the language and can exploit them to elide many boundary checks that the OS never could.
  
  JohnFen 2 years ago
  
  > Well, if it internally uses a lookup table and doesn't do bounds checking, then it can fail this way, and that failure will propagate to the caller.
  True, but I would say that wouldn't count as a reasonable implementation.
  
  account42 2 years ago
  
  It is absolutely reasonable for function invocations that violate the specification to result in program termination. Attempting to handle and recover from the error at runtime is a) a waste of resources and b) prevents future expansion of the function that makes the values legal.
- bitwize 2 years ago
  
  So... True, False, and File Not Found?
  
  teddyh 2 years ago
  
  Reference: https://thedailywtf.com/articles/What_Is_Truth_0x3f_
- rm445 2 years ago
  
  Reminds me of a joke in a John Meaney novel, a physicist discovers faster-than-light travel through a dimension with unusual properties, gaining inspiration from Java's booleans having three states (true, false and NullPointerException).
  
  Phrodo_00 2 years ago
  
  Hate to point it out, but java booleans can only be true or false. Booleans (capital B) can be true, false or null.
  
  dfox 2 years ago
  
  Method that is declared as only ever returning bool can also raise any exception that is an instance of java.lang.RuntimeException (which includes NullPointerException) this somewhat bizzare design which combines unchecked and checked exceptions is a source of major discussions since at least late 90's and has much to do with the current concept of null-safety.
  
  im3w1l 2 years ago
  
  It can also raise Errors. I know because this one really poorly designed library I used would raise Errors when asking it to decode certain invalid files.
  
  hedora 2 years ago
  
  final Boolean foo = True
  …
  if (foo.booleanValue()) { }
  Can also NPE.
  
  hedora 2 years ago
  
  Yeah, but using a raw boolean or Boolean is not considered good practice. They’re up to at least four states now:
  null, None, Some(TRUE) and Some(FALSE)
  Of course, if you want more than that, you can have unboundedly many by invoking new Boolean(), which guarantees its return value is unique according to ==

kazinator 2 years ago

Scanning a floating-point value with strtod depends on locale. If it's in some locale where the decimal point is a comma, it may stop recognizing the standard 123.456+EE notation.

The fix is never to call setlocale(); calling setlocale is like asking "f___ my C program".

ISO C localization was designed back in the 1980s, when nobody had any real experience with localizing. In a greenfield C program, it's best to do it all yourself from scratch and stay away from C localization, so you can depend on strtod and isspace to do what they are supposed to.

pavlov 2 years ago

It's useful to remember that the 1980s approach to localization preceded the global Internet.
The way most software worked was that you bought it from a local reseller. It came localized for your country (perhaps by the reseller or importer rather than the original authors of the software), and then you'd use it to conduct your local business. Data interchange wasn't that common.
Desktop printers were hugely important because a hard copy was how you'd share anything. If you needed to get the information somewhere fast, you'd then fax it.
Rarely when you did need to exchange files, you'd use floppies. Maybe you'd take your WordPerfect document to a professional print shop so they would do a layout using cutting-edge desktop publishing technology.
So the notion that somebody in Germany might receive American files, or vice versa, wasn't really a primary concern. It was considered far more important that the Germans, and everybody else, would be able to work with their data with the number formatting that was preferred (and sometimes legally mandated).
- orf 2 years ago
  
  Cool but that was over 40 years ago. Who cares, and why hasn’t it been improved since then?
  
  pvh 2 years ago
  
  To the former, curious people with an interest in how and why the world came to be as it is.
  To the latter, obviously "it" has improved, but ecosystem effects make certain changes very difficult and expensive to coordinate and what we see here is the scars from that process.
  Everything you see in the world grew out of things that came before, and was made by fallible people working with limited time, energy, and perspective.
  Honestly, I'm a bit surprised someone with a three letter handle wouldn't already recognize this. Surely you have been around here for a while.
  
  JdeBP 2 years ago
  
  It actually has been improved. See discssion of isspace_l() elsewhere on this page.
  
  nwellnhof 2 years ago
  
  Unfortunately, not all locale-dependent functions have an *_l version, at least on some platforms. glibc doesn't have sprintf_l, for example, unlike BSD and MSVCRT.
  
  JdeBP 2 years ago
  
  Luckily, sprintf() is irrelevant to the headlined case, that involves isspace_l(), which is in the BSD, GNU, and even musl C libraries.
  
  IshKebab 2 years ago
  
  I guess anyone that develops in C is happy living in the 80s amongst the footguns. Anyone who isn't has moved on to other languages where it has been improved.
ridiculous_fish 2 years ago

How would one implement locale-aware strtod() all by yourself from scratch?
- kazinator 2 years ago
  
  Use the locale-unaware strtod. If the locale's decimal separator character isn't . (period) then filter the output of that to replace the period with that character.
bitwize 2 years ago

s/greenfield C program/greenfield program/
s/from C localization.*$/from C./

smcameron 2 years ago

And if you just don't want to be bothered by locales, it's not enough to setlocale(LC_ALL, "C"); Other libraries may call setlocale() behind your back (e.g. GTK2). So, what you have to do is, you have to break setlocale's arms off:

    #define _GNU_SOURCE
    
    #include <stdio.h>
    #include <dlfcn.h>
    #include <locale.h>
    
    typedef char *(*setlocale_prototype)(int category, const char *locale);
    
    static setlocale_prototype real_setlocale = NULL;
    
    static char *the_only_locale = "C";
    
    char *setlocale(__attribute__((unused)) int category,
            __attribute__((unused)) const char *locale)
    {
        char *msg;
    
        if (!real_setlocale) {
            *(void **) &real_setlocale = dlsym(RTLD_NEXT, "setlocale");
            msg = dlerror();
            if (msg) {
                fprintf(stderr, "Failed to override setlocale(): %s\n", msg);
                fflush(stderr);
                real_setlocale = NULL;
                return the_only_locale;
            }
        }
        /* C is the locale, and the locale shall be C */
        return real_setlocale(LC_ALL, "C");
    }

Now it doesn't matter if GTK or whatever library calls setlocale() with whatever weird parameters, the locale shall be "C".

0xr0kk3r 2 years ago

Another library can call setlocale and it will impact my code? I've never, ever written locale-specific code, so this is kinda surprise.
- teddyh 2 years ago
  
  Locale is process-wide, so yes. But they shouldn’t; libraries should not call setlocale(). If it does, you should probably report it as a bug.
  It’s like if a library called chdir(); there are some things which libraries should not do, and setlocale() is one of them.
  
  JohnFen 2 years ago
  
  I would be a bit more nuanced than that. I think it's OK if libraries change process state as long as they restore it as it was before the library call returns. And as long as this behavior is clearly documented.
  
  teddyh 2 years ago
  
  Many state changes are not as reversible as one might think. Even chdir() can be non-reversible if the mountpoint has since changed. And if the locale has not been set, and then a library sets it, AFAIK you can’t “unset” it, only set it to something else.

nemothekid 2 years ago

mpv's locale rant is another "blog post" about the frustration of locale.

https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f02...

dfox 2 years ago

This touches on a issue that is a part of how we got to this situation. What is the encoding of zero and possibly slash delimited blobs of bytes (filenames, argv, envp…) that you get from and pass to the kernel? There are parts of userspace that don't care, parts that assume that LC_CTYPE is the correct answer (GNU tools are usually in one of these categories), userspace that assumes that all that is UTF-8 regardless of LC_CTYPE and more or less by sheer luck not fail completely when that is not true (Gnome) and then there is Python3 which tries to and fails at reconciling this mess with NT world where filenames have weird, but defined (and funilly enough, dependent on the locale the mkfs was ran with), UTF-16-like encoding.
- matheusmoreira 2 years ago
  
  > What is the encoding of zero and possibly slash delimited blobs of bytes (filenames, argv, envp…) that you get from and pass to the kernel?
  Bytes. Every number other than 0 and 47 is allowed. They should probably be treated as opaque blobs of binary data.
  
  ben0x539 2 years ago
  
  Well, sometimes the users ask to see them and they don't like numbers very much.
  
  account42 2 years ago
  
  Assuming UTF-8 for display works pretty well, as long as you still treat it as bytes for the actual file operations and don't corrupt the data because you expect valid UTF-8.
  Windows has the same problem where in most places the kernel will allow all WCHAR sequences (minus a short list of reserved ones) and gives you no guarantee that that's actually UTF-16.
Dwedit 2 years ago

This is the famous "shitfucked retarded legacy braindeath" post.

rwmj 2 years ago

See also a classic glibc bug: "[0-9] matches ¼ ١ ２〣 and others, but not ９ (and other nines)" (https://news.ycombinator.com/item?id=17557243)

Sadly they renamed the upstream bug report to something more sober (although didn't fix it).

david2ndaccount 2 years ago

Yep.

Because of the terribleness of locales, you should always just roll your own to handle ascii, or use an explicitly unicode-aware function. Anything in libc that relies on locale is unusable because of this.

Besides, isspace is such a trivial function that you don’t want to actually call an extern function (possibly even dynamically linked and thus having to hit the PLT) for it, you want something that is easily inlined.

nirvanis 2 years ago

Somewhat related tip: prepend LANG=C to many console commands such as grep to speed up many tools processing large files, as they will assume ASCII input (which is probably what you have in most cases)

seanhunter 2 years ago

If you care about speed you would probably be using ripgrep rather than grep anyway, but doesn’t `LANG=en_US.UTF-8` give a similar speed on modern systems without any compromise on consistency of sort ordering etc and support for extended characters?
- burntsushi 2 years ago
  
  For GNU grep in particular, no, using a UTF-8 locale can significantly slow it down:
  $ time LC_ALL=C grep -E '^\w{30}$' OpenSubtitles2018.raw.sample.en -c 3 real 0.808 user 0.744 sys 0.063 maxmem 10 MB faults 0 $ time LC_ALL=en_US.UTF-8 grep -E '^\w{30}$' OpenSubtitles2018.raw.sample.en -c 4 real 20.064 user 19.982 sys 0.077 maxmem 10 MB faults 0
  Where as ripgrep is just Unicode aware by default, and still about as fast as the ASCII only variant of GNU grep above:
  $ time rg '^\w{30}$' OpenSubtitles2018.raw.sample.en -c 4 real 1.163 user 1.132 sys 0.030 maxmem 916 MB faults 0
  
  kps 2 years ago
  
  For grep, how much of the difference is due to '\w' having a different meaning between the two cases?
  
  burntsushi 2 years ago
  
  That's exactly the point. ripgrep uses the Unicode definition by default and so corresponds to what GNU grep is doing in the en_US.UTF-8 locale.
emmelaich 2 years ago

and set it for consistency of ordering (collation) between sort, join, tsort, look, etc.

eMSF 2 years ago

While writing a fancy word counter I learnt that glibc iswspace (or the glibc locale data) actually does not consider non-breaking spaces as, well, spaces even when using a Unicode locale. This apparently conforms to ISO 30112. (For example MSVCRT does do so.)

I happened to notice this via a result mismatch as GNU wc does count NBSPs as word separators. Even though it uses iswspace, it also additionally checks for a hard coded set of Unicode non-breaking spaces.

(I have to say I'm a bit surprised at being at getting voted hidden here. I thought this was mostly related to the topic at hand. I would of course gladly be corrected if mistaken about the details.)

JdeBP 2 years ago

The fix if one wants to stick to the standard library is to make use of

    locale_t posixctypelocale = newlocale(LC_CTYPE_MASK, "POSIX", NULL);

saved somewhere early on and then

    b = isspace_l(c, posixctypelocale);

and

    b = iswspace_l(wc, posixctypelocale);

whenever one needs them.

The irony is that systems based upon the BSD C library like MacOS and FreeBSD will have this.

* https://pubs.opengroup.org/onlinepubs/9699919799/functions/n...

* https://pubs.opengroup.org/onlinepubs/9699919799/functions/i...

* https://man.freebsd.org/cgi/man.cgi?query=xlocale&sektion=3

account42 2 years ago

That's quite an absurdly inefficient way to check if a byte is one of 6 fixed values.
- JdeBP 2 years ago
  
  It's a simple array index and bitmask in most implementations, which is usually more efficient than what 6 short-circuited comparisons compiles into.
  
  account42 2 years ago
  
  Plus a function call (likely with GOT indirection) plus another pointer dereference to get from the locale handle to the array. Also you now need some initialization code to create the locale and store the handle somewhere (unless you want to make this even more ridiculously inefficient and create it each time).
  Also, five of those whitespace characters have contiguous byte values so you don't even need 6 comparisons. The compiler doesn't have to keep the short circuting here.

djoldman 2 years ago

For implications in python, see for example:

https://docs.python.org/3/library/re.html#re.LOCALE

bawolff 2 years ago

> In this case, isspace() returns true for Unicode white-space values, which includes 0x85 = NEL = Next Line, and 0xA0 = NBSP = No-Break Space.

Those aren't even unicode (utf-8) bytes for those characters. They are the iso-8859-1 bytes. (E.g. nbsp is U+00A0 which has a byte representation of 0xC2 0xA0)

kps 2 years ago

The argument to `isspace()` (in that implementation) is a Unicode code point, not an encoding byte.
- Sprocklem 2 years ago
  
  Right, but the only reasonable way to localize `isspace()` is to have it based on code units / encoding bytes, since that is what it will be used for. When people want to test a unicode code point they instead call `iswspace()` (or a non-standard but more-sane version thereof).
  
  kps 2 years ago
  
  My unverified assumption is that MacOS's behaviour here descends from NeXTStep, which used UCS-2.

emmelaich 2 years ago

I think I've mentioned it before, but the isspace() and similar man pages used to warn that they made sense only if ascii.

So the recommendation was to always do (isascii() && iswhatever()).

With the advent of locales they seem to have just omitted this rather than put in a warning or hint.

account42 2 years ago

So does whether words should be automatically capitalized in titles and in I say in the C locale they should not be.

otikik 2 years ago

At least its behavior is defined.