But those are unintuitive and still more time consuming than a masked numeric XXXX-XX-XX with a dropdown calendar. Which is what Windows provides anyways.
For context, in case anyone needs, that's a common date format in Japan. Aside from using kanji characters, the big surprise to most of the rest of the world is that the largest epoch is specified as a royal era name[1], corresponding to the Japanese monarchy.
This parallels, and the remark about patio11 refers, to this article[2], which has since become famous on HN. It ends with a similar remark from the author's prior experience as an American expatriate in a less populous and less cosmopolitan part of Japan, when a clerk remarked that Patrick McKenzie was a troublesome name to have in Japan, and why didn't he change it to something convenient and ordinary like Tanaka Taro[3].
Tiny correction: in 2010, I invented a thing parallel to something many well-educated Americans of my acquaintance believe with respect to the centrality of their experience, for the Falsehoods essay.
In 2012, a clerk actually asked my wife and I, when we got married, whether it wouldn't make more sense for me to change my name. Then he wouldn't have to spell Patrick McKenzie on the wedding paperwork, and, approximate quote, "I already have to get one name change form out for her so filling out a second one is no trouble at all."
There was a post on Language Log once scoffing at the stupidity of the Chinese for their feeling that Uyghurs should use normal Chinese names.
I asked a friend in Shanghai whether she thought Uyghur names should be allowed. She responded: "Huh? You don't get to choose what other people's names are. They tell you their names, and then you have to call them that.
And if you really want to think about weird names, there's a country called 'St. Vincent and the Grenadines'!"
(In other minority-names-in-China news, I have a Mongol friend whose name is Saruul. Her parents sinicized this as the three Mandarin syllables sha-ru-la, which look like a normal Chinese name. I don't actually know what she prefers to be called, but I suspect most people she knows call her Rula.)
Microsoft Excel is the worst offender here. When you're on a locale with , as a decimal point it's not able to read CSVs with . as a decimal point. It uses ; instead of , as a field separator.
Delegating parsing user input is a good idea, but sometimes the input methods you can rely on just don't cut it.
By the way: The international way to express a decimal separator is a (thin non-breaking) space. There's no misunderstanding possible.
> By the way: The international way to express a decimal separator is a (thin non-breaking) space. There's no misunderstanding possible.
According to whom? The CGPM recommends thin spaces as thousands separators, and either points or commas as decimal separators. NIST, ISO, etc. generally copy this, sometimes stipulating the decimal separator as one or the other.
This is a relic from the olden times were application data was rarely exchanged across locales, and people expected software to conform to the local conventions (and they largely still expect it). Microsoft never changed this because it would have broken (and would still break) a vast number of systems and workflows.
I don't know if you noticed the date on that was 2012 and it is currently 2025; along with my comment mentioning "rabbit hole, weeks ago" but all of those settings no longer exist.
I just opened Excel M365, opened per Data -> From Text/CSV a csv file and coulds chose the delimiter.
I tested semicolon and comma and in both cases the autorecognition selected the correct delimiter but I could change it to equal sign, space, tab, user defined.
You can't change the decimal separator, but the field delimiter is possible.
Take a moment to know that it’s * Crucial* not to localize strings in health software services because it can lead to data leaks and performance degradation. It’s better to work with global APIs so you’re protected from all sorts of risks.
It can and should be, though. I feel like we should have a separate word for parsing when the rules are not well-defined - something like "fuzzy parsing" (in a similar vein to fuzzy string comparison)
Renaming the problem doesn’t make it go away. It might be useful for identifying the subset of parsing which is problematic, but I think the article already achieves this well by specifying the subset of input under discussion.
OT: Why does almost every comment in this thread currently say “2 hours ago”, when they were probably written when this story was first featured, about 3 days ago?
Hovering over the time-ago item on the comment header displays the exact post time, and interestingly it shows times from Feb 16 (3 days ago) for many of the "2 hours ago" comments. Must be an artifact of some moderation tool.
You are trying to apply what you know versus what others know. No different than Farenheit vs Celsius or Yard vs Meter.
Personal, the MM/DD/YYYY format, that is stander in the USA, needs to die and be replaced with YYYY-MM-DD.
Same with 12 hour time and replacing it with 24 hour. As the saying goes l, Americans use am and pm because they can't count past 12. AM and PM are a waste of code and display area. What fits in 2 characters takes up 5 characters.
No excuse for the date, but I get the 12/24 hour clock. 12h feels more natural when spoken. Where I'm from, we speak in 12h but always write down 24h. You'll see "20:00" written down, but you'll say "8 in the evening". We don't use am/pm, though.
It depends on the local norms. Where I live, 1,004 is decimal, 1 004 or 1'004 is 1004 which makes it even more clear than the en-US default. That is, the 1.004 variant is never used, and if it is, it is assumed to be a decimal (misspelling) of 1,004.
Ah makes sense. In Canada we were taught to use spaces to separate, and decimals for decimals. But being stuck so close to the US we end up with a mess of everything.
In Denmark for example the decimal and comma are reversed in meaning, so you would not have a Danish 100,000,004 because that is an obvious non-Danish number.
What I used to do is set the thousands separator to ' in the operating system settings. That made Excel read CSV files with 1,004 and 1.004 the same, as one and four thousands. No one puts thousands separators in CSV files anyway, so that worked out. And it looked nice too.
In today's Windows 11 I can't find that setting. You can't set the thousands separator separately, not anywhere that I can find. It's a tragedy. I see Excel misreading CSV files all the time. I don't use Excel that much myself and I understand what's going on, so it doesn't affect me all that much directly, but for my Excel warrior colleagues, it's another matter.
You have to set it for the whole OS... it's like someone who works at Microsoft decided that you should never work on a file generated in a different locale.
On Windows one can change how a date is rendered, without changing the locale. I need to look up if this is propagated to browsers.
Also, I hate DOB selectors which don't allow me to manually enter the date, and default to today, and don't have a year << arrow. Only month.
Now I need to click at least (age - 1) * 12 on the < arrow.
In general, I wish more websites would use native date / number / dropdown pickers.
Workday is the worst offender here.
Try clicking the year number and the month name. Those often show a pop up with less clicking required to get to where you want.
But those are unintuitive and still more time consuming than a masked numeric XXXX-XX-XX with a dropdown calendar. Which is what Windows provides anyways.
Is that analogous to setting LC_TIME, per application or temporary?
> On Windows one can change how a date is rendered
I think that’s true on all OSes
”People whose date formats break my system are weird outliers. They should have had solid, acceptable formats, like 平成10年8月1日.”
(With apologies to patio11.)
For context, in case anyone needs, that's a common date format in Japan. Aside from using kanji characters, the big surprise to most of the rest of the world is that the largest epoch is specified as a royal era name[1], corresponding to the Japanese monarchy.
This parallels, and the remark about patio11 refers, to this article[2], which has since become famous on HN. It ends with a similar remark from the author's prior experience as an American expatriate in a less populous and less cosmopolitan part of Japan, when a clerk remarked that Patrick McKenzie was a troublesome name to have in Japan, and why didn't he change it to something convenient and ordinary like Tanaka Taro[3].
This has since become HN folklore.
[1] https://en.wikipedia.org/wiki/Japanese_era_name [2] https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-... [3] https://news.ycombinator.com/item?id=6145768
Tiny correction: in 2010, I invented a thing parallel to something many well-educated Americans of my acquaintance believe with respect to the centrality of their experience, for the Falsehoods essay.
In 2012, a clerk actually asked my wife and I, when we got married, whether it wouldn't make more sense for me to change my name. Then he wouldn't have to spell Patrick McKenzie on the wedding paperwork, and, approximate quote, "I already have to get one name change form out for her so filling out a second one is no trouble at all."
There are two sources for Japanese eras on Windows:
* Hardcoded into the system libraries/.net framework
* The Windows Registry
And this allows things to continue working even if the software is older than the newest era.
There was a post on Language Log once scoffing at the stupidity of the Chinese for their feeling that Uyghurs should use normal Chinese names.
I asked a friend in Shanghai whether she thought Uyghur names should be allowed. She responded: "Huh? You don't get to choose what other people's names are. They tell you their names, and then you have to call them that.
And if you really want to think about weird names, there's a country called 'St. Vincent and the Grenadines'!"
(In other minority-names-in-China news, I have a Mongol friend whose name is Saruul. Her parents sinicized this as the three Mandarin syllables sha-ru-la, which look like a normal Chinese name. I don't actually know what she prefers to be called, but I suspect most people she knows call her Rula.)
Microsoft Excel is the worst offender here. When you're on a locale with , as a decimal point it's not able to read CSVs with . as a decimal point. It uses ; instead of , as a field separator.
Delegating parsing user input is a good idea, but sometimes the input methods you can rely on just don't cut it.
By the way: The international way to express a decimal separator is a (thin non-breaking) space. There's no misunderstanding possible.
> By the way: The international way to express a decimal separator is a (thin non-breaking) space. There's no misunderstanding possible.
According to whom? The CGPM recommends thin spaces as thousands separators, and either points or commas as decimal separators. NIST, ISO, etc. generally copy this, sometimes stipulating the decimal separator as one or the other.
This is a relic from the olden times were application data was rarely exchanged across locales, and people expected software to conform to the local conventions (and they largely still expect it). Microsoft never changed this because it would have broken (and would still break) a vast number of systems and workflows.
> decimal separator is a (thin non-breaking) space
i really hope you mean thousands separator ...
If you import the data from a file you can select the separator
You can in libreoffice, but not excel. You _must_ change your locale. I went down this rabbit hole a few weeks ago trying to open some Dutch datasets.
You can in excel: https://superuser.com/a/407085
Also, I believe you can change the default separator in Settings somewhere.
I don't know if you noticed the date on that was 2012 and it is currently 2025; along with my comment mentioning "rabbit hole, weeks ago" but all of those settings no longer exist.
I just opened Excel M365, opened per Data -> From Text/CSV a csv file and coulds chose the delimiter.
I tested semicolon and comma and in both cases the autorecognition selected the correct delimiter but I could change it to equal sign, space, tab, user defined.
You can't change the decimal separator, but the field delimiter is possible.
Take a moment to know that it’s * Crucial* not to localize strings in health software services because it can lead to data leaks and performance degradation. It’s better to work with global APIs so you’re protected from all sorts of risks.
Performance degradation?
> Parsing Is Not a Science
It can and should be, though. I feel like we should have a separate word for parsing when the rules are not well-defined - something like "fuzzy parsing" (in a similar vein to fuzzy string comparison)
Renaming the problem doesn’t make it go away. It might be useful for identifying the subset of parsing which is problematic, but I think the article already achieves this well by specifying the subset of input under discussion.
It doesn't make the problem go away, but it makes it clear that parsing itself is not the problem
It's "scraping".
It’s called “guessing”.
OT: Why does almost every comment in this thread currently say “2 hours ago”, when they were probably written when this story was first featured, about 3 days ago?
Hovering over the time-ago item on the comment header displays the exact post time, and interestingly it shows times from Feb 16 (3 days ago) for many of the "2 hours ago" comments. Must be an artifact of some moderation tool.
There is a feature in HN that brings stories back to the front page, can't remember what it's called at the moment, but must be because of that.
How do people who use commas as decimals disambiguate 1,004 and 1.004 without changing the precision implied by number of decimal places?
You are trying to apply what you know versus what others know. No different than Farenheit vs Celsius or Yard vs Meter.
Personal, the MM/DD/YYYY format, that is stander in the USA, needs to die and be replaced with YYYY-MM-DD.
Same with 12 hour time and replacing it with 24 hour. As the saying goes l, Americans use am and pm because they can't count past 12. AM and PM are a waste of code and display area. What fits in 2 characters takes up 5 characters.
No excuse for the date, but I get the 12/24 hour clock. 12h feels more natural when spoken. Where I'm from, we speak in 12h but always write down 24h. You'll see "20:00" written down, but you'll say "8 in the evening". We don't use am/pm, though.
It depends on the local norms. Where I live, 1,004 is decimal, 1 004 or 1'004 is 1004 which makes it even more clear than the en-US default. That is, the 1.004 variant is never used, and if it is, it is assumed to be a decimal (misspelling) of 1,004.
So is there no such thing as "100,000,004" and having no idea if that's a decimal or a thousands separator?
Not sure where OP is from, but in my whereabouts „100,000,004“ wouldn’t show up in the wild. We use spaces to separate if really needed.
Ah makes sense. In Canada we were taught to use spaces to separate, and decimals for decimals. But being stuck so close to the US we end up with a mess of everything.
In Denmark for example the decimal and comma are reversed in meaning, so you would not have a Danish 100,000,004 because that is an obvious non-Danish number.
You don't. It's ambiguous. Just like the string 01/03/2025 is if you don't know the source's locale.
But it can be worse. Los Angeles, Sunday, November 2, 2025, 2:00:00 am is ambiguous. Is it PST or PDT?
That's not an ambiguous date because Nov 2 2025 is PST, not PDT.
What I used to do is set the thousands separator to ' in the operating system settings. That made Excel read CSV files with 1,004 and 1.004 the same, as one and four thousands. No one puts thousands separators in CSV files anyway, so that worked out. And it looked nice too.
In today's Windows 11 I can't find that setting. You can't set the thousands separator separately, not anywhere that I can find. It's a tragedy. I see Excel misreading CSV files all the time. I don't use Excel that much myself and I understand what's going on, so it doesn't affect me all that much directly, but for my Excel warrior colleagues, it's another matter.
You have to set it for the whole OS... it's like someone who works at Microsoft decided that you should never work on a file generated in a different locale.
If you need to apply a setting like that to a single application only, there are ways to create a program that will detour all the registry reads.
I'm sure there are -- but I'm not doing all of that for a one-off thing.
Wait, can you tell me how to set it for the whole OS? Then please do!
How do people who use periods disambiguate it? It’s simply ambiguous without context.
I mean, if that's how you write numbers, it's the same as disambiguating 1,004 and 1.004 if you use EN-us norms.
Which is to say, you assume it's as written, unless context suggests otherwise.
The same way as people who use periods. 1,004.004