This problem is not limited to Arabic. Variants of the arabic alphabet are used by Persian (including Iranian and Dari dialects), Mazanderani, Qashqai, Luri, Gilaki, Kurdish (excluding Kurds in Turkey), Talysh, Azerbaijani (in Iran), Pamir languages, Pashto, Urdu, Balochi, Sindhi (in Pakistan), Punjabi (in Pakistan), Uzbek (in Afghanistan), Turkmen (in Afghanistan), Saraiki, Hindko, Brahui, languages spoken in Kashmir.
Whole languages are dying out because people are unable to express them properly on computers. Even popular software that dominate these speakers does not care to improve their experience. For example, Urdu has traditionally been written in the Nastaliq form [1], but is usually is rendered everywhere in the Naskh form [2]. There is no way to change this, for example, in Android without basically rooting it and changing the system fonts.
For a while, Arabizi was wildly popular and universally used on feature phones. When mobiles became smarter, it was used less. Japanese has romaji and Mandarin has pinyin. Arabic's Arabizi would increase literacy rates and solve all these digital problems.
The vast majority of Japanese and Mandarin speakers are also not in favour of replacing their current writing systems (which give them a link to thousands of years of their own history) in favour of simplified systems. I suspect it is the same for Arabic speakers.
I don’t know either, but I am aware that in glyph based languages (and this article makes the case that Arabic has some glyph-like features), there is considerable social discussion about the equivalents, like pinyin. Detractors worry that sound-based (where sounds are based on the latin / western orthography) approaches to writing change something fundamental in people’s brains as distinct from more native versions.
In Chinese for instance, you can use a keyboard that combines radicals - parts of a character, or you can use a keyboard that combines phonemes. Those seem likely to change literally how you think in your language. There may be related concerns for Arabic.
That said, one of the complaints in the blog is that two different codepoints render to the same exact letter / phrase / word — this is not a problem unique to Arabic in Unicode, and there are known approaches: I’d expect (I’m not a Unicode expert by any means) that more work on the tech stack for rectification (I’m sure there’s a technical Unicode word for this process of matching codepoints for e.g. search and uniqueness of rendering) would likely be useful for Arabic, and relatively seamlessly flow in many places.
For a while, Arabizi was wildly popular and universally used on feature phones. When mobiles became smarter, it was used less. Japanese has romaji and Mandarin has pinyin. Arabic's Arabizi would increase literacy rates and solve all these digital problems.
The fact that the article was able to show correct version in regular text is pretty good indicator that if done correctly those are more or less solved problems. I don't disagree that there are probably plenty of times when those mistakes are repeated and solutions not used widely enough (more often for Arabic scripts than other languages), but even for 2017 it feels more like anecdotal examples of what can go wrong ignoring existing technical details. But those mistakes largely come down to having someone who cares and understands the language and technology not for the lack of solutions. There are probably plenty of interesting edge cases that might not be handled perfectly even though solutions for basic cases exist, but article doesn't come even close to discussing those technical details especially if it's only conclusion is "computers introduced more problems, notably because of Unicode".
> The inflexibility persisted and has arguably only become more aggravated in the 20th century
What about 21th century? Digital printing can overlap characters just fine. And modern fonts support context sensitive ligatures and glyph substitutions.
Second/third example those seemed to be caused by more by someone who doesn't understand the language copy pasting stuff.
PDF -> that's just PDF being bad. Text and text search in PDFs tends to mes up even or English.
> with unicode number U+0623, but one can also type أ, which is an alif and a high hamza, represented by unicode numbers U+0627 and U+0654.
That's what Unicode normalization and locale settings are for. Same thing applies to large fraction of latin based scripts other than English, anything which has letters with diacritic marks.
> for كثيره and كثيرة will in most cases yield different results
Similar thing in almost any non English language for example cafe and café or ABC and ⒶⒷⒸ. Although at least some systems handle it reasonably. Not sure how much it is heuristics based on large data (hard to scale across software), and how much it's good application of Unicode character decomposition/normal form tables. Which Arabic letters lack appropriate Unicode decomposition (and other) tables and what are the best practices of unicode normalization/decomposition/locale handling for search (applicable for all languages) are more interesting and actionable topics.
> Not even the simple idea of CJK has been implemented.
Many users of CJK language would argue that CJK unification was a mistake. If different languages prefer different forms of the glyph, they should better be separate characters. Having separate Chinese and Japanese fonts because CJK unified too much just introduces additional points of failure.
This problem is not limited to Arabic. Variants of the arabic alphabet are used by Persian (including Iranian and Dari dialects), Mazanderani, Qashqai, Luri, Gilaki, Kurdish (excluding Kurds in Turkey), Talysh, Azerbaijani (in Iran), Pamir languages, Pashto, Urdu, Balochi, Sindhi (in Pakistan), Punjabi (in Pakistan), Uzbek (in Afghanistan), Turkmen (in Afghanistan), Saraiki, Hindko, Brahui, languages spoken in Kashmir.
Whole languages are dying out because people are unable to express them properly on computers. Even popular software that dominate these speakers does not care to improve their experience. For example, Urdu has traditionally been written in the Nastaliq form [1], but is usually is rendered everywhere in the Naskh form [2]. There is no way to change this, for example, in Android without basically rooting it and changing the system fonts.
[1] https://en.wikipedia.org/wiki/Nastaliq
[2] https://en.wikipedia.org/wiki/Naskh_(script)
I don't know why people look down their noses at Arabizi
For a while, Arabizi was wildly popular and universally used on feature phones. When mobiles became smarter, it was used less. Japanese has romaji and Mandarin has pinyin. Arabic's Arabizi would increase literacy rates and solve all these digital problems.
The vast majority of Japanese and Mandarin speakers are also not in favour of replacing their current writing systems (which give them a link to thousands of years of their own history) in favour of simplified systems. I suspect it is the same for Arabic speakers.
Because people don't want to abandon hundreds or thousands of years of culture for a completely solvable problem.
I don’t know either, but I am aware that in glyph based languages (and this article makes the case that Arabic has some glyph-like features), there is considerable social discussion about the equivalents, like pinyin. Detractors worry that sound-based (where sounds are based on the latin / western orthography) approaches to writing change something fundamental in people’s brains as distinct from more native versions.
In Chinese for instance, you can use a keyboard that combines radicals - parts of a character, or you can use a keyboard that combines phonemes. Those seem likely to change literally how you think in your language. There may be related concerns for Arabic.
That said, one of the complaints in the blog is that two different codepoints render to the same exact letter / phrase / word — this is not a problem unique to Arabic in Unicode, and there are known approaches: I’d expect (I’m not a Unicode expert by any means) that more work on the tech stack for rectification (I’m sure there’s a technical Unicode word for this process of matching codepoints for e.g. search and uniqueness of rendering) would likely be useful for Arabic, and relatively seamlessly flow in many places.
For a while, Arabizi was wildly popular and universally used on feature phones. When mobiles became smarter, it was used less. Japanese has romaji and Mandarin has pinyin. Arabic's Arabizi would increase literacy rates and solve all these digital problems.
(2017)
How much of this is still a problem with modern software/font stacks and harfbuzz?
The fact that the article was able to show correct version in regular text is pretty good indicator that if done correctly those are more or less solved problems. I don't disagree that there are probably plenty of times when those mistakes are repeated and solutions not used widely enough (more often for Arabic scripts than other languages), but even for 2017 it feels more like anecdotal examples of what can go wrong ignoring existing technical details. But those mistakes largely come down to having someone who cares and understands the language and technology not for the lack of solutions. There are probably plenty of interesting edge cases that might not be handled perfectly even though solutions for basic cases exist, but article doesn't come even close to discussing those technical details especially if it's only conclusion is "computers introduced more problems, notably because of Unicode".
> The inflexibility persisted and has arguably only become more aggravated in the 20th century
What about 21th century? Digital printing can overlap characters just fine. And modern fonts support context sensitive ligatures and glyph substitutions.
Second/third example those seemed to be caused by more by someone who doesn't understand the language copy pasting stuff.
PDF -> that's just PDF being bad. Text and text search in PDFs tends to mes up even or English.
> with unicode number U+0623, but one can also type أ, which is an alif and a high hamza, represented by unicode numbers U+0627 and U+0654.
That's what Unicode normalization and locale settings are for. Same thing applies to large fraction of latin based scripts other than English, anything which has letters with diacritic marks.
> for كثيره and كثيرة will in most cases yield different results
Similar thing in almost any non English language for example cafe and café or ABC and ⒶⒷⒸ. Although at least some systems handle it reasonably. Not sure how much it is heuristics based on large data (hard to scale across software), and how much it's good application of Unicode character decomposition/normal form tables. Which Arabic letters lack appropriate Unicode decomposition (and other) tables and what are the best practices of unicode normalization/decomposition/locale handling for search (applicable for all languages) are more interesting and actionable topics.
> Not even the simple idea of CJK has been implemented.
Many users of CJK language would argue that CJK unification was a mistake. If different languages prefer different forms of the glyph, they should better be separate characters. Having separate Chinese and Japanese fonts because CJK unified too much just introduces additional points of failure.