IIRC precomposed hangul syllables are in Unicode[0]( although I'm not sure if the precomposed characters are the canonical form or not) so you would likely apply the same functions on the precomposed hangul block as you you would for basic latin.
It's quite interesting because I once showed a non-technical Korean speaking friend a picture of Unifont[1] and they thought that the Precomposed Hangul Syllables block was quite daft. They likened it to "making a character for aa, ab, ac, ..."
There's Unicode Collation Algorithm that can translate any Unicode string into a trivially-comparable binary string. You can map string -> collation string, and sort on that. It's not very expensive to compute, and you can store it explicitly as a separate column if you need to.
The only pain is that the collation algorithm is locale-dependent. It's not possible to have one universal mapping, because there are languages that have conflicting rules regarding the same characters (e.g. Swedish and German handling of umlauts), so if you're going to materialize/index the collation string, you'll have to choose who to disappoint.
> As far as I know, there is nothing in Korean analogous to the English alphabet song.
We sing 가나다라마바사아자차카타파하 to exactly the same tune as the alphabet song up to P.
There is also a variant where you sing out the actual name of the characters rather than just the sound
I wonder how databases deal with this. I imagine there's the appropriate LC_COLLATE, but would it have an impact on performance?
IIRC precomposed hangul syllables are in Unicode[0]( although I'm not sure if the precomposed characters are the canonical form or not) so you would likely apply the same functions on the precomposed hangul block as you you would for basic latin.
It's quite interesting because I once showed a non-technical Korean speaking friend a picture of Unifont[1] and they thought that the Precomposed Hangul Syllables block was quite daft. They likened it to "making a character for aa, ab, ac, ..."
[0] https://en.m.wikipedia.org/wiki/Hangul_Syllables
[1] http://unifoundry.com/pub/unifont/unifont-12.1.01/unifont-12...
I was under the impression that character blocks like
갋 and 괢 were not phonetically possible. I thought for the double consonant you could only have the same consonant twice.
There's Unicode Collation Algorithm that can translate any Unicode string into a trivially-comparable binary string. You can map string -> collation string, and sort on that. It's not very expensive to compute, and you can store it explicitly as a separate column if you need to.
The only pain is that the collation algorithm is locale-dependent. It's not possible to have one universal mapping, because there are languages that have conflicting rules regarding the same characters (e.g. Swedish and German handling of umlauts), so if you're going to materialize/index the collation string, you'll have to choose who to disappoint.
A DB shouldn't really care, should it? It just needs any logical order to be defined, not necessarily one related to anything humans make use of
One way to think about this is, orthographically, muhae is written mu_hae where _ sorts before any other letters.