Myanmar Prepares to Migrate from Zawgyi to Unicode

81 points by danso 6 years ago

This article conflates the encoding uses to store Burmese text with the font used to render it. Here's a chart showing the discrepancies between the two encodings:

https://confluence.dimagi.com/display/commcarepublic/Myanmar...

And here's an FAQ from Unicode itself:

https://www.unicode.org/faq/myanmar.html

Old-timers (particularly outside the US) may remember the ISO 8859 debacle, where there were various encodings for primarily European languages using the same codepoints, causing tons of confusion:

https://en.wikipedia.org/wiki/ISO/IEC_8859#The_parts_of_ISO/...

xeeeeeeeeeeenu 6 years ago

No, the article doesn't conflate anything, the situation isn't comparable to ISO-8859-* at all. Zawgyi is merely a font that renders certain glyphs differently than specified by Unicode.
It isn't a real encoding, it's a nasty hack and that's what makes the transition difficult.
- msla 6 years ago
  
  Unicode doesn't define rendering:
  https://unicode.org/standard/principles.html
  > The Unicode Standard does not define glyph images. The standard defines how characters are interpreted, not how glyphs are rendered. The software or hardware-rendering engine of a computer is responsible for the appearance of the characters on the screen. The Unicode Standard does not specify the size, shape, nor style of on-screen characters
- 9nGQluzmnq3M 6 years ago
  
  Once more: the encoding is what's used to store the data, the font is what's used to render it. If your data is not encoded correctly, Zawgyi won't show it either. Of course, the fact that Zawgyi isn't properly standardized doesn't help.
  Also, most of the technical content of the article is gibberish. Exhibit A: "It made use of the visual typing and encoding method as one would write it on paper, rather than using logical linguistics and computer encoding conventions of Unicode."
  
  msla 6 years ago
  
  The reporting here is so bad I'm honestly confused about whether anything at all is changing.
  I wonder why this news outlet tried to report this specific piece of news at all, instead of leaving it to the specialist press.

grenoire 6 years ago

What is the scale and potential difficulties involved for a migration like this? I hope some HN readers can inform us on the technical challenges.

Ayesh 6 years ago

I'm native Sinhalese speaker, and we have many characters that were not included in Unicode up until 10 or so years ago. To "fix" this, we created different font files, so that when you type "w" in an En-US keyboard, the font glyph is "අ", which we pronounce "A" sound for. This worked out OK in place where we had control over the font. So rich text documents, this was less of a problem.
In web sites, however, these text become mess because they use regular fonts with correct glyph/code-point match. We just abuse the fonts to get the characters we needed.
When Sinhalese characters are in Unicode, we couldn't immediately translate them because you need to check if the fonts were using this botched font or not, and needed to do some serious replacing which is difficult because even with Unicode, we have diacritics and certain glyphs needed more than one unicode code-point to represent them.
One glorious regex replace, in theory, can perform the similar migration for Burmese as well, you just have to write it.

yla92 6 years ago

As a developer who've dealt with those issues briefly in the past, hopefully, this really kicks off this time!

wangweij 6 years ago

Sorry but I don't quite understand what the problem here is. Is it about a different character set? Or encoding? Or glygh rendering? The article keeps using "font".

puzzledobserver 6 years ago

According to https://code.google.com/archive/p/zawgyi/wikis/WhyUnicode.wi..., there appears to be a mismatch between the corresponding code points of Zawgyi and Unicode. So changing the font used for rendering could conceivably result in corrupt text.
knolax 6 years ago

From the article it seems that Zawgyi rendered various Unicode codepoints (some of them reserved for the Burmese script and others from other scripts) in the order they appeared graphically. Although Burmese usually puts the vowel in the front of a consonant-vowel cluster, Unicode probably has it so that the codepoint corresponding to the vowel is encoded after the codepoint for the consonant. They're migrating by using a font that renders based on the Unicode order and keyboards that type in that order.
That's just what I got from the article and I have no other knowledge of the Burmese script so I may be wrong.
Someone 6 years ago

https://frontiermyanmar.net/en/features/battle-of-the-fonts (mentioned in the article being discussed) is much clearer.
Short version: in Burmese, the form a character takes depends on context. Zawgyi ‘solves’ that by having separate code points for the different forms, requiring the user to pick the right variant. The Unicode way is to make the font and the (font + font renderer) pair smarter, just as Unicode renders “é” instead of the two code points “e’”.
Zawgyi also, necessarily, uses Unicode code points assigned for other characters to encode the variants.
- allard 6 years ago
  
  The shape of "lowercase sigma" depends on whether it's in the middle of a word or at the end. These are adjacent in address space.
  ς and σ. I won't shout out their names. Is this the case in modern Greek too?
  
  Someone 6 years ago
  
  Many of such warts in Unicode are for allowing round-tripping with 8-bit character encodings. I suspect that’s the case here, too. https://en.wikipedia.org/wiki/ISO/IEC_8859-7 has them, too.
  That doesn’t explain why Unicode seems to have 27 (!) different “sigma” code points, though (https://en.wikipedia.org/wiki/Sigma#Character_encoding)

Grue3 6 years ago

And Japan is still using Shift-JIS...

shpx 6 years ago

https://en.m.wikipedia.org/wiki/Han_unification
Imagiŋe if all the "n"s became "ŋ" if you accideŋtally used the Helvetica British foŋt instead of Helvetica Americaŋ oŋ your website.
- zozbot234 6 years ago
  
  Han unification is a bit of a mess, but you can fix it with lang attributes or the equivalent ("Language" selection in office document formats, for example). You don't need to fake things with a custom font.
  
  yorwba 6 years ago
  
  Except those are all application-specific and if you're the one writing the application, rendering different languages differently still seems to require messing around with fonts. (Or embedding an HTML renderer that does the messing for you.) I'm not sure whether there's any solution for OS-level strings that don't support changing the font (e.g. window titles, application names) beyond requiring the user to pick one language for their system and ignore all others.
  
  geofft 6 years ago
  
  Do variation selectors solve this problem? i.e., is there some way for me to include a variation selector in a filename or terminal output or something and have things render right?
  
  yorwba 6 years ago
  
  I think variation selectors are intended for the case where even knowing the language is not enough to select the correct glyph, e.g. because they have been unified in a national standard despite different variants remaining in use. If one of those variants happens to be used in another country, you could of course use it to fix at least some cases, but I don't think all unified characters have a selector for each of their country-specific variants.
  Even if they did, the Ideographic Variation Database [1] doesn't exactly make it easy to use variation selectors for that purpose, because you only get an example demonstrating what the glyphs should look like. To find out which glyph (and hence variation selector) to use for a given language, you'd need an additional database.
  [1] https://unicode.org/ivd/data/2017-12-12/

PokemonNoGo 6 years ago

Interesting to see this from a neighbor of India the land of fonts, regulary in use across the board, such as krutidev and its devanagari.

bl00djack 6 years ago

Yes finally! I had been waiting for this moment