Episode 01: A Tale of Two Encodings

Listen directly on this page or visit the links in the top right to listen on your favorite podcast provider, Apple Podcasts, Spotify, or Stitcher.

People across the world connect with each other on the internet in hundreds of different languages. Technology has created some of the world’s biggest problems, but global connectivity is not one of them. Welcome to the 1 in 6 Engineers podcast. My name is Mingshi and in season 1, we will explore the biggest challenges and wins in internationalization.

There is a part of your phone that controls your ability to create and consume content on the internet. It’s not the battery life, whether it has 5G, or what apps you have installed. You have likely never given it a second of thought, but for millions of people in the country of Myanmar, this critical detail has shaped their internet experience for decades and will define their lives in the next decade. Today’s story is a tale of two encodings.

Part 1: What is Myanmar like?

Source: Audley Travel (link)

Source: Audley Travel (link)

Myanmar is a country in Southeast Asia, also known as Burma. Today Myanmar is home to over 54 million people, making it slightly larger than the state of California. The 19th and 20th centuries were characterized by British influence and rule. Then during WWII, Myanmar was occupied by Japan. The country became a major battleground and up to 250,000 civilians died.

Aung San rose as the leader of the independence movement, and he is known as the Father of Modern Myanmar. He fought to free Myanmar from British and Japanese rule, but he was assassinated before Myanmar finally won independence in 1947. His daughter Aung San Suu Kyi eventually became a prominent leader for democracy, and she won the Nobel Peace Prize in 1991 for her stand against the military dictatorship. Her fight is still ongoing to this day.

Myanmar is a country that is still very unstable, especially due to conflicts between the military junta and Aung San Suu Kyi’s party the National League for Democracy. Myanmar is made up of states with very different ethnic and religious groups. There are many instances of civil conflicts, such as between the military and the ethnic minorities in Shan state, between the government and the Christian Kachin Independence Army, and between the government and the Rohingya Muslims.  Aung San Suu Kyi was elected as the head of government in 2015, though her time in office has been controversial due to her apparent support of the military and its acts of genocide against the Rohingya Muslim minority group.

Aung San Suu Kyi was democratically reelected in November 2020, but on February 1, 2021, the military seized control of the government and arrested her, claiming election fraud. Since then, they’ve greatly restricted the ability of civilians to access the internet and committed horrific acts of violence. Due to the lack of connectivity to the internet and social media, the Myanmar people are struggling to be seen by the rest of the world, now more than ever. Unfortunately, this problem is nothing new.

Part 2: What is Unicode?

Going back to WWII, the idea of computers and networking began with the Allied forces needing a way to break the German codes. The Germans would send a secret message by using a confusing letter mapping, like A becomes D, B becomes G, and C becomes Z. This mapping would change everyday, and the Allies needed a way to efficiently process potential mappings to intercept German communication. That processor became the blueprint for modern computers.

Today our computers use a similar method to encode words that we read and write. At a base level, computers only understand numbers, so A is 1, B is 2, C is 3. This is called text encoding, and in the early days of the internet, different countries had different encoding protocols, because they only needed to support their country’s language. So let’s just say that in Canada, A is 2, B is 3, and C is 4. If you were to send a message to your friend with American encoding like “CAB”, your computer would encode the message as “312” and your friend’s computer would decode the message as CAB. But what if you had another friend in Canada? Their computer would decode it as “B-?-A” because their system doesn’t have a mapping or the number 1, and 2 maps to A and 3 maps to B. Is this starting to make sense?

In the early days of the internet, American universities and the Department of Defense were the primary users of the internet, and everyone used the same encoding rules in English, so there wasn’t a problem with mixing encodings. However, when the internet became widely available across the world in the early 1990’s, the encoding problem became a big hindrance, especially for economic development and businesses.

Thus the Unicode Consortium was born in 1991. This organization made it their mission to create a universal standard on encoding that would enable any person to communicate to anyone else in any language on any computer or device. They have established standard character mappings for 154 languages, which totals nearly 150,000 unique characters.

Each country gets a block of characters to use, and the Unicode Consortium determines which numbers map to which letters. So 0 to 128 are given to standard Latin characters, which includes all the letters we use, A-Z, capital and lowercase. Every computer contains the lookup table for which numbers map to which letters.

In 2001, Myanmar was given the block of characters mapping to 4096 to 4255, and the character mapping was designed. Here’s the codepoint mapping:

Source: Wikipedia (link)

Source: Wikipedia (link)

Part 3: What are the languages of Myanmar?

The majority of Myanmar people can read and write Burmese, and minority groups also use the languages Shan, Mon, Kayen, and Kachin. Most people do not understand English.

37c9e0cb91c455e5054ac751f0701b7b.png

A sample of Burmese text. Notice the circular structure.

Source: Pinterest (link)

Burmese consists of a set of 33 consonants and 12 vowels. Like English, Burmese is written from left to right. Modern Burmese actually uses many loan words from English, and some others from Sanskrit, Hindi, and Mandarin. Unlike English, each single character is composed of multiple consonants and vowels. That means the letters on the keyboard need to be combined together to create a single character. It’s similar to when we press the letter ‘a’ on a keyboard and see lowercase ‘a’ on the screen, but we need to press ‘Shift+a’ to get a capital A. For every single character in Burmese, you need to type multiple consonants and vowels. If you type the character sequence wrong, you might get a typo, or you might get a garbled word, kind of like if you pressed a+Shift, you wouldn’t get a capital A. Burmese characters tend to be very circular, so mistyped characters will also appear like empty dotted circles.

Unicode Burmese was quite difficult for people to use. It’s hard for us as English speakers to really understand this struggle, because we have a very small set of letters that can be used together to make any word. Burmese word building has extra steps, because the set of standalone characters is much larger, so they don’t have a simple keyboard that can just type out all possibilities. It’s as if our keyboard were composed of lines and curves, so to make the letter ‘L’, you have to type a vertical line and then a horizontal line. But also, if you typed a horizontal line and then a vertical line, you would get the letter T.

Unicode Burmese initially had these kinds of rules, like you needed to type the characters in a certain order. There aren’t the same sort of rules in writing Burmese by hand, so people found it very confusing and difficult to learn. So of course, many people decided, “Okay this isn’t for me. I don’t need the internet.”

Seeing an opportunity, a group of developers created a new encoding system, using the same code block, but with new rules that were much more flexible and intuitive. They named this system Zawgyi. Zawgyi also has a keyboard with consonants and vowels as the keys, but it was much more visually focused, so there were some “repeated” characters that varied slightly based on their appearance. It’s similar to as if our keyboard had separate keys for uppercase and lowercase letters.

This means that with Zawgyi, there are often numerous ways a single rendered character can be represented in Unicode code points. So when looking at two words that appear identical, the code points that represent them could be totally different. This causes a few problems, one of which is search. If you tried Googling for dogs, and you had to type in “d-o-g” and capital “d-o-g”, and every permutation of capital and lowercase d-o-g, you’d be pretty annoyed. For social media services like Facebook, content moderation is also a challenge, because it becomes much harder to flag hate speech and cyberbullying when there are so many ways to type out one keyword.

Regardless, Zawgyi is so popular now that some sources estimate that nearly 90% of people who own a device with internet connectivity have Zawgyi installed. Speaking of which, most people in Myanmar do not have personal computers, and internet cafes are popular ways for people to get online. When 3G mobile internet became available in 2015, the percent of people in Myanmar who were regularly using the internet rose to 12.6%.

Encoding is a system-level setting, so people needed to install it on their computers and phones. News sites and tech-savvy bloggers adopted Zawgyi, and that influenced their readers to install Zawgyi as well. After the early adopters, Myanmar users prefer to go to their local phone shop to purchase phones. The phone shop owner or operator will usually install commonly used apps, such as Facebook, and even set up user accounts. If the phone doesn’t have Zawgyi installed, the operator will install that too. Mobile data is also very expensive, so people often will never update their apps.

Part of the reason people are so reliant on their phone shops is due to low technical literacy. Today Myanmar has over 13 million internet users, including 11 million Facebook users. Most websites, such as Wikipedia, do not offer Zawgyi support, but since 2010, Facebook has been available to Zawgyi users, which has enabled it to basically be the main source of traffic on the internet. Out of a population of 54 million, 1 in 4 people have internet connectivity and 1 in 5 have a Facebook account.

Part 4: Zawgyi and Unicode

This is a big problem for sure, and the Myanmar government recognized it as one. Not using the same encoding as the rest of the world meant that anything coming out of Myanmar that was encoded with Zawgyi would also be unreadable to the rest of the world. It significantly slowed down technical literacy and entrepreneurship, because of the added layer of text encoding. You and I literally never have to really think about our text encoding because we get Unicode automatically everywhere. When the whole country is on Unicode, Myanmar will finally be on par with the rest of the world.

Luckily, the Unicode Consortium made some big improvements in the Unicode user experience, and the new updated Unicode encoding enabled people to type intuitively, just like Zawgyi, and it also gave them letters used in Myanmar’s secondary languages, like Shan, Mon, Kayah, and Karen.

On the backend, Unicode uses a deterministic ordering of codepoints, which means words that look the same are also represented with the same codepoints -- that fixes the search issue I mentioned earlier, and the ability to match hate speech. The new Unicode also is much more user friendly because it doesn’t restrict the user to inputting the consonants and vowels in a particular order, which was the main drawback of the old Unicode.

On October 1, 2019 the government mandated that everyone must switch to Unicode. They didn’t, however, provide any support for this transition, and much of the country still relies on Zawgyi. With nearly 90% of people needing to switch, only a small number of people have been able to switch independently, meaning they’ve been able to download and install Unicode on their phones. The government hasn’t offered official instructions, but companies like Facebook have kicked off efforts to educate people and support their switch. Myanmar technology users have also become more tech savvy, as seen in the rising popularity of Signal, an end-to-end encryption messaging app that prevents outside parties from spying.

Unicode_chart-01.jpg

A comparison of the same codepoints rendered with Unicode and Zawgyi.

Source: Facebook Engineering (link)

This unfortunately has left Myanmar in an awkward state of transition. Now some people, who tend to have more technical expertise and also newer devices, are using Unicode but wanting to communicate with people on Zawgyi. How do they do it?

A simple solution that some newspapers and bloggers use is posting the same thing in both encodings. Often they’ll have a section that is posted in Unicode, and then pasted below is the same content that is Zawgyi encoded. That way people can read one or the other. It’s inefficient and doesn’t look too good, but it gets the job done. There are a number of popular converters, such as Rabbit Converter (https://www.rabbit-converter.org/Rabbit/), which converts Zawgyi encoded text to Unicode encoded text.

Other platforms like Facebook detect the encoding of the content (such as a message from friend to friend) and then they figure out which encoding is installed on the device, and finally they perform the conversion when necessary before displaying the content to the user.

Part 5: How do Unicode and Zawgyi users communicate?

Because Unicode and Zawgyi are just different ways to use the code block given to Myanmar, and remember there’s a limited number of codepoints, they actually map different characters to the same codepoints. Therefore, while it’s easy to tell if a sequence of codepoints is any type of Myanmar text just by looking at what the codepoints are, it’s impossible to tell if it’s using Unicode or Zawgyi encoding.

So in other words, they use different rules to act on the same codepoints, so when we use Unicode to decode the codepoints, the resulting text could look totally readable, and when we use Zawgyi to decode them, the resulting text makes absolutely no sense. The most similar experience I can think of without actually showing you anything is like when your computer doesn’t know the new emojis, so it just shows you a box. You know what I’m talking about? But in this case, mis-rendered Burmese characters show up as incomplete characters or total nonsense phrases.

Notice that given the same set of codepoints, Unicode renders it correctly, and Zawgyi renders gibberish (like the dotted circle, which is an unrendered placeholder. In this case the codepoints were made with Unicode encoding. Source: Me :)

Notice that given the same set of codepoints, Unicode renders it correctly, and Zawgyi renders gibberish (like the dotted circle, which is an unrendered placeholder. In this case the codepoints were made with Unicode encoding.
Source: Me :)

For now, let’s take two friends, Ursula and Zachary, who use an app called Tweetbook to message each other. Ursula’s phone has Unicode installed, and Zachary’s has Zawgyi. If they tried messaging each other, they’d each see their own messages totally fine, but the other’s would just look like incomplete nonsense words. And that’s because their phone is taking codepoints encoded with one mapping and decoding it with another one.

In the solution, there are three key steps in enabling Unicode and Zawgyi users to understand each other. If Tweetbook is the message passer, then in order for them to convey Ursula’s messages to Zachary, they need to know

1/ Zachary’s device encoding

2/ Ursula’s message encoding

3/ how to convert between Unicode and Zawgyi.

When Ursula sends Zachary a message, his phone receives a set of codepoints. First, to determine Zachary’s device encoding, we can take advantage of the fact that in one encoding, combining several codepoints will combine consonants and vowels to create a single character, while in the other encoding those they do not. We can measure the pixel width of the text to determine which font encoding the device is using to render the message. Usually this is done in the HTML, an invisible or hidden element will be rendered with a small piece of Burmese, and the renderer is able to measure how long that text looks. 

The next challenge is to identify the encoding of the message. Many methods have been explored, including a regular-expression approach and proprietary models, but the most successful and widely used today was created by researchers at Google -- an open sourced library called myanmar-tools that uses a Machine Learning algorithm (github). This model is trained on several megabytes of data from various Burmese text samples across the internet. The algorithm assesses the probability that a given message was encoded with Zawgyi or Unicode.

The final problem is matching Ursula’s message encoding to Zachary’s device encoding. Like I mentioned earlier, Rabbit Converter is one solution (github) , and Google’s myanmar-tools is another. There are other converters out there too, but none of them guarantee 100% accuracy. The implementation generally involves applying a large set of rules that re-encodes the same characters. Because Unicode is deterministic, Unicode-to-Zawgyi conversion is theoretically perfect. Zawgyi-to-Unicode is a bit more difficult because Zawgyi can encode the same word multiple ways.

Some apps and web browsers use this three step approach to enable people with different device encodings to send each other messages, read each other’s content, and browse the internet. Many others have opted to be available only in Unicode.

Part 6: The future of Myanmar

Myanmar people tend to own Android devices, such as Xiaomi, Huawei, and Samsung that are installed with Zawgyi, and some of them have very old devices that do not even support the Unicode upgrade. Most people frankly don’t care about device encoding, but some people can’t even afford to. For massive apps like Facebook, dropping support for these people would be simply unfair and potentially dangerous, especially in the case of Facebook because they have been accused of inciting violence against the Rohingya people, a Muslim minority group of Myanmar who were the target of a military-backed genocide in 2016.

Source: Financial Times (link)

Source: Financial Times (link)

Following democratic elections on February 1, 2021, the elected leader Aung San Suu Kyi was arrested and the military declared a state of emergency. As part of the coup, the military banned social media networks such as Facebook, Twitter, and Instagram, and they turned off internet service entirely from 1:00AM to 9:00AM every day. 

Technological literacy has become more and more important in Myanmar with the conflicts between the military state and the people. End-to-end encrypted messaging and apps like Signal and WhatsApp are a priority now for people who fear being detained by police for what they say online. Most people do know how to use VPNs to get around the bans on social media, but ultimately the gap between those who are tech savvy and those who aren’t is only going to get bigger. The disconnect between Zawgyi and Unicode inhibits large-scale organization efforts, and it risks leaving the most vulnerable groups of people behind.


M


“I see a beautiful city and a brilliant people rising from this abyss, and, in their struggles to be truly free, in their triumphs and defeats, through long years to come, I see the evil of this time and of the previous time of which this is the natural birth, gradually making expiation for itself and wearing out.” Charles Dickens, A Tale of Two Cities

Citations:

“Brief History of Zawgyi Font.” Lionslayer, 4 Dec. 2014, lionslayer.yoeyar.com/?p=1515.

Cuddy, Alice. “Myanmar Coup: What Is Happening and Why?” BBC News, BBC, 15 Mar. 2021, www.bbc.com/news/world-asia-55902070.

Frontier, and Frontier. “Battle of the Fonts.” Frontier Myanmar, 20 May 2020, www.frontiermyanmar.net/en/battle-of-the-fonts/.

Google. “Google/Myanmar-Tools.” GitHub, github.com/google/myanmar-tools/blob/master/README.md.

LaGrow, Nick, et al. “Integrating Autoconversion: Facebook's Path from Zawgyi to Unicode.” Facebook Engineering, 23 Mar. 2020, engineering.fb.com/2019/09/26/android/unicode-font-converter/.

“Myanmar Switch to Unicode to Take Two Years: App Developer.” The Myanmar Times, 21 Nov. 2019, www.mmtimes.com/news/myanmar-switch-unicode-take-two-years-app-developer.html.

“Rohingya Genocide.” Wikipedia, Wikimedia Foundation, 11 Mar. 2021, en.wikipedia.org/wiki/Rohingya_genocide.

“Unicode in, Zawgyi out: Modernity Finally Catches up in Myanmar's Digital World.” The Japan Times, web.archive.org/web/20200822105601if_/www.japantimes.co.jp/news/2019/09/27/business/tech/unicode-in-zawgyi-out-myanmar/#.X0D55bT7Q8M.

“Unified under One Font System as Myanmar Prepares to Migrate from Zawgyi to Unicode.” Rising Voices, 6 Sept. 2019, rising.globalvoices.org/blog/2019/09/06/unified-under-one-font-system-as-myanmar-prepares-to-migrate-from-zawgyi-to-unicode/.

Watkins, Justin. Why We Should Stop Zawgyi in Its Tracks. It Harms Others and Ourselves. Use Unicode! SOAS University of London, 1 Aug. 2016, www.themimu.info/sites/themimu.info/files/documents/Presentation_Why_Stop_Zawgyi_Use_Unicode_Phandeeyar_Aug2016.pdf.

Previous
Previous

Episode 02: The Language of Computers