Why do we need Unicode?

MOHD RAZA
2 min readMar 1, 2022

In the (not too) early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today’s strange world of global intercommunication and social media was not foreseen, and it is not too unusual to see English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ in the same document (I hope I didn’t break any old browsers).

But for argument’s sake, let’s say Joe Average is a software developer. He insists that he will only ever need English, and as such only wants to use ASCII. This might be fine for Joe the user, but this is not fine for Joe the software developer. Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people, and on top of that, he is closing off his software to a large and growing economy.

Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859–1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.

  • UTF-8
  • a character encoding capable of encoding all possible characters (called code points) in Unicode.
  • code unit is 8-bits
  • use one to four code units to encode Unicode
  • 00100100 for “$” (one 8-bits);11000010 10100010 for “¢” (two 8-bits);11100010 10000010 10101100 for “” (three 8-bits)
  • UTF-16
  • another character encoding
  • code unit is 16-bits
  • use one to two code units to encode Unicode
  • 00000000 00100100 for “$” (one 16-bits);11011000 01010010 11011111 01100010 for “𤭢” (two 16-bits)

Credited-Peter Mortensen,DPenner1

--

--