Explore the UTF-16 character encoding standard—how it works, its variable-length scheme, surrogate pairs, byte order, and use cases in programming and SMS. Learn why UTF-16 is essential for representing all Unicode characters.
Understanding UTF-16 Character Encoding
UTF-16 (16-bit Unicode Transformation Format) is a widely used character encoding capable of representing every character in the Unicode standard, which includes over 1.1 million code points17. It is a variable-length encoding system that uses one or two 16-bit code units to encode characters, allowing it to cover all modern and historic scripts, emojis, and symbols.
Encoding Scheme
- Basic Multilingual Plane (BMP): Characters with code points from U+0000 to U+FFFF are encoded using a single 16-bit unit (2 bytes). This plane covers most commonly used characters across many languages16.
- Supplementary Characters: Characters outside the BMP, ranging from U+10000 to U+10FFFF (including many emojis and historic scripts), are encoded using surrogate pairs. These pairs consist of two 16-bit units (totaling 4 bytes), where the first is a high surrogate (U+D800 to U+DBFF) and the second is a low surrogate (U+DC00 to U+DFFF)17.
Variable Length Nature
UTF-16 is variable-length because it uses either one or two 16-bit units depending on the character. Most characters fit within a single 16-bit unit, but supplementary characters require two units, making UTF-16 efficient for most use cases while still supporting the full Unicode range12.
Byte Order and Byte Order Mark (BOM)
UTF-16 can be encoded in either big-endian or little-endian byte order, depending on the system architecture3. To indicate the byte order of a UTF-16 encoded text, a special marker called the Byte Order Mark (BOM) is used at the beginning of the text:
FE FF
indicates big-endianFF FE
indicates little-endian
This BOM helps software correctly interpret the byte sequence3.
Use Cases of UTF-16
UTF-16 is prominently used in:
- Programming languages such as Java and C# where strings are typically UTF-16 encoded16.
- Operating systems like Microsoft Windows, which use UTF-16 as the native encoding for text1.
- Some file formats and APIs that require support for a wide range of characters beyond the BMP1.
Impact on SMS Messaging
In SMS communication, when a message contains characters outside the GSM-7 character set (such as emojis, non-Latin scripts, or special symbols), the message must be encoded using UTF-16 or UCS-2 (a fixed-length subset of UTF-16). This encoding uses 2 bytes per character, reducing the maximum number of characters per SMS part to 70 (or 67 for concatenated messages), compared to 160 characters in GSM-7 encoding.
Summary
UTF-16 is a versatile and widely adopted Unicode encoding that balances efficient storage and comprehensive character support. By encoding most characters in 2 bytes and supplementary characters in 4 bytes via surrogate pairs, UTF-16 enables software and systems to handle the full range of human languages and symbols, including emojis and historic scripts. Its use of byte order marks ensures compatibility across different hardware architectures, making it a robust choice for many programming environments and communication protocols.
This article provides a detailed overview of UTF-16 encoding for developers, linguists, and tech enthusiasts seeking to understand how modern text encoding works under the hood. For more insights on character encoding and Unicode standards, explore related topics on text.lk.
Check sources
- https://en.wikipedia.org/wiki/UTF-16
- https://stackoverflow.com/questions/2241348/what-are-unicode-utf-8-and-utf-16
- https://lokalise.com/blog/what-is-character-encoding-exploring-unicode-utf-8-ascii-and-more/
- https://www.youtube.com/watch?v=QCEqpd807z4
- https://www.cnblogs.com/cbscan/articles/4123251.html
- https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding-introduction
- https://bsg.world/glossary/16-bit-unicode/
- https://ssojet.com/character-encoding-decoding/utf-16-in-c/