The type used for simple Javascript strings.
Javascript strings expose characters as UCS2 code units. This is a fixed-size encoding that supports the unicode
codepoints from U+000000 to U+00FFFF (Basic Multilingual Plane or BMP). Displaying larger codepoints is
a property of the environment based on UTF-16 surrogate pairs. Unicode does not, and will never, assign
characters to the codepoints from U+OOD800 to U+00DFFF. These spare codepoints allows UTF16 to combine
codeunits from 0xd800 to 0xdfff in pairs (called surrogate pairs) to represent codepoints from supplementary planes.
This transformation happens during the transition from codeunits to codepoints in UTF-16.
In UCS2, the codeunits from 0xd800 to 0xdfff directly produce codepoints in the range from U+OOD8OO to
U+OODFF. Then, the display might merge these codepoints into higher codepoints during the rendering.
Lets take an example (all the numbers are in hexadecimal):
+---+---+---+---+---+---+
Bytes | 00| 41| d8| 34| dd| 1e|
+---+---+---+---+---+---+
UTF-16BE codeunits | 0x0041| 0xd834| 0xdd1e|
+-------+-------+-------+
Codepoints (from UTF-16BE) | U+41 | U+01D11E |
+-------+---------------+
Displayed (from UTF-16BE) | A | 𝄞 |
+-------+-------+-------+
UCS2 codeunits | 0x0041| 0xd834| 0xdd1e|
+-------+-------+-------+
Codepoints (from UCS2BE) | U+41 | U+D834| U+DD1E| <- This is what Javascript sees
+-------+-------+-------+
Displayed (from UCS2BE) | A | � | � | <- This is what the user may see
+-------+-------+-------+
Displayed (from UCS2BE with surrogates) | A | 𝄞 | <- This is what the user may see
+-------+---------------+
The most important takeaway is that codepoints outside of the BMP are a property of the display, not of
the Javascript string.
This is the cause of multiple issues.
Surrogate halves are exposed as distinct characters: "𝄞".length === 2
Unmatched surrogate halves are allowed: "\ud834"
Surrogate pairs in the wrong order are allowed: "\udd1e\ud834"
If you need to support the full unicode range by manipulating codepoints instead of UCS2 character codes, you may
want to use CodepointString or CodepointArray instead of Ucs2String.
PS: This type does not deal with Unicdoe normalization either. Use CodepointString and CodepointArray if you need
it.
The type used for simple Javascript strings. Javascript strings expose characters as UCS2 code units. This is a fixed-size encoding that supports the unicode codepoints from U+000000 to U+00FFFF (Basic Multilingual Plane or BMP). Displaying larger codepoints is a property of the environment based on UTF-16 surrogate pairs. Unicode does not, and will never, assign characters to the codepoints from U+OOD800 to U+00DFFF. These spare codepoints allows UTF16 to combine codeunits from 0xd800 to 0xdfff in pairs (called surrogate pairs) to represent codepoints from supplementary planes. This transformation happens during the transition from codeunits to codepoints in UTF-16. In UCS2, the codeunits from 0xd800 to 0xdfff directly produce codepoints in the range from U+OOD8OO to U+OODFF. Then, the display might merge these codepoints into higher codepoints during the rendering.
Lets take an example (all the numbers are in hexadecimal):
+---+---+---+---+---+---+ Bytes | 00| 41| d8| 34| dd| 1e| +---+---+---+---+---+---+ UTF-16BE codeunits | 0x0041| 0xd834| 0xdd1e| +-------+-------+-------+ Codepoints (from UTF-16BE) | U+41 | U+01D11E | +-------+---------------+ Displayed (from UTF-16BE) | A | 𝄞 | +-------+-------+-------+ UCS2 codeunits | 0x0041| 0xd834| 0xdd1e| +-------+-------+-------+ Codepoints (from UCS2BE) | U+41 | U+D834| U+DD1E| <- This is what Javascript sees +-------+-------+-------+ Displayed (from UCS2BE) | A | � | � | <- This is what the user may see +-------+-------+-------+ Displayed (from UCS2BE with surrogates) | A | 𝄞 | <- This is what the user may see +-------+---------------+
The most important takeaway is that codepoints outside of the BMP are a property of the display, not of the Javascript string. This is the cause of multiple issues.
"𝄞".length === 2
"\ud834"
"\udd1e\ud834"
If you need to support the full unicode range by manipulating codepoints instead of UCS2 character codes, you may want to use CodepointString or CodepointArray instead of Ucs2String.
PS: This type does not deal with Unicdoe normalization either. Use CodepointString and CodepointArray if you need it.