The article has good tips, but Unicode normalization is just the tip of the iceberg. It is almost always impossible to do what your users expect without locale information (different languages and locales sort and compare the same graphemes differently). "What do we mean when we say two strings are equal" can be a surprisingly difficult question to answer. It's practical too, not philosophical.
By the way, try looking up the standardized Unicode casefolding algorithm sometime, it is a thing to behold.
in particular, the differences between NFC and NFKC are "fun", and rather meaningful in many cases. e.g. NFC says that "fi" and "fi" are different and not equal, though the latter is just a ligature of the former and is literally identical in meaning. this applies to ffi too. half vs full width Chinese characters are also "different" under NFC. NFKC makes those examples equal though... at the cost of saying "2⁵" is equal to "25".
Grapheme count is not a useful number. Even in a monospaced font, you’ll find that the grapheme count doesn’t give you a measurement of width since emoji will usually not be the same width as other characters.
Most of the UI toolkits defer to font-based layout calculation engines rather than grapheme counts. Grapheme counts are a handy approximation in many cases, but you aren't going to get truly accurate text selection or cursor positions without real geometry from a font and whatever layout engine is closest to it.
(Fonts may disagree on supported ligatures, for instance, or not support an emoji grapheme cluster and fall back to displaying multiple component emoji, or layout multiple graphemes in two dimensions for a given language [CJK, Arabic, even Math symbols, even before factoring in if a font layout engine supports the optional but cool UnicodeMath [0]] or any number of other tweaks and distinctions between encoding a block of text and displaying a block of text.)
Indeed. Several languages have debated dropping or already dropped an easy to access "Length" count on strings and making it much more explicit if you want "UTF-8 Encoded Length" or "Codepoint count" or "Grapheme count" or "Grapheme cluster count" or "Laid out font width".
Why endorse a bad winner when you can make more of the trade-offs more obvious and give programmers a better chance of asking for the right information instead of using the wrong information because it is the default and assuming it is correct?
Frankly, the key takeaway to most problems people run into with Unicode is that there are very, very few operations that are universally well-defined for arbitrary user-provided text. Pretty much the moment you step outside the realm of "receive, copy, save, regurgitate", you're probably going to run into edge cases.
I've said this before and have said it again: Python3 got rid of the wrong string type.
With `bytes` it was obvious that byte length was not the same as $whatever length, and that was really the only semi-common bug (and was mostly limited to English speakers who are new to programming). All other bugs come from blindly trusting `unicode` whose bugs are far more subtle and numerous.
I strongly disagree. Python 2 had no bytes type to get rid of. It had a string type that could not handle code points above U+00FF at all, and could not handle code points above U+007F very well. In addition, Python 2 had a Unicode type, and the types would get automatically converted to each other and/or encoded/decoded, often incorrectly, and sometimes throwing runtime exceptions.
Python 3 introduced the bytes type that you like so much. It sounds like you would enjoy a Python 4 with only a bytes type and no string type, and presumably with a strong convention to only use UTF-8 or with required encoding arguments everywhere.
In both Python 2 and Python 3, you still have to learn how to handle grapheme clusters carefully.
The article has good tips, but Unicode normalization is just the tip of the iceberg. It is almost always impossible to do what your users expect without locale information (different languages and locales sort and compare the same graphemes differently). "What do we mean when we say two strings are equal" can be a surprisingly difficult question to answer. It's practical too, not philosophical.
By the way, try looking up the standardized Unicode casefolding algorithm sometime, it is a thing to behold.
the normalization doc is interesting too imo: https://unicode.org/reports/tr15/
in particular, the differences between NFC and NFKC are "fun", and rather meaningful in many cases. e.g. NFC says that "fi" and "fi" are different and not equal, though the latter is just a ligature of the former and is literally identical in meaning. this applies to ffi too. half vs full width Chinese characters are also "different" under NFC. NFKC makes those examples equal though... at the cost of saying "2⁵" is equal to "25".
language is fun!
Grapheme count is not a useful number. Even in a monospaced font, you’ll find that the grapheme count doesn’t give you a measurement of width since emoji will usually not be the same width as other characters.
Grapheme count (or rather, indexing) is necessary to do text selection or cursor positions.
Fortunately you can usually outsource this to a UI toolkit which can do it.
Most of the UI toolkits defer to font-based layout calculation engines rather than grapheme counts. Grapheme counts are a handy approximation in many cases, but you aren't going to get truly accurate text selection or cursor positions without real geometry from a font and whatever layout engine is closest to it.
(Fonts may disagree on supported ligatures, for instance, or not support an emoji grapheme cluster and fall back to displaying multiple component emoji, or layout multiple graphemes in two dimensions for a given language [CJK, Arabic, even Math symbols, even before factoring in if a font layout engine supports the optional but cool UnicodeMath [0]] or any number of other tweaks and distinctions between encoding a block of text and displaying a block of text.)
[0] https://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v3.1....
For certain use-cases, but it's not like any of the other usual notions of text length are any better for what you want.
If all possible notions of length are footguns, maybe there should be no default "length" operation available.
Indeed. Several languages have debated dropping or already dropped an easy to access "Length" count on strings and making it much more explicit if you want "UTF-8 Encoded Length" or "Codepoint count" or "Grapheme count" or "Grapheme cluster count" or "Laid out font width".
Why endorse a bad winner when you can make more of the trade-offs more obvious and give programmers a better chance of asking for the right information instead of using the wrong information because it is the default and assuming it is correct?
Frankly, the key takeaway to most problems people run into with Unicode is that there are very, very few operations that are universally well-defined for arbitrary user-provided text. Pretty much the moment you step outside the realm of "receive, copy, save, regurgitate", you're probably going to run into edge cases.
I’m going to trigger some ptsd with this…
UnicodeDecodeError
Unicode footguns, in Python
I've said this before and have said it again: Python3 got rid of the wrong string type.
With `bytes` it was obvious that byte length was not the same as $whatever length, and that was really the only semi-common bug (and was mostly limited to English speakers who are new to programming). All other bugs come from blindly trusting `unicode` whose bugs are far more subtle and numerous.
I strongly disagree. Python 2 had no bytes type to get rid of. It had a string type that could not handle code points above U+00FF at all, and could not handle code points above U+007F very well. In addition, Python 2 had a Unicode type, and the types would get automatically converted to each other and/or encoded/decoded, often incorrectly, and sometimes throwing runtime exceptions.
Python 3 introduced the bytes type that you like so much. It sounds like you would enjoy a Python 4 with only a bytes type and no string type, and presumably with a strong convention to only use UTF-8 or with required encoding arguments everywhere.
In both Python 2 and Python 3, you still have to learn how to handle grapheme clusters carefully.
Python 3 didn't get rid of bytes though. If you want to manipulate data as bytes you absolutely can do that.
https://docs.python.org/3/library/stdtypes.html#binary-seque...
"The core built-in types for manipulating binary data are bytes and bytearray."
Those are arrays of integers, not of bytes. Most bytes are character-ish, which only python2's bytes acknowledged.
Additionally, python2 supported a much richer set of operations on its bytes type than python3 does.