Skip to content(if available)orjump to list(if available)

It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019)

DavidPiper

I think that string length is one of those things that people (including me) don't realise they never actually want. In a production system, I have never actually wanted string length. I have wanted:

- Number of bytes this will be stored as in the DB

- Number of monospaced font character blocks this string will take up on the screen

- Number of bytes that are actually being stored in memory

"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.

arcticbull

Taking this one step further -- there's no such thing as the context-free length of a string.

Strings should be thought of more like opaque blobs, and you should derive their length exclusively in the context in which you intend to use it. It's an API anti-pattern to have a context-free length property associated with a string because it implies something about the receiver that just isn't true for all relevant usages and leads you to make incorrect assumptions about the result.

Refining your list, the things you usually want are:

- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).

- Number of code points when parsing.

- Number of grapheme clusters for advancing the cursor back and forth when editing.

- Bounding box in pixels or points for display with a given font.

Context-free length is something we inherited from ASCII where almost all of these happened to be the same, but that's not the case anymore. Unicode is better thought of as compiled bytecode than something you can or should intuit anything about.

It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?

ramses0

"Unicode is JPG for ASCII" is an incredibly great metaphor.

size(JPG) == bytes? sectors? colors? width? height? pixels? inches? dpi?

account42

> Number of code points when parsing.

You shouldn't really ever care about the number of code points. If you do, you're probably doing something wrong.

josephg

It’s a bit of a niche use case, but I use the codepoint counts in CRDTs for collaborative text editing.

Grapheme cluster counts can’t be used because they’re unstable across Unicode versions. Some algorithms use UTF8 byte offsets - but I think that’s a mistake because they make input validation much more complicated. Using byte offsets, there’s a whole lot of invalid states you can represent easily. Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint. Then inserting at position 2 is valid again. If you send me an operation which happened at some earlier point in time, I don’t necessarily have the text document you were inserting into handy. So figuring out if your insertion (and deletion!) positions are valid at all is a very complex and expensive operation.

Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.

torstenvl

I really wish people would stop giving this bad advice, especially so stridently.

Like it or not, code points are how Unicode works. Telling people to ignore code points is telling people to ignore how data works. It's of the same philosophy that results in abstraction built on abstraction built on abstraction, with no understanding.

I vehemently dissent from this view.

null

[deleted]

baq

ASCII is very convenient when it fits in the solution space (it’d better be, it was designed for a reason), but in the global international connected computing world it doesn’t fit at all. The problem is all the tutorials, especially low level ones, assume ASCII so 1) you can print something to the console and 2) to avoid mentioning that strings are hard so folks don’t get discouraged.

Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.

craftkiller

> Notably Rust did the correct thing

In addition to separate string types, they have separate iterator types that let you explicitly get the value you want. So:

  String.len() == number of bytes
  String.bytes().count() == number of bytes
  String.chars().count() == number of unicode scalar values
  String.graphemes().count() == number of graphemes (requires unicode-segmentation which is not in the stdlib)
  String.lines().count() == number of lines
Really my only complaint is I don't think String.len() should exist, it's too ambiguous. We should have to explicitly state what we want/mean via the iterators.

pron

Similar to Java:

   String.chars().count(), String.codePoints().count(), and, for historical reasons, String.getBytes(UTF-8).length

westurner

  String.graphemes().count()
That's a real nice API. (Similarly, python has @ for matmul but there is not an implementation of matmul in stdlib. NumPy has a matmul implementation so that the `@` operator works.)

ugrapheme and ucwidth are one way to get the graphene count from a string in Python.

It's probably possible to get the grapheme cluster count from a string containing emoji characters with ICU?

bigstrat2003

> in the global international connected computing world it doesn’t fit at all.

Most people aren't living in that world. If you're working at Amazon or some business that needs to interact with many countries around the globe, sure, you have to worry about text encoding quite a bit. But the majority of software is being written for a much narrower audience, probably for one single language in one single country. There is simply no reason for most programmers to obsess over text encoding the way so many people here like to.

arp242

No one is "obsessing" over anything. Reality is there are very few cases where you can use a single 8-bit character set and not run in to problems sooner or later. Say your software is used only in Greece so you use ISO-8859-7 for Greek. That works fine, but now you want to talk to your customer Günther from Germany who has been living in Greece for the last five years, or Clément from France, or Seán from Ireland and oops, you can't.

Even plain English text can't be represented with plain ASCII (although ISO-8859-1 goes a long way).

There are some cases where just plain ASCII is okay, but there are quite few of them (and even those are somewhat controversial).

The solution is to just use UTF-8 everywhere. Or maybe UTF-16 if you really have to.

rileymat2

Except, this is a response to emoji support, which does have encoding issues even if your user base is in the US and only speaks English. Additionally, it is easy to have issues with data that your users use from other sources via copy and paste.

wat10000

Which audience makes it so you don’t have to worry about text encodings?

raverbashing

This is naive at best

Here's a better analogy, in the 70s "nobody planned" for names with 's in then. SQL injections, separators, "not in the alphabet", whatever. In the US. Where a lot of people with 's in their names live... Or double-barrelled names.

It's a much simpler problem and still tripped a lot of people

And then you have to support a user with a "funny name" or a business with "weird characters", or you expand your startup to Canada/Mexico and lo and behold...

account42

> in the global international connected computing world it doesn’t fit at all.

I disagree. Not all text is human prose. For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments.

simonask

This is American imperialism at its worst. I'm serious.

Lots of people around the world learn programming from sources in their native language, especially early in their career, or when software development is not their actual job.

Enforcing ASCII is the same as enforcing English. How would you feel if all cooking recipes were written in French? If all music theory was in Italian? If all industrial specifications were in German?

It's fine to have a dominant language in a field, but ASCII is a product of technical limitations that we no longer have. UTF-8 has been an absolute godsend for human civilization, despite its flaws.

flohofwoe

ASCII is totally fine as encoding for the lower 127 UNICODE code points. If you need to go above those 127 code points, use a different encoding like UTF-8.

Just never ever use Extended ASCII (8-bits with codepages).

null

[deleted]

eru

Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings.

ynik

Python 3 internally uses UTF-32. When exchanging data with the outside world, it uses the "default encoding" which it derives from various system settings. This usually ends up being UTF-8 on non-Windows systems, but on weird enough systems (and almost always on Windows), you can end up with a default encoding other than UTF-8. "UTF-8 mode" (https://peps.python.org/pep-0540/) fixes this but it's not yet enabled by default (this is planned for Python 3.15).

xigoi

I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.

jlarocco

It's definitely worth thinking about the real problem, but I wouldn't say it's never helpful.

The underlying issue is unit conversion. "length" is a poor name because it's ambiguous. Replacing "length" with three functions - "lengthInBytes", "lengthInCharacters", and "lengthCombined" - would make it a lot easier to pick the right thing.

xelxebar

> Number of monospaced font character blocks this string will take up on the screen

Even this has to deal with the halfwidth/fullwidth split in CJK. Even worse, Devanagari has complex rendering rules that actually depend on font choices. AFAIU, the only globally meaningful category here is rendered bounding box, which is obviously font-dependent.

But I agree with the general sentiment. What we really about how much space these text blobs take up, whether that be in a DB, in memory, or on the screen.

capitainenemo

FWIW, the cheap lazy way to get "number of bytes in DB" from JS, is unescape(encodeURIComponent("ə̀")).length

xg15

It gets more complicated if you do substring operations.

If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.

arcticbull

Substring operations (and more generally the universe of operations where there is more than one string involved) are a whole other kettle of fish. Unicode, being a byte code format more than what you think of as a logical 'string' format, has multiple ways of representing the same strings.

If you take a substring of a(bc) and compare it to string (bc) are you looking for bitwise equivalence or logical equivalence? If the former it's a bit easier (you can just memcmp) but if the latter you have to perform a normalization to one of the canonical forms.

jibal

"Unicode, being a byte code format"

UTF-8 is a byte code format; Unicode is not. In Python, where all strings are arrays of Unicode code points, substrings are likewise arrays of Unicode code points.

setr

I’m fairly positive the answer is trivially logical equivalence for pretty much any substring operation. I can’t imagine bitwise equivalence to ever be the “normal” use case, except to the implementer looking at it as a simpler/faster operation

I feel like if you’re looking for bitwise equivalence or similar, you should have to cast to some kind of byte array and access the corresponding operations accordingly

account42

> s.charAt(x) or s.codePointAt(x)

Neither of these are really useful unless you are implementing a font renderer or low level Unicode algorithm - and even then you usually only want to get the next code point rather than one at an arbitrary position.

mseepgood

The values for x and y should't come from your brain, though (with the exception of 0). They should come from previous index operations like s.indexOf(...) or s.search(regex), etc.

xg15

Indeed. Or s.length, whatever that represents.

bigstrat2003

I have never wanted any of the things you said. I have, on the other hand, always wanted the string length. I'm not saying that we shouldn't have methods like what you state, we should! But your statement that people don't actually want string length is untrue because it's overly broad.

wredcoll

Which length? Bytes? Code points? Graphemes? Pixels?

zahlman

> I have, on the other hand, always wanted the string length.

In an environment that supports advanced Unicode features, what exactly do you do with the string length?

guappa

What if you need to find 5 letter words to play wordle? Why do you care how many bytes they occupy or how large they are on screen?

xigoi

In the case of Wordle, you know the exact set of letters you’re going to be using, which easily determines how to compute length.

guappa

No no, I want to create tomorrow's puzzle.

taneq

If you're playing at this level, you need to define:

- letter

- word

- 5 :P

guappa

Eh in macedonian they have some letters that in russian are just 2 separate letters

dang

Related. Others? (Also, anybody know the answer to https://news.ycombinator.com/item?id=44987514?)

It’s not wrong that " ".length == 7 (2019) - https://news.ycombinator.com/item?id=36159443 - June 2023 (303 comments)

String length functions for single emoji characters evaluate to greater than 1 - https://news.ycombinator.com/item?id=26591373 - March 2021 (127 comments)

String Lengths in Unicode - https://news.ycombinator.com/item?id=20914184 - Sept 2019 (140 comments)

bstsb

ironic that unicode is stripped out the post's title here, making it very much wrong ;)

for context, the actual post features an emoji with multiple unicode codepoints in between the quotes

dang

Ok, we've put Man Facepalming with Light Skin Tone back up there. I failed to find a way to avoid it.

Is there a way to represent this string with escaped codepoints? It would be both amusing and in HN's plaintext spirit to do it that way in the title above, but my Unicode is weak.

NobodyNada

That would be "\U0001F926\U0001F3FC\u200D\u2642\uFE0F" in Python's syntax, or "\u{1F926}\u{1F3FC}\u{200D}\u{2642}\u{FE0F}" in Rust or JavaScript.

Might be a little long for a title :)

dang

Thanks! Your second option is almost identical to Mlller's (https://news.ycombinator.com/item?id=44988801) but the extra curly braces make it not fit. Seems like they're droppable for characters below U+FFFF, so I've squeezed it in above.

Mlller

That would be …

  "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7
… for Javascript.

dang

I can actually fit that within HN's 80 char limit without having to drop the "(2019)" bit at the end, so let's give it a try and see what happens... thanks!

cmeacham98

Funny enough I clicked on the post wondering how it could possibly be that a single space was length 7.

ale42

Maybe it isn't a space, but a list of invisible Unicode chars...

yread

It could also be a byte length of a 3 byte UTF-8 BOM and then some stupid space character like f09d85b3

robin_reala

It’s U+0020, a standard space character.

c12

I did exactly the same, thinking that maybe it was invisible unicode characters or something I didn't know about.

timeon

Unintentional click-bait.

eastbound

It can be many Zero-Width Space, or a few Hair-Width Space.

You never know, when you don’t know CSS and try to align your pixels with spaces. Some programers should start a trend where 1 tab = 3 hairline-width spaces (smaller than 1 char width).

Next up: The <half-br/> tag.

Moru

You laugh but my typewriter could do half-br 40 years ago. Was used for typing super/subscript.

null

[deleted]

zahlman

There's an awful lot of text in here but I'm not seeing a coherent argument that Python's approach is the worst, despite the author's assertion. It especially makes no sense to me that counting the characters the implementation actually uses should be worse than counting UTF-16 code units, for an implementation that doesn't use surrogate pairs (and in fact only uses those code units to store out-of-band data via the "surrogateescape" error handler, or explicitly requested characters. N.B.: Lone surrogates are still valid characters, even though a sequence containing them is not a valid string.) JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.

estimator7292

Stuff like this makes me so glad that in my world strings are ALWAYS ASCII and one char is always one byte. Unicode simply doesn't exist and all string manipulation can be done with a straightforward for loop or whatever.

Dealing with wide strings sounds like hell to me. Right up there with timezones. I'm perfectly happy with plain C in the embedded world.

RcouF1uZ4gsC

That English can be well represented with ASCII may have contributed to America becoming an early computing powerhouse. You could actually do things like processing and sorting and doing case insensitive comparisons on data likes names and addresses very cheaply.

xg15

The article both argues that the "real" length from a user perspective is Extended Grapheme Clusters - and makes a case against using it, because it requires you to store the entire character database and may also change from one Unicode version to the next.

Therefore, people should use codepoints for things like length limits or database indexes.

But wouldn't this just move the "cause breakage with new Unicode version" problem to a different layer?

If a newer Unicode version suddenly defines some sequences to be a single grapheme cluster where there were several ones before and my database index now suddenly points to the middle of that cluster, what would I do?

Seems to me, the bigger problem is with backwards compatibility guarantees in Unicode. If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?

re

What do you mean by "use codepoints for ... database indexes"? I feel like you are drawing conclusions that the essay does not propose or support. (It doesn't say that you should use codepoints for length limits.)

> If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?

Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?

xg15

I was referring to this part, in "Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?":

"For example, the Unicode version dependency of extended grapheme clusters means that you should never persist indices into a Swift strings and load them back in a future execution of your app, because an intervening Unicode data update may change the meaning of the persisted indices! The Swift string documentation does not warn against this.

You might think that this kind of thing is a theoretical issue that will never bite anyone, but even experts in data persistence, the developers of PostgreSQL, managed to make backup restorability dependent on collation order, which may change with glibc updates."

You're right it doesn't say "codepoints" as an alternative solution. That was just my assumption as it would be the closest representation that does not depend on the character database.

But you could also use code units, bytes, whatever. The problem will be the same if you have to reconstruct the grapheme clusters eventually.

> Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?

Because splitting a grapheme cluster in half can change its semantics. You don't want that if you e.g. have an index for fulltext search.

chrismorgan

> it doesn't say "codepoints" as an alternative solution. That was just my assumption …

On the contrary, the article calls code point indexing “rather useless” in the subtitle. Code unit indexing is the appropriate technique. (“Byte indexing” generally implies the use of UTF-8 and in that context is more meaningfully called code unit indexing. But I just bet there are systems out there that use UTF-16 or UTF-32 and yet use byte indexing.)

> The problem will be the same if you have to reconstruct the grapheme clusters eventually.

In practice, you basically never do. Only your GUI framework ever does, for rendering the text and for handling selection and editing. Because that’s pretty much the only place EGCs are ever actually relevant.

> You don't want that if you e.g. have an index for fulltext search.

Your text search won’t be splitting by grapheme clusters, it’ll be doing word segmentation instead.

pron

In Java,

    " ".codePoints().count()
    ==> 5

    " ".chars().count()
    ==> 7

    " ".getBytes(UTF_8).length
    ==> 17

(HN doesn't render the emoji in comments, it seems)

osener

Python does an exceptionally bad job. After dragging the community through a 15-year transition to Python 3 in order to "fix" Unicode, we ended up with support that's worse than in languages that simply treat strings as raw bytes.

Some other fun examples: https://gist.github.com/ozanmakes/0624e805a13d2cebedfc81ea84...

mid-kid

Yeah I have no idea what is wrong with that. Python simply operates on arrays of codepoints, which are a stable representation that can be converted to a bunch of encodings including "proper" utf-8, as long as all codepoints are representable in that encoding. This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.

deathanatos

> which are a stable representation that can be converted to a bunch of encodings including "proper" utf-8, as long as all codepoints are representable in that encoding.

Which, to humor the parent, is also true of raw bytes strings. One of the (valid) points raised by the gist is that `str` is not infallibly encodable to UTF-8, since it can contain values that are not valid Unicode.

> This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.

If I write,

  def foo(s: str) -> …:
… I want the input string to be Unicode. If I need "Unicode, or maybe with bullshit mixed in", that can be a different type, and then I can take

  def foo(s: UnicodeWithBullshit) -> …:

acuozzo

> Python simply operates on arrays of codepoints

But most programmers think in arrays of grapheme clusters, whether they know it or not.

zahlman

No, I'm not standing for that.

Python does it correctly and the results in that gist are expected. Characters are not grapheme clusters, and not every sequence of characters is valid. The ability to store unpaired surrogate characters is a feature: it would take extra time to validate this when it only really matters at encoding time. It also empowers the "surrogateescape" error handler, that in turn makes it possible to supply arbitrary bytes in command line arguments, even while providing strings to your program which make sense in the common case. (Not all sequences of bytes are valid UTF-8; the error handler maps the invalid bytes to invalid unpaired surrogates.) The same character counts are (correctly) observed in many other programming languages; there's nothing at all "exceptional" about Python's treatment.

It's not actually possible to "treat strings as raw bytes", because they contain more than 256 possible distinct symbols. They must be encoded; even if you assume an ecosystem-wide encoding, you are still using that encoding. But if you wish to work with raw sequences of bytes in Python, the `bytes` type is built-in and trivially created using a `b'...'` literal, or various other constructors. (There is also a mutable `bytearray` type.) These types now correctly behave as a sequence of byte (i.e., integer ranging 0..255 inclusive) values; when you index them, you get an integer. I have personal experience of these properties simplifying and clarifying my code.

Unicode was fixed (no quotation marks), with the result that you now have clearly distinct types that honour the Zen of Python principle that "explicit is better than implicit", and no longer get `UnicodeDecodeError` from attempting an encoding operation or vice-versa. (This problem spawned an entire family of very popular and very confused Stack Overflow Q&As, each with probably countless unrecognized duplicates.) As an added bonus, the default encoding for source code files changed to UTF-8, which means in practical terms that you can actually use non-English characters in your code comments (and even identifier names, with restrictions) now and have it just work without declaring an encoding (since your text editor now almost certainly assumes that encoding in 2025). This also made it possible to easily read text files as text in any declared encoding, and get strings as a result, while also having universal newline mode work, and all without needing to reach for `io` or `codecs` standard libraries.

The community was not so much "dragged through a 15-year transition"; rather, some members of the community spent as long as 15 (really 13.5, unless you count people continuing to try to use 2.7 past the extended EOL) years refusing to adapt to what was a clear bugfix of the clearly broken prior behaviour.

chrismorgan

Previous discussions:

https://news.ycombinator.com/item?id=36159443 (June 2023, 280 points, 303 comments; title got reemojied!)

https://news.ycombinator.com/item?id=26591373 (March 2021, 116 points, 127 comments)

https://news.ycombinator.com/item?id=20914184 (September 2019, 230 points, 140 comments)

I’m guessing this got posted by one who saw my comment https://news.ycombinator.com/item?id=44976046 today, though coincidence is possible. (Previous mention of the URL was 7 months ago.)

program

I did post this. I found it by chance, coming from this other post https://tonsky.me/blog/unicode/

kazinator

Why would I want this to be 17, if I'm representing strings as array of code points, rather than UTF-8?

TXR Lisp:

  1> (len " ")
  5
  2> (coded-length " ")
  17
(Trust me when I say that the emoji was there when I edited the comment.)

The second value takes work; we have to go through the code points and add up their UTF-8 lengths. The coded length is not cached.

voidmain

I haven't thought about this deeply, but it seems to me that the evolution of unicode has left it unparseable (into extended grapheme clusters, which I guess are "characters") in a forwards compatible way. If so, it seems like we need a new encoding which actually delimits these (just as utf-8 delimits code points). Then the original sender determines what is a grapheme, and if they don't know, who does?

null

[deleted]