Wcwidth and the Unicode Mess

Published: Mon 07 February 2022

In software.

Recently I came across Darren Burns' Terminal support for emoji , and I feel like I need to say a few things.

Imagine you're writing a shell, and you want to add a right prompt. That is, more prompt text, but on the right of the window.

How would you go about it?

In simple terms, what you need to do is

  1. Get the prompt text
  2. Figure out how long it is
  3. Tell the terminal to move the cursor to X cells from the right edge
  4. Write the prompt text.

This seems simple enough. You get the text, you get its length, you get the terminal's width [1], tell the terminal to move the cursor to width - length.

If you get this wrong, at best you overcounted the length and end up not writing up to the edge. At worst you undercounted and you end up wrapping the text.

This would result in something like this staircase [2]:

A screenshot of a terminal featuring a staircase glitch of some text

But if you got the length right, you would end up with the right prompt up to the right edge, and all would be good.

The only question is: How do you get the "length"?

String lengthn't

You don't actually want the "length". You want the "width". "Length" sounds like it's the number of "characters" or bytes or something, which isn't enough.

I'm going to spare you another explanation of the basics of Unicode. To be overly reductive: The basic unit in Unicode is the "codepoint". That's a thing that means something in the code. Many things you think of as "characters" map to a codepoint in Unicode. "A" is a codepoint, "a" another, so is " " (space). "🌶" is a codepoint named "Hot Pepper" and has the number U+1F336.

Unicode codepoints differ in width, and not just for proportional width fonts (you know, like in a GUI). Even in a terminal, many are _narrow_ (meaning they take up 1 cell), some use up 2 cells and are _wide_, and many aren't printable at all and might have 0 width, or even remove characters (backspace!). And these are the simple cases.

The best definition we have of this is Unicode Technical Report #11 "East Asian Width" [3]. This is regularly released and updated along with the Unicode standard. There are also data files [4] that contain the necessary information for all the codepoints.

These data files are hard to understand; One point where I believe Darren's post goes wrong is that "emoji presentation sequence" is a fairly specific thing, and doesn't, as best as I can tell, apply to the 🛥 motorboat codepoint. It defaults to text presentation, which means the given width counts.

So what should I do?

There is a function called wcwidth that you can use to determine the width of a codepoint. Darren's blog post points to Markus Kuhn's well-known implementation.

The major problem with this is that it supports up to Unicode 5 while the current version is 14, so it's now woefully out of date. Don't use it, even the version your libc ships should be more up-to-date than that, and that's very likely also outdated, especially on stable distributions people like to use.

Another possible version is the fish project's widecharwidth [5]. It's a python script that parses the Unicode data files, interprets them and generates a public domain C++, Rust or Javascript implementation of wcwidth. It also gives you some information that ordinary wcwidth doesn't, like if a codepoint is of "ambiguous" width.

Ambiguous width codepoints are an annoyance because they don't have a defined width, instead terminals usually offer a setting to set them to wide or narrow. Thankfully it's typically one setting for all of them instead of setting a separate width per-codepoint [6], so this can be supported with a simple toggle in your application.

So just use an up-to-date wcwidth and we're done?

Unfortunately, wcwidth is fundamentally limited. Remember how I said these are "simple cases" above?

Well, it turns out Unicode has codepoints that can combine to form glyphs, and the widths don't just add up. You can have an entire family glyph made up of separate family member codepoints, and the width of the combination will be 2. Since wcwidth only operates on a single codepoint, it can't express this.

Simpler combinations are _almost_ supportable. You can often get good end results by, for example, ascribing a width of 1 to U+FE0F, the "emoji variation selector". This is the codepoint that turns a sequence into an emoji presentation sequence. So U+1F6E5 MOTORBOAT has a width of 1, U+FE0F has a width of 1, the combination has a width of 2. Of course if a variation selector appears where it doesn't apply this breaks. "a" and a variation selector simply renders as "a", the variation selector poofs out of existence.

So for full Unicode support we would need a wcswidth function that knows about how codepoints combine, and that's a larger task. Also, we would need terminals to come to the same conclusion.

But then it's solved?

If you are right and the terminal is wrong, things are still broken. On my system, our good friend motorboat is rendered as follows:

  • Konsole renders it as wide, ignoring variation selectors.
  • Gnome Terminal renders what is clearly a wide glyph in one cell, so any following text overlaps it. Adding a variation selector adds color but doesn't change the width.
  • Kitty defaults to narrow and widens with a variation selector.

There is no way to support all of these.

Another complication: the width has changed before, and might change again - since Unicode 9 many emoji are wide. This means that the terminal and the application need to agree on which unicode version they support.

Terminal developers like to add yet another escape sequence that applications can send to negotiate the supported unicode version. I disagree that this is helpful. In practice, for applications it's horrible to detect the terminal to figure out which bespoke escape sequence [7] to send, and so it gets punted to the user. This is something you can beat people over the head with ("why didn't you just"), not a solution.

Instead, I don't believe the width changes often enough in practice to be a big problem, so just using the newest version you can works okay. If you want, add a flag to enable the old pre-Unicode 9 legacy widths. Widecharwidth gives you that option by marking a codepoint as "widened in Unicode 9".

In summary: a massive headache, avoid if possible. Also Terminals are kinda bad.

[1]What you do is to ioctl the terminal file. In a shell script this might be available via the $COLUMNS variable.
[2]That specific staircase is caused by fish's syntax highlighting, actually, but that's harder to explain.
[3]This was all originally invented for something to do with asian character sets that I'm too european to understand. Regardless, we abuse it for terminals because there simply is nothing else.
[4]In a special text format that changes far too often, sigh.
[5]Disclaimer: This means I'm involved.
[6]Please no terminal add this. I'm begging you.
[7]There used to be a "Terminal Working Group" that was meant to standardize these things. As best as I can tell it did not take and effectively disbanded before producing anything.

social