Published: Mon 07 February 2022
Recently I came across Darren Burns'
Terminal support for emoji , and I feel like I need to say a few things.
Imagine you're writing a shell, and you want to add a right prompt. That is, more prompt text, but on the right of the window.
How would you go about it?
In simple terms, what you need to do is
Get the prompt text
Figure out how long it is
Tell the terminal to move the cursor to X cells from the right edge
Write the prompt text.
This seems simple enough. You get the text, you get its length, you get the terminal's width , tell the terminal to move the cursor to
width - length.
If you get this wrong, at best you overcounted the length and end up not writing up to the edge. At worst you undercounted and you end up wrapping the text.
This would result in something like this staircase :
But if you got the length right, you would end up with the right prompt up to the right edge, and all would be good.
The only question is: How do you get the "length"?
You don't actually want the "length". You want the "width". "Length" sounds like it's the number of "characters" or bytes or something, which isn't enough.
I'm going to spare you another explanation of the basics of Unicode. To be overly reductive: The basic unit in Unicode is the "codepoint". That's a
thing that means something in the code. Many things you think of as "characters" map to a codepoint in Unicode. "A" is a codepoint, "a" another, so is " " (space). "🌶" is a codepoint named "Hot Pepper" and has the number U+1F336.
Unicode codepoints differ in width, and not just for proportional width fonts (you know, like in a GUI). Even in a terminal, many are _narrow_ (meaning they take up 1
cell), some use up 2 cells and are _wide_, and many aren't printable at all and might have 0 width, or even remove characters (backspace!). And these are the simple cases.
The best definition we have of this is Unicode Technical Report #11 "East Asian Width" . This is regularly released and updated along with the Unicode standard. There are also data files that contain the necessary information for all the codepoints.
These data files are hard to understand; One point where I believe Darren's post goes wrong is that "emoji presentation sequence" is a fairly specific thing, and doesn't, as best as I can tell, apply to the
🛥 motorboat codepoint. It defaults to text presentation, which means the given width counts.
So what should I do?
There is a function called
wcwidth that you can use to determine the width of a codepoint. Darren's blog post points to Markus Kuhn's well-known implementation.
The major problem with this is that it supports up to Unicode 5 while the current version is 14, so it's now woefully out of date. Don't use it, even the version your libc ships should be more up-to-date than that, and that's very likely also outdated, especially on stable distributions people like to use.
Another possible version is the fish project's
Ambiguous width codepoints are an annoyance because they don't have a defined width, instead terminals usually offer a setting to set them to wide or narrow. Thankfully it's typically one setting for all of them instead of setting a separate width per-codepoint , so this can be supported with a simple toggle in your application.
So just use an up-to-date wcwidth and we're done?
Unfortunately, wcwidth is fundamentally limited. Remember how I said these are "simple cases" above?
Well, it turns out Unicode has codepoints that can combine to form glyphs, and the widths don't just add up. You can have an entire family glyph made up of separate family member codepoints, and the width of the combination will be 2. Since wcwidth only operates on a single codepoint, it can't express this.
Simpler combinations are _almost_ supportable. You can often get good end results by, for example, ascribing a width of 1 to U+FE0F, the "emoji variation selector". This is the codepoint that turns a sequence into an emoji presentation sequence. So U+1F6E5 MOTORBOAT has a width of 1, U+FE0F has a width of 1, the combination has a width of 2. Of course if a variation selector appears where it doesn't apply this breaks. "a" and a variation selector simply renders as "a", the variation selector poofs out of existence.
So for full Unicode support we would need a
wcswidth function that knows about how codepoints combine, and that's a larger task. Also, we would need terminals to come to the same conclusion.
But then it's solved?
If you are right and the terminal is wrong, things are still broken. On my system, our good friend motorboat is rendered as follows:
Konsole renders it as wide, ignoring variation selectors.
Gnome Terminal renders what is clearly a wide glyph in one cell, so any following text overlaps it. Adding a variation selector adds color but doesn't change the width.
Kitty defaults to narrow and widens with a variation selector.
There is no way to support all of these.
Another complication: the width has changed before, and might change again - since Unicode 9 many emoji are wide. This means that the terminal and the application need to agree on which unicode version they support.
Terminal developers like to add yet another escape sequence that applications can send to negotiate the supported unicode version. I disagree that this is helpful. In practice, for applications it's horrible to detect the terminal to figure out which bespoke escape sequence to send, and so it gets punted to the user. This is something you can beat people over the head with ("why didn't you just"), not a solution.
Instead, I don't believe the width changes often enough in practice to be a big problem, so just using the newest version you can works okay. If you want, add a flag to enable the old pre-Unicode 9 legacy widths. Widecharwidth gives you that option by marking a codepoint as "widened in Unicode 9".
In summary: a massive headache, avoid if possible. Also
Terminals are kinda bad.