Putting regexes where they don't belong

Published: Thu 21 January 2021

In software.

This is the story of a hack.

It's not anything wrong or bad. It works quite well, but it just has that... quality. The one where you see it and you laugh in amused disgust.

This is the story of how I made python do regexes when it shouldn't do regexes.

The motive

I've blogged about littlecheck before. It's fish's script test driver [1]. The way it works is that you write a script, and then you write the output you expect into # CHECK: comments inside the script.

Littlecheck then lets whatever interpreter you picked run the script and compares its output to all the # CHECK: lines. This is super simple and works quite well in practice. Here's an example:

echo Hello!
# CHECK: Hello!

echo Goodbye
# CHECK: Goodbye

echo No check for this
# ^^ Oh no, that one will fail.

Only... when things went wrong, littlecheck did this naive comparison where it complained about the first line that was wrong, and then let you figure out the context of all of that.

Was it a superfluous line of output? A # CHECK too many? Or actually a line that was different from what was expected?

It didn't tell you. It just said "this line doesn't look like this CHECK on line XYZ, also here's the rest of the output".

Now, if you've been around unix a few times, you might know what this is a problem for: diff!

You have a bunch of lines on one side, a bunch on the other and you want to know what the diff-erence is between the two, so you run diff on them!

Well, yeah, you would. Only there's a problem: Littlecheck does regexes.

The Means

The venerable diff utility doesn't handle regexes, and we wouldn't want to launch it anyway [2].

But Littlecheck is written in python, and that has a lot of stuff in the standard library. Maybe there's a diffing tool?

Oh, there's a difflib. Cool!

And python can do regexes - that's what we use to match them in the first place. So let's just pass a comparator to the "SequenceMatcher" thing and be done with it.


difflib's SequenceMatcher doesn't take a comparator function, or key, or something comparable (hah!) [3].

The Opportunity

Where we're going we don't need comparator functions.

It turns out SequenceMatcher takes its arguments as a list. So how about instead of passing strings we pass the regex objects? No, that won't work because the other side is still strings, so when it compares the two it'll just always be false.

So what if we overload the list's __contains__ function? Well, no, that won't work because SequenceMatcher takes the list and puts the elements into a dictionary [4] that we don't control.

So... how about we override the equality operator? Well, not quite. Since it's a dictionary it first tries hash comparisons. Luckily, that goes via the __hash__ function, so I can write the most awful python I have ever written:

def __hash__(self):
    # Chosen by fair diceroll
    # No, just kidding.
    return 0

This makes the hash comparison always collide, so whenever python checks if something is in the dictionary it'll have to take the long route and do an actual comparison. After that, we hack the __eq__ function to do a regex match:

def __eq__(self, other):
    if other is None:
        return False
    if isinstance(other, CheckCmd):
        return other.regex.match(self.text)
    if isinstance(other, Line):
        # We only compare the text here so SequenceMatcher can reshuffle these
        return self.text == other.text
    raise NotImplementedError

From there it's smooth sailing putting lipstick on this pig... boat [5], and this is what it looks like:

Example output. It shows that Littlecheck can identify errors somewhere in the middle without misinterpreting everything after.

And that's the way I like it.

[1]Not to be confused with the unit test driver or the interactive test driver
[2]Littlecheck is a single-file utility with no dependencies other than python, we'd like to keep it that way.
[3]To be honest I don't think difflib is "great". It's mostly a collection of things that someone found useful, once, and the API is a mess. Also: I intend all my puns. Even the accidental ones I intend in principle.
[4]In what I'm pretty sure is an example of premature optimization. That or someone ran this stuff on gigabytes of text and expected an answer in milliseconds.
[5]Mixing metaphors is good fun, actually