Rob Pike's simple C regex matcher in Go

292 points by todsacerdoti 3 years ago

antirez 3 years ago

Apparently I reinvented almost the same code several years after Pike:

https://github.com/redis/redis/blob/99ebbee2b260b3466672b4fa...

hddqsb 3 years ago
Just a heads up that this implementation has exponential time complexity on pathological patterns.
Demo: https://tio.run/##tVddc6IwFH3nV0Q7WyXFIlrdUWr7uC/9B213BkIQKo...
```
    pattern = "a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z"
    text    = "aabbccddeeffgghhiijjkkllmmnnooppqqrrssttuuvvwwxxyyzza"
    running time = 1.6s
```
It should be relatively easy to fix this. Insight: If the pattern contains two or more wildcards (i.e. it has the form "<A>*<B>*<C>"), and you've found a match for "<A>*<B>", but you can't find a match for "*<C>", you can stop immediately without considering longer matches for the first wildcard. The reason is that a longer match for the first wildcard will mean you to start looking for "*<C>" later in the string, and that's strictly harder.
- ball_of_lint 3 years ago
  
  That might help, but I think it goes in the wrong direction - The proper way to implement a regex matcher if you are worried about performance against arbitrary patterns is to build a Finite State Machine. It's not as simple of code and involves knowing some computer science theory, but results in a matcher with better time complexity guarantees.
  
  hddqsb 3 years ago
  
  [Note: This comment tree is about the Redis code, which uses glob-style patterns rather than regex. So in this comment and in my grandparent comment, "*" means "any string" rather than "zero or more of the previous pattern".]
  An NFA / DFA would be a perfectly valid way to implement an efficient matcher, but it's not the only way.
  For example, memoization (https://news.ycombinator.com/item?id=32434412#32436442) can be easily shown to have time complexity O(P * T) where P = len(pattern) and T = len(text), which is the same as a basic NFA implementation.
  In the case of the Redis code, my proposed fix requires much less code and memory than memoization or an NFA, and has time complexity O(P * T). It's slightly harder to prove this bound; here is a sketch of the proof:
  Matching against a pattern fragment that does not contain "*" takes O(len(pattern_fragment)).
  When a call to stringmatchlen() encounters a wildcard, it may recurse O(T) times (and then it returns). Each recursive call can be classified either "leaf" (if there is a match failure before the next wildcard) or "full" (if the match reaches the next wildcard).
  The insight I described leads to at most one "full" call at each level of the recursion (i.e. for each wildcard in the pattern). So for a particular pattern fragment, the number of "leaf" calls that process it is O(T). Hence the overall time complexity is O(P * T).
  
  burntsushi 3 years ago
  
  See also: https://github.com/google/re2/blob/main/re2/bitstate.cc
  Although its space complexity is P*T, which limits its use to small regexes/haystacks. It sounds like your claim is that your technique achieves a smaller space bound?
  
  hddqsb 3 years ago
  
  Yes. My proposed fix takes O(P * T) time and O(1) memory (but it only supports glob-style wildcards; it doesn't support repetition of specific characters / character classes / sub-patterns).
  
  burntsushi 3 years ago
  
  Ah gotya. An important clarification indeed. Thanka for that!
  
  ridiculous_fish 3 years ago
  
  Finite state machines also have their cliffs, in particular bounded repetition. Rust's FA-based regex will explode with /.{12345}/, while v8's backtracking engine handles it no problem.
  
  burntsushi 3 years ago
  
  Yes, but please be careful here. The cliffs are categorically different. The cliffs of a finite state machine are still bounded to O(m * n), where m ~ len(regex) and n ~ len(haystack). The cliffs of a backtracker are exponential and difficult to predict.
  The repetition operator provokes the cliff because it represents short-hand syntax for easily increasing `m` to very large sizes. Indeed, by default, '.{12345}' will fail to compile because it blows the default limit. That only works because, as I said, the cliffs for finite automata based regex engines are far more predictable.

jll29 3 years ago

I like the article that this post refers to about efficient implementation (https://swtch.com/~rsc/regexp/regexp1.html). It mentions a debate between J. Friedl and comp. sci. purists about whether NFAs and regular expressions should really be called what they are in the face of backreferences and other notation. Friedl suggests another name should be used, because de facto "NFAs" are no longer equivalent to DFAs.

Since Xerox Research Europe's invention of xfst, the term "extended regular expression" is already taken for a specific superclass of expressions that describe regular relations.

So I propose two alternatives, semi-regular expressions and super-regular expressions, for strings that look like regular expressions but that use non-regular extensions like backreferences that push the complexity beyond what can be done with NDAs/DFAs.

eru 3 years ago

What really irks me that there are several useful operations that can be supported by proper regular expressions in linear time, but those franken-expressions never feature them.
I am mostly talking about set intersection and complement. But https://en.wikipedia.org/wiki/Regular_language#Closure_prope... has some more.
- mananaysiempre 3 years ago
  
  Burntsushi (of regex crate / ripgrep fame) is on record saying[1] he thinks those are intractable in a standard DFA implementation on a large alphabet (Unicode), and if he can’t do it, I don’t think there’s much hope for the rest of us :) They are relatively straightforward in derivative-based implementations[2,3], but those are kind of meh performance-wise AFAIU, and just eagerly building the DFA using derivatives[4] yields huge automata.
  (To be fair, the regex crate can’t match canonically equivalent Unicode strings, which I also think is pretty meh, but I’ve played with Unicode normalization recently and I’m convinced it would be impossible to support canonical equivalence with the same order of magnitude in performance without going through horrible contortions.)
  [1] https://news.ycombinator.com/item?id=31693008
  [2] https://news.ycombinator.com/item?id=17652542
  [3] https://news.ycombinator.com/item?id=8199902
  [4] https://news.ycombinator.com/item?id=25604829
  
  burntsushi 3 years ago
  
  > To be fair, the regex crate can’t match canonically equivalent Unicode strings, which I also think is pretty meh
  Yeah not "meh" at all. :-) I don't think any regex engine can do it? Does icu's regex engine even do it? Certainly no fsm regex engine does.
  UTS#18 doesn't even ask regex engines to do it. If you need it, it just suggests canonicalizing both the regex and the haystack first.
  And yes, derivatives build full DFAs. Those are a no-go in a general purpose regex engine.
  There's also the issue of complement and intersection being somewhat difficult to reason about.
  I'll also note that I have never given any serious effort to actually trying to implement complement or intersection. I just don't see any viable implememtation path to it in the first place. I don't know where I would start. Everything I cam think of involves extreme compile-time costs that would just be totally unacceptable in a general purpose regex engine.
  But, I am not really an "ideas" person. I see myself more as an engineer. So my inability should not really be taken for gospel. Be skeptical of expertise. That's the foundation of science. :)
  
  hayley-patton 3 years ago
  
  > And yes, derivatives build full DFAs. Those are a no-go in a general purpose regex engine.
  Lazily compute derivatives? From memory Rust regex computes and caches a DFA lazily from the NFA, for comparison.
  
  burntsushi 3 years ago
  
  Feel free to show me an implementation that does it with an alphabet of bytes, not codepoints, while maintaining support for Unicode classes.
  It doesn't strike me as something amenable to lazy compilation. But maybe I'm wrong.
  
  mananaysiempre 3 years ago
  
  No, I don’t think anybody can do it.
  Unless I can’t read, ICU4C does not even do streaming normalization, and the buffer-at-a-time one is around 100—200 MB/s or so[1], which looks ... mildly amusing next to your engine :) My own attempt at streaming normalization is currently about two times slower, because it has deeper (and smaller) tables and is just generally dumb, but then ICU4C is not particularly smart either. I expect that normalization at 2x ICU speed or more is possible, but that still won’t make normalize-then-match competitive with no normalization, while encoding all canonical equivalents into the DFA sounds downright hellish (also strictly speaking impossible, given normalization requires unbounded buffer space, but that part is probably solvable).
  UTS #18 level 2 (which I’m aware you explicitly did not attempt) kind of says some things about canonical equivalence and then goes “nah, normalization is difficult, let’s go match grapheme clusters, also maybe just prenormalize to NFD”[2]. Which just looks weird, but apparently it means they did try to impose a requirement then gave up more than a decade ago[3]. Does anything even fully implement level 2, I wonder?
  So, meh, but not really directed at you—regex is supernaturally fast as far as I’m concerned. Perhaps I’m annoyed at Unicode normalization for being (not really complicated but) so punishingly difficult to do quickly it isn’t really clear how you could wire it into a regex matcher.
  There is also the question of how normalization-invariant regexes should even work: should “<C-caron>.” match “<C-cedilla><caron>”? UTS #18 (with invariance requirement restored) seems to imply yes, but it’s hard to imagine when anyone would actually want that. (Note that both pattern and haystack are normalized—the corresponding regex for NFD input would be something like “(C<caron>[[:ccc=0:][:ccc>=230:]]|C[[:ccc>0:]&[:ccc<230:]]<caron>)”, but nothing implements that syntax.) So while UTS #18 gives a possible interpretation of how Unicode regexes should work, I’m not entirely convinced it’s the right one, or even right enough to go on the crusade of implementing a normalization-invariant regex engine.
  [1] Or so I thought, but my tests used UTF-32 while apparently UTF-16 is much faster: https://tzlaine.github.io/text/doc/html/boost_text__proposed.... Ugh.
  [2] https://www.unicode.org/reports/tr18/#Canonical_Equivalents
  [3] https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_...
  
  burntsushi 3 years ago
  
  > Does anything even fully implement level 2, I wonder?
  Not sure about backtrackers, but I know ICgrep attempted quite a bit of level 2: http://international-characters.com/icgrep
  But I'm not sure how far they got. And most of the links on that page are dead. :-(
  And yeah, UTS#18 actually used to have a level 3 (custom tailoring), but they removed it.
  I'm content with level 1 support. The regex crate is just about there and that actually makes it have better Unicode support than the vast majority of regex engines. :-)
  Level 2 is indeed hard.
Groxx 3 years ago

Alternatively: declare that "regex" has graduated into its own proper term (not a contraction) for whatever this evolving somewhat-too-capable-for-easy-classification tool is.
It's certainly how it's used colloquially.
- amscanne 3 years ago
  
  The term “regex” is already used interchangeably for both classes. The Go regular expression library (which is widely used) does not support backreferences, and is guaranteed to run in linear time.
  It seems sillier to say “that’s not a regex” (since it definitely is a regular expression engine), than to say that the others are extended regular expressions or some other, more flexible thing (for better and worse).
kazinator 3 years ago

I cannot agree here. The term "regular expression" relates to "regular language".
So instead of inventing new words haphazardly, we should look at what kind of language we are able to express with these non-regular expressions, what the name of that language is and then use that name for the expressions.
Here is a starting point: where do regular languages sit in the Chomsky hierarchy:
https://en.wikipedia.org/wiki/Regular_language#Location_in_t...
Here, we see that there are also less inclusive languages included in regular: finite languages and star-free languages, for which the corresponding "finite expression" and "star-free expression" make sense.
In the other direction, maybe some useful language kinds can be identified between "regular" and "context-free" in relation to the capabilities of certain kinds of expressions. However, "context-free expression" isn't a bad start. ("Cfex" [si:fex]?)
082349872349872 3 years ago

Of those, I'd prefer semi-regular expressions (but personally currently lean towards irregular expressions). As they lose the nice properties, there's certainly nothing "super"-regular about them.
- LudwigNagasena 3 years ago
  
  Super-x may also mean that it has some extra properties wrt x. In fact, it was the original meaning that has drifted to its more common use today. Cf. “hyper” as in “hypersonic”. Both “super” and “hyper” even mean the same thing (“over”/“above”) in their original languages (Latin and Greek respectively).
- catlifeonmars 3 years ago
  
  What about pseudo-regular expressions?
anitil 3 years ago

I've read that article so many times I knew what it was by seeing the link. The first few times it didn't make sense, and now it seems almost obvious - I can't remember what I was not understanding.
flashgordon 3 years ago

Hah this is a classic. I actually used this as a reference to build my own about a year ago in typescript (documentation a long way to go and still WIP) which I am using in a couple of production ready dsls:
https://github.com/panyam/tlex

compsciphd 3 years ago

So fun story (to me in retrospect).

I had an interview at google years ago where I basically I was asked to come up with this on my own. In my opinion, a terrible interview Q (per Kernighan's blog, linked in original post, it took pike an hour or two alone in his office), and I bombed it. It was either my just before or just after lunch interview and it stuck with me for the rest of the day and I couldn't give my entire focus to the rest of the sessions.

anyways, after it was done, I found kernighan's blog post and decided I should try to implement this in java as it will allow me to even get some practice with things like junit and more experience with object oriented programming as I had been living in the C world at that time). so I did, but I then found this regular expression page ( https://www.regular-expressions.info/) and learned many more things about "regular expressions" that i hadn't learned in school (because they aren't regular) and wondered if I could extend pike's simplicity to them in a relatively simple manner. So I did.

which I created this https://github.com/sjpotter/regex.

Not meant to be "performant" but meant to be educational to me and perhaps others.

Then when I was learning go, I rewrote it in Go. I did things that are not "normal" go patterns (i.e. using panic and recover to make error handling cleaner (in my opinion, if an error path is simply if (err) { return err } all the way up, I personally think panics/recover at the top can be cleaner, but it seems most disagree), but it was educational on how to rewrite my java code to an object oriented style in Go and to explore the limitations of interfaces and how one might get around them (also perhaps in ways that go against what might be considered proper go programming)

https://github.com/sjpotter/regex-go

though now that I look at that repo, wondering if all the work was done elsewhere and then I just pushed a single commit there, might need to go back and look at it

kramerger 3 years ago

> I personally think panics/recover at the top can be cleaner, but it seems most disagree
Count me among those. At best, panic/recover would be an uglier version of throw/catch with even worse performance.
That said, I hope someday Go adds the "?" return-operator that kotlin and Rust use. Won't really change anything besides adding syntactic sugar for an already recommended pattern. But development ergonomics would improve a lot.
- mdcfrancis 3 years ago
  
  The authors of the encoding/json library thought the same. As a way to exit from deep down in a parse stack it works well especially if you only have limited entry points. https://cs.opensource.google/go/go/+/refs/tags/go1.19:src/en... uses well typed panics.
  
  nemo1618 3 years ago
  
  The stdlib is not always the best example of idiomatic Go; encoding/json is particularly infamous for having poor performance and awkward APIs. I'm fairly confident that it it was re-written today, it wouldn't use panic/recover.
  
  compsciphd 3 years ago
  
  which was why I used it. I basically had a single entry point (if i remember correctly) something like "match(regex, text)" and hence it made it easier for me to write by just panic at the error location and recover within the match() call than putting if err != null everywhere. Or at least that's what I thought then in terms of cleanliness.
  in all the production code I've written in go since then, I haven't used panic/recover because its not considered proper. But I'm still a bit skeptical as do think it can make code a bit cleaner when writing a library (I'd 100% agree though that the panic should never escape the library though and only a plain error returned to the caller, i.e. need strong library boundaries for this).
  
  paulsmith 3 years ago
  
  It’s a viable alternative to returning error values in a recursive descent parser.
- jjnoakes 3 years ago
  
  > That said, I hope someday Go adds the "?" return-operator
  Same here. I think this is my biggest code-reading pain point as a go developer. I'm toying with the idea of playing more with Go+
  https://github.com/goplus/gop/blob/main/doc/docs.md#error-ha...
benhoyt 3 years ago

I agree it's way too hard as a one-hour interview question! When I interviewed at Google a few years' ago I was asked to write a simple glob matcher that handled ? and * (like the one I've shown here: https://benhoyt.com/writings/rob-pike-regex/#bonus-glob-matc...). I think that question was border-line okay for an interview, though when I conduct interviews I prefer to give less algorithm-y problems. It was a tough question for me to do in 30-40 minutes, though I think with a little bit of help I got a working solution in Python (I probably didn't handle edge cases properly).
- layer8 3 years ago
  
  I’ve implemented glob matchers by regex-replacing the glob pattern, turning it into a regex, and then using that regex. I wonder how that would have gone in an interview. :)
  
  burntsushi 3 years ago
  
  This is actually how glob matching is currently implemented in ripgrep: https://github.com/BurntSushi/ripgrep/blob/a9d97a1dda0a07a8e...
  Works quite well because it can take advantage of the host of optimizations inside the regex engine.
- IshKebab 3 years ago
  
  Honestly even that is way too hard. The people who originally came up with DFA and NFA engines and all that didn't do it under pressure in 40 minutes. You're basically selecting for people who have done it before.
  
  naniwaduni 3 years ago
  
  But Dijkstra's shortest-paths algorithm is, of course, a perfect interview question, having infamously been derived in half an hour over a coffee date with his fiancée.
  
  google234123 3 years ago
  
  This is completely unfair comparison. Deriving is not comparable to reciting/applying something... Dijkstra's is pretty intuitive once you know it and about 10 or 12 line of python code? Quantum mechanics took years and multiple geniuses to figure out yet somehow undergrads and graduate students are able to learn it in one or two semesters.
  
  IshKebab 3 years ago
  
  I'm not sure I follow - interview questions are supposed to be about deriving things, not reciting/applying them. In any case I'm pretty sure he was being sarcastic. You aren't hiring Dijkstra.
  
  google234123 3 years ago
  
  You are right. However they also test that you know and are able apply and recite the fundamentals.
- compsciphd 3 years ago
  
  we probably got asked the same Q, but I believe pike's solution is close to the optimum solution to the google interview Q (in terms of being able to code out on a board in a short period of time).
  If I remember correctly I was asked to be able to do a regex that handled . ? and *
  The problem I have with this question is that if you spend 10-15 minutes going down a very wrong path, it's very very difficult to correct yourself in a pressurized environment.
  I've tried to take the question and reformat it.
  i.e.
  I ask, what language do you want to program in (presuming I'm comfortable with the languages they want to use, otherwise I wouldn't be in the room, as wouldn't be a fair evaluation).
  I give them the basic "strcmp" from pike's implementation (i.e. move character by character matching). i.e. something like
  int match(char *s, char *r) { if (*s == '\0' && *r == '\0') { return 1; } else if (*s == *r) { return match(s+1, r+1); } else { return 0; } }
  First Q. What does this do? (explained above)
  Second Q. how could we modify it work to match a substring (my explanation would be to just iterate over text like pike, but as with everything need to be open to where the interview takes you, but sometimes might need to say "that works for this, but for the future Qs, might want to think of it in this way", and give them some code, as again, I dont want them to code themselves into a corner that they don't have time to get out of).
  Third Q. how could we add anchor support (ala regex, with a bit of explanation if they don't know what they are) (ala pike, determines where we are in the text, ala for ^ no iterating, and for $ we should be at a '\0' in the text
  Fourth Q. how could we add "." support (just a || change to the primary matching condition, with being smart to check for '\0' which shouldn't match, but in a pressurized environment they might not get that without prodding right away)
  Fifth Q. how could we add "?" support (now we have to look ahead a bit ala what pike does with )
  Sixth Q (and the hardest), how could we add support.
  I want to see if they can take some existing code, digest it and then see how they think to expand it. As there are many deliverables along the way, I'd hope I'd get a decent amount of "signal" even if they can't get to "?" or "*"
  I also give them the link to kernighan's discussion on pike's code at the end.
  if people have critiques of this approach to the Q (or further ways to improve it, I'm open)
renewiltord 3 years ago

Absolutely insane as an interview question. I know lots of top engineers who've built the fundamentals and driven companies to billions of dollars multiple times who would have failed to have completed this objective within an hour.
I refuse to believe anyone could imagine it's possible. Either that or there's a whole bunch of hidden talent wasting away at Google - which sounds unlikely.
- cmrdporcupine 3 years ago
  
  The Google interview is more interested in how you approach the problem, explain it, and analyze it etc than in a working solution. You can pass the interview without having fully working code.
  (I did interview training at Google twice)
  I still think it's a terrible interview format, though.
- iainmerrick 3 years ago
  
  Or, most developers at Google do interviews but many of them aren’t very good at calibrating their questions.

mkovach 3 years ago

I love this bit of code. It saved my butt in '99 when I had to implement searches for potential "Y2K" coding issues. Some where, collecting dust in CVS repository, probably saved on tape is some storage bunker, is a modified version of this that searched millions of lines of code for various potential "double digit year" issues.

And this also taught me an elegant way to implement recursion in C, that I shamelessly used in interviews to look way smarter than I actually am.

hddqsb 3 years ago

> horrible run times on craftily-constructed regexes

Here is a simple patch to make the runtime O(len(pattern) * len(text)) instead of exponential. It adds 5 lines and a new parameter. The idea is to memoize (cache) the results, specifically the failures (there is no need to cache the successes, because all the functions return immediately on success).

     // Match reports whether regexp matches anywhere in text.
     func Match(regexp, text string) bool {
    +    seen := make(map[[2]int]bool)
         if regexp != "" && regexp[0] == '^' {
    -        return matchHere(regexp[1:], text)
    +        return matchHere(regexp[1:], text, seen)
         }
         for {
    -        if matchHere(regexp, text) {
    +        if matchHere(regexp, text, seen) {
                 return true
             }
             if text == "" {
                 return false
             }
             text = text[1:]
         }
     }
     
     // matchHere reports whether regexp matches at beginning of text.
    -func matchHere(regexp, text string) bool {
    +func matchHere(regexp, text string, seen map[[2]int]bool) bool {
         switch {
         case regexp == "":
             return true
         case regexp == "$":
             return text == ""
         case len(regexp) >= 2 && regexp[1] == '*':
    -        return matchStar(regexp[0], regexp[2:], text)
    +        return matchStar(regexp[0], regexp[2:], text, seen)
         case text != "" && (regexp[0] == '.' || regexp[0] == text[0]):
    -        return matchHere(regexp[1:], text[1:])
    +        return matchHere(regexp[1:], text[1:], seen)
         }
         return false
     }
     
     // matchStar reports whether c*regexp matches at beginning of text.
    -func matchStar(c byte, regexp, text string) bool {
    +func matchStar(c byte, regexp, text string, seen map[[2]int]bool) bool {
         for {
    -        if matchHere(regexp, text) {
    +        if seen[[2]int{len(regexp), len(text)}] {
    +            return false
    +        }
    +        if matchHere(regexp, text, seen) {
                 return true
             }
             if text == "" || (text[0] != c && c != '.') {
                 return false
             }
    +        seen[[2]int{len(regexp), len(text)}] = true
             text = text[1:]
         }
     }

Demo: https://go.dev/play/p/aD9vzXwTHGE

rurban 3 years ago

I also revived my old dynamic string matchers recently, which don't need any costly compilation step: https://github.com/rurban/tiny-matcher (in-work)

And with compilation https://github.com/rurban/tiny-regex-c (done)

benhoyt 3 years ago

That's really nice clear code -- thanks for sharing. This would be great for small embedded systems.
- rurban 3 years ago
  
  Not yet for embedded, as my variant still allocs. When I have time I'll add the static variants. I'll need it for MQTT on my baremetal firmwares sooner or later.

returningfory2 3 years ago

> I ran benchmarks [...] matching the regex Ben.*H over 100 concatenated repeats of the King James Bible.

Is there a concern with these kinds of micro-benchmarks, where you repeatedly do the same small operation (matching the same regex), that your results will be skewed by the branch predictor, CPU caches, etc.?

tymscar 3 years ago

That might be a concern but if you run the same benchmark , the same amount of times, on the same hardware, with the algorithms being the only difference, then you still get a proper comparison between them, even if the timings are useless outside said benchmark

commandersaki 3 years ago

Excellent post. I remember reading the exegesis years ago but completely forgot about it. Really impressed how well the Go version turned out.

Btw always love your writing Ben, it’s very engaging.

tgv 3 years ago

It's a left-shortest matching back-tracker. Not to be used in practice. I'm sure this implementation features in various text books.

maccard 3 years ago

So it's doable if you know the solution and have seen it before, but otherwise it's pretty damn difficult
- samatman 3 years ago
  
  The issue here is the pathological performance under crafted inputs, I don't believe GP was saying it's easy to write or uninteresting history.

djbusby 3 years ago

Ok. This was educational and fun. Thanks!

adrianmsmith 3 years ago

Good to see the term "regexp" (with a "p") in use. I always used to call it that, then at some point realized everyone was writing "regex", and wondered if I'd spelled the abbreviation wrong the whole time. I guess it's just a function of time, it used to be spelled with a "p" and now mostly isn't.

smcl 3 years ago

I ended up learning it the same, possibly from the Camel Book? One thing I noticed when I wasn't coding alone in my bedroom and had to actually talk to other coders was that "regex" is more natural to say than "regexp", so that might explain why it became the preferred spelling.
IshKebab 3 years ago

Probably because "regex" is much easier to say and looks much more like a real word than "regexp". Like "latex" or "codex". Nothing ends in "exp".
lupire 3 years ago

Regexp has been dying for decades
https://trends.google.com/trends/explore?date=all&geo=US&q=R...
rob74 3 years ago

Guess that time favors shorter abbreviations. I prefer "rex" myself (at least as a variable name for a regex object).

nsonha 3 years ago

I don't see anything "beautiful" here, it works and is performant but seems pretty average in term of readability

dev_snd 3 years ago

I never understood why incrementing a pointer while doing another operation, such as a comparison, is considered good style in C, TBQH.
I think the go example is much more readable.
- rswail 3 years ago
  
  Because C was originally for PDP/DEC equipment and the instruction set had standard register access modes including both pre and post increment.
  So the (for example) strcmp "while (s++ == d++);" made sense as efficient code, because the pointer access and the post increment effectively compiled down to almost a single instruction.
  https://en.wikipedia.org/wiki/PDP-11_architecture#General_re...
  
  im3w1l 3 years ago
  
  x86 has cmps for doing comparison with postincrement / postdecrement.
  
  rswail 3 years ago
  
  Sure, but the point was that C was designed to implement Unix and Unix was implemented on DEC machines and DEC machines had a particular architecture around registers that included the pre/post increment, which lead to the C *p++ style to iterate.
  
  im3w1l 3 years ago
  
  I wanted to point out that such instructions are in use to this day rather than being than being something from a long gone era.
  
  azinman2 3 years ago
  
  Do compilers use that if you break it apart to something more legible? If you don’t, does the intel cpu end up doing the right thing anyway?
  
  umanwizard 3 years ago
  
  Yes, modern compilers should treat them the same.
abudabi123 3 years ago

Emacs has a DSL with the regex power and potential to make the reading experience better than poking toothpicks in your eyeball; experience and familiarity with the syntax sees beauty where others fail to?
christophilus 3 years ago

Of the C code I’ve seen, this is quite nice. It’s decently commented, and very simple. I’d say it’s above average, but that may say more about the codebases I’ve seen than it does about this example.

metadat 3 years ago

I wonder what the non-recursive redesign would look like, without using the simulate-recursion-via-callstack translation technique.

I slept on it and the solution still isn't clear to me.

As an aside: It'd be nice if the go unit tests logged an indication of whether or not the ./matchC exec tests ran.

unixbane 3 years ago

okay now post the standard academic version of elegant simple regex in algol, lisp, ml whatever and see how many upvotes it gets

user5678 3 years ago

kubb 3 years ago

It's a a nicely written function... but it has the difficulty of a warmup interview question or a CS101 homework.

lupire 3 years ago

Regexps are usually in an Algorithms class.
Good code looks deceptively simple.
keepquestioning 3 years ago