In my previous post I dug up messages someone wrote almost 10 years ago, and in some places challenged what they wrote. In writing the post, I mostly wanted to document the thread that created the somewhat-famous “now you have two problems” quote, and the challenges and rebuttals I included are a bit unfair of me (being 10 years later, with little chance that the author would care to engage now in a debate about writings and technology 10 years old).
So, with that in mind I'd like to refute a comment made yesterday on the Slashdot review of my book:
“Nondeterministic finite automata” is well defined in comp-sci and Friedl has it wrong. The set of languages accepted by NFA's is exactly the same as the set accepted by DFA's.
Perl's engine and its brethren use search-and-backtrack. They accept a lot things that are not regular expressions. Such engines don't have much theory behind them, and it's hard to reason generally about what they do and how fast they do it.
In the context of what my book covers, this is unreasonable on many levels, but I won't respond to it now, I'll respond to it 10 years ago. I wrote the following in 1997, in a post on comp.lang.perl.misc:
|> Perhaps it's now too late, and we need to find a new |> term for what we used to call regular expressions.
Perhaps it's best to differentiate between the calculus phrase “regular expression”, coined by Dr. Stephen Kleene in the 40s, and the use of the same term in modern computer science. There are many situations in language where the meanings of words have changed. Some examples:
- When I was growing up, “How are you doing?” was not a statement.
- A Japanese futon has no relationship to what is called a “futon” in American English.
- Tennis shoes are rarely used for tennis.
- The little eraser-head pointing devices on laptops are called mice.
Ah, well, I could go on forever. But basically, the term “regular expression” has no implied meaning to most people that first come across it in Perl or whatnot, so little confusion arises. It would have been nice had some other word been used from the start (perhaps one giving more of a hint as to what the thing is all about), but for the most part we're stuck with history. I suppose you could start a crusade to use a different phrase. How about 'whachamacallit'? 🙂
Jeffrey
Whoever wrote the comment also wrote a similar comment in March, and believes that the reality of how things work is somehow dictated by the name. I tend to believe that the reality of how things work is dictated by, well, reality, often despite the name. (Names/functions of government offices, like “Internal Revenue Service”, are an excellent example of this 🙂
My book is concerned with the reality of how things work, and I make it clear in the book that the reality is different than what the linguistic historical-baggage might imply to some small subset of people.
To continue the special theme of this post of bringing up 10-year-old writings to rebut current ignorance, here's the sidebar on page 104 of the first edition of my book (the poster having said that he has the first edition), the master source having been last updated on November 4, 1996:
(A similar sidebar appears on page 180 of later editions)
The true mathematical and computational meaning of “NFA” is different from what is commonly called an “NFA regex engine.” In theory, NFA and DFA engines should match exactly the same text and have exactly the same features. In practice, the desire for richer, more expressive regular expressions has caused their semantics to diverge. We'll see several examples later in this chapter, but one right off the top is support for backreferences.
As a programmer, if you have a true (mathematically speaking) NFA regex engine, it is a relatively small task to add support for backreferences. A DFA's engine's design precludes the adding of this support, but an NFA's common implementation makes it trivial. In doing so, you create a more powerful tool, but you also make it decidedly nonregular (mathematically speaking). What does this mean? At most, that you should probably stop calling it an NFA, and start using the phrase “nonregular expressions,” since that describes (mathematically speaking) the new situation. No one has actually done this, so the name “NFA” has lingered, even though the implementation is no longer (mathematically speaking) an NFA.
What does all this mean to you, as a user? Absolutely nothing. As a user, you don't care if it's regular, nonregular, unregular, irregular, or incontinent. So long as you know what you can expect from it (something this chapter will show you), you know all you need to care about.
For those wishing to learn more about the theory of regular expressions, the classic computer-science text is chapter 3 of Aho, Sethi, and Ullman's Compilers — Principles, Techniques, and Tools (Addison-Wesley, 1986), commonly called “The Dragon Book” due to the cover design. More specifically, this is the “red dragon”. The “green dragon” is its predecessor, Aho and Ullman's Principles of Compiler Design.
In particular, that third paragraph really sums it up.
(Okay, all done, I'll go back to posts about photography and Anthony...)
Interesting words from one of Perl 6 documentations:
[quote]
This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them regex rather than “regular expressions” because they haven’t been regular expressions for a long time,
[/quote]
If you want to read more, see:
http://dev.perl.org/perl6/doc/design/syn/S05.html
Forum thread on which the documentation is mentioned:
http://forum.java.sun.com/thread.jspa?threadID=768745
I don’t know whether it is a good news or bad news.
Man, was that a boring post. Remember that some of your readers have half of your IQ . Please go back to fun with Anthony. 🙂
Some people, confronted with a problem of language abuse, say “I know what, I’ll teach them the facts.” Then they have two problems.
Hmmmm…
http://swtch.com/~rsc/regexp/regexp1.html
NFA DFA pain. The name of the engine does not matter. Every engine implementation can have a different name, and compromizes between strict NFA , strict DFA, strict hybrid fa, history-based fa, non-amnesia-fa…. Quantum-fa soon in 70 years time. The family of FAs is growing very fast.
http://swtch.com/~rsc/regexp/regexp1.html is stupidely based on a cornercase that any formal regex analyser would solve in microseconds: (a?){20}a{20} is better written as a{20}(a?){20} (when the nfa meets the dfa …. ). Backtracking explosion is gone, There is also a load of formal negated logic which can be applied to reduce similar linear and quadratic explosions (even without using assertions). Identically, a lot of backscratching is prevented in engines which detect | as a xor, i.e. exclusive alternations, or can scan backwards when necessary.
Can even implement a few fail transitions in a NFA and then rebrand it more-deterministic-less-stupid-nfa. That would be a good compromize. Would any user care about it ?
Jeffrey, you are right against the theoricists and talebans of the FA classification as frozen in the 60’s (half a century ago).
Pierre misses the point.
It’s not that the regular expression that Russ Cox used in his article cannot be transformed (it can be, and trivially at that). Rather, the point is that when using NDFAs and DFAs, the type of exponential blow up that happens in Perl’s backtracking implementation doesn’t happen. Indeed, it cannot happen.
Jeffrey is right in that the end user shouldn’t necessarily have to be aware of the underlying theory in order to use the implementation (though it couldn’t hurt). However, that doesn’t mean that it’s okay to abuse the theory by calling it something that it most definitely is not. Words have meaning, terms have definitions, and just because you choose to ignore them doesn’t make you correct. Millions of people say “ain’t” every day; that doesn’t make it proper English.
The fact is that, as much as practitioners may abuse them, these terms have formal definitions. Those definitions don’t need to, nor will they, change just because Perl screwed it up by making regular expressions not regular anymore. It’s really not the computer scientists fault. It’s Larry Wall’s (or maybe Henry Spencer’s) fault for not understanding the theory in the first place.
If anything, this causes greater confusion because someone who’s curious about the underlying theory may go read up on it, and then not understand why their code doesn’t work the same way. This contributes to the illusion that the theory is impenetrable and irrelevant. It may also cause a programmer to charge off, confidently assuming that his or her usage of regular expressions is correct and guaranteed to run in time linear in the size of the string being parsed with bounded memory usage, only to find that the thing actually blows up.
A much more sensible approach would be to just call a spade a spade: Perl et al don’t use NFAs. They really don’t, despite what people keep insisting. So, simply say what they do do: they use backtracking, which is subject to exponential run times and memory consumption. There’s nothing wrong with that, as long as people know that’s what is happening and plan accordingly. So just tell them, so that they can be better prepared to make use of the tools that the language provides because they understand the potential consequences.
The problem with all the confusion documented above is that the users of “almost-regular
expressions” (AREs) do, or at least should, sometimes care about efficiency;
good algorithms and good advice will give them
a chance of meeting their efficiency needs. Computer scientists who
have worked (and still work—this research did not stop in the 1980s) to figure out efficient
algorithms for deciding whether a string matches an ARE have also given performance
guarantees along with these algorithms, particularized by properties of the specific ARE instance.
If the user needs a hard guarantee of not-horrible performance, they need to know that they
should use a matcher
that can use a non-backtracking algorithm, and need to know what
properties of an ARE will ensure a non-backtracking match. Fortunately, most AREs can be
matched in a non-backtracking way (AREs containing backreferences being the most prominent exception).
If the user is in a situation where it is OK for the matcher to simply run out of time or memory
and fail, they can use whatever matcher they desire—and indeed, most backtracking matchers perform
acceptably for most AREs and input strings. The problem is that an ARE that works perfectly fine
with a backtracking matcher on most input strings may well suddenly blow up in time or space
when presented with just the “wrong” one. The user who hasn’t anticipated and planned for
this possibility will find themselves in a bad situation.
CS researchers now know how to build matchers that perform at least as well as backtracking
matchers on all inputs, that implement the same semantics as backtracking matchers,
and that run in time proportional only to the length of the input string for a wide and predictable
variety of AREs. These matchers are not terribly difficult to build, and source code is available. It would
behoove those choosing
ARE matchers to choose that kind: there is no downside. Users of ARE-based software that
care about performance need to
be aware of which matching algorithm is being used, and to control what sort of ARE and/or input
string is fed to the matcher so that they get the performance guarantees they need.
Efficiency is a running theme through the whole book, because as I wrote (in the quote from the book, cited above), “So long as you know what you can expect from it (something this chapter will show you), you know all you need to care about.“. There are plenty of cases for which an NFA that uses backtracking is order(N) on the length of the string, and cases, too, where changing one character of the regex turns it into a never-ending computational bomb. Knowing what you can expect from a regex engine encompasses all that. —Jeffrey