RegularExpression

Last edit August 17, 2014

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. --JamieZawinski

For a contrary opinion, and a pointer to a book well worth reading, http://gilesbowkett.blogspot.com/2011/10/i-had-problem-and-i-decided-to-use.html. -- RobertField

: Also known as regex or regexp (RegExp)

RegularExpressions are a fundamental building block of ComputerScience. In their basic form, the language of RegularExpressions is formally equivalent to FiniteAutomata (FiniteStateMachines), both of the deterministic and nondeterministic flavors. Regexes are also extremely commonly used in Unix, and are basically the heart of the Perl programming language.

They are also a wonderful way to write programs that are WrongButGoodEnough?. RegularExpressions are most often incorrect - to do a correct search, you usually need to consider the structure of the file. -- LexSpoon

Note that the programming language feature known as regular expressions, and the thing from computation theory (the set of grammars which can be recognized by a DFA) are two different things. RegularExpressionsArent.

I find RegularExpressions a wonderful way of coming up with programs that are Right, let alone GoodEnough. In practice, RegularExpressions are usually capable of handling the structure of the file (in other words, for simple enough structures). Granted, there is often the temptation to use overly simplified regex - but this way lies madness. If it is a relatively sane parsing problem, RegularExpressions are just the ticket. -- AnonymousDonor

RegularExpressions are surprisingly good at analysing imprecise, free-form or natural-language material. They really suck for structured-text analysis, which I think is what LexSpoon is talking about.

I guess a good exercise is to try and make a RegularExpression that matches all SGML/XML tags within text. You have to make sure they are not within a quoted environment, and in searching the end of the tag, you have to take into account comments, quotes and quoted quotes. And that doesn't yet even check the tag is correctly written. The complete RegularExpression might get many lines long. -- PanuKalliokoski

Actually, I believe complete XML parsing with RegularExpressions isn't possible, due to limitations inherent to RegularExpressions. It's been a couple of years since I read a book about 'compiler building' (the new DragonBook?, I believe it was), and I'd have to crack it open again to fill in the details. (And I believe there is some information on wiki about the three levels of grammars, but I can't find it. Anyhow, only the most restricted of those levels can be parsed with only regexps. And XML isn't in that level).

Matching all tags is not the same as parsing. XML parsing is indeed impossible with Regular Expressions, as is anything that allows arbitrarily deep nesting (the touchstone of a non-regular language), but that doesn't imply that it is impossible to at least pluck out all tags with a nice, long regular expression. I don't think it's impossible but a standard Finite Automata to do so would be as ugly as sin, because standard Finite Automata do not fare well with rules like "Anything but an A followed by a B"... nest two or three of those within each other and it passes the human ability to read or write with the multiplicity of states, but it is still, technically, an FA. A regular expression can be written because while a correct one must worry about CDATA and commenting, there are still a finite number of permutations of "modes" you can be in, because you can't be in a "CDATA nested in a comment nested in CDATA nested in CDATA".

A few pages on this wiki: [EditHint, move to bottom?]

See TextFormattingRegularExpressions for the regular expressions this Wiki uses.

See RegularExpressionMatchAssertion, AlternativesToRegularExpressions, StructuralRegularExpressions, RegularExpressionExamples.

Other sources of information on RegularExpressions:

Jeffrey E.F. Friedl's book MasteringRegularExpressions has more information than you probably want to know, including specifics of POSIX RegularExpressions.

For an online tutorial see: http://www.regular-expressions.info/about.html

Regular expressions are a broken form of PrologLanguage. Without cuts!

You can't write a serious program with regular expressions. Even toy programs end up having to reimplement cuts outside of regexp. That's because each additional subexpression explodes the space which has to be backtracked, and the run time ends up being very polynomial.

Concrete example: I have a regular expression made up of 27 distinct subexpressions, and I want to match the whole sucker to a string that contains a single matching entry. Matching 17 of those 27 subexpressions takes 0.8 seconds. 19 takes 8 seconds. I ran out of patience so I don't even know how long it takes to match 21 out of 27.

Oh, and here's the bad thing. Regular expressions are also polynomial in the number of matching entries they're working over. But if you limit it to 3 subexpressions, it's linear. Go figure.

The problem is Perl (or some other PCRE language). Use awk (lol) and/or see http://swtch.com/~rsc/regexp/regexp1.html

Restructuring my program so it CAN make use of cuts, which of course I'll have to implement, will be a pain.

RegularExpressions were introduced by S.C. Kleene to describe the McCulloch and Pitts 1943 finite automata model of neurons. ("Representation of Events in Nerve Nets", p3-40 in ClaudeShannon/JohnMcCarthy "Automata Studies", 1956)

The first application of RegularExpressions to editor search/replace (in the QED editor) was by KenThompson, who published a RegularExpression-to-NFA algorithm in 1968, "Regular Expression Search Algorithm", CACM 11:6, 419-422

KenThompson went on to reimplement this in the Unix ed editor, which BillJoy turned into the vi editor. KenThompson adapted the ed code for grep and sed. (Some years after its creation, Emacs eventually borrowed the idea of RegularExpressions, but not the code, directly from these Unix editors -- RMS, private communication)

SteveJohnson (prior to, and building towards, his Unix yacc tool) and MikeLesk (in the Unix lex) did some of the earliest applications of RegularExpressions to compiler lexical analyzers via automated DFA-building tools.

Awk is a scripting language/command line tool derived directly from this Unix Culture of RegularExpressions; it is no coincidence that the language most famous for RegularExpressions today, perl, was developed in a Unix environment, inspired by awk and other Unix RegularExpression tools.

RegularExpressions were thus widespread in Unix tools of all sorts from the beginning, years to decades before this technology was widespread elsewhere (although obviously there were exceptions), and RegularExpressions have always been an extremely important (albeit under-acknowledged) part of UnixCulture, contributing to the historical attitudes of SmugUnixWeenies that other systems JustDontGetIt.

Pipes have historically been considered to have been the KillerApp of Unix, but there's a strong argument that it was actually RegularExpressions. Never heard that, but if its a "strong" argument, I'm convinced

Resources

Netscape JavaScript http://www.evolt.org/article/Regular_Expressions_in_JavaScript/17/36435/ and http://web.archive.org/web/20030810202454/devedge.netscape.com/library/manuals/2000/javascript/1.5/reference/regexp.html

PerlLanguage usage

Syntax

http://search.cpan.org/dist/perl/pod/perlre.pod

Tutorial

http://search.cpan.org/~dapm/perl-5.10.1/pod/perlretut.pod

Operators

http://search.cpan.org/~dapm/perl-5.10.1/pod/perlop.pod

Reference

http://search.cpan.org/~dapm/perl-5.10.1/pod/perlreref.pod

VbScript usage

http://msdn.microsoft.com/library/default.asp?URL=/library/en-us/dnclinic/html/scripting051099.asp

RegExp Coach http://farm.tucows.com/blog/_archives/2004/10/27/167823.html

RegexCoach (same as above?) http://www.weitz.de/regex-coach/ - Very good! And in Lisp, for all the SmugLispWeenies out there.

Steve Ramsay's Guide to Regular Expressions http://etext.lib.virginia.edu/helpsheets/regex.html

Redet http://billposer.org/Software/redet.html - a regular expression development and execution tool, supporting more than 40 search programs and programming languages.

TclTk regexp visualizer - can highlught what each part of the regexp matches - very useful when learning regexps

http://www.doulos.com/knowhow/tcltk/examples/trev/

I've been known to refer to RegularExpressions as "a precise notation for fuzzy matches" or "a precise notation for expressing fuzzy search logic" when explaining them to people familiar only with "string matching" as their search method.

The concept of inclusion in/exclusion from sets, and a grammar for saying things like "any digit at the beginning of a line, followed by at least 1 but not more than 12 of any non-punctuation character, followed by any combination of alpha-numeric characters joined to a set of non-empty parentheses, said line ending in a semicolon, possibly followed by white space" in a compact and exact notation is not necessarily easily taught, but the look of revelation when one "gets it" is priceless.

As a pedagogical tool for helping people to "get it", that's interesting. I suppose it doesn't matter that it's technically inaccurate to call it fuzzy matching, since it depends on the pattern whether or not one specifies a literally fuzzy match in the technical ZadehLotfi FuzzyLogic sense, since these people won't be familiar with FuzzyLogic now nor in the future?

Just so. I originally hesitated before using "fuzzy" to describe it, but you're right, FuzzyLogic is well beyond the scope of where they'll go (though I usually throw in a disclaimer that I'm playing fast and loose with the term). The main hurdle for them is usually the use of "any" and "not" in conjunction with "at least" and so on. It's getting them to the point where they realize that they're not looking for precise spellings and not confined by having to examine every occurrence of some word just because they can't remember exactly what words are around it. It's a little liberty I take -- somewhat in the name of WittgensteinsLadder -- to get over the concept hump.

I think I just broke a refactoring record. I was rewriting code from a BadProgrammer and I chopped 127 lines into 7 using a RegularExpression in CsharpLanguage :) The Regex parsed an XML file stream and split out 2 elements by name, with values optionally in quotes. The other programmer manually parsed the file, split it into arrays, split those into key/value pairs, foreach-ed through those till he found the ones he wanted, stored them into separates strings, then later recombined them into the final XML. OMG what a nightmare it was, but I will never go back to a non-RegularExpression language! (I'm a former VB6 junkie)

RegularExpression was not supported well in MS environments (buggy and slow) before. Why did you not use the Xml parsers to do what you need to do?

MyBrowser, FireFox 2.0.0.11 -- allows Find in this page, but without RegularExpressions -- Maybe the OpenSource contributors can add that.

CategoryRegularExpressions CategoryLanguageFeature