Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. --
JamieZawinski
For a contrary opinion, and a pointer to a book well worth reading,
http://gilesbowkett.blogspot.com/2011/10/i-had-problem-and-i-decided-to-use.html. --
RobertField
-
- Also known as regex or regexp (RegExp)
RegularExpressions are a fundamental building block of
ComputerScience. In their basic form, the language of
RegularExpressions is formally equivalent to
FiniteAutomata (
FiniteStateMachines), both of the deterministic and nondeterministic flavors. Regexes are also extremely commonly used in Unix, and are basically the heart of the Perl programming language.
They are also a wonderful way to write programs that are
WrongButGoodEnough?.
RegularExpressions are most often incorrect - to do a correct search, you usually need to consider the structure of the file. --
LexSpoon
Note that the
programming language feature known as regular expressions, and the thing from computation theory (the set of grammars which can be recognized by a DFA) are two different things.
RegularExpressionsArent.
I find
RegularExpressions a wonderful way of coming up with programs that are Right, let alone
GoodEnough. In practice,
RegularExpressions are usually capable of handling the structure of the file (in other words, for simple enough structures). Granted, there is often the temptation to use overly simplified regex - but this way lies madness. If it is a relatively sane parsing problem,
RegularExpressions are just the ticket. --
AnonymousDonor
RegularExpressions are surprisingly good at analysing imprecise, free-form or natural-language material. They really suck for structured-text analysis, which I think is what
LexSpoon is talking about.
I guess a good exercise is to try and make a
RegularExpression that matches all SGML/XML tags within text. You have to make sure they are not within a quoted environment, and in searching the end of the tag, you have to take into account comments, quotes and quoted quotes. And that doesn't yet even check the tag is correctly written. The complete
RegularExpression might get many lines long. --
PanuKalliokoski
Actually, I believe complete XML parsing with
RegularExpressions isn't possible, due to limitations inherent to
RegularExpressions. It's been a couple of years since I read a book about 'compiler building' (the new
DragonBook?, I believe it was), and I'd have to crack it open again to fill in the details. (And I believe there is some information on wiki about the three levels of grammars, but I can't find it. Anyhow, only the most restricted of those levels can be parsed with only regexps. And XML isn't in that level).
Matching all tags is not the same as
parsing. XML parsing is indeed impossible with Regular Expressions, as is anything that allows arbitrarily deep nesting (the touchstone of a non-regular language), but that doesn't imply that it is impossible to at least pluck out all tags with a nice, long regular expression. I don't think it's impossible but a standard Finite Automata to do so would be as ugly as sin, because standard Finite Automata do not fare well with rules like "Anything but an A followed by a B"... nest two or three of those within each other and it passes the human ability to read or write with the multiplicity of states, but it is still, technically, an FA. A regular expression can be written because while a correct one must worry about CDATA and commenting, there are still a finite number of permutations of "modes" you can be in, because you can't be in a "CDATA nested in a comment nested in CDATA nested in CDATA".
A few pages on this wiki: [
EditHint, move to bottom?]
See
TextFormattingRegularExpressions for the regular expressions this Wiki uses.
See
RegularExpressionMatchAssertion,
AlternativesToRegularExpressions,
StructuralRegularExpressions,
RegularExpressionExamples.
Other sources of information on RegularExpressions:
Jeffrey E.F. Friedl's book
MasteringRegularExpressions has more information than you probably want to know, including specifics of POSIX
RegularExpressions.
For an online tutorial see:
http://www.regular-expressions.info/about.html
Regular expressions are a broken form of
PrologLanguage. Without cuts!
You can't write a serious program with regular expressions. Even toy programs end up having to reimplement cuts outside of regexp. That's because each additional subexpression explodes the space which has to be backtracked, and the run time ends up being very polynomial.
Concrete example: I have a regular expression made up of 27 distinct subexpressions, and I want to match the whole sucker to a string that contains a
single matching entry. Matching 17 of those 27 subexpressions takes 0.8 seconds. 19 takes 8 seconds. I ran out of patience so I don't even know how long it takes to match 21 out of 27.
Oh, and here's the bad thing. Regular expressions are
also polynomial in the
number of matching entries they're working over. But if you limit it to 3 subexpressions, it's linear. Go figure.
The problem is Perl (or some other PCRE language). Use awk (lol) and/or see http://swtch.com/~rsc/regexp/regexp1.html
Restructuring my program so it CAN make use of cuts, which of course I'll have to implement, will be a pain.
RegularExpressions were introduced by S.C. Kleene to describe the
McCulloch and Pitts 1943 finite automata model of neurons. ("Representation of Events in Nerve Nets", p3-40 in
ClaudeShannon/
JohnMcCarthy "Automata Studies", 1956)
The first application of
RegularExpressions to editor search/replace (in the QED editor) was by
KenThompson, who published a
RegularExpression-to-NFA algorithm in 1968, "Regular Expression Search Algorithm", CACM 11:6, 419-422
KenThompson went on to reimplement this in the Unix
ed editor, which
BillJoy turned into the
vi editor.
KenThompson adapted the
ed code for
grep and
sed. (Some years after its creation, Emacs eventually borrowed the idea of
RegularExpressions, but not the code, directly from these Unix editors -- RMS, private communication)
SteveJohnson (prior to, and building towards, his Unix
yacc tool) and
MikeLesk (in the Unix
lex) did some of the earliest applications of
RegularExpressions to compiler lexical analyzers via automated DFA-building tools.
Awk is a scripting language/command line tool derived directly from this Unix Culture of
RegularExpressions; it is no coincidence that the language most famous for
RegularExpressions today,
perl, was developed in a Unix environment, inspired by
awk and other Unix
RegularExpression tools.
RegularExpressions were thus widespread in Unix tools of all sorts from the beginning, years to decades before this technology was widespread elsewhere (although obviously there were exceptions), and
RegularExpressions have always been an extremely important (albeit under-acknowledged) part of
UnixCulture, contributing to the historical attitudes of
SmugUnixWeenies that other systems
JustDontGetIt.
Pipes have historically been considered to have been the
KillerApp of Unix, but there's a strong argument that it was actually
RegularExpressions. Never heard that, but if its a "strong" argument, I'm convinced
Resources
Netscape JavaScript http://www.evolt.org/article/Regular_Expressions_in_JavaScript/17/36435/ and
http://web.archive.org/web/20030810202454/devedge.netscape.com/library/manuals/2000/javascript/1.5/reference/regexp.html
PerlLanguage usage
Syntax
Tutorial
Operators
Reference
VbScript usage
http://msdn.microsoft.com/library/default.asp?URL=/library/en-us/dnclinic/html/scripting051099.asp
RegExp Coach http://farm.tucows.com/blog/_archives/2004/10/27/167823.html
RegexCoach (same as above?) http://www.weitz.de/regex-coach/ -
Very good! And in Lisp, for all the SmugLispWeenies out there.
Steve Ramsay's Guide to Regular Expressions http://etext.lib.virginia.edu/helpsheets/regex.html
Redet http://billposer.org/Software/redet.html - a regular expression development and execution tool, supporting more than 40 search programs and programming languages.
TclTk regexp visualizer - can highlught what each part of the regexp matches - very useful when learning regexps
I've been known to refer to
RegularExpressions as "
a precise notation for fuzzy matches" or "
a precise notation for expressing fuzzy search logic" when explaining them to people familiar only with "string matching" as their search method.
The concept of inclusion in/exclusion from sets, and a grammar for saying things like "any digit at the beginning of a line, followed by at least 1 but not more than 12 of any non-punctuation character, followed by any combination of alpha-numeric characters joined to a set of non-empty parentheses, said line ending in a semicolon, possibly followed by white space" in a compact and exact notation is not necessarily easily taught, but the look of revelation when one "gets it" is priceless.
As a pedagogical tool for helping people to "get it", that's interesting. I suppose it doesn't matter that it's technically inaccurate to call it fuzzy matching, since it depends on the pattern whether or not one specifies a literally fuzzy match in the technical ZadehLotfi FuzzyLogic sense, since these people won't be familiar with FuzzyLogic now nor in the future?
Just so. I originally hesitated before using "fuzzy" to describe it, but you're right,
FuzzyLogic is well beyond the scope of where they'll go (though I usually throw in a disclaimer that I'm playing fast and loose with the term). The main hurdle for them is usually the use of "any" and "not" in conjunction with "at least" and so on. It's getting them to the point where they realize that they're not looking for precise spellings and not confined by having to examine every occurrence of some word just because they can't remember
exactly what words are around it. It's a little liberty I take -- somewhat in the name of
WittgensteinsLadder -- to get over the concept hump.
I think I just broke a refactoring record. I was rewriting code from a
BadProgrammer and I chopped 127 lines into 7 using a
RegularExpression in
CsharpLanguage :) The Regex parsed an XML file stream and split out 2 elements by name, with values optionally in quotes. The other programmer manually parsed the file, split it into arrays, split those into key/value pairs, foreach-ed through those till he found the ones he wanted, stored them into separates strings, then later recombined them into the final XML. OMG what a nightmare it was, but I will never go back to a non-
RegularExpression language! (I'm a former VB6 junkie)
- RegularExpression was not supported well in MS environments (buggy and slow) before. Why did you not use the Xml parsers to do what you need to do?
MyBrowser,
FireFox 2.0.0.11 -- allows Find in this page, but without
RegularExpressions -- Maybe the
OpenSource contributors can add that.
See also:
AlternativesToRegularExpressions
CategoryRegularExpressions CategoryLanguageFeature