return to the main page Mastering Regular Expressions
Third Edition, August 2006
By Jeffrey Friedl

Full Table of Contents

xvii     Preface
xvii        The Need for This Book
xviii       Intended Audience
xix         How to Read This Book
xix         Organization
xx                 The Introduction
xx                 The Details
xxi                Tool-Specific Information
xxi         Typographical Conventions
xxii        Exercises
xxiii       Links, Code, Errata, and Contacts
xxiii              Safari\(rgEnabled
xxiv        Personal Comments and Acknowledgments

1        Chapter 1   Introduction to Regular Expressions
2           Solving Real Problems
4           Regular Expressions as a Language
4                  The Filename Analogy
5                  The Language Analogy
6                            The goal of this book
6           The Regular-Expression Frame of Mind
6                  If You Have Some Regular-Expression Experience
6                  Searching Text Files: Egrep
8           Egrep Metacharacters
8                  Start and End of the Line
9                  Character Classes
9                            Matching any one of several characters
10                           Negated character classes
11                 Matching Any Character with Dot
13                 Alternation
13                           Matching any one of several subexpressions
14                 Ignoring Differences in Capitalization
15                 Word Boundaries
16                 In a Nutshell
17                 Optional Items
18                 Other Quantifiers: Repetition
20                           Defined range of matches: intervals
20                 Parentheses and Backreferences
22                 The Great Escape
23          Expanding the Foundation
23                 Linguistic Diversification
23                 The Goal of a Regular Expression
23                 A Few More Examples
24                           Variable names
24                           A string within double quotes
24                           Dollar amount (with optional cents)
25                           An HTTP/HTML URL
26                           An HTML tag
26                           Time of day, such as “9:17 am” or “12:30 pm”
27                 Regular Expression Nomenclature
27                           Regex
27                           Matching
27                           Metacharacter
27                           Flavor
29                           Subexpression
29                           Character
30                 Improving on the Status Quo
32                 Summary
33          Personal Glimpses

35       Chapter 2   Extended Introductory Examples
36          About the Examples
37                 A Short Introduction to Perl
38          Matching Text with Regular Expressions
40                 Toward a More Real-World Example
40                 Side Effects of a Successful Match
43                 Intertwined Regular Expressions
44                           A short aside -- metacharacters galore
47                           Generic “whitespace” with \s
49                 Intermission
50          Modifying Text with Regular Expressions
50                 Example: Form Letter
51                 Example: Prettifying a Stock Price
53                 Automated Editing
53                 A Small Mail Utility
58                           Real-world problems, real-world solutions
59                           The “real” real world
59                 Adding Commas to a Number with Lookaround
60                           Lookaround doesn't “consume” text
61                           A few more lookahead examples
64                           Back to the comma example...
65                           Word boundaries and negative lookaround
67                           Commafication without lookbehind
67                 Text-to-HTML Conversion
68                           Cooking special characters
69                           Separating paragraphs
70                           “Linkizing” an email address
74                           “Linkizing” an HTTP URL
77                 That Doubled-Word Thing
80                           Moving bits around: operators, functions, and objects

83       Chapter 3   Overview of Regular Expression Features and Flavors
85          A Casual Stroll Across the Regex Landscape
85                 The Origins of Regular Expressions
86                           Grep's metacharacters
86                           Grep evolves
86                           Egrep evolves
87                           Other species evolve
87                           POSIX -- An attempt at standardization
88                           Henry Spencer's regex package
88                           Perl evolves
90                           A partial consolidation of flavors
91                           Versions as of this book
91                 At a Glance
93          Care and Handling of Regular Expressions
94                 Integrated Handling
95                 Procedural and Object-Oriented Handling
95                           Regex handling in Java
96                           Regex handling in VB and other .NET languages
97                           Regex handling in PHP
97                           Regex handling in Python
97                           Why do approaches differ?
98                 A Search-and-Replace Example
98                           Search and replace in Java
99                           Search and replace in VB.NET
99                           Search and replace in PHP
100                Search and Replace in Other Languages
100                          Awk
100                          Tcl
100                          GNU Emacs
101                Care and Handling: Summary
101         Strings, Character Encodings, and Modes
101                Strings as Regular Expressions
102                          Strings in Java
103                          Strings in VB.NET
103                          Strings in C#
103                          Strings in PHP
104                          Strings in Python
104                          Strings in Tcl
105                          Regex literals in Perl
105                Character-Encoding Issues
106                          Richness of encoding-related support
106                Unicode
107                          Characters versus combining-character sequences
108                          Multiple code points for the same character
109                          Unicode 3.1+ and code points beyond U+FFFF
109                          Unicode line terminator
110                Regex Modes and Match Modes
110                          Case-insensitive match mode
111                          Free-spacing and comments regex mode
111                          Dot-matches-all match mode (a.k.a., “single-line mode”)
112                          Enhanced line-anchor match mode (a.k.a., “multiline mode”)
113                          Literal-text regex mode
113         Common Metacharacters and Features
115                Character Representations
115                          Character shorthands
115                          These are machine dependent?
116                          Octal escape -- \num
117                          Hex and Unicode escapes: \xnum, \x{num}, \unum, \Unum, ...
117                          Control characters: \cchar
118                Character Classes and Class-Like Constructs
118                          Normal classes: [a-z] and [^a-z]
119                          Almost any character: dot
120                          Exactly one byte
120                          Unicode combining character sequence: \X
120                          Class shorthands: \w, \d, \s, \W, \D, \S
121                          Unicode properties, scripts, and blocks: \p{Prop}, \P{Prop}
125                          Simple class subtraction: [[a-z]-[aeiou]]
125                          Full class set operations: [[a-z]&&[^aeiou]]
127                          POSIX bracket-expression “character class”: [[:alpha:]]
128                          POSIX bracket-expression “collating sequences”: [[.span-ll.]]
128                          POSIX bracket-expression “character equivalents”: [[=n=]]
128                          Emacs syntax classes
129                Anchors and Other “Zero-Width Assertions”
129                          Start of line/string: ^, \A
129                          End of line/string: $, \Z, \z
130                          Start of match (or end of previous match): \G
133                          Word boundaries: \b, \B, \<, \>, ...
133                          Lookahead (?=...), (?!...); Lookbehind, (?<=...), (?<!...)
135                Comments and Mode Modifiers
135                          Mode modifier: (?modifier), such as (?i) or (?-i)
135                          Mode-modified span: (?modifier:...), such as (?i:...)
136                          Comments: (?#...) and #...
136                          Literal-text span: \Q...\E
137                Grouping, Capturing, Conditionals, and Control
137                          Capturing/Grouping Parentheses: (...) and \1, \2, ...
137                          Grouping-only parentheses: (?:...)
138                          Named capture: (?<Name>...)
139                          Atomic grouping: (?>...)
139                          Alternation: ...|...|...
140                          Conditional: (?if then|else)
141                          Greedy quantifiers: *, +, ?, {num,num}
141                          Lazy quantifiers: *?, +?, ??, {num,num}?
142                          Possessive quantifiers: *+, ++, ?+, {num,num}+
142         Guide to the Advanced Chapters

143      Chapter 4   The Mechanics of Expression Processing
143         Start Your Engines!
144                Two Kinds of Engines
144                New Standards
144                          The impact of standards
145                Regex Engine Types
146                From the Department of Redundancy Department
146                Testing the Engine Type
146                          Traditional NFA or not?
147                          DFA or POSIX NFA?
147         Match Basics
147                About the Examples
148                Rule 1: The Match That Begins Earliest Wins
148                          The “transmission” and the bump-along
149                Engine Pieces and Parts
150                          No “electric” parentheses, backreferences, or lazy quantifiers
151                Rule 2: The Standard Quantifiers Are Greedy
151                          A subjective example
152                          Being too greedy
153                          First come, first served
153                          Getting down to the details
153         Regex-Directed Versus Text-Directed
153                NFA Engine: Regex-Directed
155                          The control benefits of an NFA engine
155                DFA Engine: Text-Directed
156                First Thoughts: NFA and DFA in Comparison
156                          Consequences to us as users
157         Backtracking
158                A Really Crummy Analogy
158                          A crummy little example
159                Two Important Points on Backtracking
159                Saved States
160                          A match without backtracking
160                          A match after backtracking
160                          A non-match
161                          A lazy match
162                Backtracking and Greediness
162                          Star, plus, and their backtracking
162                          Revisiting a fuller example
163         More About Greediness and Backtracking
164                Problems of Greediness
165                Multi-Character “Quotes”
166                Using Lazy Quantifiers
167                Greediness and Laziness Always Favor a Match
168                The Essence of Greediness, Laziness, and Backtracking
169                Possessive Quantifiers and Atomic Grouping
170                          Atomic grouping with (?>...)
172                Possessive Quantifiers, ?+, *+, ++, and {m,n}+
173                The Backtracking of Lookaround
174                          Mimicking atomic grouping with positive lookahead
174                Is Alternation Greedy?
175                Taking Advantage of Ordered Alternation
176                          Ordered alternation pitfalls
177         NFA, DFA, and POSIX
177                “The Longest-Leftmost”
177                          Really, the longest
178                POSIX and the Longest-Leftmost Rule
179                Speed and Efficiency
179                          DFA efficiency
180                Summary: NFA and DFA in Comparison
180                          DFA versus NFA: Differences in the pre-use compile
181                          DFA versus NFA: Differences in match speed
181                          DFA versus NFA: Differences in what is matched
182                          DFA versus NFA: Differences in capabilities
183                          DFA versus NFA: Differences in ease of implementation
183         Summary

185      Chapter 5   Practical Regex Techniques
186         Regex Balancing Act
186         A Few Short Examples
186                Continuing with Continuation Lines
187                Matching an IP Address
189                          Know your context
190                Working with Filenames
190                          Removing the leading path from a filename
191                          Accessing the filename from a path
192                          Both leading path and filename
193                Matching Balanced Sets of Parentheses
194                Watching Out for Unwanted Matches
196                Matching Delimited Text
196                          Allowing escaped quotes in double-quoted strings
198                Knowing Your Data and Making Assumptions
199                Stripping Leading and Trailing Whitespace
200         HTML-Related Examples
200                Matching an HTML Tag
201                Matching an HTML Link
203                Examining an HTTP URL
203                Validating a Hostname
206                Plucking Out a URL in the Real World
208         Extended Examples
209                Keeping in Sync with Your Data
210                          Keeping the match in sync with expectations
211                          Maintaining sync after a non-match as well
212                          Maintaining sync with \G
212                          This example in perspective
213                Parsing CSV Files
215                          Distrusting the bump-along
218                          One change for the sake of efficiency
218                          Other CSV formats

221      Chapter 6   Crafting an Efficient Expression
222         A Sobering Example
223                A Simple Change -- Placing Your Best Foot Forward
223                Efficiency Versus Correctness
225                Advancing Further -- Localizing the Greediness
226                Reality Check
226                          “Exponential” matches
228         A Global View of Backtracking
229                More Work for a POSIX NFA
230                Work Required During a Non-Match
231                Being More Specific
231                Alternation Can Be Expensive
232         Benchmarking
234                Know What You're Measuring
234                Benchmarking with PHP
235                Benchmarking with Java
237                Benchmarking with VB.NET
238                Benchmarking with Ruby
238                Benchmarking with Python
239                Benchmarking with Tcl
240         Common Optimizations
240                No Free Lunch
241                Everyone's Lunch is Different
241                The Mechanics of Regex Application
242                Pre-Application Optimizations
242                          Compile caching
245                          Pre-check of required character/substring optimization
245                          Length-cognizance optimization
246                Optimizations with the Transmission
246                          Start of string/line anchor optimization
246                          Implicit-anchor optimization
246                          End of string/line anchor optimization
247                          Initial character/class/substring discrimination optimization
247                          Embedded literal string check optimization
247                          Length-cognizance transmission optimization
247                Optimizations of the Regex Itself
247                          Literal string concatenation optimization
247                          Simple quantifier optimization
248                          Needless parentheses elimination
248                          Needless character class elimination
248                          Character following lazy quantifier optimization
249                          “Excessive” backtracking detection
250                          Exponential (a.k.a., super-linear) short-circuiting
250                          State-suppression with possessive quantifiers
251                          Small quantifier equivalence
252                          Need cognizance
252         Techniques for Faster Expressions
254                Common Sense Techniques
254                          Avoid recompiling
254                          Use non-capturing parentheses
254                          Don't add superfluous parentheses
254                          Don't use superfluous character classes
255                          Use leading anchors
255                Expose Literal Text
255                          “Factor out” required components from quantifiers
255                          “Factor out” required components from the front of alternation
256                Expose Anchors
256                          Expose ^ and \G at the front of expressions
256                          Expose $ at the end of expressions
256                Lazy Versus Greedy: Be Specific
257                Split Into Multiple Regular Expressions
258                Mimic Initial-Character Discrimination
259                          Don't do this with Tcl
259                          Don't do this with PHP
259                Use Atomic Grouping and Possessive Quantifiers
260                Lead the Engine to a Match
260                          Put the most likely alternative first
261                          Distribute into the end of alternation
261         Unrolling the Loop
262                Method 1: Building a Regex From Past Experiences
263                          Constructing a general “unrolling-the-loop” pattern
264                The Real “Unrolling-the-Loop” Pattern
264                          Avoiding the neverending match
266                          General things to look out for
266                Method 2: A Top-Down View
267                Method 3: An Internet Hostname
268                Observations
268                Using Atomic Grouping and Possessive Quantifiers
269                          Making a neverending match safe with possessive quantifiers
269                          Making a neverending match safe with atomic grouping
270                Short Unrolling Examples
270                          Unrolling “multi-character” quotes
270                          Unrolling the continuation-line example
271                          Unrolling the CSV regex
272                Unrolling C Comments
272                          To unroll or to not unroll...
273                          A direct approach
274                          Making it work
275                          Unrolling the C loop
277         The Freeflowing Regex
277                A Helping Hand to Guide the Match
279                A Well-Guided Regex is a Fast Regex
281                Wrapup
281         In Summary: Think!

283      Chapter 7   Perl
285         Regular Expressions as a Language Component
286                Perl's Greatest Strength
286                Perl's Greatest Weakness
286         Perl's Regex Flavor
288                Regex Operands and Regex Literals
289                          Features supported by regex literals
291                          Picking your own regex delimiters
292                How Regex Literals Are Parsed
292                Regex Modifiers
293         Regex-Related Perlisms
294                Expression Context
294                          Contorting an expression
295                Dynamic Scope and Regex Match Effects
295                          Global and private variables
295                          Dynamically scoped values
298                          A better analogy: clear transparencies
298                          Regex side effects and dynamic scoping
299                          Dynamic scoping versus lexical scoping
299                Special Variables Modified by a Match
303                          Using $1 within a regex?
303         The qr/.../ Operator and Regex Objects
303                Building and Using Regex Objects
304                          Match modes (or lack thereof) are very sticky
305                Viewing Regex Objects
306                Using Regex Objects for Efficiency
306         The Match Operator
307                Match's Regex Operand
307                          Using a regex literal
307                          Using a regex object
308                          The default regex
308                          Special match-once ?...?
308                Specifying the Match Target Operand
308                          The default target
309                          Negating the sense of the match
309                Different Uses of the Match Operator
310                          Normal “does this match?” -- scalar context without /g
310                          Normal “pluck data from a string” -- list context, without /g
311                          “Pluck all matches” -- list context, with the /g modifier
312                Iterative Matching: Scalar Context, with /g
313                          The “current match location” and the pos() function
314                          Pre-setting a string's pos
315                          Using \G
315                          “Tag-team” matching with /gc
316                          Pos-related summary
316                The Match Operator's Environmental Relations
317                          The match operator's side effects
317                          Outside influences on the match operator
318                          Keeping your mind in context (and context in mind)
318         The Substitution Operator
319                The Replacement Operand
319                The /e Modifier
320                          Multiple uses of /e
321                Context and Return Value
321         The Split Operator
322                Basic Split
322                          Basic match operand
322                          Target string operand
323                          Basic chunk-limit operand
323                          Advanced split
324                Returning Empty Elements
324                          Trailing empty elements
324                          The chunk-limit operand's second job
324                          Special matches at the ends of the string
325                Split's Special Regex Operands
325                          Split has no side effects
326                Split's Match Operand with Capturing Parentheses
326         Fun with Perl Enhancements
328                Using a Dynamic Regex to Match Nested Pairs
331                Using the Embedded-Code Construct
331                          Using embedded code to display match-time information
332                          Using embedded code to see all matches
334                          Finding the longest match
335                          Finding the longest-leftmost match
335                Using local in an Embedded-Code Construct
338                A Warning About Embedded Code and my Variables
340                Matching Nested Constructs with Embedded Code
341                Overloading Regex Literals
341                          Adding start- and end-of-word metacharacters
343                          Adding support for possessive quantifiers
344                Problems with Regex-Literal Overloading
344                Mimicking Named Capture
347         Perl Efficiency Issues
348                “There's More Than One Way to Do It”
348                Regex Compilation, the /o Modifier, qr/.../, and Efficiency
350                          The internal mechanics of preparing a regex
350                          Perl steps to reduce regex compilation
352                          The “compile once” /o modifier
353                          Using regex objects for efficiency
354                          Using the default regex for efficiency
355                Understanding the “Pre-Match” Copy
355                          Pre-match copy supports $1, $&, $', $+, ...
355                          The pre-match copy is not always needed
356                          How expensive is the pre-match copy?
357                          Avoiding the pre-match copy
359                The Study Function
359                          When not to use study
360                          When study can help
360                Benchmarking
361                Regex Debugging Information
362                          Run-time debugging information
363                          Other ways to invoke debugging messages
363         Final Comments

365      Chapter 8   Java
366         Java's Regex Flavor
369                Java Support for \p{...} and \P{...}
369                          Unicode properties
369                          Unicode blocks
369                          Special Java character properties
370                Unicode Line Terminators
371         Using java.util.regex
372         The Pattern.compile() Factory
373                Pattern's matcher method
373         The Matcher Object
375                Applying the Regex
376                Querying Match Results
378                          Match-result example
378                Simple Search and Replace
379                          Simple search and replace examples
380                          The replacement argument
380                Advanced Search and Replace
381                          Search-and-replace examples
382                In-Place Search and Replace
383                          Using a different-sized replacement
384                The Matcher's Region
385                          Points to keep in mind
386                          Setting and inspecting region bounds
386                          Looking outside the current region
387                          Transparent bounds
388                          Anchoring bounds
389                Method Chaining
389                Methods for Building a Scanner
391                          Examples illustrating hitEnd and requireEnd
392                          The hitEnd bug and its workaround
392                Other Matcher Methods
394                          Querying a matcher's target text
394         Other Pattern Methods
395                Pattern's split Method, with One Argument
396                          Empty elements with adjacent matches
396                Pattern's split Method, with Two Arguments
396                          Split with a limit less than zero
396                          Split with a limit of zero
396                          Split with a limit greater than zero
397         Additional Examples
397                Adding Width and Height Attributes to Image Tags
399                Validating HTML with Multiple Patterns Per Matcher
401                Parsing Comma-Separated Values (CSV) Text
401         Java Version Differences
402                Differences Between 1.4.2 and 1.5.0
402                          New methods in Java 1.5.0
402                          Unicode-support differences between 1.4.2 and 1.5.0
403                Differences Between 1.5.0 and 1.6

405      Chapter 9   .NET
406         .NET's Regex Flavor
409                Additional Comments on the Flavor
409                          Named capture
409                          Conditional tests
410                          “Compiled” expressions
411                          Right-to-left matching
412                          Backslash-digit ambiguities
412                          ECMAScript mode
413         Using .NET Regular Expressions
413                Regex Quickstart
413                          Quickstart: Checking a string for match
414                          Quickstart: Matching and getting the text matched
414                          Quickstart: Matching and getting captured text
414                          Quickstart: Search and replace
415                Package Overview
415                          Importing the regex namespace
416                Core Object Overview
416                          Regex objects
417                          Match objects
418                          Group objects
418                          Capture objects
418                          All results are computed at match time
418         Core Object Details
419                Creating Regex Objects
419                          Catching exceptions
419                          Regex options
421                Using Regex Objects
427                Using Match Objects
430                Using Group Objects
431         Static “Convenience” Functions
432                Regex Caching
432         Support Functions
434         Advanced .NET
434                Regex Assemblies
436                Matching Nested Constructs
437                Capture Objects

439      Chapter 10   PHP
441         PHP's Regex Flavor
443         The Preg Function Interface
444                “Pattern” Arguments
444                          PHP single-quoted strings
445                          Delimiters
446                          Pattern modifiers
449         The Preg Functions
449                preg_match
450                          Capturing match data
450                          Trailing “non-participatory” elements stripped
451                          Named capture
452                          Getting more details on the match: PREG_OFFSET_CAPTURE
453                          The offset argument
453                preg_match_all
454                          Collecting match data
456                          preg_match_all and the PREG_OFFSET_CAPTURE flag
457                          preg_match_all with named capture
458                preg_replace
459                          Basic one-string, one-pattern, one-replacement preg_replace
460                          Multiple subjects, patterns, and replacements
463                preg_replace_callback
465                          A callback versus the e pattern modifier
465                preg_split
466                          preg_split's limit argument
468                          preg_split's flag arguments
469                preg_grep
470                preg_quote
471         “Missing” Preg Functions
472                preg_regex_to_pattern
472                          The problem
472                          The solution
474                Syntax-Checking an Unknown Pattern Argument
475                Syntax-Checking an Unknown Regex
475         Recursive Expressions
475                Matching Text with Nested Parentheses
476                          Recursive reference to a set of capturing parentheses
476                          Recursive reference via named capture
477                          More on possessive quantifiers
477                No Backtracking Into Recursion
478                Matching a Set of Nested Parentheses
478         PHP Efficiency Issues
478                The S Pattern Modifier: “Study”
479                          Standard optimizations, without the S pattern modifier
479                          Enhancing the optimization with the S pattern modifier
480                          When the S pattern modifier can't help
480                          Suggested use
480         Extended Examples
480                CSV Parsing with PHP
481                Checking Tagged Data for Proper Nesting
481                          The main body of this expression
483                          Possessive quantifiers
483                          Real-world XML
484                          HTML?

485      Index


Copyright © 2002 Jeffrey Friedl