return to the main page Mastering Regular Expressions
Second Edition

Table of Contents

xv       Preface
xv          The Need for This Book
xvi                Why I've Written the Second Edition
xvi         Intended Audience
xvii        How to Read This Book
xvii        Organization
xviii              The Introduction
xviii              The Details
xix                Tool-Specific Information
xix         Typographical Conventions
xx          Exercises
xxi         Links, Code, Errata, and Contacts
xxi         Personal Comments and Acknowledgments

1        Chapter 1   Introduction to Regular Expressions
2           Solving Real Problems
4           Regular Expressions as a Language
4                  The Filename Analogy
5                  The Language Analogy
6                            The goal of this book
6           The Regular-Expression Frame of Mind
6                  If You Have Some Regular-Expression Experience
6                  Searching Text Files: Egrep
8           Egrep Metacharacters
8                  Start and End of the Line
9                  Character Classes
9                            Matching any one of several characters
10                           Negated character classes
11                 Matching Any Character with Dot
13                 Alternation
13                           Matching any one of several subexpressions
14                 Ignoring Differences in Capitalization
15                 Word Boundaries
16                 In a Nutshell
17                 Optional Items
18                 Other Quantifiers: Repetition
20                           Defined range of matches: intervals
20                 Parentheses and Backreferences
22                 The Great Escape
23          Expanding the Foundation
23                 Linguistic Diversification
23                 The Goal of a Regular Expression
23                 A Few More Examples
24                           Variable names
24                           A string within double quotes
24                           Dollar amount (with optional cents)
25                           An HTTP/HTML URL
26                           An HTML tag
26                           Time of day, such as “9:17 am” or “12:30 pm”
27                 Regular Expression Nomenclature
27                           Regex
27                           Matching
27                           Metacharacter
27                           Flavor
29                           Subexpression
29                           Character
30                 Improving on the Status Quo
32                 Summary
33          Personal Glimpses

35       Chapter 2   Extended Introductory Examples
36          About the Examples
37                 A Short Introduction to Perl
38          Matching Text with Regular Expressions
40                 Toward a More Real-World Example
40                 Side Effects of a Successful Match
43                 Intertwined Regular Expressions
44                           A short aside -- metacharacters galore
47                           Generic “whitespace” with \s
49                 Intermission
50          Modifying Text with Regular Expressions
50                 Example: Form Letter
51                 Example: Prettifying a Stock Price
53                 Automated Editing
53                 A Small Mail Utility
58                           Real-world problems, real-world solutions
59                           The “real” real world
59                 Adding Commas to a Number with Lookaround
60                           Lookaround doesn't “consume” text
61                           A few more lookahead examples
64                           Back to the comma example...
65                           Word boundaries and negative lookaround
67                           Commafication without lookbehind
67                 Text-to-HTML Conversion
68                           Cooking special characters
69                           Separating paragraphs
70                           “Linkizing” an email address
74                           “Linkizing” an HTTP URL
77                 That Doubled-Word Thing
80                           Moving bits around: operators, functions, and objects

83       Chapter 3   Overview of Regular Expression Features and Flavors
85          A Casual Stroll Across the Regex Landscape
85                 The Origins of Regular Expressions
86                           Grep's metacharacters
86                           Grep evolves
86                           Egrep evolves
87                           Other species evolve
87                           POSIX -- An attempt at standardization
88                           Henry Spencer's regex package
88                           Perl evolves
90                           A partial consolidation of flavors
91                           Versions as of this book
91                 At a Glance
93          Care and Handling of Regular Expressions
94                 Integrated Handling
95                 Procedural and Object-Oriented Handling
95                           Regex handling in Java
96                           Regex handling in VB and other .NET languages
97                           Regex handling in Python
97                           Why do approaches differ?
97                 A Search-and-Replace Example
98                           Search-and-replace in Java
99                           Search-and-replace in VB.NET
99                 Search and Replace in Other Languages
99                           Awk
100                          Tcl
100                          GNU Emacs
101                Care and Handling: Summary
101         Strings, Character Encodings, and Modes
101                Strings as Regular Expressions
102                          Strings in Java
102                          Strings in VB.NET
102                          Strings in C#
103                          Strings in PHP
103                          Strings in Python
104                          Strings in Tcl
104                          Regex literals in Perl
105                Character-Encoding Issues
105                          Richness of encoding-related support
106                          Unicode
109                Regex Modes and Match Modes
109                          Case-insensitive match mode
110                          Free-spacing and comments regex mode
110                          Dot-matches-all match mode (a.k.a., “single-line mode”)
111                          Enhanced line-anchor match mode (a.k.a., “multiline mode”)
112                          Literal-text regex mode
112         Common Metacharacters and Features
114                Character Representations
114                          Character shorthands
114                          These are machine dependent?
115                          Octal escape -- \num
116                          Hex and Unicode escapes: \xnum, \x{num}, \unum, \Unum, ...
116                          Control characters: \cchar
117                Character Classes and Class-Like Constructs
117                          Normal classes: [a-z] and [^a-z]
118                          Almost any character: dot
119                          Class shorthands: \w, \d, \s, \W, \D, \S
119                          Unicode properties, scripts, and blocks: \p{Prop}, \P{Prop}
123                          Class set operations: [[a-z]&&[^aeiou]]
125                          Unicode combining character sequence: \X
125                          POSIX bracket-expression “character class”: [[:alpha:]]
126                          POSIX bracket-expression “collating sequences”: [[.span-ll.]]
126                          POSIX bracket-expression “character equivalents”: [[=n=]]
127                          Emacs syntax classes
127                Anchors and Other “Zero-Width Assertions”
127                          Start of line/string: ^, \A
127                          End of line/string: $, \Z, \z
128                          Start of match (or end of previous match): \G
131                          Word boundaries: \b, \B, \<, \>, ...
132                          Lookahead (?=...), (?!...); Lookbehind, (?<=...), (?<!...)
133                Comments and Mode Modifiers
133                          Mode modifier: (?modifier), such as (?i) or (?-i)
134                          Mode-modified span: (?modifier:...), such as (?i:...)
134                          Comments: (?#...) and #...
134                          Literal-text span: \Q...\E
135                Grouping, Capturing, Conditionals, and Control
135                          Capturing/Grouping Parentheses: (...) and \1, \2, ...
136                          Grouping-only parentheses: (?:...)
137                          Named capture: (?<Name>...)
137                          Atomic grouping: (?>...)
138                          Alternation: ...|...|...
138                          Conditional: (?if then|else)
139                          Greedy quantifiers: *, +, ?, {num,num}
140                          Lazy quantifiers: *?, +?, ??, {num,num}?
140                          Possessive quantifiers: *+, ++, ?+, {num,num}+
141         Guide to the Advanced Chapters

143      Chapter 4   The Mechanics of Expression Processing
143         Start Your Engines!
144                Two Kinds of Engines
144                New Standards
144                          The impact of standards
145                Regex Engine Types
146                From the Department of Redundancy Department
146                Testing the Engine Type
146                          Traditional NFA or not?
147                          DFA or POSIX NFA?
147         Match Basics
147                About the Examples
148                Rule 1: The Match That Begins Earliest Wins
148                          The “transmission” and the bump-along
149                Engine Pieces and Parts
150                          No “electric” parentheses, backreferences, or lazy quantifiers
151                Rule 2: The Standard Quantifiers Are Greedy
151                          A subjective example
152                          Being too greedy
153                          First come, first served
153                          Getting down to the details
153         Regex-Directed Versus Text-Directed
153                NFA Engine: Regex-Directed
155                          The control benefits of an NFA engine
155                DFA Engine: Text-Directed
156                First Thoughts: NFA and DFA in Comparison
156                          Consequences to us as users
157         Backtracking
158                A Really Crummy Analogy
158                          A crummy little example
159                Two Important Points on Backtracking
159                Saved States
160                          A match without backtracking
160                          A match after backtracking
160                          A non-match
161                          A lazy match
162                Backtracking and Greediness
162                          Star, plus, and their backtracking
162                          Revisiting a fuller example
163         More About Greediness and Backtracking
164                Problems of Greediness
165                Multi-Character “Quotes”
166                Using Lazy Quantifiers
167                Greediness and Laziness Always Favor a Match
168                The Essence of Greediness, Laziness, and Backtracking
169                Possessive Quantifiers and Atomic Grouping
170                          Atomic grouping with (?>...)
172                Possessive Quantifiers, ?+, *+, ++, and {m,n}+
173                The Backtracking of Lookaround
174                          Mimicking atomic grouping with positive lookahead
174                Is Alternation Greedy?
175                Taking Advantage of Ordered Alternation
176                          Ordered alternation pitfalls
177         NFA, DFA, and POSIX
177                “The Longest-Leftmost”
177                          Really, the longest
178                POSIX and the Longest-Leftmost Rule
179                Speed and Efficiency
179                          DFA efficiency
180                Summary: NFA and DFA in Comparison
180                          DFA versus NFA: Differences in the pre-use compile
181                          DFA versus NFA: Differences in match speed
181                          DFA versus NFA: Differences in what is matched
182                          DFA versus NFA: Differences in capabilities
182                          DFA versus NFA: Differences in ease of implementation
183         Summary

185      Chapter 5   Practical Regex Techniques
186         Regex Balancing Act
186         A Few Short Examples
186                Continuing with Continuation Lines
187                Matching an IP Address
189                          Know your context
190                Working with Filenames
190                          Removing the leading path from a filename
191                          Accessing the filename from a path
192                          Both leading path and filename
193                Matching Balanced Sets of Parentheses
194                Watching Out for Unwanted Matches
196                Matching Delimited Text
196                          Allowing escaped quotes in double-quoted strings
198                Knowing Your Data and Making Assumptions
199                Stripping Leading and Trailing Whitespace
200         HTML-Related Examples
200                Matching an HTML Tag
201                Matching an HTML Link
203                Examining an HTTP URL
203                Validating a Hostname
205                Plucking Out a URL in the Real World
208         Extended Examples
208                Keeping in Sync with Your Data
210                          Keeping the match in sync with expectations
211                          Maintaining sync after a non-match as well
212                          Maintaining sync with \G
212                          This example in perspective
212                Parsing CSV Files
215                          Distrusting the bump-along
218                          One change for the sake of efficiency
219                          Other CSV formats

221      Chapter 6   Crafting an Efficient Expression
222         A Sobering Example
223                A Simple Change -- Placing Your Best Foot Forward
223                Efficiency Verses Correctness
225                Advancing Further -- Localizing the Greediness
226                Reality Check
226                          “Exponential” matches
228         A Global View of Backtracking
229                More Work for a POSIX NFA
230                Work Required During a Non-Match
231                Being More Specific
231                Alternation Can Be Expensive
232         Benchmarking
234                Know What You're Measuring
234                Benchmarking with Java
236                Benchmarking with VB.NET
237                Benchmarking with Python
238                Benchmarking with Ruby
239                Benchmarking with Tcl
239         Common Optimizations
240                No Free Lunch
240                Everyone's Lunch is Different
241                The Mechanics of Regex Application
242                Pre-Application Optimizations
242                          Compile caching
244                          Pre-check of required character/substring optimization
245                          Length-cognizance optimization
245                Optimizations with the Transmission
245                          Start of string/line anchor optimization
246                          Implicit-anchor optimization
246                          End of string/line anchor optimization
246                          Initial character/class/substring discrimination optimization
247                          Embedded literal string check optimization
247                          Length-cognizance transmission optimization
247                Optimizations of the Regex Itself
247                          Literal string concatenation optimization
247                          Simple quantifier optimization
248                          Needless parentheses elimination
249                          Needless character class elimination
249                          Character following lazy quantifier optimization
249                          “Excessive” backtracking detection
250                          Exponential (a.k.a, super-linear) short-circuiting
250                          State-suppression with possessive quantifiers
251                          Small quantifier equivalence
252                          Need cognizance
252         Techniques for Faster Expressions
254                Common Sense Techniques
254                          Avoid recompiling
254                          Use non-capturing parentheses
254                          Don't add superfluous parentheses
254                          Don't use superfluous character classes
255                          Use leading anchors
255                Expose Literal Text
255                          “Factor out” required components from quantifiers
255                          “Factor out” required components from the front of alternation
255                Expose Anchors
256                          Expose ^ and \G at the front of expressions
256                          Expose $ at the end of expressions
256                Lazy Versus Greedy: Be Specific
257                Split Into Multiple Regular Expressions
258                Mimic Initial-Character Discrimination
259                          Don't do this with Tcl
259                Use Atomic Grouping and Possessive Quantifiers
260                Lead the Engine to a Match
260                          Put the most likely alternative first
260                          Distribute into the end of alternation
261         Unrolling the Loop
262                Method 1: Building a Regex From Past Experiences
262                          Constructing a general “unrolling-the-loop” pattern
263                The Real “Unrolling-the-Loop” Pattern
264                          Avoiding the neverending match
265                          General things to look out for
266                Method 2: A Top-Down View
267                Method 3: An Internet Hostname
268                Observations
268                Using Atomic Grouping and Possessive Quantifiers
269                          Making a neverending match safe with possessive quantifiers
269                          Making a neverending match safe with atomic grouping
270                Short Unrolling Examples
270                          Unrolling “multi-character” quotes
270                          Unrolling the continuation-line example
271                          Unrolling the CSV regex
272                Unrolling C Comments
272                          To unroll or to not unroll...
273                          A direct approach
274                          Making it work
275                          Unrolling the C loop
277         The Freeflowing Regex
277                A Helping Hand to Guide the Match
279                A Well-Guided Regex is a Fast Regex
280                Wrapup
281         In Summary: Think!

283      Chapter 7   Perl
285         Regular Expressions as a Language Component
286                Perl's Greatest Strength
286                Perl's Greatest Weakness
286         Perl's Regex Flavor
288                Regex Operands and Regex Literals
289                          Features supported by regex literals
291                          Picking your own regex delimiters
292                How Regex Literals Are Parsed
292                Regex Modifiers
293         Regex-Related Perlisms
294                Expression Context
294                          Contorting an expression
295                Dynamic Scope and Regex Match Effects
295                          Global and private variables
295                          Dynamically scoped values
298                          A better analogy: clear transparencies
298                          Regex side effects and dynamic scoping
299                          Dynamic scoping versus lexical scoping
299                Special Variables Modified by a Match
303                          Using $1 within a regex?
303         The qr/.../ Operator and Regex Objects
303                Building and Using Regex Objects
304                          Match modes (or lack thereof) are very sticky
305                Viewing Regex Objects
306                Using Regex Objects for Efficiency
306         The Match Operator
307                Match's Regex Operand
307                          Using a regex literal
307                          Using a regex object
308                          The default regex
308                          Special match-once ?...?
308                Specifying the Match Target Operand
308                          The default target
309                          Negating the sense of the match
309                Different Uses of the Match Operator
310                          Normal “does this match?” -- scalar context without /g
310                          Normal “pluck data from a string” -- list context, without /g
311                          “Pluck all matches” -- list context, with the /g modifier
312                Iterative Matching: Scalar Context, with /g
313                          The “current match location” and the pos() function
314                          Pre-setting a string's pos
315                          Using \G
315                          “Tag-team” matching with /gc
316                          Pos-related summary
316                The Match Operator's Environmental Relations
317                          The match operator's side effects
317                          Outside influences on the match operator
318                          Keeping your mind in context (and context in mind)
318         The Substitution Operator
319                The Replacement Operand
319                The /e Modifier
320                          Multiple uses of /e
321                Context and Return Value
321         The Split Operator
322                Basic Split
322                          Basic match operand
322                          Target string operand
323                          Basic chunk-limit operand
323                          Advanced split
324                Returning Empty Elements
324                          Trailing empty elements
324                          The chunk-limit operand's second job
324                          Special matches at the ends of the string
325                Split's Special Regex Operands
325                          Split has no side effects
326                Split's Match Operand with Capturing Parentheses
326         Fun with Perl Enhancements
328                Using a Dynamic Regex to Match Nested Pairs
331                Using the Embedded-Code Construct
331                          Using embedded code to display match-time information
332                          Using embedded code to see all matches
334                          Finding the longest match
335                          Finding the longest-leftmost match
335                Using local in an Embedded-Code Construct
338                A Warning About Embedded Code and my Variables
340                Matching Nested Constructs with Embedded Code
341                Overloading Regex Literals
341                          Adding start- and end-of-word metacharacters
343                          Adding support for possessive quantifiers
344                Problems with Regex-Literal Overloading
344                Mimicking Named Capture
347         Perl Efficiency Issues
348                “There's More Than One Way to Do It”
348                Regex Compilation, the /o Modifier, qr/.../, and Efficiency
350                          The internal mechanics of preparing a regex
350                          Perl steps to reduce regex compilation
352                          The “compile once” /o modifier
353                          Using regex objects for efficiency
354                          Using the default regex for efficiency
355                Understanding the “Pre-Match” Copy
355                          Pre-match copy supports $1, $&, $', $+, ...
355                          The pre-match copy is not always needed
356                          How expensive is the pre-match copy?
357                          Avoiding the pre-match copy
359                The Study Function
359                          When not to use study
360                          When study can help
360                Benchmarking
361                Regex Debugging Information
362                          Run-time debugging information
363                          Other ways to invoke debugging messages
363         Final Comments

365      Chapter 8   Java
366         Judging a Regex Package
366                Technical Issues
367                Social and Political Issues
368         Object Models
368                A Few Abstract Object Models
369                          An “all-in-one” model
370                          A “match state” model
371                          A “match result” model
372                Growing Complexity
372         Packages, Packages, Packages
375                Why So Many “Perl5” Flavors?
375                Lies, Damn Lies, and Benchmarks
376                          Warning: Benchmark results can cause drowsiness!
377                          And the winner is...
377                Recommendations
378         Sun's Regex Package
378                Regex Flavor
381                Using java.util.regex
383                The Pattern.compile() Factory
384                          Pattern's matcher(...) method
384                The Matcher Object
384                          Applying the regex
385                          Querying the results
385                          Reusing Matcher objects for efficiency
387                          Simple search and replace
388                          Advanced search and replace
390                Other Pattern Methods
390                          Pattern's split method, with one argument
391                          Pattern's split method, with two arguments
392         A Quick Look at Jakarta-ORO
392                ORO's Perl5Util
393                A Mini Perl5Util Reference
393                          Perl5Util basics -- initiating a match
396                          Perl5Util basics -- inspecting the results of a match
397                Using ORO's Underlying Classes

399      Chapter 9   .NET
400         .NET's Regex Flavor
402                Additional Comments on the Flavor
402                          Named capture
403                          Conditional tests
404                          “Compiled” expressions
405                          Right-to-left matching
406                          Backslash-digit ambiguities
406                          ECMAScript mode
407         Using .NET Regular Expressions
407                Regex Quickstart
407                          Quickstart: Checking a string for match
408                          Quickstart: Matching and getting the text matched
408                          Quickstart: Matching and getting captured text
408                          Quickstart: Search and replace
409                Package Overview
409                          Importing the regex namespace
410                Core Object Overview
410                          Regex objects
411                          Match objects
412                          Group objects
412                          Capture objects
412                          All results are computed at match time
412         Core Object Details
413                Creating Regex Objects
413                          Catching exceptions
413                          Regex options
415                Using Regex Objects
421                Using Match Objects
424                Using Group Objects
425         Static “Convenience” Functions
426                Regex Caching
426         Support Functions
427         Advanced .NET
428                Regex Assemblies
430                Matching Nested Constructs
431                Capture Objects

433      Index


Copyright © 2002 Jeffrey Friedl