Table of Contents
xv Preface
xv The Need for This Book
xvi Why I've Written the Second Edition
xvi Intended Audience
xvii How to Read This Book
xvii Organization
xviii The Introduction
xviii The Details
xix Tool-Specific Information
xix Typographical Conventions
xx Exercises
xxi Links, Code, Errata, and Contacts
xxi Personal Comments and Acknowledgments
1 Chapter 1 Introduction to Regular Expressions
2 Solving Real Problems
4 Regular Expressions as a Language
4 The Filename Analogy
5 The Language Analogy
6 The goal of this book
6 The Regular-Expression Frame of Mind
6 If You Have Some Regular-Expression Experience
6 Searching Text Files: Egrep
8 Egrep Metacharacters
8 Start and End of the Line
9 Character Classes
9 Matching any one of several characters
10 Negated character classes
11 Matching Any Character with Dot
13 Alternation
13 Matching any one of several subexpressions
14 Ignoring Differences in Capitalization
15 Word Boundaries
16 In a Nutshell
17 Optional Items
18 Other Quantifiers: Repetition
20 Defined range of matches: intervals
20 Parentheses and Backreferences
22 The Great Escape
23 Expanding the Foundation
23 Linguistic Diversification
23 The Goal of a Regular Expression
23 A Few More Examples
24 Variable names
24 A string within double quotes
24 Dollar amount (with optional cents)
25 An HTTP/HTML URL
26 An HTML tag
26 Time of day, such as 9:17 am or 12:30 pm
27 Regular Expression Nomenclature
27 Regex
27 Matching
27 Metacharacter
27 Flavor
29 Subexpression
29 Character
30 Improving on the Status Quo
32 Summary
33 Personal Glimpses
35 Chapter 2 Extended Introductory Examples
36 About the Examples
37 A Short Introduction to Perl
38 Matching Text with Regular Expressions
40 Toward a More Real-World Example
40 Side Effects of a Successful Match
43 Intertwined Regular Expressions
44 A short aside -- metacharacters galore
47 Generic whitespace with \s
49 Intermission
50 Modifying Text with Regular Expressions
50 Example: Form Letter
51 Example: Prettifying a Stock Price
53 Automated Editing
53 A Small Mail Utility
58 Real-world problems, real-world solutions
59 The real real world
59 Adding Commas to a Number with Lookaround
60 Lookaround doesn't consume text
61 A few more lookahead examples
64 Back to the comma example...
65 Word boundaries and negative lookaround
67 Commafication without lookbehind
67 Text-to-HTML Conversion
68 Cooking special characters
69 Separating paragraphs
70 Linkizing an email address
74 Linkizing an HTTP URL
77 That Doubled-Word Thing
80 Moving bits around: operators, functions, and objects
83 Chapter 3 Overview of Regular Expression Features and Flavors
85 A Casual Stroll Across the Regex Landscape
85 The Origins of Regular Expressions
86 Grep's metacharacters
86 Grep evolves
86 Egrep evolves
87 Other species evolve
87 POSIX -- An attempt at standardization
88 Henry Spencer's regex package
88 Perl evolves
90 A partial consolidation of flavors
91 Versions as of this book
91 At a Glance
93 Care and Handling of Regular Expressions
94 Integrated Handling
95 Procedural and Object-Oriented Handling
95 Regex handling in Java
96 Regex handling in VB and other .NET languages
97 Regex handling in Python
97 Why do approaches differ?
97 A Search-and-Replace Example
98 Search-and-replace in Java
99 Search-and-replace in VB.NET
99 Search and Replace in Other Languages
99 Awk
100 Tcl
100 GNU Emacs
101 Care and Handling: Summary
101 Strings, Character Encodings, and Modes
101 Strings as Regular Expressions
102 Strings in Java
102 Strings in VB.NET
102 Strings in C#
103 Strings in PHP
103 Strings in Python
104 Strings in Tcl
104 Regex literals in Perl
105 Character-Encoding Issues
105 Richness of encoding-related support
106 Unicode
109 Regex Modes and Match Modes
109 Case-insensitive match mode
110 Free-spacing and comments regex mode
110 Dot-matches-all match mode (a.k.a., single-line mode)
111 Enhanced line-anchor match mode (a.k.a., multiline mode)
112 Literal-text regex mode
112 Common Metacharacters and Features
114 Character Representations
114 Character shorthands
114 These are machine dependent?
115 Octal escape -- \num
116 Hex and Unicode escapes: \xnum, \x{num}, \unum, \Unum, ...
116 Control characters: \cchar
117 Character Classes and Class-Like Constructs
117 Normal classes: [a-z] and [^a-z]
118 Almost any character: dot
119 Class shorthands: \w, \d, \s, \W, \D, \S
119 Unicode properties, scripts, and blocks: \p{Prop}, \P{Prop}
123 Class set operations: [[a-z]&&[^aeiou]]
125 Unicode combining character sequence: \X
125 POSIX bracket-expression character class: [[:alpha:]]
126 POSIX bracket-expression collating sequences: [[.span-ll.]]
126 POSIX bracket-expression character equivalents: [[=n=]]
127 Emacs syntax classes
127 Anchors and Other Zero-Width Assertions
127 Start of line/string: ^, \A
127 End of line/string: $, \Z, \z
128 Start of match (or end of previous match): \G
131 Word boundaries: \b, \B, \<, \>, ...
132 Lookahead (?=...), (?!...); Lookbehind, (?<=...), (?<!...)
133 Comments and Mode Modifiers
133 Mode modifier: (?modifier), such as (?i) or (?-i)
134 Mode-modified span: (?modifier:...), such as (?i:...)
134 Comments: (?#...) and #...
134 Literal-text span: \Q...\E
135 Grouping, Capturing, Conditionals, and Control
135 Capturing/Grouping Parentheses: (...) and \1, \2, ...
136 Grouping-only parentheses: (?:...)
137 Named capture: (?<Name>...)
137 Atomic grouping: (?>...)
138 Alternation: ...|...|...
138 Conditional: (?if then|else)
139 Greedy quantifiers: *, +, ?, {num,num}
140 Lazy quantifiers: *?, +?, ??, {num,num}?
140 Possessive quantifiers: *+, ++, ?+, {num,num}+
141 Guide to the Advanced Chapters
143 Chapter 4 The Mechanics of Expression Processing
143 Start Your Engines!
144 Two Kinds of Engines
144 New Standards
144 The impact of standards
145 Regex Engine Types
146 From the Department of Redundancy Department
146 Testing the Engine Type
146 Traditional NFA or not?
147 DFA or POSIX NFA?
147 Match Basics
147 About the Examples
148 Rule 1: The Match That Begins Earliest Wins
148 The transmission and the bump-along
149 Engine Pieces and Parts
150 No electric parentheses, backreferences, or lazy quantifiers
151 Rule 2: The Standard Quantifiers Are Greedy
151 A subjective example
152 Being too greedy
153 First come, first served
153 Getting down to the details
153 Regex-Directed Versus Text-Directed
153 NFA Engine: Regex-Directed
155 The control benefits of an NFA engine
155 DFA Engine: Text-Directed
156 First Thoughts: NFA and DFA in Comparison
156 Consequences to us as users
157 Backtracking
158 A Really Crummy Analogy
158 A crummy little example
159 Two Important Points on Backtracking
159 Saved States
160 A match without backtracking
160 A match after backtracking
160 A non-match
161 A lazy match
162 Backtracking and Greediness
162 Star, plus, and their backtracking
162 Revisiting a fuller example
163 More About Greediness and Backtracking
164 Problems of Greediness
165 Multi-Character Quotes
166 Using Lazy Quantifiers
167 Greediness and Laziness Always Favor a Match
168 The Essence of Greediness, Laziness, and Backtracking
169 Possessive Quantifiers and Atomic Grouping
170 Atomic grouping with (?>...)
172 Possessive Quantifiers, ?+, *+, ++, and {m,n}+
173 The Backtracking of Lookaround
174 Mimicking atomic grouping with positive lookahead
174 Is Alternation Greedy?
175 Taking Advantage of Ordered Alternation
176 Ordered alternation pitfalls
177 NFA, DFA, and POSIX
177 The Longest-Leftmost
177 Really, the longest
178 POSIX and the Longest-Leftmost Rule
179 Speed and Efficiency
179 DFA efficiency
180 Summary: NFA and DFA in Comparison
180 DFA versus NFA: Differences in the pre-use compile
181 DFA versus NFA: Differences in match speed
181 DFA versus NFA: Differences in what is matched
182 DFA versus NFA: Differences in capabilities
182 DFA versus NFA: Differences in ease of implementation
183 Summary
185 Chapter 5 Practical Regex Techniques
186 Regex Balancing Act
186 A Few Short Examples
186 Continuing with Continuation Lines
187 Matching an IP Address
189 Know your context
190 Working with Filenames
190 Removing the leading path from a filename
191 Accessing the filename from a path
192 Both leading path and filename
193 Matching Balanced Sets of Parentheses
194 Watching Out for Unwanted Matches
196 Matching Delimited Text
196 Allowing escaped quotes in double-quoted strings
198 Knowing Your Data and Making Assumptions
199 Stripping Leading and Trailing Whitespace
200 HTML-Related Examples
200 Matching an HTML Tag
201 Matching an HTML Link
203 Examining an HTTP URL
203 Validating a Hostname
205 Plucking Out a URL in the Real World
208 Extended Examples
208 Keeping in Sync with Your Data
210 Keeping the match in sync with expectations
211 Maintaining sync after a non-match as well
212 Maintaining sync with \G
212 This example in perspective
212 Parsing CSV Files
215 Distrusting the bump-along
218 One change for the sake of efficiency
219 Other CSV formats
221 Chapter 6 Crafting an Efficient Expression
222 A Sobering Example
223 A Simple Change -- Placing Your Best Foot Forward
223 Efficiency Verses Correctness
225 Advancing Further -- Localizing the Greediness
226 Reality Check
226 Exponential matches
228 A Global View of Backtracking
229 More Work for a POSIX NFA
230 Work Required During a Non-Match
231 Being More Specific
231 Alternation Can Be Expensive
232 Benchmarking
234 Know What You're Measuring
234 Benchmarking with Java
236 Benchmarking with VB.NET
237 Benchmarking with Python
238 Benchmarking with Ruby
239 Benchmarking with Tcl
239 Common Optimizations
240 No Free Lunch
240 Everyone's Lunch is Different
241 The Mechanics of Regex Application
242 Pre-Application Optimizations
242 Compile caching
244 Pre-check of required character/substring optimization
245 Length-cognizance optimization
245 Optimizations with the Transmission
245 Start of string/line anchor optimization
246 Implicit-anchor optimization
246 End of string/line anchor optimization
246 Initial character/class/substring discrimination optimization
247 Embedded literal string check optimization
247 Length-cognizance transmission optimization
247 Optimizations of the Regex Itself
247 Literal string concatenation optimization
247 Simple quantifier optimization
248 Needless parentheses elimination
249 Needless character class elimination
249 Character following lazy quantifier optimization
249 Excessive backtracking detection
250 Exponential (a.k.a, super-linear) short-circuiting
250 State-suppression with possessive quantifiers
251 Small quantifier equivalence
252 Need cognizance
252 Techniques for Faster Expressions
254 Common Sense Techniques
254 Avoid recompiling
254 Use non-capturing parentheses
254 Don't add superfluous parentheses
254 Don't use superfluous character classes
255 Use leading anchors
255 Expose Literal Text
255 Factor out required components from quantifiers
255 Factor out required components from the front of alternation
255 Expose Anchors
256 Expose ^ and \G at the front of expressions
256 Expose $ at the end of expressions
256 Lazy Versus Greedy: Be Specific
257 Split Into Multiple Regular Expressions
258 Mimic Initial-Character Discrimination
259 Don't do this with Tcl
259 Use Atomic Grouping and Possessive Quantifiers
260 Lead the Engine to a Match
260 Put the most likely alternative first
260 Distribute into the end of alternation
261 Unrolling the Loop
262 Method 1: Building a Regex From Past Experiences
262 Constructing a general unrolling-the-loop pattern
263 The Real Unrolling-the-Loop Pattern
264 Avoiding the neverending match
265 General things to look out for
266 Method 2: A Top-Down View
267 Method 3: An Internet Hostname
268 Observations
268 Using Atomic Grouping and Possessive Quantifiers
269 Making a neverending match safe with possessive quantifiers
269 Making a neverending match safe with atomic grouping
270 Short Unrolling Examples
270 Unrolling multi-character quotes
270 Unrolling the continuation-line example
271 Unrolling the CSV regex
272 Unrolling C Comments
272 To unroll or to not unroll...
273 A direct approach
274 Making it work
275 Unrolling the C loop
277 The Freeflowing Regex
277 A Helping Hand to Guide the Match
279 A Well-Guided Regex is a Fast Regex
280 Wrapup
281 In Summary: Think!
283 Chapter 7 Perl
285 Regular Expressions as a Language Component
286 Perl's Greatest Strength
286 Perl's Greatest Weakness
286 Perl's Regex Flavor
288 Regex Operands and Regex Literals
289 Features supported by regex literals
291 Picking your own regex delimiters
292 How Regex Literals Are Parsed
292 Regex Modifiers
293 Regex-Related Perlisms
294 Expression Context
294 Contorting an expression
295 Dynamic Scope and Regex Match Effects
295 Global and private variables
295 Dynamically scoped values
298 A better analogy: clear transparencies
298 Regex side effects and dynamic scoping
299 Dynamic scoping versus lexical scoping
299 Special Variables Modified by a Match
303 Using $1 within a regex?
303 The qr/.../ Operator and Regex Objects
303 Building and Using Regex Objects
304 Match modes (or lack thereof) are very sticky
305 Viewing Regex Objects
306 Using Regex Objects for Efficiency
306 The Match Operator
307 Match's Regex Operand
307 Using a regex literal
307 Using a regex object
308 The default regex
308 Special match-once ?...?
308 Specifying the Match Target Operand
308 The default target
309 Negating the sense of the match
309 Different Uses of the Match Operator
310 Normal does this match? -- scalar context without /g
310 Normal pluck data from a string -- list context, without /g
311 Pluck all matches -- list context, with the /g modifier
312 Iterative Matching: Scalar Context, with /g
313 The current match location and the pos() function
314 Pre-setting a string's pos
315 Using \G
315 Tag-team matching with /gc
316 Pos-related summary
316 The Match Operator's Environmental Relations
317 The match operator's side effects
317 Outside influences on the match operator
318 Keeping your mind in context (and context in mind)
318 The Substitution Operator
319 The Replacement Operand
319 The /e Modifier
320 Multiple uses of /e
321 Context and Return Value
321 The Split Operator
322 Basic Split
322 Basic match operand
322 Target string operand
323 Basic chunk-limit operand
323 Advanced split
324 Returning Empty Elements
324 Trailing empty elements
324 The chunk-limit operand's second job
324 Special matches at the ends of the string
325 Split's Special Regex Operands
325 Split has no side effects
326 Split's Match Operand with Capturing Parentheses
326 Fun with Perl Enhancements
328 Using a Dynamic Regex to Match Nested Pairs
331 Using the Embedded-Code Construct
331 Using embedded code to display match-time information
332 Using embedded code to see all matches
334 Finding the longest match
335 Finding the longest-leftmost match
335 Using local in an Embedded-Code Construct
338 A Warning About Embedded Code and my Variables
340 Matching Nested Constructs with Embedded Code
341 Overloading Regex Literals
341 Adding start- and end-of-word metacharacters
343 Adding support for possessive quantifiers
344 Problems with Regex-Literal Overloading
344 Mimicking Named Capture
347 Perl Efficiency Issues
348 There's More Than One Way to Do It
348 Regex Compilation, the /o Modifier, qr/.../, and Efficiency
350 The internal mechanics of preparing a regex
350 Perl steps to reduce regex compilation
352 The compile once /o modifier
353 Using regex objects for efficiency
354 Using the default regex for efficiency
355 Understanding the Pre-Match Copy
355 Pre-match copy supports $1, $&, $', $+, ...
355 The pre-match copy is not always needed
356 How expensive is the pre-match copy?
357 Avoiding the pre-match copy
359 The Study Function
359 When not to use study
360 When study can help
360 Benchmarking
361 Regex Debugging Information
362 Run-time debugging information
363 Other ways to invoke debugging messages
363 Final Comments
365 Chapter 8 Java
366 Judging a Regex Package
366 Technical Issues
367 Social and Political Issues
368 Object Models
368 A Few Abstract Object Models
369 An all-in-one model
370 A match state model
371 A match result model
372 Growing Complexity
372 Packages, Packages, Packages
375 Why So Many Perl5 Flavors?
375 Lies, Damn Lies, and Benchmarks
376 Warning: Benchmark results can cause drowsiness!
377 And the winner is...
377 Recommendations
378 Sun's Regex Package
378 Regex Flavor
381 Using java.util.regex
383 The Pattern.compile() Factory
384 Pattern's matcher(...) method
384 The Matcher Object
384 Applying the regex
385 Querying the results
385 Reusing Matcher objects for efficiency
387 Simple search and replace
388 Advanced search and replace
390 Other Pattern Methods
390 Pattern's split method, with one argument
391 Pattern's split method, with two arguments
392 A Quick Look at Jakarta-ORO
392 ORO's Perl5Util
393 A Mini Perl5Util Reference
393 Perl5Util basics -- initiating a match
396 Perl5Util basics -- inspecting the results of a match
397 Using ORO's Underlying Classes
399 Chapter 9 .NET
400 .NET's Regex Flavor
402 Additional Comments on the Flavor
402 Named capture
403 Conditional tests
404 Compiled expressions
405 Right-to-left matching
406 Backslash-digit ambiguities
406 ECMAScript mode
407 Using .NET Regular Expressions
407 Regex Quickstart
407 Quickstart: Checking a string for match
408 Quickstart: Matching and getting the text matched
408 Quickstart: Matching and getting captured text
408 Quickstart: Search and replace
409 Package Overview
409 Importing the regex namespace
410 Core Object Overview
410 Regex objects
411 Match objects
412 Group objects
412 Capture objects
412 All results are computed at match time
412 Core Object Details
413 Creating Regex Objects
413 Catching exceptions
413 Regex options
415 Using Regex Objects
421 Using Match Objects
424 Using Group Objects
425 Static Convenience Functions
426 Regex Caching
426 Support Functions
427 Advanced .NET
428 Regex Assemblies
430 Matching Nested Constructs
431 Capture Objects
433 Index