Full Table of Contents
xvii Preface
xvii The Need for This Book
xviii Intended Audience
xix How to Read This Book
xix Organization
xx The Introduction
xx The Details
xxi Tool-Specific Information
xxi Typographical Conventions
xxii Exercises
xxiii Links, Code, Errata, and Contacts
xxiii Safari\(rgEnabled
xxiv Personal Comments and Acknowledgments
1 Chapter 1 Introduction to Regular Expressions
2 Solving Real Problems
4 Regular Expressions as a Language
4 The Filename Analogy
5 The Language Analogy
6 The goal of this book
6 The Regular-Expression Frame of Mind
6 If You Have Some Regular-Expression Experience
6 Searching Text Files: Egrep
8 Egrep Metacharacters
8 Start and End of the Line
9 Character Classes
9 Matching any one of several characters
10 Negated character classes
11 Matching Any Character with Dot
13 Alternation
13 Matching any one of several subexpressions
14 Ignoring Differences in Capitalization
15 Word Boundaries
16 In a Nutshell
17 Optional Items
18 Other Quantifiers: Repetition
20 Defined range of matches: intervals
20 Parentheses and Backreferences
22 The Great Escape
23 Expanding the Foundation
23 Linguistic Diversification
23 The Goal of a Regular Expression
23 A Few More Examples
24 Variable names
24 A string within double quotes
24 Dollar amount (with optional cents)
25 An HTTP/HTML URL
26 An HTML tag
26 Time of day, such as 9:17 am or 12:30 pm
27 Regular Expression Nomenclature
27 Regex
27 Matching
27 Metacharacter
27 Flavor
29 Subexpression
29 Character
30 Improving on the Status Quo
32 Summary
33 Personal Glimpses
35 Chapter 2 Extended Introductory Examples
36 About the Examples
37 A Short Introduction to Perl
38 Matching Text with Regular Expressions
40 Toward a More Real-World Example
40 Side Effects of a Successful Match
43 Intertwined Regular Expressions
44 A short aside -- metacharacters galore
47 Generic whitespace with \s
49 Intermission
50 Modifying Text with Regular Expressions
50 Example: Form Letter
51 Example: Prettifying a Stock Price
53 Automated Editing
53 A Small Mail Utility
58 Real-world problems, real-world solutions
59 The real real world
59 Adding Commas to a Number with Lookaround
60 Lookaround doesn't consume text
61 A few more lookahead examples
64 Back to the comma example...
65 Word boundaries and negative lookaround
67 Commafication without lookbehind
67 Text-to-HTML Conversion
68 Cooking special characters
69 Separating paragraphs
70 Linkizing an email address
74 Linkizing an HTTP URL
77 That Doubled-Word Thing
80 Moving bits around: operators, functions, and objects
83 Chapter 3 Overview of Regular Expression Features and Flavors
85 A Casual Stroll Across the Regex Landscape
85 The Origins of Regular Expressions
86 Grep's metacharacters
86 Grep evolves
86 Egrep evolves
87 Other species evolve
87 POSIX -- An attempt at standardization
88 Henry Spencer's regex package
88 Perl evolves
90 A partial consolidation of flavors
91 Versions as of this book
91 At a Glance
93 Care and Handling of Regular Expressions
94 Integrated Handling
95 Procedural and Object-Oriented Handling
95 Regex handling in Java
96 Regex handling in VB and other .NET languages
97 Regex handling in PHP
97 Regex handling in Python
97 Why do approaches differ?
98 A Search-and-Replace Example
98 Search and replace in Java
99 Search and replace in VB.NET
99 Search and replace in PHP
100 Search and Replace in Other Languages
100 Awk
100 Tcl
100 GNU Emacs
101 Care and Handling: Summary
101 Strings, Character Encodings, and Modes
101 Strings as Regular Expressions
102 Strings in Java
103 Strings in VB.NET
103 Strings in C#
103 Strings in PHP
104 Strings in Python
104 Strings in Tcl
105 Regex literals in Perl
105 Character-Encoding Issues
106 Richness of encoding-related support
106 Unicode
107 Characters versus combining-character sequences
108 Multiple code points for the same character
109 Unicode 3.1+ and code points beyond U+FFFF
109 Unicode line terminator
110 Regex Modes and Match Modes
110 Case-insensitive match mode
111 Free-spacing and comments regex mode
111 Dot-matches-all match mode (a.k.a., single-line mode)
112 Enhanced line-anchor match mode (a.k.a., multiline mode)
113 Literal-text regex mode
113 Common Metacharacters and Features
115 Character Representations
115 Character shorthands
115 These are machine dependent?
116 Octal escape -- \num
117 Hex and Unicode escapes: \xnum, \x{num}, \unum, \Unum, ...
117 Control characters: \cchar
118 Character Classes and Class-Like Constructs
118 Normal classes: [a-z] and [^a-z]
119 Almost any character: dot
120 Exactly one byte
120 Unicode combining character sequence: \X
120 Class shorthands: \w, \d, \s, \W, \D, \S
121 Unicode properties, scripts, and blocks: \p{Prop}, \P{Prop}
125 Simple class subtraction: [[a-z]-[aeiou]]
125 Full class set operations: [[a-z]&&[^aeiou]]
127 POSIX bracket-expression character class: [[:alpha:]]
128 POSIX bracket-expression collating sequences: [[.span-ll.]]
128 POSIX bracket-expression character equivalents: [[=n=]]
128 Emacs syntax classes
129 Anchors and Other Zero-Width Assertions
129 Start of line/string: ^, \A
129 End of line/string: $, \Z, \z
130 Start of match (or end of previous match): \G
133 Word boundaries: \b, \B, \<, \>, ...
133 Lookahead (?=...), (?!...); Lookbehind, (?<=...), (?<!...)
135 Comments and Mode Modifiers
135 Mode modifier: (?modifier), such as (?i) or (?-i)
135 Mode-modified span: (?modifier:...), such as (?i:...)
136 Comments: (?#...) and #...
136 Literal-text span: \Q...\E
137 Grouping, Capturing, Conditionals, and Control
137 Capturing/Grouping Parentheses: (...) and \1, \2, ...
137 Grouping-only parentheses: (?:...)
138 Named capture: (?<Name>...)
139 Atomic grouping: (?>...)
139 Alternation: ...|...|...
140 Conditional: (?if then|else)
141 Greedy quantifiers: *, +, ?, {num,num}
141 Lazy quantifiers: *?, +?, ??, {num,num}?
142 Possessive quantifiers: *+, ++, ?+, {num,num}+
142 Guide to the Advanced Chapters
143 Chapter 4 The Mechanics of Expression Processing
143 Start Your Engines!
144 Two Kinds of Engines
144 New Standards
144 The impact of standards
145 Regex Engine Types
146 From the Department of Redundancy Department
146 Testing the Engine Type
146 Traditional NFA or not?
147 DFA or POSIX NFA?
147 Match Basics
147 About the Examples
148 Rule 1: The Match That Begins Earliest Wins
148 The transmission and the bump-along
149 Engine Pieces and Parts
150 No electric parentheses, backreferences, or lazy quantifiers
151 Rule 2: The Standard Quantifiers Are Greedy
151 A subjective example
152 Being too greedy
153 First come, first served
153 Getting down to the details
153 Regex-Directed Versus Text-Directed
153 NFA Engine: Regex-Directed
155 The control benefits of an NFA engine
155 DFA Engine: Text-Directed
156 First Thoughts: NFA and DFA in Comparison
156 Consequences to us as users
157 Backtracking
158 A Really Crummy Analogy
158 A crummy little example
159 Two Important Points on Backtracking
159 Saved States
160 A match without backtracking
160 A match after backtracking
160 A non-match
161 A lazy match
162 Backtracking and Greediness
162 Star, plus, and their backtracking
162 Revisiting a fuller example
163 More About Greediness and Backtracking
164 Problems of Greediness
165 Multi-Character Quotes
166 Using Lazy Quantifiers
167 Greediness and Laziness Always Favor a Match
168 The Essence of Greediness, Laziness, and Backtracking
169 Possessive Quantifiers and Atomic Grouping
170 Atomic grouping with (?>...)
172 Possessive Quantifiers, ?+, *+, ++, and {m,n}+
173 The Backtracking of Lookaround
174 Mimicking atomic grouping with positive lookahead
174 Is Alternation Greedy?
175 Taking Advantage of Ordered Alternation
176 Ordered alternation pitfalls
177 NFA, DFA, and POSIX
177 The Longest-Leftmost
177 Really, the longest
178 POSIX and the Longest-Leftmost Rule
179 Speed and Efficiency
179 DFA efficiency
180 Summary: NFA and DFA in Comparison
180 DFA versus NFA: Differences in the pre-use compile
181 DFA versus NFA: Differences in match speed
181 DFA versus NFA: Differences in what is matched
182 DFA versus NFA: Differences in capabilities
183 DFA versus NFA: Differences in ease of implementation
183 Summary
185 Chapter 5 Practical Regex Techniques
186 Regex Balancing Act
186 A Few Short Examples
186 Continuing with Continuation Lines
187 Matching an IP Address
189 Know your context
190 Working with Filenames
190 Removing the leading path from a filename
191 Accessing the filename from a path
192 Both leading path and filename
193 Matching Balanced Sets of Parentheses
194 Watching Out for Unwanted Matches
196 Matching Delimited Text
196 Allowing escaped quotes in double-quoted strings
198 Knowing Your Data and Making Assumptions
199 Stripping Leading and Trailing Whitespace
200 HTML-Related Examples
200 Matching an HTML Tag
201 Matching an HTML Link
203 Examining an HTTP URL
203 Validating a Hostname
206 Plucking Out a URL in the Real World
208 Extended Examples
209 Keeping in Sync with Your Data
210 Keeping the match in sync with expectations
211 Maintaining sync after a non-match as well
212 Maintaining sync with \G
212 This example in perspective
213 Parsing CSV Files
215 Distrusting the bump-along
218 One change for the sake of efficiency
218 Other CSV formats
221 Chapter 6 Crafting an Efficient Expression
222 A Sobering Example
223 A Simple Change -- Placing Your Best Foot Forward
223 Efficiency Versus Correctness
225 Advancing Further -- Localizing the Greediness
226 Reality Check
226 Exponential matches
228 A Global View of Backtracking
229 More Work for a POSIX NFA
230 Work Required During a Non-Match
231 Being More Specific
231 Alternation Can Be Expensive
232 Benchmarking
234 Know What You're Measuring
234 Benchmarking with PHP
235 Benchmarking with Java
237 Benchmarking with VB.NET
238 Benchmarking with Ruby
238 Benchmarking with Python
239 Benchmarking with Tcl
240 Common Optimizations
240 No Free Lunch
241 Everyone's Lunch is Different
241 The Mechanics of Regex Application
242 Pre-Application Optimizations
242 Compile caching
245 Pre-check of required character/substring optimization
245 Length-cognizance optimization
246 Optimizations with the Transmission
246 Start of string/line anchor optimization
246 Implicit-anchor optimization
246 End of string/line anchor optimization
247 Initial character/class/substring discrimination optimization
247 Embedded literal string check optimization
247 Length-cognizance transmission optimization
247 Optimizations of the Regex Itself
247 Literal string concatenation optimization
247 Simple quantifier optimization
248 Needless parentheses elimination
248 Needless character class elimination
248 Character following lazy quantifier optimization
249 Excessive backtracking detection
250 Exponential (a.k.a., super-linear) short-circuiting
250 State-suppression with possessive quantifiers
251 Small quantifier equivalence
252 Need cognizance
252 Techniques for Faster Expressions
254 Common Sense Techniques
254 Avoid recompiling
254 Use non-capturing parentheses
254 Don't add superfluous parentheses
254 Don't use superfluous character classes
255 Use leading anchors
255 Expose Literal Text
255 Factor out required components from quantifiers
255 Factor out required components from the front of alternation
256 Expose Anchors
256 Expose ^ and \G at the front of expressions
256 Expose $ at the end of expressions
256 Lazy Versus Greedy: Be Specific
257 Split Into Multiple Regular Expressions
258 Mimic Initial-Character Discrimination
259 Don't do this with Tcl
259 Don't do this with PHP
259 Use Atomic Grouping and Possessive Quantifiers
260 Lead the Engine to a Match
260 Put the most likely alternative first
261 Distribute into the end of alternation
261 Unrolling the Loop
262 Method 1: Building a Regex From Past Experiences
263 Constructing a general unrolling-the-loop pattern
264 The Real Unrolling-the-Loop Pattern
264 Avoiding the neverending match
266 General things to look out for
266 Method 2: A Top-Down View
267 Method 3: An Internet Hostname
268 Observations
268 Using Atomic Grouping and Possessive Quantifiers
269 Making a neverending match safe with possessive quantifiers
269 Making a neverending match safe with atomic grouping
270 Short Unrolling Examples
270 Unrolling multi-character quotes
270 Unrolling the continuation-line example
271 Unrolling the CSV regex
272 Unrolling C Comments
272 To unroll or to not unroll...
273 A direct approach
274 Making it work
275 Unrolling the C loop
277 The Freeflowing Regex
277 A Helping Hand to Guide the Match
279 A Well-Guided Regex is a Fast Regex
281 Wrapup
281 In Summary: Think!
283 Chapter 7 Perl
285 Regular Expressions as a Language Component
286 Perl's Greatest Strength
286 Perl's Greatest Weakness
286 Perl's Regex Flavor
288 Regex Operands and Regex Literals
289 Features supported by regex literals
291 Picking your own regex delimiters
292 How Regex Literals Are Parsed
292 Regex Modifiers
293 Regex-Related Perlisms
294 Expression Context
294 Contorting an expression
295 Dynamic Scope and Regex Match Effects
295 Global and private variables
295 Dynamically scoped values
298 A better analogy: clear transparencies
298 Regex side effects and dynamic scoping
299 Dynamic scoping versus lexical scoping
299 Special Variables Modified by a Match
303 Using $1 within a regex?
303 The qr/.../ Operator and Regex Objects
303 Building and Using Regex Objects
304 Match modes (or lack thereof) are very sticky
305 Viewing Regex Objects
306 Using Regex Objects for Efficiency
306 The Match Operator
307 Match's Regex Operand
307 Using a regex literal
307 Using a regex object
308 The default regex
308 Special match-once ?...?
308 Specifying the Match Target Operand
308 The default target
309 Negating the sense of the match
309 Different Uses of the Match Operator
310 Normal does this match? -- scalar context without /g
310 Normal pluck data from a string -- list context, without /g
311 Pluck all matches -- list context, with the /g modifier
312 Iterative Matching: Scalar Context, with /g
313 The current match location and the pos() function
314 Pre-setting a string's pos
315 Using \G
315 Tag-team matching with /gc
316 Pos-related summary
316 The Match Operator's Environmental Relations
317 The match operator's side effects
317 Outside influences on the match operator
318 Keeping your mind in context (and context in mind)
318 The Substitution Operator
319 The Replacement Operand
319 The /e Modifier
320 Multiple uses of /e
321 Context and Return Value
321 The Split Operator
322 Basic Split
322 Basic match operand
322 Target string operand
323 Basic chunk-limit operand
323 Advanced split
324 Returning Empty Elements
324 Trailing empty elements
324 The chunk-limit operand's second job
324 Special matches at the ends of the string
325 Split's Special Regex Operands
325 Split has no side effects
326 Split's Match Operand with Capturing Parentheses
326 Fun with Perl Enhancements
328 Using a Dynamic Regex to Match Nested Pairs
331 Using the Embedded-Code Construct
331 Using embedded code to display match-time information
332 Using embedded code to see all matches
334 Finding the longest match
335 Finding the longest-leftmost match
335 Using local in an Embedded-Code Construct
338 A Warning About Embedded Code and my Variables
340 Matching Nested Constructs with Embedded Code
341 Overloading Regex Literals
341 Adding start- and end-of-word metacharacters
343 Adding support for possessive quantifiers
344 Problems with Regex-Literal Overloading
344 Mimicking Named Capture
347 Perl Efficiency Issues
348 There's More Than One Way to Do It
348 Regex Compilation, the /o Modifier, qr/.../, and Efficiency
350 The internal mechanics of preparing a regex
350 Perl steps to reduce regex compilation
352 The compile once /o modifier
353 Using regex objects for efficiency
354 Using the default regex for efficiency
355 Understanding the Pre-Match Copy
355 Pre-match copy supports $1, $&, $', $+, ...
355 The pre-match copy is not always needed
356 How expensive is the pre-match copy?
357 Avoiding the pre-match copy
359 The Study Function
359 When not to use study
360 When study can help
360 Benchmarking
361 Regex Debugging Information
362 Run-time debugging information
363 Other ways to invoke debugging messages
363 Final Comments
365 Chapter 8 Java
366 Java's Regex Flavor
369 Java Support for \p{...} and \P{...}
369 Unicode properties
369 Unicode blocks
369 Special Java character properties
370 Unicode Line Terminators
371 Using java.util.regex
372 The Pattern.compile() Factory
373 Pattern's matcher method
373 The Matcher Object
375 Applying the Regex
376 Querying Match Results
378 Match-result example
378 Simple Search and Replace
379 Simple search and replace examples
380 The replacement argument
380 Advanced Search and Replace
381 Search-and-replace examples
382 In-Place Search and Replace
383 Using a different-sized replacement
384 The Matcher's Region
385 Points to keep in mind
386 Setting and inspecting region bounds
386 Looking outside the current region
387 Transparent bounds
388 Anchoring bounds
389 Method Chaining
389 Methods for Building a Scanner
391 Examples illustrating hitEnd and requireEnd
392 The hitEnd bug and its workaround
392 Other Matcher Methods
394 Querying a matcher's target text
394 Other Pattern Methods
395 Pattern's split Method, with One Argument
396 Empty elements with adjacent matches
396 Pattern's split Method, with Two Arguments
396 Split with a limit less than zero
396 Split with a limit of zero
396 Split with a limit greater than zero
397 Additional Examples
397 Adding Width and Height Attributes to Image Tags
399 Validating HTML with Multiple Patterns Per Matcher
401 Parsing Comma-Separated Values (CSV) Text
401 Java Version Differences
402 Differences Between 1.4.2 and 1.5.0
402 New methods in Java 1.5.0
402 Unicode-support differences between 1.4.2 and 1.5.0
403 Differences Between 1.5.0 and 1.6
405 Chapter 9 .NET
406 .NET's Regex Flavor
409 Additional Comments on the Flavor
409 Named capture
409 Conditional tests
410 Compiled expressions
411 Right-to-left matching
412 Backslash-digit ambiguities
412 ECMAScript mode
413 Using .NET Regular Expressions
413 Regex Quickstart
413 Quickstart: Checking a string for match
414 Quickstart: Matching and getting the text matched
414 Quickstart: Matching and getting captured text
414 Quickstart: Search and replace
415 Package Overview
415 Importing the regex namespace
416 Core Object Overview
416 Regex objects
417 Match objects
418 Group objects
418 Capture objects
418 All results are computed at match time
418 Core Object Details
419 Creating Regex Objects
419 Catching exceptions
419 Regex options
421 Using Regex Objects
427 Using Match Objects
430 Using Group Objects
431 Static Convenience Functions
432 Regex Caching
432 Support Functions
434 Advanced .NET
434 Regex Assemblies
436 Matching Nested Constructs
437 Capture Objects
439 Chapter 10 PHP
441 PHP's Regex Flavor
443 The Preg Function Interface
444 Pattern Arguments
444 PHP single-quoted strings
445 Delimiters
446 Pattern modifiers
449 The Preg Functions
449 preg_match
450 Capturing match data
450 Trailing non-participatory elements stripped
451 Named capture
452 Getting more details on the match: PREG_OFFSET_CAPTURE
453 The offset argument
453 preg_match_all
454 Collecting match data
456 preg_match_all and the PREG_OFFSET_CAPTURE flag
457 preg_match_all with named capture
458 preg_replace
459 Basic one-string, one-pattern, one-replacement preg_replace
460 Multiple subjects, patterns, and replacements
463 preg_replace_callback
465 A callback versus the e pattern modifier
465 preg_split
466 preg_split's limit argument
468 preg_split's flag arguments
469 preg_grep
470 preg_quote
471 Missing Preg Functions
472 preg_regex_to_pattern
472 The problem
472 The solution
474 Syntax-Checking an Unknown Pattern Argument
475 Syntax-Checking an Unknown Regex
475 Recursive Expressions
475 Matching Text with Nested Parentheses
476 Recursive reference to a set of capturing parentheses
476 Recursive reference via named capture
477 More on possessive quantifiers
477 No Backtracking Into Recursion
478 Matching a Set of Nested Parentheses
478 PHP Efficiency Issues
478 The S Pattern Modifier: Study
479 Standard optimizations, without the S pattern modifier
479 Enhancing the optimization with the S pattern modifier
480 When the S pattern modifier can't help
480 Suggested use
480 Extended Examples
480 CSV Parsing with PHP
481 Checking Tagged Data for Proper Nesting
481 The main body of this expression
483 Possessive quantifiers
483 Real-world XML
484 HTML?
485 Index