Documentation

Regular Expression

The magic of string manipulation.

Regular Expression - A Comprehensive Guide

Contents

Introduction
Live Tester
Practical Syntax
Comprehensive Reference
Conclusion

1. Introduction

Text - whether it’s log files, configuration data, user input, or documents - is everywhere in programming. From simple scripts to full-blown applications, developers frequently need to search, validate, extract, or transform pieces of text. While many IDEs and editors (like Notepad++, VS Code, or Sublime Text) offer basic find-and-replace functionality, they quickly reach their limits when you need to handle patterns, conditional logic, or repeated structures. That’s where regular expressions (regex) come in.

At its core, a regular expression is a compact, declarative way to describe a set of strings. Instead of manually scanning through text to find, say, all warning messages in a log, you can write a single pattern that matches timestamps, log levels, filenames, email addresses, URLs - you name it. Once you understand the fundamentals, regex can become one of the most powerful “weapons” in your developer toolbox.

Consider a typical log snippet:

2025-05-31 14:23:48 [Info] Starting application...
2025-05-31 14:23:50 [Info] Loading application data...
2025-05-31 14:23:55 [Warning] Missing font file: New Times Roman.
2025-05-31 14:24:22 [Info] The application has started, with 1 warning.
2025-05-31 14:25:01 [Error] Unhandled exception: NullReferenceException.

If you want to answer questions like:

“What warnings were emitted, and when?”
“Which lines contain error messages?”
“How many times did the application start?”

You could write a script in your favorite language that splits on spaces, checks tokens, and does string comparisons. But that quickly becomes tedious: parsing variability in timestamps, handling optional fields, or ignoring whitespace quirks can make your code fragile.

With a well-crafted regex, however, you can match on the entire log line, capture the timestamp, the log level, and the rest of the message - all in one pattern. More importantly, regex can handle any text-based format for which there’s a consistent structure: CSV files, HTML/XML tags, custom configuration formats, and so on.

By the end of this article, you will:

Understand the basic motivation behind regular expressions.
See practical syntax elements and examples for day-to-day tasks.
Have a handy reference that summarizes the most commonly used regex constructs.

Whether you’ve done a bit of “find and replace” in an editor, or you’ve written simple string processing code, this guide will bridge the gap and show you how to level up with regular expressions.

2. Live Tester

Regex Tester (.NET / C# syntax*)

Pattern:

Options Ignore Case (i) Multiline (m) Singleline / DotAll (s) * Global (g) is always on so all matches are shown.

Test Text:

Replacement (optional)

Replace With:

3. Practical Syntax

In this section, we’ll cover the core building blocks of regex patterns. Each subsection introduces a concept, followed by real-world examples - some based on log processing, others on everyday tasks like validating email addresses or extracting numbers from text.

3.1 Literal Characters and Simple Matches

At the simplest level, a regex can be just literal characters. For example:

Pattern Error will match any occurrence of the substring “Error” in a text.
Pattern 2025-05-31 will match that exact date string.

However, real-world text rarely lines up perfectly. Log messages, for instance, include varying timestamps and different log levels. To match them in a generic way, you need metacharacters.

Key Point: Most characters in a regex match themselves literally, except for special metacharacters (like ., *, ?, \, etc.). To match a literal period, bracket, or other metacharacter, you must “escape” it with a backslash (\. to match a literal dot, \[ to match a literal left bracket, and so on).

3.2 Character Classes (`[]`) and Ranges

A character class is a bracketed set of characters that match exactly one character from the set. For example:

[AEIOU] matches any single uppercase vowel.
[0-9] matches any digit from 0 through 9.
[A-Za-z0-9_] matches a typical word character (letter, digit, or underscore).

Example 1: Matching a Four-Digit Year

Suppose you want to match any year between 1900 and 2099:

^(19|20)[0-9]{2}$

^ and $ are anchors (we’ll cover them in a moment).
(19|20) matches either “19” or “20”.
[0-9]{2} matches exactly two digits.
Altogether, this matches “1900” through “2099”.

Example 2: Extracting Log Levels

Given a log line like:

2025-05-31 14:23:55 [Warning] Missing font file: New Times Roman.

You can match the log level (Info, Warning, Error, etc.) with:

\[(Info|Warning|Error|Debug)\]

\[ and \] match literal square brackets.
(Info|Warning|Error|Debug) matches one of those four words.

If you expect additional levels (e.g., Trace or Fatal), you can add them inside the parentheses:

\[(?:Info|Warning|Error|Debug|Trace|Fatal)\]

Here we’ve used (?: … ) to create a non-capturing group (we’ll talk about groups more in section 5).

3.3 Quantifiers: Repetition Control

Quantifiers specify how many times the preceding element should match. The most common quantifiers are:

* – Match zero or more times (greedy).
+ – Match one or more times (greedy).
? – Match zero or one time (makes the preceding token optional).
{n} – Match exactly n times.
{n,m} – Match between n and m times (inclusive).
{n,} – Match n or more times.

Example 3: Matching Timestamps

A timestamp like 2025-05-31 14:23:48 can be matched with:

\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}

\d is shorthand for [0-9].
\d{4} matches exactly four digits (the year).
- matches a literal hyphen.
\d{2} matches two digits (month and day).
\s+ matches one or more whitespace characters (space, tab).
\d{2}:\d{2}:\d{2} matches HH:MM:SS.

If you’re sure there’s exactly one space between date and time, you could use \s instead of \s+.

Example 4: Optional File Extension

Suppose you have filenames like report.txt or report. To match both, you could use:

^report(?:\.txt)?$

^ and $ ensure we match the entire string.
(?:\.txt)? means “optionally match ‘.txt’”.
- \. matches a literal dot.
- txt matches “txt”.
- The ? after the group makes it optional.

3.4 Anchors: Positioning Matches

Anchors specify positions in the text rather than actual characters:

^ – Start of the line/string.
$ – End of the line/string.
\b – Word boundary (between \w and \W).
\B – Non-word boundary.

Example 5: Lines Starting with “Error”

If you want to catch any line that begins with the word “Error” (as a standalone word), you can write:

^Error\b.*$

^Error ensures the line starts with “Error”.
\b asserts a word boundary so that “Errors” or “ErrorCode” are not matched.
.* matches the rest of the line (any character, zero or more times).
$ anchors the end of line.

3.5 Groups and Capturing

Parentheses ( and ) serve two main purposes:

Grouping: Treat multiple tokens as a single unit (often used with quantifiers or alternation).
Capturing: Store the text matched by that group for later retrieval (in code or in a replacement string).

Example 6: Capturing Timestamps and Levels

Given:

2025-05-31 14:23:55 [Warning] Missing font...

You might write:

^(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2})\s+\[(Info|Warning|Error|Debug)\]\s+(.*)$

(\d{4}-\d{2}-\d{2}) captures the date (Group 1).
(\d{2}:\d{2}:\d{2}) captures the time (Group 2).
(Info|Warning|Error|Debug) captures the log level (Group 3).
(.*) captures the rest of the message (Group 4).
In most languages, after matching, you can extract Group 1, Group 2, etc.:
- e.g., in C#: match.Groups[1].Value gives the date.

If you want to group without capturing (for alternation or quantifier purposes), use (?: … ). For instance:

(?:Info|Warning|Error|Debug)

does not create a numbered backreference.

3.6 Common Metacharacters and Escape Sequences

Some of the most frequently used shorthand and metacharacters include:

. – Matches any character except newline (in most flavors; some allow a “dotall” mode to match newline).
\d – Digit (equivalent to [0-9]).
\D – Non-digit (equivalent to [^0-9]).
\w – Word character (letter, digit, underscore; [A-Za-z0-9_]).
\W – Non-word character (anything not in \w).
\s – Whitespace character (space, tab, newline, etc.).
\S – Non-whitespace character.
\t, \n, \r – Tab, newline, carriage return.

Tip: When writing regexes in code, you often need to escape backslashes. For instance, in C# you can use verbatim string literals (@"\d{4}-\d{2}-\d{2}") to avoid double escapes.

3.7 Practical Example: Extracting Email Addresses

Say you have a document or a configuration file and need to pull out all email addresses. A simple - but not fully RFC‑compliant - pattern is:

\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b

\b – Ensure we start at a word boundary.
[A-Za-z0-9._%+-]+ – One or more letters, digits, dots, underscores, percent signs, plus, or hyphens (the local part before the @).
@ – Literal “@” symbol.
[A-Za-z0-9.-]+ – One or more letters, digits, dots, or hyphens (domain).
\. – Literal dot.
[A-Za-z]{2,} – At least two letters for the TLD (e.g., “com”, “org”, “io”).
\b – End at a word boundary to prevent partial matches (so we don’t match trailing punctuation).

Testing It

Given:

Please contact alice@example.com or bob.smith@my-domain.io for more info. Alternatively, reach out to admin@localhost if testing locally.

alice@example.com matches.
bob.smith@my-domain.io matches.
admin@localhost does not match because localhost has no “. + two-or-more-letters” TLD.

3.8 Practical Example: Validating North American Phone Numbers

North American Numbering Plan (NANP) phone numbers often look like:

123-456-7890
(123) 456-7890
123.456.7890
+1 (123) 456-7890

A practical regex might be:

^(\+1\s?)?           # Optional country code +1 and a space
(?:\(\d{3}\)|\d{3})  # Either (123) or 123
[ .-]?               # Separator: space, dot, or hyphen (optional)
\d{3}                # Next three digits
[ .-]?               # Separator again (optional)
\d{4}$               # Last four digits

When compacted (and removing whitespace/comments):

^(\+1\s?)?(?:\(\d{3}\)|\d{3})[ .-]?\d{3}[ .-]?\d{4}$

^ and $ anchor the pattern to the entire string.
(\+1\s?)? – Optionally match “+1” followed by an optional space.
(?:$\d{3}$|\d{3}) – Either three digits inside parentheses or three digits without.
[ .-]? – Optionally match a space, a dot, or a hyphen.
\d{3} – Three digits.
[ .-]? – Separator again.
\d{4} – Four digits.

This pattern will match most common North American formats. By tweaking the groups and quantifiers, you can adapt it for other regions or stricter/looser rules.

3.9 Practical Example: Parsing Log Files

Returning to our log example, suppose you want to extract only the warning messages and their timestamps. You might apply a regex like:

^(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+\[Warning\]\s+(.*)$

Group 1 captures the timestamp (2025-05-31 14:23:55).
Group 2 captures the message itself (Missing font file: New Times Roman.).

In most programming languages:

C# (using System.Text.RegularExpressions):

var pattern = @"^(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+\[Warning\]\s+(.*)$";
var matches = Regex.Matches(logContents, pattern, RegexOptions.Multiline);
foreach (Match m in matches) {
    string timestamp = m.Groups[1].Value;
    string message   = m.Groups[2].Value;
    Console.WriteLine($"[{timestamp}] Warning: {message}");
}

Python (using the re module):

import re

pattern = r'^(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+\[Warning\]\s+(.*)$'
for line in log_contents.splitlines():
    m = re.match(pattern, line)
    if m:
        timestamp = m.group(1)
        message   = m.group(2)
        print(f"[{timestamp}] Warning: {message}")

Even if you’ve only ever clicked “Replace” in an editor, these examples should give you a clear sense of how powerful a single regex can be. You define “match everything up until [Warning], then capture whatever comes next”, and you get exactly the data you want.

3.10 Testing and Debugging Regex

When you’re new to regex, it’s easy to overlook small mistakes - missing an escape, misplacing a quantifier, forgetting an anchor. Here are a few tips:

Use an Online Tester Websites like regex101.com (PCRE, JavaScript, Python), RegexStorm (for .NET), or regexr.com let you type your pattern, paste test text, and see matches/highlights in real time.
Start Simple, Then Grow If you need a complex pattern, start with a small piece. For instance, test just \d{4}-\d{2}-\d{2} on the date first, then add time, then add log level, etc.

Comment Your Patterns (When Supported) Some languages/flavors support the “extended” or “free-spacing” mode, where you can add whitespace and # comments inside your regex:

(?x)            # Enable free-spacing mode
^               # Start of line
(\d{4}-\d{2}-\d{2})   # Group 1: Date
\s+             # One or more spaces
(\d{2}:\d{2}:\d{2})   # Group 2: Time
\s+             # One or more spaces
\[Warning\]     # Literal [Warning]
\s+             # One or more spaces
(.*)            # Group 3: Message text
$

This makes maintenance and future edits easier, especially for complex regexes.

Escape Early If you need to match special characters (., *, ?, +, [, ], (, ), |, ^, $, \), always escape them with a backslash unless you explicitly want their special behavior.

4. Comprehensive Reference

Below is a summary of the most commonly used regular expression constructs. Keep this as a quick “cheat sheet” for reference.

Note: Regex flavors (PCRE, Python, JavaScript, .NET, Java, etc.) differ in some details, but most of the items below work universally. Where there are flavor-specific notes, they are indicated.

4.1 Basic Metacharacters

Metacharacter	Meaning
`.`	Matches any character except newline (unless “dotall”/“singleline” mode is enabled).
`^`	Matches the start of a line/string.
`$`	Matches the end of a line/string.
`\b`	Word boundary (transition between `\w` and `\W`).
`\B`	Non-word boundary (opposite of `\b`).
`\d`	Digit, equivalent to `[0-9]`.
`\D`	Non-digit, equivalent to `[^0-9]`.
`\w`	Word character, equivalent to `[A-Za-z0-9_]`.
`\W`	Non-word character, equivalent to `[^A-Za-z0-9_]`.
`\s`	Whitespace (space, tab, newline, vertical tab, form feed, etc.).
`\S`	Non-whitespace (anything not matched by `\s`).
`\t`	Tab character.
`\n`	Newline character (LF).
`\r`	Carriage return (CR).
`\f`	Form feed.
`\v` (flavor-specific)	Vertical tab (supported in some flavors).
`\\`	Literal backslash.
`\.`	Literal dot.
`\*`	Literal asterisk.
`\+`	Literal plus.
`\?`	Literal question mark.
`\{` `\}`	Literal braces.
`\[` `\]`	Literal square brackets.
`$` `$`	Literal parentheses.
`\\|`	Literal pipe/alternation operator.
`\^`	Literal caret (start-of-string anchor).
`\$`	Literal dollar sign (end-of-string anchor).

4.2 Quantifiers

Quantifier	Meaning
`*`	Match the preceding token zero or more times (greedy).
`+`	Match the preceding token one or more times (greedy).
`?`	Match the preceding token zero or one time (makes it optional; also used for lazy quantifiers).
`{n}`	Match the preceding token exactly n times.
`{n,}`	Match the preceding token n or more times (no upper limit).
`{n,m}`	Match the preceding token at least n but not more than m times.
`*?`, `+?`, `??`, `{n,m}?`	The “lazy” or “non-greedy” versions of the above quantifiers (match as few characters as possible).

Greedy vs. Lazy:

Greedy quantifiers (*, +, {n,m}) will try to match as much text as possible while still allowing the rest of the pattern to succeed.

Lazy quantifiers (*?, +?, {n,m}?) match as little as possible to allow the overall match to succeed.

Example: Greedy versus Lazy

Pattern a.*b on text axxxbxxb will match axxxbxxb (greedy: from the first a to the last b).
Pattern a.*?b will match axxxb (lazy: from a to the first b).

4.3 Character Classes and Sets

Syntax	Matches
`[abc]`	Any one of the characters `a`, `b`, or `c`.
`[^abc]`	Any character except `a`, `b`, or `c`.
`[a-z]`	Any lowercase letter from `a` to `z`.
`[A-Z]`	Any uppercase letter from `A` to `Z`.
`[0-9]`	Any digit from `0` to `9`.
`[A-Za-z0-9_]`	Any “word” character (often equivalent to `\w`).
`[[:alnum:]]` (POSIX)	Letters and digits (POSIX-style). Useable in some flavors (e.g., GNU, PCRE with `[[:alnum:]]`).
`[[:space:]]`, `[[:digit:]]`, etc.	POSIX character classes (flavor support varies).

Example: Matching Hexadecimal Digits

^[0-9A-Fa-f]+$

Matches any non-empty string composed entirely of hexadecimal digits.

4.4 Alternation (`|`) and Grouping

Grouping: ( … )
- Capturing group: saves the matched text for backreferences or extraction.
- Non-capturing group: (?: … ) does not save the match; useful for grouping without capturing.
Alternation: A|B matches either pattern A or pattern B.

Example: Matching Multiple File Extensions

^.*\.(?:jpg|jpeg|png|gif|bmp)$

.*\. matches any filename up to and including the dot.
(?:jpg|jpeg|png|gif|bmp) matches one of those five extensions.
$ ensures the extension is at the end.

4.5 Anchors and Boundaries

Anchor	Meaning
`^`	Start of string (or line, with multiline mode).
`$`	End of string (or line, with multiline mode).
`\A`	Start of string (ignores multiline mode).
`\z` or `\Z`	End of string (`\z` is absolute; `\Z` allows an optional trailing newline).
`\b`	Word boundary (between `\w` and `\W`).
`\B`	Non-word boundary (opposite of `\b`).

Multiline Mode (m flag)

If you enable multiline mode (usually via a flag like RegexOptions.Multiline in .NET or re.MULTILINE in Python), then ^ and $ match the start/end of each line rather than the entire string.

4.6 Lookarounds (Zero-Width Assertions)

Lookarounds allow you to assert that a match is (or is not) preceded or followed by something else, without including that “something else” in the match.

Construct	Meaning
`(?= … )`	Positive lookahead: what follows must match `…`.
`(?! … )`	Negative lookahead: what follows must not match `…`.
`(?<= … )`	Positive lookbehind: what precedes must match `…`.
`(?<! … )`	Negative lookbehind: what precedes must not match `…`.

Example: Matching Passwords without Digits

To match a 6–12 character password that does not contain any digits:

^(?=.{6,12}$)(?!.*\d).+$

(?=.{6,12}$) – Ensure length is between 6 and 12.
(?!.*\d) – Ensure there are no digits anywhere (.*\d means “somewhere a digit”; the (?! … ) says “the rest of the string must not match .*\d”).
.+$ – Finally, match one or more characters (any), up to the end.

4.7 Backreferences

Once you capture a group, you can refer to it later in the pattern (to enforce that two parts of the text are identical) or in a replacement string.

\1, \2, … refer to the text matched by Group 1, Group 2, etc. (in most flavors).
In replacement strings (e.g., in .NET or JavaScript), you might use $1, $2, etc.

Example: Matching HTML/XML Opening and Closing Tags

To ensure that an opening tag <tag> is closed by </tag> with the same name:

^<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)</\1>$

<([A-Za-z][A-Za-z0-9]*)\b[^>]*>
- Captures the tag name (e.g., div, span, p) in Group 1.
- \b ensures a word boundary after the tag name.
- [^>]* matches any attributes inside the opening tag.
(.*?)
- Lazily captures any content inside the element (Group 2).
</\1>
- Matches the exact tag name captured in Group 1 (e.g., </div>).

This simple pattern will match a complete element like <p class="intro">Hello</p>, but it does not handle nested elements of the same tag. For nested structures, you need more advanced parsing (often beyond what basic regex can handle robustly).

4.8 Common Shorthands and Escape Sequences

Sequence	Meaning
`\d`	Digit (same as `[0-9]`).
`\D`	Non-digit (same as `[^0-9]`).
`\w`	Word character `[A-Za-z0-9_]`.
`\W`	Non-word character `[^A-Za-z0-9_]`.
`\s`	Whitespace `[ \t\r\n\f\v]`.
`\S`	Non-whitespace.
`\t`, `\n`, `\r`, `\f`	Tab, newline, carriage return, form feed.
`\0`	Null character (U+0000).
`\xhh`	Character with hex code `hh` (two hex digits).
`\uhhhh`	Unicode character with code `hhhh` (four hex digits). (Flavor-specific: .NET, Java, JavaScript with `u` flag, etc.)
`\cX`	Control character (e.g., `\cA` is U+0001).
`\Q … \E`	In some flavors (like .NET), `\Q` starts a quoting section (treat everything up to `\E` literally).

4.9 Quantifier Possessive (`+` after quantifier) and Atomic Groups (Flavor-Specific)

Some regex flavors (like Java, PCRE) support possessive quantifiers (*+, ++, ?+, {n,m}+) that never backtrack:

Example: a*+b
- a*+ matches as many a as possible and doesn’t allow backtracking to let b match.
- If the input is “aaaaabc”:
  - a*+ matches “aaaaa”.
  - Then the engine tries to match b. Since the next character is b, overall match is “aaaaab”.

Atomic groups (?> … ) have a similar effect: once a subpattern inside (?> … ) matches, the engine won’t backtrack into it.

Note: Not all flavors support possessive quantifiers or atomic groups. Consult your language/runtime documentation if you need these advanced features.

4.10 Flags (Modifiers) and Modes

Most regex engines let you enable or disable certain behaviors via flags. Common flags include:

Flag	Meaning	Typical Syntax
`i`	Case-insensitive matching (`RegexOptions.IgnoreCase` in .NET).	In-line: `(?i)pattern`.
`m`	Multiline mode: `^` and `$` match start/end of lines, not just string start/end.	In .NET: `RegexOptions.Multiline` or `(?m)pattern`.
`s`	Single-line or dot-all mode: `.` matches newline as well.	In .NET: `RegexOptions.Singleline` or `(?s)pattern`.
`x`	Free-spacing/comment mode: ignores whitespace and allows `# comments`.	In .NET: `RegexOptions.IgnorePatternWhitespace` or `(?x)pattern`.
`U`	Ungreedy mode (PCRE): swap greedy and lazy behaviors (rarely used).	In PCRE: `(?U)pattern`.
`u`	Unicode mode (flavor-specific).	In JavaScript with `/u` flag, or `RegexOptions.CultureInvariant` in .NET.

Example: Inline Flags

(?im)^[a-z]+\d+\s+$

(?i) means “case-insensitive”.
(?m) means “multiline”.
Without flags, ^[a-z]+ would match lowercase letters only and only at the start of the entire string; with i and m, it matches lowercase or uppercase on each line.

4.11 Replacement Patterns

When you use regex to perform replacements, you often use backreferences in the replacement string:

In many languages (e.g., JavaScript, Python), you use $1, $2, …
In .NET, you typically use $1, $2 as well.
In some older tools or flavors, \1, \2 might be used.

Example: Swap “Last, First” to “First Last”

Original text:

Smith, John
Doe, Jane

Regex find pattern:

^([A-Za-z]+),\s+([A-Za-z]+)$

Replacement pattern:

$2 $1

After replacement:

John Smith
Jane Doe

([A-Za-z]+) captures the last name in Group 1.
([A-Za-z]+) captures the first name in Group 2.
$2 $1 swaps them in the output.

5. Conclusion

By now you should have a solid grasp of:

Why regular expressions are invaluable for text processing.
How to write basic patterns: literals, character classes, quantifiers, anchors.
How to use groups, lookarounds, and backreferences for more advanced matching.
Which tokens and constructs constitute the “core” of most practical regexes.
Where to look when you need to refresh your memory (the reference section above, or a site like regexstorm.net/reference).

With a few dozen minutes of practice - writing regexes against sample text - you’ll rapidly gain confidence. Remember:

Test early and often in an online tester or your IDE’s “Find in Files” - seeing the matches in real time is far more enlightening than manually eyeballing code.
Readability matters. If the pattern becomes unreadable, use the free-spacing mode ((?x)) and add comments.
Keep it simple for common tasks. You rarely need the most arcane syntax; matching log levels, dates, email addresses, or basic CSV fields can usually be done with straightforward character classes and quantifiers.

Once you’ve mastered the basics, you can explore more advanced topics like:

Atomic grouping and possessive quantifiers (for performance-critical patterns).
Recursive patterns (in PCRE or some advanced flavors) to match nested constructs.
Conditional subpatterns (in PCRE or .NET) where you test if a group has matched before deciding the next part of the pattern.

But those are topics for another day. For now, lean on the examples and reference sheet above, and try applying regex to your next text‑processing challenge. In no time, you’ll find yourself thinking in patterns - and you won’t look back.

Congratulations on starting your pattern crafting journey! Keep this guide handy, bookmark your favorite regex tester, and remember that a well‑written regular expression often turns hours of manual parsing into a single line of elegant code.

References

RegStorm for C#