Contents
Text - whether it’s log files, configuration data, user input, or documents - is everywhere in programming. From simple scripts to full-blown applications, developers frequently need to search, validate, extract, or transform pieces of text. While many IDEs and editors (like Notepad++, VS Code, or Sublime Text) offer basic find-and-replace functionality, they quickly reach their limits when you need to handle patterns, conditional logic, or repeated structures. That’s where regular expressions (regex) come in.
At its core, a regular expression is a compact, declarative way to describe a set of strings. Instead of manually scanning through text to find, say, all warning messages in a log, you can write a single pattern that matches timestamps, log levels, filenames, email addresses, URLs - you name it. Once you understand the fundamentals, regex can become one of the most powerful “weapons” in your developer toolbox.
Consider a typical log snippet:
2025-05-31 14:23:48 [Info] Starting application...
2025-05-31 14:23:50 [Info] Loading application data...
2025-05-31 14:23:55 [Warning] Missing font file: New Times Roman.
2025-05-31 14:24:22 [Info] The application has started, with 1 warning.
2025-05-31 14:25:01 [Error] Unhandled exception: NullReferenceException.
If you want to answer questions like:
You could write a script in your favorite language that splits on spaces, checks tokens, and does string comparisons. But that quickly becomes tedious: parsing variability in timestamps, handling optional fields, or ignoring whitespace quirks can make your code fragile.
With a well-crafted regex, however, you can match on the entire log line, capture the timestamp, the log level, and the rest of the message - all in one pattern. More importantly, regex can handle any text-based format for which there’s a consistent structure: CSV files, HTML/XML tags, custom configuration formats, and so on.
By the end of this article, you will:
Whether you’ve done a bit of “find and replace” in an editor, or you’ve written simple string processing code, this guide will bridge the gap and show you how to level up with regular expressions.
In this section, we’ll cover the core building blocks of regex patterns. Each subsection introduces a concept, followed by real-world examples - some based on log processing, others on everyday tasks like validating email addresses or extracting numbers from text.
At the simplest level, a regex can be just literal characters. For example:
Error
will match any occurrence of the substring “Error” in a text.2025-05-31
will match that exact date string.However, real-world text rarely lines up perfectly. Log messages, for instance, include varying timestamps and different log levels. To match them in a generic way, you need metacharacters.
Key Point: Most characters in a regex match themselves literally, except for special metacharacters (like
.
,*
,?
,\
, etc.). To match a literal period, bracket, or other metacharacter, you must “escape” it with a backslash (\.
to match a literal dot,\[
to match a literal left bracket, and so on).
[]
) and RangesA character class is a bracketed set of characters that match exactly one character from the set. For example:
[AEIOU]
matches any single uppercase vowel.[0-9]
matches any digit from 0 through 9.[A-Za-z0-9_]
matches a typical word character (letter, digit, or underscore).Suppose you want to match any year between 1900 and 2099:
^(19|20)[0-9]{2}$
^
and $
are anchors (we’ll cover them in a moment).(19|20)
matches either “19” or “20”.[0-9]{2}
matches exactly two digits.Given a log line like:
2025-05-31 14:23:55 [Warning] Missing font file: New Times Roman.
You can match the log level (Info
, Warning
, Error
, etc.) with:
\[(Info|Warning|Error|Debug)\]
\[
and \]
match literal square brackets.(Info|Warning|Error|Debug)
matches one of those four words.If you expect additional levels (e.g., Trace
or Fatal
), you can add them inside the parentheses:
\[(?:Info|Warning|Error|Debug|Trace|Fatal)\]
Here we’ve used (?: … )
to create a non-capturing group (we’ll talk about groups more in section 5).
Quantifiers specify how many times the preceding element should match. The most common quantifiers are:
*
– Match zero or more times (greedy).+
– Match one or more times (greedy).?
– Match zero or one time (makes the preceding token optional).{n}
– Match exactly n times.{n,m}
– Match between n and m times (inclusive).{n,}
– Match n or more times.A timestamp like 2025-05-31 14:23:48
can be matched with:
\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}
\d
is shorthand for [0-9]
.\d{4}
matches exactly four digits (the year).-
matches a literal hyphen.\d{2}
matches two digits (month and day).\s+
matches one or more whitespace characters (space, tab).\d{2}:\d{2}:\d{2}
matches HH:MM:SS
.If you’re sure there’s exactly one space between date and time, you could use \s
instead of \s+
.
Suppose you have filenames like report.txt
or report
. To match both, you could use:
^report(?:\.txt)?$
^
and $
ensure we match the entire string.
(?:\.txt)?
means “optionally match ‘.txt’”.
\.
matches a literal dot.txt
matches “txt”.?
after the group makes it optional.Anchors specify positions in the text rather than actual characters:
^
– Start of the line/string.$
– End of the line/string.\b
– Word boundary (between \w
and \W
).\B
– Non-word boundary.If you want to catch any line that begins with the word “Error” (as a standalone word), you can write:
^Error\b.*$
^Error
ensures the line starts with “Error”.\b
asserts a word boundary so that “Errors” or “ErrorCode” are not matched..*
matches the rest of the line (any character, zero or more times).$
anchors the end of line.Parentheses (
and )
serve two main purposes:
Given:
2025-05-31 14:23:55 [Warning] Missing font...
You might write:
^(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2})\s+\[(Info|Warning|Error|Debug)\]\s+(.*)$
(\d{4}-\d{2}-\d{2})
captures the date (Group 1
).
(\d{2}:\d{2}:\d{2})
captures the time (Group 2
).
(Info|Warning|Error|Debug)
captures the log level (Group 3
).
(.*)
captures the rest of the message (Group 4
).
In most languages, after matching, you can extract Group 1, Group 2, etc.:
match.Groups[1].Value
gives the date.If you want to group without capturing (for alternation or quantifier purposes), use (?: … )
. For instance:
(?:Info|Warning|Error|Debug)
does not create a numbered backreference.
Some of the most frequently used shorthand and metacharacters include:
.
– Matches any character except newline (in most flavors; some allow a “dotall” mode to match newline).\d
– Digit (equivalent to [0-9]
).\D
– Non-digit (equivalent to [^0-9]
).\w
– Word character (letter, digit, underscore; [A-Za-z0-9_]
).\W
– Non-word character (anything not in \w
).\s
– Whitespace character (space, tab, newline, etc.).\S
– Non-whitespace character.\t
, \n
, \r
– Tab, newline, carriage return.Tip: When writing regexes in code, you often need to escape backslashes. For instance, in C# you can use verbatim string literals (
@"\d{4}-\d{2}-\d{2}"
) to avoid double escapes.
Say you have a document or a configuration file and need to pull out all email addresses. A simple - but not fully RFC‑compliant - pattern is:
\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b
\b
– Ensure we start at a word boundary.[A-Za-z0-9._%+-]+
– One or more letters, digits, dots, underscores, percent signs, plus, or hyphens (the local part before the @
).@
– Literal “@” symbol.[A-Za-z0-9.-]+
– One or more letters, digits, dots, or hyphens (domain).\.
– Literal dot.[A-Za-z]{2,}
– At least two letters for the TLD (e.g., “com”, “org”, “io”).\b
– End at a word boundary to prevent partial matches (so we don’t match trailing punctuation).Given:
Please contact alice@example.com or bob.smith@my-domain.io for more info. Alternatively, reach out to admin@localhost if testing locally.
alice@example.com
matches.bob.smith@my-domain.io
matches.admin@localhost
does not match because localhost
has no “.
+ two-or-more-letters” TLD.North American Numbering Plan (NANP) phone numbers often look like:
123-456-7890
(123) 456-7890
123.456.7890
+1 (123) 456-7890
A practical regex might be:
^(\+1\s?)? # Optional country code +1 and a space
(?:\(\d{3}\)|\d{3}) # Either (123) or 123
[ .-]? # Separator: space, dot, or hyphen (optional)
\d{3} # Next three digits
[ .-]? # Separator again (optional)
\d{4}$ # Last four digits
When compacted (and removing whitespace/comments):
^(\+1\s?)?(?:\(\d{3}\)|\d{3})[ .-]?\d{3}[ .-]?\d{4}$
^
and $
anchor the pattern to the entire string.(\+1\s?)?
– Optionally match “+1” followed by an optional space.(?:\(\d{3}\)|\d{3})
– Either three digits inside parentheses or three digits without.[ .-]?
– Optionally match a space, a dot, or a hyphen.\d{3}
– Three digits.[ .-]?
– Separator again.\d{4}
– Four digits.This pattern will match most common North American formats. By tweaking the groups and quantifiers, you can adapt it for other regions or stricter/looser rules.
Returning to our log example, suppose you want to extract only the warning messages and their timestamps. You might apply a regex like:
^(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+\[Warning\]\s+(.*)$
2025-05-31 14:23:55
).Missing font file: New Times Roman.
).In most programming languages:
C# (using System.Text.RegularExpressions
):
var pattern = @"^(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+\[Warning\]\s+(.*)$";
var matches = Regex.Matches(logContents, pattern, RegexOptions.Multiline);
foreach (Match m in matches) {
string timestamp = m.Groups[1].Value;
string message = m.Groups[2].Value;
Console.WriteLine($"[{timestamp}] Warning: {message}");
}
Python (using the re
module):
import re
pattern = r'^(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+\[Warning\]\s+(.*)$'
for line in log_contents.splitlines():
m = re.match(pattern, line)
if m:
timestamp = m.group(1)
message = m.group(2)
print(f"[{timestamp}] Warning: {message}")
Even if you’ve only ever clicked “Replace” in an editor, these examples should give you a clear sense of how powerful a single regex can be. You define “match everything up until [
Warning]
, then capture whatever comes next”, and you get exactly the data you want.
When you’re new to regex, it’s easy to overlook small mistakes - missing an escape, misplacing a quantifier, forgetting an anchor. Here are a few tips:
Use an Online Tester
Websites like regex101.com (PCRE, JavaScript, Python), RegexStorm (for .NET), or regexr.com let you type your pattern, paste test text, and see matches/highlights in real time.
Start Simple, Then Grow
If you need a complex pattern, start with a small piece. For instance, test just \d{4}-\d{2}-\d{2}
on the date first, then add time, then add log level, etc.
Comment Your Patterns (When Supported)
Some languages/flavors support the “extended” or “free-spacing” mode, where you can add whitespace and # comments
inside your regex:
(?x) # Enable free-spacing mode
^ # Start of line
(\d{4}-\d{2}-\d{2}) # Group 1: Date
\s+ # One or more spaces
(\d{2}:\d{2}:\d{2}) # Group 2: Time
\s+ # One or more spaces
\[Warning\] # Literal [Warning]
\s+ # One or more spaces
(.*) # Group 3: Message text
$
This makes maintenance and future edits easier, especially for complex regexes.
Escape Early
If you need to match special characters (.
, *
, ?
, +
, [
, ]
, (
, )
, |
, ^
, $
, \
), always escape them with a backslash unless you explicitly want their special behavior.
Below is a summary of the most commonly used regular expression constructs. Keep this as a quick “cheat sheet” for reference.
Note: Regex flavors (PCRE, Python, JavaScript, .NET, Java, etc.) differ in some details, but most of the items below work universally. Where there are flavor-specific notes, they are indicated.
Metacharacter | Meaning |
---|---|
. |
Matches any character except newline (unless “dotall”/“singleline” mode is enabled). |
^ |
Matches the start of a line/string. |
$ |
Matches the end of a line/string. |
\b |
Word boundary (transition between \w and \W ). |
\B |
Non-word boundary (opposite of \b ). |
\d |
Digit, equivalent to [0-9] . |
\D |
Non-digit, equivalent to [^0-9] . |
\w |
Word character, equivalent to [A-Za-z0-9_] . |
\W |
Non-word character, equivalent to [^A-Za-z0-9_] . |
\s |
Whitespace (space, tab, newline, vertical tab, form feed, etc.). |
\S |
Non-whitespace (anything not matched by \s ). |
\t |
Tab character. |
\n |
Newline character (LF). |
\r |
Carriage return (CR). |
\f |
Form feed. |
\v (flavor-specific) |
Vertical tab (supported in some flavors). |
\\ |
Literal backslash. |
\. |
Literal dot. |
\* |
Literal asterisk. |
\+ |
Literal plus. |
\? |
Literal question mark. |
\{ \} |
Literal braces. |
\[ \] |
Literal square brackets. |
\( \) |
Literal parentheses. |
\| |
Literal pipe/alternation operator. |
\^ |
Literal caret (start-of-string anchor). |
\$ |
Literal dollar sign (end-of-string anchor). |
Quantifier | Meaning |
---|---|
* |
Match the preceding token zero or more times (greedy). |
+ |
Match the preceding token one or more times (greedy). |
? |
Match the preceding token zero or one time (makes it optional; also used for lazy quantifiers). |
{n} |
Match the preceding token exactly n times. |
{n,} |
Match the preceding token n or more times (no upper limit). |
{n,m} |
Match the preceding token at least n but not more than m times. |
*? , +? , ?? , {n,m}? |
The “lazy” or “non-greedy” versions of the above quantifiers (match as few characters as possible). |
Greedy vs. Lazy:
- Greedy quantifiers (
*
,+
,{n,m}
) will try to match as much text as possible while still allowing the rest of the pattern to succeed.- Lazy quantifiers (
*?
,+?
,{n,m}?
) match as little as possible to allow the overall match to succeed.
a.*b
on text axxxbxxb
will match axxxbxxb
(greedy: from the first a
to the last b
).a.*?b
will match axxxb
(lazy: from a
to the first b
).Syntax | Matches |
---|---|
[abc] |
Any one of the characters a , b , or c . |
[^abc] |
Any character except a , b , or c . |
[a-z] |
Any lowercase letter from a to z . |
[A-Z] |
Any uppercase letter from A to Z . |
[0-9] |
Any digit from 0 to 9 . |
[A-Za-z0-9_] |
Any “word” character (often equivalent to \w ). |
[[:alnum:]] (POSIX) |
Letters and digits (POSIX-style). Useable in some flavors (e.g., GNU, PCRE with [[:alnum:]] ). |
[[:space:]] , [[:digit:]] , etc. |
POSIX character classes (flavor support varies). |
^[0-9A-Fa-f]+$
|
) and GroupingGrouping: ( … )
(?: … )
does not save the match; useful for grouping without capturing.Alternation: A|B
matches either pattern A or pattern B.
^.*\.(?:jpg|jpeg|png|gif|bmp)$
.*\.
matches any filename up to and including the dot.(?:jpg|jpeg|png|gif|bmp)
matches one of those five extensions.$
ensures the extension is at the end.Anchor | Meaning |
---|---|
^ |
Start of string (or line, with multiline mode). |
$ |
End of string (or line, with multiline mode). |
\A |
Start of string (ignores multiline mode). |
\z or \Z |
End of string (\z is absolute; \Z allows an optional trailing newline). |
\b |
Word boundary (between \w and \W ). |
\B |
Non-word boundary (opposite of \b ). |
Multiline Mode (
m
flag)
- If you enable multiline mode (usually via a flag like
RegexOptions.Multiline
in .NET orre.MULTILINE
in Python), then^
and$
match the start/end of each line rather than the entire string.
Lookarounds allow you to assert that a match is (or is not) preceded or followed by something else, without including that “something else” in the match.
Construct | Meaning |
---|---|
(?= … ) |
Positive lookahead: what follows must match … . |
(?! … ) |
Negative lookahead: what follows must not match … . |
(?<= … ) |
Positive lookbehind: what precedes must match … . |
(?<! … ) |
Negative lookbehind: what precedes must not match … . |
To match a 6–12 character password that does not contain any digits:
^(?=.{6,12}$)(?!.*\d).+$
(?=.{6,12}$)
– Ensure length is between 6 and 12.(?!.*\d)
– Ensure there are no digits anywhere (.*\d
means “somewhere a digit”; the (?! … )
says “the rest of the string must not match .*\d
”)..+$
– Finally, match one or more characters (any), up to the end.Once you capture a group, you can refer to it later in the pattern (to enforce that two parts of the text are identical) or in a replacement string.
\1
, \2
, … refer to the text matched by Group 1, Group 2, etc. (in most flavors).$1
, $2
, etc.To ensure that an opening tag <tag>
is closed by </tag>
with the same name:
^<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)</\1>$
<([A-Za-z][A-Za-z0-9]*)\b[^>]*>
div
, span
, p
) in Group 1.\b
ensures a word boundary after the tag name.[^>]*
matches any attributes inside the opening tag.(.*?)
</\1>
</div>
).This simple pattern will match a complete element like <p class="intro">Hello</p>
, but it does not handle nested elements of the same tag. For nested structures, you need more advanced parsing (often beyond what basic regex can handle robustly).
Sequence | Meaning |
---|---|
\d |
Digit (same as [0-9] ). |
\D |
Non-digit (same as [^0-9] ). |
\w |
Word character [A-Za-z0-9_] . |
\W |
Non-word character [^A-Za-z0-9_] . |
\s |
Whitespace [ \t\r\n\f\v] . |
\S |
Non-whitespace. |
\t , \n , \r , \f |
Tab, newline, carriage return, form feed. |
\0 |
Null character (U+0000). |
\xhh |
Character with hex code hh (two hex digits). |
\uhhhh |
Unicode character with code hhhh (four hex digits). (Flavor-specific: .NET, Java, JavaScript with u flag, etc.) |
\cX |
Control character (e.g., \cA is U+0001). |
\Q … \E |
In some flavors (like .NET), \Q starts a quoting section (treat everything up to \E literally). |
+
after quantifier) and Atomic Groups (Flavor-Specific)Some regex flavors (like Java, PCRE) support possessive quantifiers (*+
, ++
, ?+
, {n,m}+
) that never backtrack:
Example: a*+b
a*+
matches as many a
as possible and doesn’t allow backtracking to let b
match.
If the input is “aaaaabc”:
a*+
matches “aaaaa”.b
. Since the next character is b
, overall match is “aaaaab”.Atomic groups (?> … )
have a similar effect: once a subpattern inside (?> … )
matches, the engine won’t backtrack into it.
Note: Not all flavors support possessive quantifiers or atomic groups. Consult your language/runtime documentation if you need these advanced features.
Most regex engines let you enable or disable certain behaviors via flags. Common flags include:
Flag | Meaning | Typical Syntax |
---|---|---|
i |
Case-insensitive matching (RegexOptions.IgnoreCase in .NET). |
In-line: (?i)pattern . |
m |
Multiline mode: ^ and $ match start/end of lines, not just string start/end. |
In .NET: RegexOptions.Multiline or (?m)pattern . |
s |
Single-line or dot-all mode: . matches newline as well. |
In .NET: RegexOptions.Singleline or (?s)pattern . |
x |
Free-spacing/comment mode: ignores whitespace and allows # comments . |
In .NET: RegexOptions.IgnorePatternWhitespace or (?x)pattern . |
U |
Ungreedy mode (PCRE): swap greedy and lazy behaviors (rarely used). | In PCRE: (?U)pattern . |
u |
Unicode mode (flavor-specific). | In JavaScript with /u flag, or RegexOptions.CultureInvariant in .NET. |
(?im)^[a-z]+\d+\s+$
(?i)
means “case-insensitive”.(?m)
means “multiline”.^[a-z]+
would match lowercase letters only and only at the start of the entire string; with i
and m
, it matches lowercase or uppercase on each line.When you use regex to perform replacements, you often use backreferences in the replacement string:
$1
, $2
, …$1
, $2
as well.\1
, \2
might be used.Original text:
Smith, John
Doe, Jane
Regex find pattern:
^([A-Za-z]+),\s+([A-Za-z]+)$
Replacement pattern:
$2 $1
After replacement:
John Smith
Jane Doe
([A-Za-z]+)
captures the last name in Group 1.([A-Za-z]+)
captures the first name in Group 2.$2 $1
swaps them in the output.By now you should have a solid grasp of:
With a few dozen minutes of practice - writing regexes against sample text - you’ll rapidly gain confidence. Remember:
(?x)
) and add comments.Once you’ve mastered the basics, you can explore more advanced topics like:
But those are topics for another day. For now, lean on the examples and reference sheet above, and try applying regex to your next text‑processing challenge. In no time, you’ll find yourself thinking in patterns - and you won’t look back.
Congratulations on starting your pattern crafting journey! Keep this guide handy, bookmark your favorite regex tester, and remember that a well‑written regular expression often turns hours of manual parsing into a single line of elegant code.