Table of Contents
- Introduction to Regular Expressions
- Why Learn Regex?
- Common Use Cases
- Regex Learning Roadmap
- Getting Started with Python’s
reModule- Setting Up the Environment
- First Regex Example
- Raw Strings and Escape Characters
- Basic Regex Patterns
- Literal Character Matching
- Case Sensitivity Options
- Basic Pattern Exercises
- Character Classes and Special Characters
- Built-in Character Classes
- Custom Character Classes
- Character Class Combinations
- Quantifiers
- Basic Quantifiers
- Greedy vs Non-Greedy Matching
- Quantifier Practice Examples
- Groups and Capturing
- Basic Grouping
- Named Groups
- Non-Capturing Groups
- Backreferences
- Anchors and Boundaries
- Start and End Anchors
- Word Boundaries
- Multiline Mode
- Advanced Patterns
- Lookahead and Lookbehind Assertions
- Alternation and Conditional Matching
- Recursive Patterns
- Regex Methods in Python
- Core Methods Comparison
- Compiled Patterns
- Working with Match Objects
- Interactive Examples and Debugging
- Step-by-Step Pattern Building
- Debugging Tools and Techniques
- Common Error Messages
- Real-World Applications
- Data Validation Patterns
- Text Processing and Extraction
- Log Analysis and Parsing
- Web Scraping Applications
- Performance and Optimization
- Pattern Compilation Best Practices
- Avoiding Catastrophic Backtracking
- Memory and Speed Optimization
- Common Pitfalls and Best Practices
- Frequent Mistakes
- Testing Strategies
- Code Organization
- Comprehensive Regex Cheat Sheet
- Quick Reference Patterns
- Method Comparison Table
- Flag Options
- Practice Exercises
- Beginner Exercises
- Intermediate Challenges
- Advanced Problems
1. Introduction to Regular Expressions
Regular expressions (regex) are powerful pattern-matching tools used to find, match, and manipulate text. They provide a concise way to describe complex string patterns.
graph LR
A[Text Input] --> B[Regex Pattern]
B --> C{Match?}
C -->|Yes| D[Extract/Replace/Validate]
C -->|No| E[No Action]Why Learn Regex?
- Text Processing: Extract specific information from large text files
- Data Validation: Validate email addresses, phone numbers, etc.
- Data Cleaning: Remove unwanted characters or format data
- Log Analysis: Parse and analyze log files
- Web Scraping: Extract specific data from HTML
graph TB
A[Regular Expressions] --> B[Pattern Matching]
A --> C[Text Manipulation]
A --> D[Data Validation]
B --> B1[Find specific text]
B --> B2[Extract information]
B --> B3[Search & Replace]
C --> C1[Clean data]
C --> C2[Transform format]
C --> C3[Split strings]
D --> D1[Email validation]
D --> D2[Phone numbers]
D --> D3[Input sanitization]Common Use Cases
# Example scenarios where regex excels
examples = {
"Email extraction": "Extract all emails from a text file",
"Phone formatting": "Convert (555) 123-4567 to 555-123-4567",
"Data cleaning": "Remove extra whitespace and special characters",
"Log parsing": "Extract timestamps and error codes from logs",
"URL validation": "Check if a string is a valid web address"
}PythonRegex Learning Roadmap
graph TD
A[Start Here] --> B[Basic Patterns]
B --> C[Character Classes]
C --> D[Quantifiers]
D --> E[Groups & Capturing]
E --> F[Anchors & Boundaries]
F --> G[Advanced Features]
G --> H[Real-world Applications]
H --> I[Performance & Optimization]
style A fill:#e1f5fe
style I fill:#c8e6c92. Getting Started with Python’s re Module
Python’s built-in re module provides regex functionality.
import re
# Basic pattern matching
pattern = r"hello"
text = "hello world"
match = re.search(pattern, text)
print(match.group() if match else "No match") # Output: helloPythonRaw Strings
Always use raw strings (r"") for regex patterns to avoid escaping issues:
# Good practice
pattern = r"\d+\.\d+" # Matches decimal numbers
# Avoid this
pattern = "\\d+\\.\\d+" # Same pattern but harder to readPythonflowchart TD
A[Regex Pattern] --> B{Raw String?}
B -->|Yes r""| C[Clean, Readable Pattern]
B -->|No ""| D[Escaped Characters \\\\]
C --> E[Easy to Debug]
D --> F[Hard to Read/Maintain]3. Basic Regex Patterns
Literal Characters
Match exact characters:
import re
pattern = r"cat"
text = "The cat sat on the mat"
matches = re.findall(pattern, text)
print(matches) # Output: ['cat']PythonCase Sensitivity
# Case sensitive (default)
pattern = r"Cat"
text = "cat and Cat"
matches = re.findall(pattern, text)
print(matches) # Output: ['Cat']
# Case insensitive
pattern = r"Cat"
text = "cat and Cat"
matches = re.findall(pattern, text, re.IGNORECASE)
print(matches) # Output: ['cat', 'Cat']Pythongraph TB
A["Input Text: 'cat and Cat'"] --> B["Pattern: 'Cat'"]
B --> C{Case Sensitive?}
C -->|Yes| D["Matches: ['Cat']"]
C -->|No re.IGNORECASE| E["Matches: ['cat', 'Cat']"]4. Character Classes and Special Characters
Basic Character Classes
import re
# \d - digits (0-9)
pattern = r"\d+"
text = "I have 25 apples and 10 oranges"
matches = re.findall(pattern, text)
print(matches) # Output: ['25', '10']
# \w - word characters (letters, digits, underscore)
pattern = r"\w+"
text = "hello_world 123!"
matches = re.findall(pattern, text)
print(matches) # Output: ['hello_world', '123']
# \s - whitespace characters
pattern = r"\s+"
text = "hello world\t\n"
matches = re.findall(pattern, text)
print(matches) # Output: [' ', '\t\n']PythonCharacter Class Summary
| Pattern | Description | Example Match |
|---|---|---|
\d | Digit (0-9) | 5, 42 |
\D | Non-digit | a, @ |
\w | Word character | a, Z, _, 5 |
\W | Non-word character | @, !, |
\s | Whitespace | , \t, \n |
\S | Non-whitespace | a, 1, @ |
. | Any character (except newline) | a, 1, @ |
graph TD
A[Character Classes] --> B["\d Digits"]
A --> C["\w Word Chars"]
A --> D["\s Whitespace"]
A --> E[". Any Char"]
A --> F[Custom Classes]
B --> B1["0-9 (Numbers)"]
C --> C1["a-z, A-Z, 0-9, _ (Alphanumeric + underscore)"]
D --> D1["Space, Tab, Newline"]
E --> E1["Everything except newline"]
F --> F1["[abc] - specific chars"]
F --> F2["[a-z] - ranges"]
F --> F3["[^abc] - negation"]
style A fill:#e3f2fd
style B fill:#fff3e0
style C fill:#f3e5f5
style D fill:#e8f5e8
style E fill:#ffebee
style F fill:#fce4ecCustom Character Classes
# [abc] - matches a, b, or c
pattern = r"[aeiou]"
text = "hello world"
matches = re.findall(pattern, text)
print(matches) # Output: ['e', 'o', 'o']
# [a-z] - matches lowercase letters
pattern = r"[a-z]+"
text = "Hello World 123"
matches = re.findall(pattern, text)
print(matches) # Output: ['ello', 'orld']
# [^abc] - matches anything except a, b, or c
pattern = r"[^aeiou]+"
text = "hello world"
matches = re.findall(pattern, text)
print(matches) # Output: ['h', 'll', ' w', 'rld']Python5. Quantifiers {#quantifiers}
Quantifiers specify how many times a pattern should match.
import re
# * - zero or more
pattern = r"ab*c"
texts = ["ac", "abc", "abbc", "abbbc"]
for text in texts:
match = re.search(pattern, text)
print(f"{text}: {bool(match)}")
# Output: ac: True, abc: True, abbc: True, abbbc: True
# + - one or more
pattern = r"ab+c"
texts = ["ac", "abc", "abbc"]
for text in texts:
match = re.search(pattern, text)
print(f"{text}: {bool(match)}")
# Output: ac: False, abc: True, abbc: True
# ? - zero or one
pattern = r"colou?r"
texts = ["color", "colour"]
for text in texts:
match = re.search(pattern, text)
print(f"{text}: {bool(match)}")
# Output: color: True, colour: True
# {n} - exactly n times
pattern = r"\d{3}"
text = "Call 123-456-7890"
matches = re.findall(pattern, text)
print(matches) # Output: ['123', '456', '789']
# {n,m} - between n and m times
pattern = r"\d{2,4}"
text = "1 22 333 4444 55555"
matches = re.findall(pattern, text)
print(matches) # Output: ['22', '333', '4444', '5555']PythonQuantifier Summary
| Quantifier | Description | Example |
|---|---|---|
* | 0 or more | ab* matches a, ab, abb |
+ | 1 or more | ab+ matches ab, abb (not a) |
? | 0 or 1 | ab? matches a, ab |
{n} | Exactly n | a{3} matches aaa |
{n,} | n or more | a{2,} matches aa, aaa, aaaa |
{n,m} | Between n and m | a{2,4} matches aa, aaa, aaaa |
graph LR
A[Quantifiers] --> B["Greedy: *, +, ?"]
A --> C["Exact: {n}"]
A --> D["Range: {n,m}"]
B --> B1[Match as much as possible]
C --> C1[Match exact count]
D --> D1[Match within range]Greedy vs Non-Greedy
import re
text = "<tag>content</tag>"
# Greedy matching (default)
pattern = r"<.*>"
match = re.search(pattern, text)
print(match.group()) # Output: <tag>content</tag>
# Non-greedy matching
pattern = r"<.*?>"
matches = re.findall(pattern, text)
print(matches) # Output: ['<tag>', '</tag>']Python6. Groups and Capturing
Groups allow you to capture parts of a match and apply quantifiers to multiple characters.
import re
# Basic grouping
pattern = r"(\d{3})-(\d{3})-(\d{4})"
text = "Call me at 123-456-7890"
match = re.search(pattern, text)
if match:
print(f"Full match: {match.group(0)}") # 123-456-7890
print(f"Area code: {match.group(1)}") # 123
print(f"Exchange: {match.group(2)}") # 456
print(f"Number: {match.group(3)}") # 7890
print(f"All groups: {match.groups()}") # ('123', '456', '7890')PythonNamed Groups
pattern = r"(?P<area>\d{3})-(?P<exchange>\d{3})-(?P<number>\d{4})"
text = "Call me at 123-456-7890"
match = re.search(pattern, text)
if match:
print(f"Area: {match.group('area')}") # 123
print(f"Exchange: {match.group('exchange')}")# 456
print(f"Number: {match.group('number')}") # 7890
print(f"Dict: {match.groupdict()}") # {'area': '123', 'exchange': '456', 'number': '7890'}PythonNon-Capturing Groups
# (?:...) - non-capturing group
pattern = r"(?:Mr|Mrs|Ms)\. (\w+)"
text = "Hello Mr. Smith and Mrs. Johnson"
matches = re.findall(pattern, text)
print(matches) # Output: ['Smith', 'Johnson']Pythongraph TD
A[Groups in Regex] --> B["Capturing Groups </br> (...)"]
A --> C["Named Groups </br> (?P(name)...)"]
A --> D["Non-Capturing Groups </br> (?:...)"]
B --> B1[Accessible by number]
C --> C1[Accessible by name]
D --> D1[Not captured, just grouped]7. Anchors and Boundaries
Anchors specify where in the string a match should occur.
import re
text = "The cat in the hat"
# ^ - start of string
pattern = r"^The"
match = re.search(pattern, text)
print(bool(match)) # True
pattern = r"^cat"
match = re.search(pattern, text)
print(bool(match)) # False
# $ - end of string
pattern = r"hat$"
match = re.search(pattern, text)
print(bool(match)) # True
# \b - word boundary
pattern = r"\bcat\b"
texts = ["cat", "catch", "scattered", "the cat runs"]
for t in texts:
match = re.search(pattern, t)
print(f"'{t}': {bool(match)}")
# Output: 'cat': True, 'catch': False, 'scattered': False, 'the cat runs': TruePythonAnchor Summary
| Anchor | Description | Example |
|---|---|---|
^ | Start of string | ^Hello matches “Hello world” |
$ | End of string | world$ matches “Hello world” |
\b | Word boundary | \bcat\b matches “cat” but not “catch” |
\B | Non-word boundary | \Bcat\B matches “scattered” |
graph LR
A["Text: 'The cat in the hat'"] --> B[Anchors]
B --> C["^ Start"]
B --> D["$ End"]
B --> E["\b Word Boundary"]
C --> C1["^The ✓"]
D --> D1["hat$ ✓"]
E --> E1["\bcat\b ✓"]8. Advanced Patterns
Lookahead and Lookbehind
import re
# Positive lookahead (?=...)
pattern = r"\d+(?= dollars)"
text = "I have 50 dollars and 25 cents"
matches = re.findall(pattern, text)
print(matches) # Output: ['50']
# Negative lookahead (?!...)
pattern = r"\d+(?! dollars)"
text = "I have 50 dollars and 25 cents"
matches = re.findall(pattern, text)
print(matches) # Output: ['5', '0', '2', '5']
# Positive lookbehind (?<=...)
pattern = r"(?<=\$)\d+"
text = "Price: $25.99"
matches = re.findall(pattern, text)
print(matches) # Output: ['25']
# Negative lookbehind (?<!...)
pattern = r"(?<!\$)\d+"
text = "Price: $25.99 and 10 items"
matches = re.findall(pattern, text)
print(matches) # Output: ['5', '99', '10']PythonAlternation
# | - OR operator
pattern = r"cat|dog|bird"
text = "I have a cat and a dog"
matches = re.findall(pattern, text)
print(matches) # Output: ['cat', 'dog']
# Grouped alternation
pattern = r"(Mr|Mrs|Ms)\. (\w+)"
text = "Mr. Smith and Ms. Johnson"
matches = re.findall(pattern, text)
print(matches) # Output: [('Mr', 'Smith'), ('Ms', 'Johnson')]Pythongraph TD
A[Advanced Patterns] --> B["Lookahead (?=...)"]
A --> C["Lookbehind (?<=...)"]
A --> D["Alternation |"]
B --> B1["Match if followed by..."]
C --> C1["Match if preceded by..."]
D --> D1[Match this OR that]9. Regex Methods in Python
Essential re Module Functions
import re
text = "The price is $25.99 and $15.50"
pattern = r"\$(\d+\.\d+)"
# re.search() - finds first match
match = re.search(pattern, text)
if match:
print(f"First price: {match.group(1)}") # 25.99
# re.findall() - finds all matches
matches = re.findall(pattern, text)
print(f"All prices: {matches}") # ['25.99', '15.50']
# re.finditer() - returns match objects
for match in re.finditer(pattern, text):
print(f"Price: ${match.group(1)} at position {match.start()}-{match.end()}")
# re.sub() - substitute matches
new_text = re.sub(pattern, r"$XXX.XX", text)
print(new_text) # The price is $XXX.XX and $XXX.XX
# re.split() - split by pattern
text = "apple,banana;orange:grape"
fruits = re.split(r"[,;:]", text)
print(fruits) # ['apple', 'banana', 'orange', 'grape']PythonCompiled Patterns
# Compile pattern for reuse (more efficient)
pattern = re.compile(r"\$(\d+\.\d+)")
text = "Price: $25.99"
match = pattern.search(text)
matches = pattern.findall(text)
new_text = pattern.sub("$XX.XX", text)Pythonflowchart TD
A[re Module Methods] --> B[re.search]
A --> C[re.findall]
A --> D[re.finditer]
A --> E[re.sub]
A --> F[re.split]
A --> G[re.compile]
B --> B1[First match object]
C --> C1[List of all matches]
D --> D1[Iterator of match objects]
E --> E1[Replace matches]
F --> F1[Split by pattern]
G --> G1[Compiled pattern object]10. Interactive Examples and Debugging {#interactive-debugging}
Step-by-Step Pattern Building
Let’s build a complex email validation pattern step by step:
import re
# Step 1: Start simple - match any characters before @
pattern1 = r".+@"
test_email = "user@example.com"
print(f"Step 1: {bool(re.search(pattern1, test_email))}") # True
# Step 2: Be more specific about allowed characters before @
pattern2 = r"[a-zA-Z0-9._%+-]+@"
print(f"Step 2: {bool(re.search(pattern2, test_email))}") # True
# Step 3: Add domain part
pattern3 = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+"
print(f"Step 3: {bool(re.search(pattern3, test_email))}") # True
# Step 4: Ensure domain has extension
pattern4 = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]+"
print(f"Step 4: {bool(re.search(pattern4, test_email))}") # True
# Step 5: Add anchors for exact matching
pattern5 = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
print(f"Step 5 (final): {bool(re.match(pattern5, test_email))}") # TruePythongraph TD
A["Start: .+@"] --> B["Refine: [a-zA-Z0-9._%+-]+@"]
B --> C["Add domain: +@[a-zA-Z0-9.-]+"]
C --> D["Add extension: +\.[a-zA-Z]+"]
D --> E["Add anchors: ^...$"]
E --> F["Final Pattern"]
style A fill:#ffebee
style F fill:#e8f5e8Interactive Pattern Tester
def test_pattern_interactively():
"""Interactive pattern testing function"""
def test_regex(pattern, test_strings, description=""):
"""Test a regex pattern against multiple strings"""
print(f"\n{'='*50}")
print(f"Testing: {description}")
print(f"Pattern: {pattern}")
print(f"{'='*50}")
compiled_pattern = re.compile(pattern)
for test_string in test_strings:
match = compiled_pattern.search(test_string)
if match:
print(f"✓ '{test_string}' -> Match: '{match.group()}'")
if match.groups():
print(f" Groups: {match.groups()}")
else:
print(f"✗ '{test_string}' -> No match")
# Example usage
phone_pattern = r"(\(?\d{3}\)?[-.\s]?)(\d{3}[-.\s]?\d{4})"
phone_tests = [
"123-456-7890",
"(555) 123-4567",
"555.123.4567",
"1234567890",
"invalid-phone"
]
test_regex(phone_pattern, phone_tests, "Phone Number Validation")
# Run the interactive tester
test_pattern_interactively()PythonDebugging Tools and Techniques
def debug_regex_step_by_step(pattern, text):
"""Debug a regex pattern by showing each step"""
print(f"Debugging pattern: {pattern}")
print(f"Against text: '{text}'")
print("-" * 50)
try:
compiled_pattern = re.compile(pattern)
match = compiled_pattern.search(text)
if match:
print(f"✓ Match found!")
print(f" Full match: '{match.group()}'")
print(f" Start position: {match.start()}")
print(f" End position: {match.end()}")
print(f" Span: {match.span()}")
if match.groups():
print(" Captured groups:")
for i, group in enumerate(match.groups(), 1):
print(f" Group {i}: '{group}'")
if hasattr(match, 'groupdict') and match.groupdict():
print(" Named groups:")
for name, value in match.groupdict().items():
print(f" {name}: '{value}'")
else:
print("✗ No match found")
# Show all matches if there are multiple
all_matches = compiled_pattern.findall(text)
if len(all_matches) > 1:
print(f"\nAll matches found: {all_matches}")
except re.error as e:
print(f"❌ Regex error: {e}")
return False
return True
# Example debugging session
debug_regex_step_by_step(
r"(\w+)@(\w+)\.(\w+)",
"Contact us at john@example.com or mary@test.org"
)PythonCommon Error Messages and Solutions
def demonstrate_common_errors():
"""Show common regex errors and their solutions"""
common_errors = [
{
"error": "Unbalanced parenthesis",
"bad_pattern": r"(\d+",
"good_pattern": r"(\d+)",
"description": "Always close parentheses"
},
{
"error": "Invalid escape sequence",
"bad_pattern": "\\d+\\s+\\w+", # Python string
"good_pattern": r"\d+\s+\w+", # Raw string
"description": "Use raw strings to avoid double escaping"
},
{
"error": "Nothing to repeat",
"bad_pattern": r"+\d+",
"good_pattern": r"\d+",
"description": "Quantifiers need something to quantify"
}
]
for error_info in common_errors:
print(f"\n{error_info['error']}:")
print(f"❌ Bad: {error_info['bad_pattern']}")
print(f"✅ Good: {error_info['good_pattern']}")
print(f"💡 Tip: {error_info['description']}")
demonstrate_common_errors()Pythongraph TD
A[Regex Debugging] --> B[Step-by-step Testing]
A --> C[Interactive Pattern Building]
A --> D[Error Analysis]
B --> B1[Print intermediate results]
B --> B2[Test with multiple inputs]
B --> B3[Visualize matches]
C --> C1[Start simple]
C --> C2[Add complexity gradually]
C --> C3[Test each iteration]
D --> D1[Common syntax errors]
D --> D2[Logic errors]
D --> D3[Performance issues]11. Real-World Applications
Email Validation
import re
from typing import List, Dict, Optional
def validate_email(email: str) -> Dict[str, any]:
"""
Comprehensive email validation with detailed feedback
Args:
email: Email address to validate
Returns:
Dictionary with validation result and details
"""
if not isinstance(email, str):
return {"valid": False, "error": "Input must be a string"}
if not email:
return {"valid": False, "error": "Email cannot be empty"}
# Basic pattern for email validation
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
try:
is_valid = bool(re.match(pattern, email))
if is_valid:
# Extract parts for additional validation
local_part, domain = email.split('@')
domain_parts = domain.split('.')
return {
"valid": True,
"email": email,
"local_part": local_part,
"domain": domain,
"tld": domain_parts[-1],
"length": len(email)
}
else:
# Provide specific error feedback
errors = []
if '@' not in email:
errors.append("Missing @ symbol")
elif email.count('@') > 1:
errors.append("Multiple @ symbols")
elif email.startswith('@'):
errors.append("Cannot start with @")
elif email.endswith('@'):
errors.append("Cannot end with @")
elif '.' not in email.split('@')[-1]:
errors.append("Domain must contain a dot")
else:
errors.append("Invalid format")
return {"valid": False, "error": "; ".join(errors)}
except Exception as e:
return {"valid": False, "error": f"Validation error: {str(e)}"}
# Test cases with comprehensive feedback
test_emails = [
"user@example.com", # Valid
"test.email+tag@domain.org", # Valid with plus
"user.name@sub.domain.com", # Valid with subdomain
"invalid.email", # Invalid - no @
"@domain.com", # Invalid - starts with @
"user@", # Invalid - ends with @
"user@@domain.com", # Invalid - double @
"user@domain", # Invalid - no TLD
"", # Invalid - empty
123, # Invalid - not string
]
print("Email Validation Results:")
print("=" * 50)
for email in test_emails:
result = validate_email(email)
status = "✓" if result["valid"] else "✗"
if result["valid"]:
print(f"{status} {email}")
print(f" Local: {result['local_part']}, Domain: {result['domain']}")
else:
print(f"{status} {email} - {result['error']}")PythonPhone Number Extraction
def extract_phone_numbers(text):
# Matches various phone number formats
pattern = r"(\+?1[-.\s]?)?\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})"
matches = re.finditer(pattern, text)
phones = []
for match in matches:
phone = f"({match.group(2)}) {match.group(3)}-{match.group(4)}"
phones.append(phone)
return phones
text = """
Contact us at 123-456-7890 or (555) 123-4567.
You can also reach us at +1 (800) 555-0123.
"""
phones = extract_phone_numbers(text)
for phone in phones:
print(phone)PythonLog File Analysis
def parse_log_entry(log_line):
# Common log format: IP - - [timestamp] "method url protocol" status size
pattern = r'(\d+\.\d+\.\d+\.\d+).*?\[(.*?)\]\s+"(\w+)\s+(.*?)\s+.*?"\s+(\d+)\s+(\d+)'
match = re.search(pattern, log_line)
if match:
return {
'ip': match.group(1),
'timestamp': match.group(2),
'method': match.group(3),
'url': match.group(4),
'status': int(match.group(5)),
'size': int(match.group(6))
}
return None
log = '192.168.1.1 - - [01/Jan/2024:12:00:00 +0000] "GET /index.html HTTP/1.1" 200 1234'
parsed = parse_log_entry(log)
print(parsed)PythonURL Extraction
def extract_urls(text):
pattern = r'https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:\w)*)?)?'
return re.findall(pattern, text)
text = """
Visit our website at https://example.com or check out
the documentation at https://docs.example.com/guide?lang=en#overview
"""
urls = extract_urls(text)
for url in urls:
print(url)PythonWeb Scraping Applications
import re
import requests
def scrape_product_info(html_content):
"""Extract product information from e-commerce HTML"""
patterns = {
'title': r'<h1[^>]*class="[^"]*product-title[^"]*"[^>]*>(.*?)</h1>',
'price': r'<span[^>]*class="[^"]*price[^"]*"[^>]*>\$?([\d,]+\.?\d*)</span>',
'rating': r'<div[^>]*class="[^"]*rating[^"]*"[^>]*>.*?(\d+\.?\d*)\s*out of',
'availability': r'<span[^>]*class="[^"]*stock[^"]*"[^>]*>(In Stock|Out of Stock)</span>'
}
product_info = {}
for key, pattern in patterns.items():
match = re.search(pattern, html_content, re.IGNORECASE | re.DOTALL)
if match:
product_info[key] = match.group(1).strip()
else:
product_info[key] = "Not found"
return product_info
# Example usage
sample_html = """
<div class="product-container">
<h1 class="product-title">Wireless Bluetooth Headphones</h1>
<span class="price-current">$79.99</span>
<div class="rating-section">4.5 out of 5 stars</div>
<span class="stock-status">In Stock</span>
</div>
"""
product = scrape_product_info(sample_html)
print(product)PythonData Analysis and Processing
import re
import pandas as pd
from collections import defaultdict
def analyze_text_data(text_data):
"""Comprehensive text analysis using regex"""
analysis = {
'word_count': len(re.findall(r'\b\w+\b', text_data)),
'sentence_count': len(re.findall(r'[.!?]+', text_data)),
'paragraph_count': len(re.findall(r'\n\s*\n', text_data)) + 1,
'emails': re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text_data),
'phone_numbers': re.findall(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text_data),
'urls': re.findall(r'https?://[^\s<>"\']+', text_data),
'mentions': re.findall(r'@\w+', text_data),
'hashtags': re.findall(r'#\w+', text_data),
'numbers': re.findall(r'\b\d+\.?\d*\b', text_data),
'dates': re.findall(r'\b\d{1,2}/\d{1,2}/\d{4}\b', text_data)
}
# Word frequency analysis
words = re.findall(r'\b\w+\b', text_data.lower())
word_freq = defaultdict(int)
for word in words:
if len(word) > 3: # Only count words longer than 3 characters
word_freq[word] += 1
analysis['top_words'] = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:10]
return analysis
# Example usage
sample_text = """
Contact our team at support@example.com or call us at (555) 123-4567.
Visit our website at https://example.com for more information.
Follow us @example_company and use #ExampleProduct in your posts.
Our sales increased by 25.5% on 12/15/2023. We now have 1000+ customers!
"""
analysis_result = analyze_text_data(sample_text)
for key, value in analysis_result.items():
print(f"{key}: {value}")PythonFile Processing Applications
import re
import os
from pathlib import Path
class LogAnalyzer:
"""Comprehensive log file analyzer using regex patterns"""
def __init__(self):
self.patterns = {
'apache_access': r'(\d+\.\d+\.\d+\.\d+).*?\[(.*?)\].*?"(\w+)\s+([^"]+).*?"\s+(\d+)\s+(\d+)',
'error_log': r'\[(.*?)\]\s*\[(\w+)\]\s*(.*)',
'application_log': r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(\w+)\s+(.*)',
'nginx_access': r'(\d+\.\d+\.\d+\.\d+).*?\[(.*?)\].*?"([^"]+)"\s+(\d+)\s+(\d+)',
}
def parse_apache_access_log(self, log_line):
"""Parse Apache access log format"""
match = re.search(self.patterns['apache_access'], log_line)
if match:
return {
'ip': match.group(1),
'timestamp': match.group(2),
'method': match.group(3),
'url': match.group(4),
'status': int(match.group(5)),
'size': int(match.group(6))
}
return None
def analyze_log_file(self, file_path, log_type='apache_access'):
"""Analyze entire log file and generate statistics"""
stats = {
'total_requests': 0,
'status_codes': defaultdict(int),
'ips': defaultdict(int),
'methods': defaultdict(int),
'urls': defaultdict(int),
'errors': []
}
try:
with open(file_path, 'r') as f:
for line_num, line in enumerate(f, 1):
parsed = self.parse_apache_access_log(line.strip())
if parsed:
stats['total_requests'] += 1
stats['status_codes'][parsed['status']] += 1
stats['ips'][parsed['ip']] += 1
stats['methods'][parsed['method']] += 1
stats['urls'][parsed['url']] += 1
if parsed['status'] >= 400:
stats['errors'].append({
'line': line_num,
'status': parsed['status'],
'url': parsed['url'],
'ip': parsed['ip']
})
except FileNotFoundError:
print(f"File {file_path} not found")
return None
# Convert to regular dicts and get top entries
for key in ['status_codes', 'ips', 'methods', 'urls']:
stats[key] = dict(sorted(stats[key].items(), key=lambda x: x[1], reverse=True)[:10])
return stats
# Text cleaning and normalization
def clean_and_normalize_text(text):
"""Clean and normalize text data using regex"""
# Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text)
# Remove special characters but keep basic punctuation
text = re.sub(r'[^\w\s\.\,\!\?\-]', '', text)
# Normalize multiple punctuation marks
text = re.sub(r'[\.]{2,}', '...', text)
text = re.sub(r'[!]{2,}', '!', text)
text = re.sub(r'[?]{2,}', '?', text)
# Fix spacing around punctuation
text = re.sub(r'\s*([,.!?])\s*', r'\1 ', text)
# Remove leading/trailing whitespace
text = text.strip()
return text
# CSV data extraction from text
def extract_csv_like_data(text):
"""Extract structured data that looks like CSV from text"""
# Pattern for CSV-like data (comma-separated values)
csv_pattern = r'^[^,\n]+(?:,[^,\n]+)+$'
lines = text.split('\n')
csv_lines = []
for line in lines:
if re.match(csv_pattern, line.strip()):
csv_lines.append(line.strip())
return csv_lines
# Example usage
sample_dirty_text = """
<html><body>This is some messy text!!!
It has <b>HTML tags</b> and irregular spacing...
Contact: user@example.com, phone: 555-123-4567.
Data: John,25,Engineer
Jane,30,Designer
Bob,28,Developer
</body></html>
"""
cleaned_text = clean_and_normalize_text(sample_dirty_text)
print("Cleaned text:", cleaned_text)
csv_data = extract_csv_like_data(sample_dirty_text)
print("Extracted CSV data:", csv_data)Pythongraph TD
A[Real-World Applications] --> B[Email Validation]
A --> C[Phone Numbers]
A --> D[Log Parsing]
A --> E[URL Extraction]
A --> F[Data Cleaning]
B --> B1[Format Verification]
C --> C1[Various Formats]
D --> D1[Structure Extraction]
E --> E1[Link Discovery]
F --> F1[Text Normalization]11. Performance and Optimization
Compiling Patterns
import re
import time
# Inefficient - recompiles pattern each time
def slow_search(texts):
pattern = r"\b\w+@\w+\.\w+\b"
results = []
for text in texts:
matches = re.findall(pattern, text)
results.extend(matches)
return results
# Efficient - compile once, use many times
def fast_search(texts):
pattern = re.compile(r"\b\w+@\w+\.\w+\b")
results = []
for text in texts:
matches = pattern.findall(text)
results.extend(matches)
return results
# Test with large dataset
texts = ["Contact user@example.com for info"] * 10000
# Time the approaches
start = time.time()
slow_results = slow_search(texts)
slow_time = time.time() - start
start = time.time()
fast_results = fast_search(texts)
fast_time = time.time() - start
print(f"Slow approach: {slow_time:.4f} seconds")
print(f"Fast approach: {fast_time:.4f} seconds")
print(f"Speed improvement: {slow_time / fast_time:.2f}x")PythonOptimizing Patterns
# Inefficient - backtracking
bad_pattern = r"(a+)+b"
# Efficient - atomic grouping or possessive quantifiers
good_pattern = r"a+b"
# Use specific character classes instead of .
# Bad
pattern = r".*@.*\..*"
# Good
pattern = r"[^@]+@[^.]+\.[a-zA-Z]+"
# Anchor patterns when possible
# Unanchored (slower)
pattern = r"\d{3}-\d{3}-\d{4}"
# Anchored (faster)
pattern = r"^\d{3}-\d{3}-\d{4}$"PythonPerformance Tips
graph TB
A["Regex Performance"] --> B["Compile Patterns"]
A --> C[Avoid Backtracking]
A --> D[Use Specific Classes]
A --> E[Anchor Patterns]
B --> B1[re.compile for reuse]
C --> C1[Avoid nested quantifiers]
D --> D1["[^@]+ instead of .*"]
E --> E1[^ and $ when appropriate]12. Common Pitfalls and Best Practices {#best-practices}
Common Mistakes
1. Forgetting Raw Strings
# Wrong - need to escape backslashes
pattern = "\\d+\\.\\d+"
# Right - use raw strings
pattern = r"\d+\.\d+"Python2. Greedy vs Non-Greedy
html = "<div>Hello</div><div>World</div>"
# Wrong - matches entire string
pattern = r"<div>.*</div>"
match = re.search(pattern, html)
print(match.group()) # <div>Hello</div><div>World</div>
# Right - non-greedy matching
pattern = r"<div>.*?</div>"
matches = re.findall(pattern, html)
print(matches) # ['<div>Hello</div>', '<div>World</div>']Python3. Not Escaping Special Characters
# Wrong - . matches any character
pattern = r"3.14"
text = "3X14"
print(bool(re.search(pattern, text))) # True (unexpected)
# Right - escape the literal dot
pattern = r"3\.14"
text = "3X14"
print(bool(re.search(pattern, text))) # FalsePythonBest Practices
1. Use Verbose Mode for Complex Patterns
# Complex pattern - hard to read
pattern = r"^(?P<area>\d{3})-(?P<exchange>\d{3})-(?P<number>\d{4})$"
# Same pattern with verbose mode - much clearer
pattern = re.compile(r"""
^ # Start of string
(?P<area>\d{3}) # Area code (3 digits)
- # Literal hyphen
(?P<exchange>\d{3}) # Exchange (3 digits)
- # Literal hyphen
(?P<number>\d{4}) # Number (4 digits)
$ # End of string
""", re.VERBOSE)Python2. Validate Input Before Processing
def safe_regex_search(pattern, text):
if not isinstance(text, str):
return None
try:
compiled_pattern = re.compile(pattern)
return compiled_pattern.search(text)
except re.error as e:
print(f"Invalid regex pattern: {e}")
return NonePython3. Use Appropriate Methods
# For validation - use match()
def is_valid_email(email):
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
return bool(re.match(pattern, email))
# For finding - use search() or findall()
def extract_emails(text):
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
return re.findall(pattern, text)
# For replacement - use sub()
def mask_emails(text):
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
return re.sub(pattern, "***@***.***", text)PythonTesting Regex Patterns
import unittest
class TestEmailRegex(unittest.TestCase):
def setUp(self):
self.email_pattern = re.compile(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")
def test_valid_emails(self):
valid_emails = [
"user@example.com",
"test.email@domain.org",
"user+tag@example.co.uk"
]
for email in valid_emails:
with self.subTest(email=email):
self.assertTrue(self.email_pattern.match(email))
def test_invalid_emails(self):
invalid_emails = [
"invalid.email",
"@domain.com",
"user@",
"user space@domain.com"
]
for email in invalid_emails:
with self.subTest(email=email):
self.assertFalse(self.email_pattern.match(email))
# Run tests
if __name__ == "__main__":
unittest.main()Pythongraph TD
A[Best Practices] --> B[Use Raw Strings]
A --> C[Test Thoroughly]
A --> D[Handle Errors]
A --> E[Document Complex Patterns]
A --> F[Choose Right Method]
B --> B1[Avoid escape issues]
C --> C1[Unit test patterns]
D --> D1[Catch re.error]
E --> E1[Use re.VERBOSE]
F --> F1[match vs search vs findall]14. Comprehensive Regex Cheat Sheet
Quick Reference Patterns
# BASIC PATTERNS
basic_patterns = {
'literal': r'hello', # Matches "hello" exactly
'case_insensitive': r'(?i)hello', # Matches "Hello", "HELLO", etc.
'any_char': r'h.llo', # Matches "hello", "hallo", "h3llo"
'optional': r'colou?r', # Matches "color" or "colour"
}
# CHARACTER CLASSES
character_classes = {
'digit': r'\d', # [0-9]
'non_digit': r'\D', # [^0-9]
'word': r'\w', # [a-zA-Z0-9_]
'non_word': r'\W', # [^a-zA-Z0-9_]
'whitespace': r'\s', # [ \t\n\r\f\v]
'non_whitespace': r'\S', # [^ \t\n\r\f\v]
'custom_class': r'[aeiou]', # Vowels only
'range': r'[a-z]', # Lowercase letters
'negated': r'[^0-9]', # Not digits
}
# QUANTIFIERS
quantifiers = {
'zero_or_more': r'a*', # "", "a", "aa", "aaa"
'one_or_more': r'a+', # "a", "aa", "aaa"
'zero_or_one': r'a?', # "", "a"
'exactly_n': r'a{3}', # "aaa"
'n_or_more': r'a{3,}', # "aaa", "aaaa", "aaaaa"
'between_n_m': r'a{2,4}', # "aa", "aaa", "aaaa"
'non_greedy': r'a+?', # Non-greedy one or more
}
# ANCHORS AND BOUNDARIES
anchors = {
'start_of_string': r'^hello', # Must start with "hello"
'end_of_string': r'world$', # Must end with "world"
'word_boundary': r'\bhello\b', # "hello" as whole word
'non_word_boundary': r'\Bhello\B', # "hello" inside word
'start_of_line': r'(?m)^hello', # Start of any line
'end_of_line': r'(?m)world$', # End of any line
}
# GROUPS AND CAPTURING
groups = {
'capturing': r'(hello)', # Captures "hello"
'non_capturing': r'(?:hello)', # Groups but doesn't capture
'named_group': r'(?P<greeting>hello)', # Named capture
'backreference': r'(\w+) \1', # Matches repeated words
'conditional': r'(a)?(?(1)b|c)', # Complex conditional
}
# LOOKAROUNDS
lookarounds = {
'positive_lookahead': r'hello(?= world)', # "hello" followed by " world"
'negative_lookahead': r'hello(?! world)', # "hello" NOT followed by " world"
'positive_lookbehind': r'(?<=say )hello', # "hello" preceded by "say "
'negative_lookbehind': r'(?<!say )hello', # "hello" NOT preceded by "say "
}PythonCommon Use Case Patterns
# EMAIL VALIDATION
email_patterns = {
'basic': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'strict': r'^[a-zA-Z0-9.!#$%&\'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$'
}
# PHONE NUMBERS
phone_patterns = {
'us_simple': r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}',
'us_with_country': r'(\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}',
'international': r'^\+?[1-9]\d{1,14}$'
}
# URLS
url_patterns = {
'http_https': r'https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:\w)*)?)?',
'with_subdomains': r'https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&=]*)'
}
# DATES
date_patterns = {
'mm_dd_yyyy': r'\b(0?[1-9]|1[0-2])/(0?[1-9]|[12]\d|3[01])/\d{4}\b',
'yyyy_mm_dd': r'\b\d{4}-(0?[1-9]|1[0-2])-(0?[1-9]|[12]\d|3[01])\b',
'dd_mm_yyyy': r'\b(0?[1-9]|[12]\d|3[01])/(0?[1-9]|1[0-2])/\d{4}\b'
}
# IP ADDRESSES
ip_patterns = {
'ipv4': r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b',
'ipv4_strict': r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b'
}
# CREDIT CARDS
credit_card_patterns = {
'visa': r'4[0-9]{12}(?:[0-9]{3})?',
'mastercard': r'5[1-5][0-9]{14}',
'amex': r'3[47][0-9]{13}',
'any': r'(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|3[0-9]{13}|6(?:011|5[0-9]{2})[0-9]{12})'
}PythonMethod Comparison Table
| Method | Purpose | Returns | Use When |
|---|---|---|---|
re.match() | Match from start | Match object or None | Validating entire string |
re.search() | Find first match | Match object or None | Finding one occurrence |
re.findall() | Find all matches | List of strings | Getting all matches as list |
re.finditer() | Find all matches | Iterator of Match objects | Need match details for all |
re.sub() | Replace matches | Modified string | Text replacement |
re.subn() | Replace matches | (string, count) tuple | Replacement + count needed |
re.split() | Split by pattern | List of strings | Splitting text |
re.compile() | Compile pattern | Pattern object | Reusing same pattern |
Flag Options
import re
# COMMON FLAGS
flags_demo = {
'IGNORECASE': re.IGNORECASE, # Case-insensitive matching
'MULTILINE': re.MULTILINE, # ^ and $ match line boundaries
'DOTALL': re.DOTALL, # . matches newlines too
'VERBOSE': re.VERBOSE, # Allow comments and whitespace
'ASCII': re.ASCII, # Make \w, \W, \b, \B ASCII-only
'LOCALE': re.LOCALE, # Make \w, \W, \b, \B locale-aware
'UNICODE': re.UNICODE, # Make \w, \W, \b, \B Unicode-aware
}
# COMBINING FLAGS
combined_flags = re.IGNORECASE | re.MULTILINE | re.DOTALL
# INLINE FLAGS
inline_flags = {
'case_insensitive': r'(?i)pattern',
'multiline': r'(?m)pattern',
'dotall': r'(?s)pattern',
'verbose': r'(?x)pattern',
'multiple': r'(?ims)pattern', # Combined flags
}Pythongraph TB
A[Regex Cheat Sheet] --> B[Basic Patterns]
A --> C[Character Classes]
A --> D[Quantifiers]
A --> E[Common Use Cases]
B --> B1[Literal matching]
B --> B2[Case options]
B --> B3[Any character]
C --> C1[Built-in classes]
C --> C2[Custom classes]
C --> C3[Negated classes]
D --> D1[Basic quantifiers]
D --> D2[Exact counts]
D --> D3[Greedy vs lazy]
E --> E1[Email validation]
E --> E2[Phone numbers]
E --> E3[URLs & IPs]
style A fill:#e1f5fe
style E fill:#c8e6c915. Practice Exercises
Beginner Exercises
# Exercise 1: Basic Pattern Matching
def exercise_1():
"""Find all words that start with 'th' (case insensitive)"""
text = "The quick brown fox thinks that this is the best thing."
# Your pattern here:
pattern = r"th\w*"
# Solution: re.findall(pattern, text, re.IGNORECASE)
# Exercise 2: Number Extraction
def exercise_2():
"""Extract all numbers (integers and decimals) from text"""
text = "The price is $29.99 and we have 15 items in stock. Call 555-1234."
# Your pattern here:
pattern = r"\d+\.?\d*"
# Test your solution
# Exercise 3: Email Finding
def exercise_3():
"""Find all email addresses in the text"""
text = "Contact john@example.com or mary.smith@company.org for more info."
# Your pattern here:
pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
# Test your solutionPythonIntermediate Challenges
# Challenge 1: Phone Number Standardization
def challenge_1():
"""Convert various phone formats to (XXX) XXX-XXXX"""
phones = [
"123-456-7890",
"(555) 123-4567",
"555.123.4567",
"5551234567"
]
# Create a function to standardize all formats
def standardize_phone(phone):
pattern = r"(\(?\d{3}\)?[-.\s]?)(\d{3}[-.\s]?)(\d{4})"
match = re.search(pattern, phone)
if match:
return f"({match.group(1)}) {match.group(2)}-{match.group(3)}"
return None
# Challenge 2: Log Parser
def challenge_2():
"""Parse web server logs to extract useful information"""
log_entry = '192.168.1.100 - - [10/Oct/2023:13:55:36 +0000] "GET /api/users HTTP/1.1" 200 2326'
# Extract: IP, timestamp, method, endpoint, status code, response size
pattern = r'(\d+\.\d+\.\d+\.\d+).*?\[(.*?)\].*?"(\w+)\s+([^"]+).*?"\s+(\d+)\s+(\d+)'
# Complete the parser function
# Challenge 3: HTML Tag Extractor
def challenge_3():
"""Extract all HTML tags and their attributes"""
html = '<div class="container" id="main"><p>Hello</p><a href="http://example.com">Link</a></div>'
# Extract tag name and attributes separately
tag_pattern = r'<(\w+)([^>]*)>'
# Complete the extraction logicPythonAdvanced Problems
# Advanced 1: Nested Structure Parser
def advanced_1():
"""Parse nested parentheses and extract content at each level"""
text = "((hello (world) test) (foo bar))"
# This requires recursive or advanced techniques
# Hint: You might need to use a stack-based approach or recursive regex
# Advanced 2: Template Variable Extractor
def advanced_2():
"""Extract template variables like {{variable}} and {{variable|filter}}"""
template = "Hello {{name}}, your balance is {{account.balance|currency}}."
# Extract variable names and filters
pattern = r'\{\{\s*([^|\}]+)(?:\|([^}]+))?\s*\}\}'
# Complete the extraction and parsing
# Advanced 3: SQL Query Parser
def advanced_3():
"""Parse SQL SELECT statements to extract tables and columns"""
sql = """
SELECT users.name, orders.total, products.title
FROM users
JOIN orders ON users.id = orders.user_id
JOIN products ON orders.product_id = products.id
WHERE users.active = 1
"""
# Extract table names, column names, and JOIN conditions
# This is a complex parsing challengePythonExercise Solutions and Explanations
def show_solutions():
"""Detailed solutions with explanations"""
solutions = {
"Beginner 1": {
"pattern": r"th\w*",
"flags": "re.IGNORECASE",
"explanation": "th matches literal 'th', \\w* matches zero or more word characters"
},
"Beginner 2": {
"pattern": r"\d+\.?\d*",
"explanation": "\\d+ matches digits, \\.? optional decimal point, \\d* optional decimal digits"
},
"Intermediate 1": {
"approach": "Capture groups for area code, exchange, and number",
"pattern": r"(\(?\d{3}\)?[-.\s]?)(\d{3}[-.\s]?)(\d{4})",
"replacement": r"(\1) \2-\3"
}
}
for exercise, solution in solutions.items():
print(f"\n{exercise}:")
for key, value in solution.items():
print(f" {key}: {value}")
# Interactive practice function
def practice_regex():
"""Interactive regex practice session"""
exercises = [
{
"description": "Find all words ending in 'ing'",
"text": "Running and jumping are fun activities.",
"expected": ["Running", "jumping"]
},
{
"description": "Extract all hashtags from social media text",
"text": "Love this weather! #sunny #beautiful #weekend",
"expected": ["#sunny", "#beautiful", "#weekend"]
}
]
for i, exercise in enumerate(exercises, 1):
print(f"\nExercise {i}: {exercise['description']}")
print(f"Text: {exercise['text']}")
print(f"Expected: {exercise['expected']}")
# Student would input their pattern here
# pattern = input("Enter your regex pattern: ")
# result = re.findall(pattern, exercise['text'])
# print(f"Your result: {result}")Pythongraph TB
A[Practice Exercises] --> B[Beginner Level]
A --> C[Intermediate Level]
A --> D[Advanced Level]
B --> B1[Basic matching]
B --> B2[Character classes]
B --> B3[Simple quantifiers]
C --> C1[Complex patterns]
C --> C2[Data parsing]
C --> C3[Text processing]
D --> D1[Nested structures]
D --> D2[Template parsing]
D --> D3[Language parsing]
style A fill:#e1f5fe
style B fill:#e8f5e8
style C fill:#fff3e0
style D fill:#ffebeeConclusion
Regular expressions are a powerful tool for text processing in Python. This comprehensive guide has taken you from beginner concepts to expert-level techniques.
Key Learning Outcomes
By completing this guide, you should be able to:
- Understand Regex Fundamentals: Basic patterns, character classes, and quantifiers
- Apply Advanced Techniques: Lookarounds, groups, and complex pattern matching
- Debug Regex Patterns: Use systematic approaches to troubleshoot issues
- Optimize Performance: Write efficient patterns and avoid common pitfalls
- Solve Real-World Problems: Apply regex to practical text processing tasks
Best Practices Summary
graph TB
A[Regex Best Practices] --> B[Development]
A --> C[Testing]
A --> D[Performance]
A --> E[Maintenance]
B --> B1[Start simple, add complexity]
B --> B2[Use raw strings r'']
B --> B3[Comment complex patterns]
C --> C1[Test edge cases]
C --> C2[Validate with real data]
C --> C3[Use unit tests]
D --> D1[Compile patterns for reuse]
D --> D2[Avoid catastrophic backtracking]
D --> D3[Use specific character classes]
E --> E1[Document pattern purpose]
E --> E2[Use meaningful variable names]
E --> E3[Keep patterns readable]Essential Quick Reference
import re
# Most commonly used patterns
essential_patterns = {
'email': r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
'phone_us': r"\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})",
'url': r"https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:\w)*)?)?",
'ipv4': r"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b",
'date_mdy': r"\b(0?[1-9]|1[0-2])/(0?[1-9]|[12]\d|3[01])/\d{4}\b",
'number': r"-?\d+\.?\d*",
'word': r"\b\w+\b"
}
# Essential methods
re.search(pattern, text) # Find first match
re.findall(pattern, text) # Find all matches
re.finditer(pattern, text) # Iterator of match objects
re.sub(pattern, repl, text) # Replace matches
re.split(pattern, text) # Split by pattern
re.compile(pattern) # Compile for reusePythonYour Next Steps
- Practice Daily: Work with regex patterns regularly
- Build a Pattern Library: Save useful patterns you create
- Join Communities: Engage with other regex learners
- Contribute: Help others and share your knowledge
- Stay Updated: Follow regex developments
When to Use Regex (and When Not To)
✅ Good for:
- Pattern matching and validation
- Text extraction and parsing
- Find and replace operations
- Data cleaning and preprocessing
❌ Consider alternatives for:
- Complex parsing (use dedicated parsers)
- Simple string operations (use str methods)
- Structured data (JSON, XML, CSV libraries)
- Performance-critical code without optimization
Happy regex coding! 🚀
*”Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” – Jamie Zawinski*
While humorous, this quote reminds us to use regex thoughtfully. This guide teaches you to be one of the developers who wields regex effectively and responsibly.
Discover more from Altgr Blog
Subscribe to get the latest posts sent to your email.
