Python Regular Expressions

    Table of Contents

    1. Introduction to Regular Expressions
      • Why Learn Regex?
      • Common Use Cases
      • Regex Learning Roadmap
    2. Getting Started with Python’s re Module
      • Setting Up the Environment
      • First Regex Example
      • Raw Strings and Escape Characters
    3. Basic Regex Patterns
      • Literal Character Matching
      • Case Sensitivity Options
      • Basic Pattern Exercises
    4. Character Classes and Special Characters
      • Built-in Character Classes
      • Custom Character Classes
      • Character Class Combinations
    5. Quantifiers
      • Basic Quantifiers
      • Greedy vs Non-Greedy Matching
      • Quantifier Practice Examples
    6. Groups and Capturing
      • Basic Grouping
      • Named Groups
      • Non-Capturing Groups
      • Backreferences
    7. Anchors and Boundaries
      • Start and End Anchors
      • Word Boundaries
      • Multiline Mode
    8. Advanced Patterns
      • Lookahead and Lookbehind Assertions
      • Alternation and Conditional Matching
      • Recursive Patterns
    9. Regex Methods in Python
      • Core Methods Comparison
      • Compiled Patterns
      • Working with Match Objects
    10. Interactive Examples and Debugging
      • Step-by-Step Pattern Building
      • Debugging Tools and Techniques
      • Common Error Messages
    11. Real-World Applications
      • Data Validation Patterns
      • Text Processing and Extraction
      • Log Analysis and Parsing
      • Web Scraping Applications
    12. Performance and Optimization
      • Pattern Compilation Best Practices
      • Avoiding Catastrophic Backtracking
      • Memory and Speed Optimization
    13. Common Pitfalls and Best Practices
      • Frequent Mistakes
      • Testing Strategies
      • Code Organization
    14. Comprehensive Regex Cheat Sheet
      • Quick Reference Patterns
      • Method Comparison Table
      • Flag Options
    15. Practice Exercises
      • Beginner Exercises
      • Intermediate Challenges
      • Advanced Problems

    1. Introduction to Regular Expressions

    Regular expressions (regex) are powerful pattern-matching tools used to find, match, and manipulate text. They provide a concise way to describe complex string patterns.

    graph LR
        A[Text Input] --> B[Regex Pattern]
        B --> C{Match?}
        C -->|Yes| D[Extract/Replace/Validate]
        C -->|No| E[No Action]

    Why Learn Regex?

    • Text Processing: Extract specific information from large text files
    • Data Validation: Validate email addresses, phone numbers, etc.
    • Data Cleaning: Remove unwanted characters or format data
    • Log Analysis: Parse and analyze log files
    • Web Scraping: Extract specific data from HTML
    graph TB
        A[Regular Expressions] --> B[Pattern Matching]
        A --> C[Text Manipulation]
        A --> D[Data Validation]
    
        B --> B1[Find specific text]
        B --> B2[Extract information]
        B --> B3[Search & Replace]
    
        C --> C1[Clean data]
        C --> C2[Transform format]
        C --> C3[Split strings]
    
        D --> D1[Email validation]
        D --> D2[Phone numbers]
        D --> D3[Input sanitization]

    Common Use Cases

    # Example scenarios where regex excels
    examples = {
        "Email extraction": "Extract all emails from a text file",
        "Phone formatting": "Convert (555) 123-4567 to 555-123-4567",
        "Data cleaning": "Remove extra whitespace and special characters",
        "Log parsing": "Extract timestamps and error codes from logs",
        "URL validation": "Check if a string is a valid web address"
    }
    Python

    Regex Learning Roadmap

    graph TD
        A[Start Here] --> B[Basic Patterns]
        B --> C[Character Classes]
        C --> D[Quantifiers]
        D --> E[Groups & Capturing]
        E --> F[Anchors & Boundaries]
        F --> G[Advanced Features]
        G --> H[Real-world Applications]
        H --> I[Performance & Optimization]
    
        style A fill:#e1f5fe
        style I fill:#c8e6c9

    2. Getting Started with Python’s re Module

    Python’s built-in re module provides regex functionality.

    import re
    
    # Basic pattern matching
    pattern = r"hello"
    text = "hello world"
    match = re.search(pattern, text)
    print(match.group() if match else "No match")  # Output: hello
    Python

    Raw Strings

    Always use raw strings (r"") for regex patterns to avoid escaping issues:

    # Good practice
    pattern = r"\d+\.\d+"  # Matches decimal numbers
    
    # Avoid this
    pattern = "\\d+\\.\\d+"  # Same pattern but harder to read
    Python
    flowchart TD
        A[Regex Pattern] --> B{Raw String?}
        B -->|Yes r""| C[Clean, Readable Pattern]
        B -->|No ""| D[Escaped Characters \\\\]
        C --> E[Easy to Debug]
        D --> F[Hard to Read/Maintain]

    3. Basic Regex Patterns

    Literal Characters

    Match exact characters:

    import re
    
    pattern = r"cat"
    text = "The cat sat on the mat"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['cat']
    Python

    Case Sensitivity

    # Case sensitive (default)
    pattern = r"Cat"
    text = "cat and Cat"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['Cat']
    
    # Case insensitive
    pattern = r"Cat"
    text = "cat and Cat"
    matches = re.findall(pattern, text, re.IGNORECASE)
    print(matches)  # Output: ['cat', 'Cat']
    Python
    graph TB
        A["Input Text: 'cat and Cat'"] --> B["Pattern: 'Cat'"]
        B --> C{Case Sensitive?}
        C -->|Yes| D["Matches: ['Cat']"]
        C -->|No re.IGNORECASE| E["Matches: ['cat', 'Cat']"]

    4. Character Classes and Special Characters

    Basic Character Classes

    import re
    
    # \d - digits (0-9)
    pattern = r"\d+"
    text = "I have 25 apples and 10 oranges"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['25', '10']
    
    # \w - word characters (letters, digits, underscore)
    pattern = r"\w+"
    text = "hello_world 123!"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['hello_world', '123']
    
    # \s - whitespace characters
    pattern = r"\s+"
    text = "hello   world\t\n"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['   ', '\t\n']
    Python

    Character Class Summary

    PatternDescriptionExample Match
    \dDigit (0-9)5, 42
    \DNon-digita, @
    \wWord charactera, Z, _, 5
    \WNon-word character@, !,
    \sWhitespace, \t, \n
    \SNon-whitespacea, 1, @
    .Any character (except newline)a, 1, @
    graph TD
        A[Character Classes] --> B["\d Digits"]
        A --> C["\w Word Chars"]
        A --> D["\s Whitespace"]
        A --> E[". Any Char"]
        A --> F[Custom Classes]
    
        B --> B1["0-9 (Numbers)"]
        C --> C1["a-z, A-Z, 0-9, _ (Alphanumeric + underscore)"]
        D --> D1["Space, Tab, Newline"]
        E --> E1["Everything except newline"]
        F --> F1["[abc] - specific chars"]
        F --> F2["[a-z] - ranges"]
        F --> F3["[^abc] - negation"]
    
        style A fill:#e3f2fd
        style B fill:#fff3e0
        style C fill:#f3e5f5
        style D fill:#e8f5e8
        style E fill:#ffebee
        style F fill:#fce4ec

    Custom Character Classes

    # [abc] - matches a, b, or c
    pattern = r"[aeiou]"
    text = "hello world"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['e', 'o', 'o']
    
    # [a-z] - matches lowercase letters
    pattern = r"[a-z]+"
    text = "Hello World 123"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['ello', 'orld']
    
    # [^abc] - matches anything except a, b, or c
    pattern = r"[^aeiou]+"
    text = "hello world"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['h', 'll', ' w', 'rld']
    Python

    5. Quantifiers {#quantifiers}

    Quantifiers specify how many times a pattern should match.

    import re
    
    # * - zero or more
    pattern = r"ab*c"
    texts = ["ac", "abc", "abbc", "abbbc"]
    for text in texts:
        match = re.search(pattern, text)
        print(f"{text}: {bool(match)}")
    # Output: ac: True, abc: True, abbc: True, abbbc: True
    
    # + - one or more
    pattern = r"ab+c"
    texts = ["ac", "abc", "abbc"]
    for text in texts:
        match = re.search(pattern, text)
        print(f"{text}: {bool(match)}")
    # Output: ac: False, abc: True, abbc: True
    
    # ? - zero or one
    pattern = r"colou?r"
    texts = ["color", "colour"]
    for text in texts:
        match = re.search(pattern, text)
        print(f"{text}: {bool(match)}")
    # Output: color: True, colour: True
    
    # {n} - exactly n times
    pattern = r"\d{3}"
    text = "Call 123-456-7890"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['123', '456', '789']
    
    # {n,m} - between n and m times
    pattern = r"\d{2,4}"
    text = "1 22 333 4444 55555"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['22', '333', '4444', '5555']
    Python

    Quantifier Summary

    QuantifierDescriptionExample
    *0 or moreab* matches a, ab, abb
    +1 or moreab+ matches ab, abb (not a)
    ?0 or 1ab? matches a, ab
    {n}Exactly na{3} matches aaa
    {n,}n or morea{2,} matches aa, aaa, aaaa
    {n,m}Between n and ma{2,4} matches aa, aaa, aaaa
    graph LR
        A[Quantifiers] --> B["Greedy: *, +, ?"]
        A --> C["Exact: {n}"]
        A --> D["Range: {n,m}"]
    
        B --> B1[Match as much as possible]
        C --> C1[Match exact count]
        D --> D1[Match within range]

    Greedy vs Non-Greedy

    import re
    
    text = "<tag>content</tag>"
    
    # Greedy matching (default)
    pattern = r"<.*>"
    match = re.search(pattern, text)
    print(match.group())  # Output: <tag>content</tag>
    
    # Non-greedy matching
    pattern = r"<.*?>"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['<tag>', '</tag>']
    Python

    6. Groups and Capturing

    Groups allow you to capture parts of a match and apply quantifiers to multiple characters.

    import re
    
    # Basic grouping
    pattern = r"(\d{3})-(\d{3})-(\d{4})"
    text = "Call me at 123-456-7890"
    match = re.search(pattern, text)
    if match:
        print(f"Full match: {match.group(0)}")  # 123-456-7890
        print(f"Area code: {match.group(1)}")   # 123
        print(f"Exchange: {match.group(2)}")    # 456
        print(f"Number: {match.group(3)}")      # 7890
        print(f"All groups: {match.groups()}")  # ('123', '456', '7890')
    Python

    Named Groups

    pattern = r"(?P<area>\d{3})-(?P<exchange>\d{3})-(?P<number>\d{4})"
    text = "Call me at 123-456-7890"
    match = re.search(pattern, text)
    if match:
        print(f"Area: {match.group('area')}")        # 123
        print(f"Exchange: {match.group('exchange')}")# 456
        print(f"Number: {match.group('number')}")    # 7890
        print(f"Dict: {match.groupdict()}")          # {'area': '123', 'exchange': '456', 'number': '7890'}
    Python

    Non-Capturing Groups

    # (?:...) - non-capturing group
    pattern = r"(?:Mr|Mrs|Ms)\. (\w+)"
    text = "Hello Mr. Smith and Mrs. Johnson"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['Smith', 'Johnson']
    Python
    graph TD
        A[Groups in Regex] --> B["Capturing Groups </br> (...)"]
        A --> C["Named Groups </br> (?P(name)...)"]
        A --> D["Non-Capturing Groups </br> (?:...)"]
    
        B --> B1[Accessible by number]
        C --> C1[Accessible by name]
        D --> D1[Not captured, just grouped]

    7. Anchors and Boundaries

    Anchors specify where in the string a match should occur.

    import re
    
    text = "The cat in the hat"
    
    # ^ - start of string
    pattern = r"^The"
    match = re.search(pattern, text)
    print(bool(match))  # True
    
    pattern = r"^cat"
    match = re.search(pattern, text)
    print(bool(match))  # False
    
    # $ - end of string
    pattern = r"hat$"
    match = re.search(pattern, text)
    print(bool(match))  # True
    
    # \b - word boundary
    pattern = r"\bcat\b"
    texts = ["cat", "catch", "scattered", "the cat runs"]
    for t in texts:
        match = re.search(pattern, t)
        print(f"'{t}': {bool(match)}")
    # Output: 'cat': True, 'catch': False, 'scattered': False, 'the cat runs': True
    Python

    Anchor Summary

    AnchorDescriptionExample
    ^Start of string^Hello matches “Hello world”
    $End of stringworld$ matches “Hello world”
    \bWord boundary\bcat\b matches “cat” but not “catch”
    \BNon-word boundary\Bcat\B matches “scattered”
    graph LR
        A["Text: 'The cat in the hat'"] --> B[Anchors]
        B --> C["^ Start"]
        B --> D["$ End"]
        B --> E["\b Word Boundary"]
    
        C --> C1["^The ✓"]
        D --> D1["hat$ ✓"]
        E --> E1["\bcat\b ✓"]

    8. Advanced Patterns

    Lookahead and Lookbehind

    import re
    
    # Positive lookahead (?=...)
    pattern = r"\d+(?= dollars)"
    text = "I have 50 dollars and 25 cents"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['50']
    
    # Negative lookahead (?!...)
    pattern = r"\d+(?! dollars)"
    text = "I have 50 dollars and 25 cents"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['5', '0', '2', '5']
    
    # Positive lookbehind (?<=...)
    pattern = r"(?<=\$)\d+"
    text = "Price: $25.99"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['25']
    
    # Negative lookbehind (?<!...)
    pattern = r"(?<!\$)\d+"
    text = "Price: $25.99 and 10 items"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['5', '99', '10']
    Python

    Alternation

    # | - OR operator
    pattern = r"cat|dog|bird"
    text = "I have a cat and a dog"
    matches = re.findall(pattern, text)
    print(matches)  # Output: ['cat', 'dog']
    
    # Grouped alternation
    pattern = r"(Mr|Mrs|Ms)\. (\w+)"
    text = "Mr. Smith and Ms. Johnson"
    matches = re.findall(pattern, text)
    print(matches)  # Output: [('Mr', 'Smith'), ('Ms', 'Johnson')]
    Python
    graph TD
        A[Advanced Patterns] --> B["Lookahead (?=...)"]
        A --> C["Lookbehind (?<=...)"]
        A --> D["Alternation |"]
    
        B --> B1["Match if followed by..."]
        C --> C1["Match if preceded by..."]
        D --> D1[Match this OR that]

    9. Regex Methods in Python

    Essential re Module Functions

    import re
    
    text = "The price is $25.99 and $15.50"
    pattern = r"\$(\d+\.\d+)"
    
    # re.search() - finds first match
    match = re.search(pattern, text)
    if match:
        print(f"First price: {match.group(1)}")  # 25.99
    
    # re.findall() - finds all matches
    matches = re.findall(pattern, text)
    print(f"All prices: {matches}")  # ['25.99', '15.50']
    
    # re.finditer() - returns match objects
    for match in re.finditer(pattern, text):
        print(f"Price: ${match.group(1)} at position {match.start()}-{match.end()}")
    
    # re.sub() - substitute matches
    new_text = re.sub(pattern, r"$XXX.XX", text)
    print(new_text)  # The price is $XXX.XX and $XXX.XX
    
    # re.split() - split by pattern
    text = "apple,banana;orange:grape"
    fruits = re.split(r"[,;:]", text)
    print(fruits)  # ['apple', 'banana', 'orange', 'grape']
    Python

    Compiled Patterns

    # Compile pattern for reuse (more efficient)
    pattern = re.compile(r"\$(\d+\.\d+)")
    text = "Price: $25.99"
    
    match = pattern.search(text)
    matches = pattern.findall(text)
    new_text = pattern.sub("$XX.XX", text)
    Python
    flowchart TD
        A[re Module Methods] --> B[re.search]
        A --> C[re.findall]
        A --> D[re.finditer]
        A --> E[re.sub]
        A --> F[re.split]
        A --> G[re.compile]
    
        B --> B1[First match object]
        C --> C1[List of all matches]
        D --> D1[Iterator of match objects]
        E --> E1[Replace matches]
        F --> F1[Split by pattern]
        G --> G1[Compiled pattern object]

    10. Interactive Examples and Debugging {#interactive-debugging}

    Step-by-Step Pattern Building

    Let’s build a complex email validation pattern step by step:

    import re
    
    # Step 1: Start simple - match any characters before @
    pattern1 = r".+@"
    test_email = "user@example.com"
    print(f"Step 1: {bool(re.search(pattern1, test_email))}")  # True
    
    # Step 2: Be more specific about allowed characters before @
    pattern2 = r"[a-zA-Z0-9._%+-]+@"
    print(f"Step 2: {bool(re.search(pattern2, test_email))}")  # True
    
    # Step 3: Add domain part
    pattern3 = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+"
    print(f"Step 3: {bool(re.search(pattern3, test_email))}")  # True
    
    # Step 4: Ensure domain has extension
    pattern4 = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]+"
    print(f"Step 4: {bool(re.search(pattern4, test_email))}")  # True
    
    # Step 5: Add anchors for exact matching
    pattern5 = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    print(f"Step 5 (final): {bool(re.match(pattern5, test_email))}")  # True
    Python
    graph TD
        A["Start: .+@"] --> B["Refine: [a-zA-Z0-9._%+-]+@"]
        B --> C["Add domain: +@[a-zA-Z0-9.-]+"]
        C --> D["Add extension: +\.[a-zA-Z]+"]
        D --> E["Add anchors: ^...$"]
        E --> F["Final Pattern"]
    
        style A fill:#ffebee
        style F fill:#e8f5e8

    Interactive Pattern Tester

    def test_pattern_interactively():
        """Interactive pattern testing function"""
    
        def test_regex(pattern, test_strings, description=""):
            """Test a regex pattern against multiple strings"""
            print(f"\n{'='*50}")
            print(f"Testing: {description}")
            print(f"Pattern: {pattern}")
            print(f"{'='*50}")
    
            compiled_pattern = re.compile(pattern)
    
            for test_string in test_strings:
                match = compiled_pattern.search(test_string)
                if match:
                    print(f"✓ '{test_string}' -> Match: '{match.group()}'")
                    if match.groups():
                        print(f"  Groups: {match.groups()}")
                else:
                    print(f"✗ '{test_string}' -> No match")
    
        # Example usage
        phone_pattern = r"(\(?\d{3}\)?[-.\s]?)(\d{3}[-.\s]?\d{4})"
        phone_tests = [
            "123-456-7890",
            "(555) 123-4567",
            "555.123.4567",
            "1234567890",
            "invalid-phone"
        ]
    
        test_regex(phone_pattern, phone_tests, "Phone Number Validation")
    
    # Run the interactive tester
    test_pattern_interactively()
    Python

    Debugging Tools and Techniques

    def debug_regex_step_by_step(pattern, text):
        """Debug a regex pattern by showing each step"""
    
        print(f"Debugging pattern: {pattern}")
        print(f"Against text: '{text}'")
        print("-" * 50)
    
        try:
            compiled_pattern = re.compile(pattern)
            match = compiled_pattern.search(text)
    
            if match:
                print(f"✓ Match found!")
                print(f"  Full match: '{match.group()}'")
                print(f"  Start position: {match.start()}")
                print(f"  End position: {match.end()}")
                print(f"  Span: {match.span()}")
    
                if match.groups():
                    print("  Captured groups:")
                    for i, group in enumerate(match.groups(), 1):
                        print(f"    Group {i}: '{group}'")
    
                if hasattr(match, 'groupdict') and match.groupdict():
                    print("  Named groups:")
                    for name, value in match.groupdict().items():
                        print(f"    {name}: '{value}'")
            else:
                print("✗ No match found")
    
            # Show all matches if there are multiple
            all_matches = compiled_pattern.findall(text)
            if len(all_matches) > 1:
                print(f"\nAll matches found: {all_matches}")
    
        except re.error as e:
            print(f"❌ Regex error: {e}")
            return False
    
        return True
    
    # Example debugging session
    debug_regex_step_by_step(
        r"(\w+)@(\w+)\.(\w+)",
        "Contact us at john@example.com or mary@test.org"
    )
    Python

    Common Error Messages and Solutions

    def demonstrate_common_errors():
        """Show common regex errors and their solutions"""
    
        common_errors = [
            {
                "error": "Unbalanced parenthesis",
                "bad_pattern": r"(\d+",
                "good_pattern": r"(\d+)",
                "description": "Always close parentheses"
            },
            {
                "error": "Invalid escape sequence",
                "bad_pattern": "\\d+\\s+\\w+",  # Python string
                "good_pattern": r"\d+\s+\w+",   # Raw string
                "description": "Use raw strings to avoid double escaping"
            },
            {
                "error": "Nothing to repeat",
                "bad_pattern": r"+\d+",
                "good_pattern": r"\d+",
                "description": "Quantifiers need something to quantify"
            }
        ]
    
        for error_info in common_errors:
            print(f"\n{error_info['error']}:")
            print(f"❌ Bad: {error_info['bad_pattern']}")
            print(f"✅ Good: {error_info['good_pattern']}")
            print(f"💡 Tip: {error_info['description']}")
    
    demonstrate_common_errors()
    Python
    graph TD
        A[Regex Debugging] --> B[Step-by-step Testing]
        A --> C[Interactive Pattern Building]
        A --> D[Error Analysis]
    
        B --> B1[Print intermediate results]
        B --> B2[Test with multiple inputs]
        B --> B3[Visualize matches]
    
        C --> C1[Start simple]
        C --> C2[Add complexity gradually]
        C --> C3[Test each iteration]
    
        D --> D1[Common syntax errors]
        D --> D2[Logic errors]
        D --> D3[Performance issues]

    11. Real-World Applications

    Email Validation

    import re
    from typing import List, Dict, Optional
    
    def validate_email(email: str) -> Dict[str, any]:
        """
        Comprehensive email validation with detailed feedback
    
        Args:
            email: Email address to validate
    
        Returns:
            Dictionary with validation result and details
        """
        if not isinstance(email, str):
            return {"valid": False, "error": "Input must be a string"}
    
        if not email:
            return {"valid": False, "error": "Email cannot be empty"}
    
        # Basic pattern for email validation
        pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    
        try:
            is_valid = bool(re.match(pattern, email))
    
            if is_valid:
                # Extract parts for additional validation
                local_part, domain = email.split('@')
                domain_parts = domain.split('.')
    
                return {
                    "valid": True,
                    "email": email,
                    "local_part": local_part,
                    "domain": domain,
                    "tld": domain_parts[-1],
                    "length": len(email)
                }
            else:
                # Provide specific error feedback
                errors = []
    
                if '@' not in email:
                    errors.append("Missing @ symbol")
                elif email.count('@') > 1:
                    errors.append("Multiple @ symbols")
                elif email.startswith('@'):
                    errors.append("Cannot start with @")
                elif email.endswith('@'):
                    errors.append("Cannot end with @")
                elif '.' not in email.split('@')[-1]:
                    errors.append("Domain must contain a dot")
                else:
                    errors.append("Invalid format")
    
                return {"valid": False, "error": "; ".join(errors)}
    
        except Exception as e:
            return {"valid": False, "error": f"Validation error: {str(e)}"}
    
    # Test cases with comprehensive feedback
    test_emails = [
        "user@example.com",           # Valid
        "test.email+tag@domain.org",  # Valid with plus
        "user.name@sub.domain.com",   # Valid with subdomain
        "invalid.email",              # Invalid - no @
        "@domain.com",               # Invalid - starts with @
        "user@",                     # Invalid - ends with @
        "user@@domain.com",          # Invalid - double @
        "user@domain",               # Invalid - no TLD
        "",                          # Invalid - empty
        123,                         # Invalid - not string
    ]
    
    print("Email Validation Results:")
    print("=" * 50)
    
    for email in test_emails:
        result = validate_email(email)
        status = "" if result["valid"] else ""
    
        if result["valid"]:
            print(f"{status} {email}")
            print(f"  Local: {result['local_part']}, Domain: {result['domain']}")
        else:
            print(f"{status} {email} - {result['error']}")
    Python

    Phone Number Extraction

    def extract_phone_numbers(text):
        # Matches various phone number formats
        pattern = r"(\+?1[-.\s]?)?\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})"
        matches = re.finditer(pattern, text)
    
        phones = []
        for match in matches:
            phone = f"({match.group(2)}) {match.group(3)}-{match.group(4)}"
            phones.append(phone)
    
        return phones
    
    text = """
    Contact us at 123-456-7890 or (555) 123-4567.
    You can also reach us at +1 (800) 555-0123.
    """
    
    phones = extract_phone_numbers(text)
    for phone in phones:
        print(phone)
    Python

    Log File Analysis

    def parse_log_entry(log_line):
        # Common log format: IP - - [timestamp] "method url protocol" status size
        pattern = r'(\d+\.\d+\.\d+\.\d+).*?\[(.*?)\]\s+"(\w+)\s+(.*?)\s+.*?"\s+(\d+)\s+(\d+)'
    
        match = re.search(pattern, log_line)
        if match:
            return {
                'ip': match.group(1),
                'timestamp': match.group(2),
                'method': match.group(3),
                'url': match.group(4),
                'status': int(match.group(5)),
                'size': int(match.group(6))
            }
        return None
    
    log = '192.168.1.1 - - [01/Jan/2024:12:00:00 +0000] "GET /index.html HTTP/1.1" 200 1234'
    parsed = parse_log_entry(log)
    print(parsed)
    Python

    URL Extraction

    def extract_urls(text):
        pattern = r'https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:\w)*)?)?'
        return re.findall(pattern, text)
    
    text = """
    Visit our website at https://example.com or check out
    the documentation at https://docs.example.com/guide?lang=en#overview
    """
    
    urls = extract_urls(text)
    for url in urls:
        print(url)
    Python

    Web Scraping Applications

    import re
    import requests
    
    def scrape_product_info(html_content):
        """Extract product information from e-commerce HTML"""
    
        patterns = {
            'title': r'<h1[^>]*class="[^"]*product-title[^"]*"[^>]*>(.*?)</h1>',
            'price': r'<span[^>]*class="[^"]*price[^"]*"[^>]*>\$?([\d,]+\.?\d*)</span>',
            'rating': r'<div[^>]*class="[^"]*rating[^"]*"[^>]*>.*?(\d+\.?\d*)\s*out of',
            'availability': r'<span[^>]*class="[^"]*stock[^"]*"[^>]*>(In Stock|Out of Stock)</span>'
        }
    
        product_info = {}
    
        for key, pattern in patterns.items():
            match = re.search(pattern, html_content, re.IGNORECASE | re.DOTALL)
            if match:
                product_info[key] = match.group(1).strip()
            else:
                product_info[key] = "Not found"
    
        return product_info
    
    # Example usage
    sample_html = """
    <div class="product-container">
        <h1 class="product-title">Wireless Bluetooth Headphones</h1>
        <span class="price-current">$79.99</span>
        <div class="rating-section">4.5 out of 5 stars</div>
        <span class="stock-status">In Stock</span>
    </div>
    """
    
    product = scrape_product_info(sample_html)
    print(product)
    Python

    Data Analysis and Processing

    import re
    import pandas as pd
    from collections import defaultdict
    
    def analyze_text_data(text_data):
        """Comprehensive text analysis using regex"""
    
        analysis = {
            'word_count': len(re.findall(r'\b\w+\b', text_data)),
            'sentence_count': len(re.findall(r'[.!?]+', text_data)),
            'paragraph_count': len(re.findall(r'\n\s*\n', text_data)) + 1,
            'emails': re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text_data),
            'phone_numbers': re.findall(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text_data),
            'urls': re.findall(r'https?://[^\s<>"\']+', text_data),
            'mentions': re.findall(r'@\w+', text_data),
            'hashtags': re.findall(r'#\w+', text_data),
            'numbers': re.findall(r'\b\d+\.?\d*\b', text_data),
            'dates': re.findall(r'\b\d{1,2}/\d{1,2}/\d{4}\b', text_data)
        }
    
        # Word frequency analysis
        words = re.findall(r'\b\w+\b', text_data.lower())
        word_freq = defaultdict(int)
        for word in words:
            if len(word) > 3:  # Only count words longer than 3 characters
                word_freq[word] += 1
    
        analysis['top_words'] = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:10]
    
        return analysis
    
    # Example usage
    sample_text = """
    Contact our team at support@example.com or call us at (555) 123-4567.
    Visit our website at https://example.com for more information.
    Follow us @example_company and use #ExampleProduct in your posts.
    Our sales increased by 25.5% on 12/15/2023. We now have 1000+ customers!
    """
    
    analysis_result = analyze_text_data(sample_text)
    for key, value in analysis_result.items():
        print(f"{key}: {value}")
    Python

    File Processing Applications

    import re
    import os
    from pathlib import Path
    
    class LogAnalyzer:
        """Comprehensive log file analyzer using regex patterns"""
    
        def __init__(self):
            self.patterns = {
                'apache_access': r'(\d+\.\d+\.\d+\.\d+).*?\[(.*?)\].*?"(\w+)\s+([^"]+).*?"\s+(\d+)\s+(\d+)',
                'error_log': r'\[(.*?)\]\s*\[(\w+)\]\s*(.*)',
                'application_log': r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(\w+)\s+(.*)',
                'nginx_access': r'(\d+\.\d+\.\d+\.\d+).*?\[(.*?)\].*?"([^"]+)"\s+(\d+)\s+(\d+)',
            }
    
        def parse_apache_access_log(self, log_line):
            """Parse Apache access log format"""
            match = re.search(self.patterns['apache_access'], log_line)
            if match:
                return {
                    'ip': match.group(1),
                    'timestamp': match.group(2),
                    'method': match.group(3),
                    'url': match.group(4),
                    'status': int(match.group(5)),
                    'size': int(match.group(6))
                }
            return None
    
        def analyze_log_file(self, file_path, log_type='apache_access'):
            """Analyze entire log file and generate statistics"""
    
            stats = {
                'total_requests': 0,
                'status_codes': defaultdict(int),
                'ips': defaultdict(int),
                'methods': defaultdict(int),
                'urls': defaultdict(int),
                'errors': []
            }
    
            try:
                with open(file_path, 'r') as f:
                    for line_num, line in enumerate(f, 1):
                        parsed = self.parse_apache_access_log(line.strip())
                        if parsed:
                            stats['total_requests'] += 1
                            stats['status_codes'][parsed['status']] += 1
                            stats['ips'][parsed['ip']] += 1
                            stats['methods'][parsed['method']] += 1
                            stats['urls'][parsed['url']] += 1
    
                            if parsed['status'] >= 400:
                                stats['errors'].append({
                                    'line': line_num,
                                    'status': parsed['status'],
                                    'url': parsed['url'],
                                    'ip': parsed['ip']
                                })
    
            except FileNotFoundError:
                print(f"File {file_path} not found")
                return None
    
            # Convert to regular dicts and get top entries
            for key in ['status_codes', 'ips', 'methods', 'urls']:
                stats[key] = dict(sorted(stats[key].items(), key=lambda x: x[1], reverse=True)[:10])
    
            return stats
    
    # Text cleaning and normalization
    def clean_and_normalize_text(text):
        """Clean and normalize text data using regex"""
    
        # Remove HTML tags
        text = re.sub(r'<[^>]+>', '', text)
    
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text)
    
        # Remove special characters but keep basic punctuation
        text = re.sub(r'[^\w\s\.\,\!\?\-]', '', text)
    
        # Normalize multiple punctuation marks
        text = re.sub(r'[\.]{2,}', '...', text)
        text = re.sub(r'[!]{2,}', '!', text)
        text = re.sub(r'[?]{2,}', '?', text)
    
        # Fix spacing around punctuation
        text = re.sub(r'\s*([,.!?])\s*', r'\1 ', text)
    
        # Remove leading/trailing whitespace
        text = text.strip()
    
        return text
    
    # CSV data extraction from text
    def extract_csv_like_data(text):
        """Extract structured data that looks like CSV from text"""
    
        # Pattern for CSV-like data (comma-separated values)
        csv_pattern = r'^[^,\n]+(?:,[^,\n]+)+$'
    
        lines = text.split('\n')
        csv_lines = []
    
        for line in lines:
            if re.match(csv_pattern, line.strip()):
                csv_lines.append(line.strip())
    
        return csv_lines
    
    # Example usage
    sample_dirty_text = """
    <html><body>This is some  messy    text!!! 
    It has <b>HTML tags</b> and irregular   spacing...
    Contact: user@example.com, phone: 555-123-4567.
    Data: John,25,Engineer
          Jane,30,Designer  
          Bob,28,Developer
    </body></html>
    """
    
    cleaned_text = clean_and_normalize_text(sample_dirty_text)
    print("Cleaned text:", cleaned_text)
    
    csv_data = extract_csv_like_data(sample_dirty_text)
    print("Extracted CSV data:", csv_data)
    Python
    graph TD
        A[Real-World Applications] --> B[Email Validation]
        A --> C[Phone Numbers]
        A --> D[Log Parsing]
        A --> E[URL Extraction]
        A --> F[Data Cleaning]
    
        B --> B1[Format Verification]
        C --> C1[Various Formats]
        D --> D1[Structure Extraction]
        E --> E1[Link Discovery]
        F --> F1[Text Normalization]

    11. Performance and Optimization

    Compiling Patterns

    import re
    import time
    
    # Inefficient - recompiles pattern each time
    def slow_search(texts):
        pattern = r"\b\w+@\w+\.\w+\b"
        results = []
        for text in texts:
            matches = re.findall(pattern, text)
            results.extend(matches)
        return results
    
    # Efficient - compile once, use many times
    def fast_search(texts):
        pattern = re.compile(r"\b\w+@\w+\.\w+\b")
        results = []
        for text in texts:
            matches = pattern.findall(text)
            results.extend(matches)
        return results
    
    # Test with large dataset
    texts = ["Contact user@example.com for info"] * 10000
    
    # Time the approaches
    start = time.time()
    slow_results = slow_search(texts)
    slow_time = time.time() - start
    
    start = time.time()
    fast_results = fast_search(texts)
    fast_time = time.time() - start
    
    print(f"Slow approach: {slow_time:.4f} seconds")
    print(f"Fast approach: {fast_time:.4f} seconds")
    print(f"Speed improvement: {slow_time / fast_time:.2f}x")
    Python

    Optimizing Patterns

    # Inefficient - backtracking
    bad_pattern = r"(a+)+b"
    
    # Efficient - atomic grouping or possessive quantifiers
    good_pattern = r"a+b"
    
    # Use specific character classes instead of .
    # Bad
    pattern = r".*@.*\..*"
    # Good
    pattern = r"[^@]+@[^.]+\.[a-zA-Z]+"
    
    # Anchor patterns when possible
    # Unanchored (slower)
    pattern = r"\d{3}-\d{3}-\d{4}"
    # Anchored (faster)
    pattern = r"^\d{3}-\d{3}-\d{4}$"
    Python

    Performance Tips

    graph TB
        A["Regex Performance"] --> B["Compile Patterns"]
        A --> C[Avoid Backtracking]
        A --> D[Use Specific Classes]
        A --> E[Anchor Patterns]
    
        B --> B1[re.compile for reuse]
        C --> C1[Avoid nested quantifiers]
        D --> D1["[^@]+ instead of .*"]
        E --> E1[^ and $ when appropriate]

    12. Common Pitfalls and Best Practices {#best-practices}

    Common Mistakes

    1. Forgetting Raw Strings

    # Wrong - need to escape backslashes
    pattern = "\\d+\\.\\d+"
    
    # Right - use raw strings
    pattern = r"\d+\.\d+"
    Python

    2. Greedy vs Non-Greedy

    html = "<div>Hello</div><div>World</div>"
    
    # Wrong - matches entire string
    pattern = r"<div>.*</div>"
    match = re.search(pattern, html)
    print(match.group())  # <div>Hello</div><div>World</div>
    
    # Right - non-greedy matching
    pattern = r"<div>.*?</div>"
    matches = re.findall(pattern, html)
    print(matches)  # ['<div>Hello</div>', '<div>World</div>']
    Python

    3. Not Escaping Special Characters

    # Wrong - . matches any character
    pattern = r"3.14"
    text = "3X14"
    print(bool(re.search(pattern, text)))  # True (unexpected)
    
    # Right - escape the literal dot
    pattern = r"3\.14"
    text = "3X14"
    print(bool(re.search(pattern, text)))  # False
    Python

    Best Practices

    1. Use Verbose Mode for Complex Patterns

    # Complex pattern - hard to read
    pattern = r"^(?P<area>\d{3})-(?P<exchange>\d{3})-(?P<number>\d{4})$"
    
    # Same pattern with verbose mode - much clearer
    pattern = re.compile(r"""
        ^                   # Start of string
        (?P<area>\d{3})     # Area code (3 digits)
        -                   # Literal hyphen
        (?P<exchange>\d{3}) # Exchange (3 digits)
        -                   # Literal hyphen
        (?P<number>\d{4})   # Number (4 digits)
        $                   # End of string
    """, re.VERBOSE)
    Python

    2. Validate Input Before Processing

    def safe_regex_search(pattern, text):
        if not isinstance(text, str):
            return None
    
        try:
            compiled_pattern = re.compile(pattern)
            return compiled_pattern.search(text)
        except re.error as e:
            print(f"Invalid regex pattern: {e}")
            return None
    Python

    3. Use Appropriate Methods

    # For validation - use match()
    def is_valid_email(email):
        pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
        return bool(re.match(pattern, email))
    
    # For finding - use search() or findall()
    def extract_emails(text):
        pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
        return re.findall(pattern, text)
    
    # For replacement - use sub()
    def mask_emails(text):
        pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
        return re.sub(pattern, "***@***.***", text)
    Python

    Testing Regex Patterns

    import unittest
    
    class TestEmailRegex(unittest.TestCase):
        def setUp(self):
            self.email_pattern = re.compile(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")
    
        def test_valid_emails(self):
            valid_emails = [
                "user@example.com",
                "test.email@domain.org",
                "user+tag@example.co.uk"
            ]
    
            for email in valid_emails:
                with self.subTest(email=email):
                    self.assertTrue(self.email_pattern.match(email))
    
        def test_invalid_emails(self):
            invalid_emails = [
                "invalid.email",
                "@domain.com",
                "user@",
                "user space@domain.com"
            ]
    
            for email in invalid_emails:
                with self.subTest(email=email):
                    self.assertFalse(self.email_pattern.match(email))
    
    # Run tests
    if __name__ == "__main__":
        unittest.main()
    Python
    graph TD
        A[Best Practices] --> B[Use Raw Strings]
        A --> C[Test Thoroughly]
        A --> D[Handle Errors]
        A --> E[Document Complex Patterns]
        A --> F[Choose Right Method]
    
        B --> B1[Avoid escape issues]
        C --> C1[Unit test patterns]
        D --> D1[Catch re.error]
        E --> E1[Use re.VERBOSE]
        F --> F1[match vs search vs findall]

    14. Comprehensive Regex Cheat Sheet

    Quick Reference Patterns

    # BASIC PATTERNS
    basic_patterns = {
        'literal': r'hello',           # Matches "hello" exactly
        'case_insensitive': r'(?i)hello',  # Matches "Hello", "HELLO", etc.
        'any_char': r'h.llo',          # Matches "hello", "hallo", "h3llo"
        'optional': r'colou?r',        # Matches "color" or "colour"
    }
    
    # CHARACTER CLASSES
    character_classes = {
        'digit': r'\d',                # [0-9]
        'non_digit': r'\D',            # [^0-9]
        'word': r'\w',                 # [a-zA-Z0-9_]
        'non_word': r'\W',             # [^a-zA-Z0-9_]
        'whitespace': r'\s',           # [ \t\n\r\f\v]
        'non_whitespace': r'\S',       # [^ \t\n\r\f\v]
        'custom_class': r'[aeiou]',    # Vowels only
        'range': r'[a-z]',             # Lowercase letters
        'negated': r'[^0-9]',          # Not digits
    }
    
    # QUANTIFIERS
    quantifiers = {
        'zero_or_more': r'a*',         # "", "a", "aa", "aaa"
        'one_or_more': r'a+',          # "a", "aa", "aaa"
        'zero_or_one': r'a?',          # "", "a"
        'exactly_n': r'a{3}',          # "aaa"
        'n_or_more': r'a{3,}',         # "aaa", "aaaa", "aaaaa"
        'between_n_m': r'a{2,4}',      # "aa", "aaa", "aaaa"
        'non_greedy': r'a+?',          # Non-greedy one or more
    }
    
    # ANCHORS AND BOUNDARIES
    anchors = {
        'start_of_string': r'^hello',  # Must start with "hello"
        'end_of_string': r'world$',    # Must end with "world"
        'word_boundary': r'\bhello\b', # "hello" as whole word
        'non_word_boundary': r'\Bhello\B', # "hello" inside word
        'start_of_line': r'(?m)^hello',    # Start of any line
        'end_of_line': r'(?m)world$',      # End of any line
    }
    
    # GROUPS AND CAPTURING
    groups = {
        'capturing': r'(hello)',       # Captures "hello"
        'non_capturing': r'(?:hello)', # Groups but doesn't capture
        'named_group': r'(?P<greeting>hello)', # Named capture
        'backreference': r'(\w+) \1',  # Matches repeated words
        'conditional': r'(a)?(?(1)b|c)', # Complex conditional
    }
    
    # LOOKAROUNDS
    lookarounds = {
        'positive_lookahead': r'hello(?= world)',    # "hello" followed by " world"
        'negative_lookahead': r'hello(?! world)',    # "hello" NOT followed by " world"
        'positive_lookbehind': r'(?<=say )hello',    # "hello" preceded by "say "
        'negative_lookbehind': r'(?<!say )hello',    # "hello" NOT preceded by "say "
    }
    Python

    Common Use Case Patterns

    # EMAIL VALIDATION
    email_patterns = {
        'basic': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
        'strict': r'^[a-zA-Z0-9.!#$%&\'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$'
    }
    
    # PHONE NUMBERS
    phone_patterns = {
        'us_simple': r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}',
        'us_with_country': r'(\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}',
        'international': r'^\+?[1-9]\d{1,14}$'
    }
    
    # URLS
    url_patterns = {
        'http_https': r'https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:\w)*)?)?',
        'with_subdomains': r'https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&=]*)'
    }
    
    # DATES
    date_patterns = {
        'mm_dd_yyyy': r'\b(0?[1-9]|1[0-2])/(0?[1-9]|[12]\d|3[01])/\d{4}\b',
        'yyyy_mm_dd': r'\b\d{4}-(0?[1-9]|1[0-2])-(0?[1-9]|[12]\d|3[01])\b',
        'dd_mm_yyyy': r'\b(0?[1-9]|[12]\d|3[01])/(0?[1-9]|1[0-2])/\d{4}\b'
    }
    
    # IP ADDRESSES
    ip_patterns = {
        'ipv4': r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b',
        'ipv4_strict': r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b'
    }
    
    # CREDIT CARDS
    credit_card_patterns = {
        'visa': r'4[0-9]{12}(?:[0-9]{3})?',
        'mastercard': r'5[1-5][0-9]{14}',
        'amex': r'3[47][0-9]{13}',
        'any': r'(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|3[0-9]{13}|6(?:011|5[0-9]{2})[0-9]{12})'
    }
    Python

    Method Comparison Table

    MethodPurposeReturnsUse When
    re.match()Match from startMatch object or NoneValidating entire string
    re.search()Find first matchMatch object or NoneFinding one occurrence
    re.findall()Find all matchesList of stringsGetting all matches as list
    re.finditer()Find all matchesIterator of Match objectsNeed match details for all
    re.sub()Replace matchesModified stringText replacement
    re.subn()Replace matches(string, count) tupleReplacement + count needed
    re.split()Split by patternList of stringsSplitting text
    re.compile()Compile patternPattern objectReusing same pattern

    Flag Options

    import re
    
    # COMMON FLAGS
    flags_demo = {
        'IGNORECASE': re.IGNORECASE,    # Case-insensitive matching
        'MULTILINE': re.MULTILINE,      # ^ and $ match line boundaries
        'DOTALL': re.DOTALL,           # . matches newlines too
        'VERBOSE': re.VERBOSE,          # Allow comments and whitespace
        'ASCII': re.ASCII,              # Make \w, \W, \b, \B ASCII-only
        'LOCALE': re.LOCALE,            # Make \w, \W, \b, \B locale-aware
        'UNICODE': re.UNICODE,          # Make \w, \W, \b, \B Unicode-aware
    }
    
    # COMBINING FLAGS
    combined_flags = re.IGNORECASE | re.MULTILINE | re.DOTALL
    
    # INLINE FLAGS
    inline_flags = {
        'case_insensitive': r'(?i)pattern',
        'multiline': r'(?m)pattern',
        'dotall': r'(?s)pattern',
        'verbose': r'(?x)pattern',
        'multiple': r'(?ims)pattern',  # Combined flags
    }
    Python
    graph TB
        A[Regex Cheat Sheet] --> B[Basic Patterns]
        A --> C[Character Classes]
        A --> D[Quantifiers]
        A --> E[Common Use Cases]
    
        B --> B1[Literal matching]
        B --> B2[Case options]
        B --> B3[Any character]
    
        C --> C1[Built-in classes]
        C --> C2[Custom classes]
        C --> C3[Negated classes]
    
        D --> D1[Basic quantifiers]
        D --> D2[Exact counts]
        D --> D3[Greedy vs lazy]
    
        E --> E1[Email validation]
        E --> E2[Phone numbers]
        E --> E3[URLs & IPs]
    
        style A fill:#e1f5fe
        style E fill:#c8e6c9

    15. Practice Exercises

    Beginner Exercises

    # Exercise 1: Basic Pattern Matching
    def exercise_1():
        """Find all words that start with 'th' (case insensitive)"""
        text = "The quick brown fox thinks that this is the best thing."
        # Your pattern here:
        pattern = r"th\w*"
        # Solution: re.findall(pattern, text, re.IGNORECASE)
    
    # Exercise 2: Number Extraction
    def exercise_2():
        """Extract all numbers (integers and decimals) from text"""
        text = "The price is $29.99 and we have 15 items in stock. Call 555-1234."
        # Your pattern here:
        pattern = r"\d+\.?\d*"
        # Test your solution
    
    # Exercise 3: Email Finding
    def exercise_3():
        """Find all email addresses in the text"""
        text = "Contact john@example.com or mary.smith@company.org for more info."
        # Your pattern here:
        pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
        # Test your solution
    Python

    Intermediate Challenges

    # Challenge 1: Phone Number Standardization
    def challenge_1():
        """Convert various phone formats to (XXX) XXX-XXXX"""
        phones = [
            "123-456-7890",
            "(555) 123-4567", 
            "555.123.4567",
            "5551234567"
        ]
        # Create a function to standardize all formats
    
        def standardize_phone(phone):
            pattern = r"(\(?\d{3}\)?[-.\s]?)(\d{3}[-.\s]?)(\d{4})"
            match = re.search(pattern, phone)
            if match:
                return f"({match.group(1)}) {match.group(2)}-{match.group(3)}"
            return None
    
    # Challenge 2: Log Parser
    def challenge_2():
        """Parse web server logs to extract useful information"""
        log_entry = '192.168.1.100 - - [10/Oct/2023:13:55:36 +0000] "GET /api/users HTTP/1.1" 200 2326'
    
        # Extract: IP, timestamp, method, endpoint, status code, response size
        pattern = r'(\d+\.\d+\.\d+\.\d+).*?\[(.*?)\].*?"(\w+)\s+([^"]+).*?"\s+(\d+)\s+(\d+)'
        # Complete the parser function
    
    # Challenge 3: HTML Tag Extractor
    def challenge_3():
        """Extract all HTML tags and their attributes"""
        html = '<div class="container" id="main"><p>Hello</p><a href="http://example.com">Link</a></div>'
    
        # Extract tag name and attributes separately
        tag_pattern = r'<(\w+)([^>]*)>'
        # Complete the extraction logic
    Python

    Advanced Problems

    # Advanced 1: Nested Structure Parser
    def advanced_1():
        """Parse nested parentheses and extract content at each level"""
        text = "((hello (world) test) (foo bar))"
    
        # This requires recursive or advanced techniques
        # Hint: You might need to use a stack-based approach or recursive regex
    
    # Advanced 2: Template Variable Extractor
    def advanced_2():
        """Extract template variables like {{variable}} and {{variable|filter}}"""
        template = "Hello {{name}}, your balance is {{account.balance|currency}}."
    
        # Extract variable names and filters
        pattern = r'\{\{\s*([^|\}]+)(?:\|([^}]+))?\s*\}\}'
        # Complete the extraction and parsing
    
    # Advanced 3: SQL Query Parser
    def advanced_3():
        """Parse SQL SELECT statements to extract tables and columns"""
        sql = """
        SELECT users.name, orders.total, products.title 
        FROM users 
        JOIN orders ON users.id = orders.user_id 
        JOIN products ON orders.product_id = products.id 
        WHERE users.active = 1
        """
    
        # Extract table names, column names, and JOIN conditions
        # This is a complex parsing challenge
    Python

    Exercise Solutions and Explanations

    def show_solutions():
        """Detailed solutions with explanations"""
    
        solutions = {
            "Beginner 1": {
                "pattern": r"th\w*",
                "flags": "re.IGNORECASE",
                "explanation": "th matches literal 'th', \\w* matches zero or more word characters"
            },
            "Beginner 2": {
                "pattern": r"\d+\.?\d*",
                "explanation": "\\d+ matches digits, \\.? optional decimal point, \\d* optional decimal digits"
            },
            "Intermediate 1": {
                "approach": "Capture groups for area code, exchange, and number",
                "pattern": r"(\(?\d{3}\)?[-.\s]?)(\d{3}[-.\s]?)(\d{4})",
                "replacement": r"(\1) \2-\3"
            }
        }
    
        for exercise, solution in solutions.items():
            print(f"\n{exercise}:")
            for key, value in solution.items():
                print(f"  {key}: {value}")
    
    # Interactive practice function
    def practice_regex():
        """Interactive regex practice session"""
    
        exercises = [
            {
                "description": "Find all words ending in 'ing'",
                "text": "Running and jumping are fun activities.",
                "expected": ["Running", "jumping"]
            },
            {
                "description": "Extract all hashtags from social media text",
                "text": "Love this weather! #sunny #beautiful #weekend",
                "expected": ["#sunny", "#beautiful", "#weekend"]
            }
        ]
    
        for i, exercise in enumerate(exercises, 1):
            print(f"\nExercise {i}: {exercise['description']}")
            print(f"Text: {exercise['text']}")
            print(f"Expected: {exercise['expected']}")
    
            # Student would input their pattern here
            # pattern = input("Enter your regex pattern: ")
            # result = re.findall(pattern, exercise['text'])
            # print(f"Your result: {result}")
    Python
    graph TB
        A[Practice Exercises] --> B[Beginner Level]
        A --> C[Intermediate Level]
        A --> D[Advanced Level]
    
        B --> B1[Basic matching]
        B --> B2[Character classes]
        B --> B3[Simple quantifiers]
    
        C --> C1[Complex patterns]
        C --> C2[Data parsing]
        C --> C3[Text processing]
    
        D --> D1[Nested structures]
        D --> D2[Template parsing]
        D --> D3[Language parsing]
    
        style A fill:#e1f5fe
        style B fill:#e8f5e8
        style C fill:#fff3e0
        style D fill:#ffebee

    Conclusion

    Regular expressions are a powerful tool for text processing in Python. This comprehensive guide has taken you from beginner concepts to expert-level techniques.

    Key Learning Outcomes

    By completing this guide, you should be able to:

    1. Understand Regex Fundamentals: Basic patterns, character classes, and quantifiers
    2. Apply Advanced Techniques: Lookarounds, groups, and complex pattern matching
    3. Debug Regex Patterns: Use systematic approaches to troubleshoot issues
    4. Optimize Performance: Write efficient patterns and avoid common pitfalls
    5. Solve Real-World Problems: Apply regex to practical text processing tasks

    Best Practices Summary

    graph TB
        A[Regex Best Practices] --> B[Development]
        A --> C[Testing]
        A --> D[Performance]
        A --> E[Maintenance]
    
        B --> B1[Start simple, add complexity]
        B --> B2[Use raw strings r'']
        B --> B3[Comment complex patterns]
    
        C --> C1[Test edge cases]
        C --> C2[Validate with real data]
        C --> C3[Use unit tests]
    
        D --> D1[Compile patterns for reuse]
        D --> D2[Avoid catastrophic backtracking]
        D --> D3[Use specific character classes]
    
        E --> E1[Document pattern purpose]
        E --> E2[Use meaningful variable names]
        E --> E3[Keep patterns readable]

    Essential Quick Reference

    import re
    
    # Most commonly used patterns
    essential_patterns = {
        'email': r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
        'phone_us': r"\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})",
        'url': r"https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:\w)*)?)?",
        'ipv4': r"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b",
        'date_mdy': r"\b(0?[1-9]|1[0-2])/(0?[1-9]|[12]\d|3[01])/\d{4}\b",
        'number': r"-?\d+\.?\d*",
        'word': r"\b\w+\b"
    }
    
    # Essential methods
    re.search(pattern, text)      # Find first match
    re.findall(pattern, text)     # Find all matches
    re.finditer(pattern, text)    # Iterator of match objects
    re.sub(pattern, repl, text)   # Replace matches
    re.split(pattern, text)       # Split by pattern
    re.compile(pattern)           # Compile for reuse
    Python

    Your Next Steps

    1. Practice Daily: Work with regex patterns regularly
    2. Build a Pattern Library: Save useful patterns you create
    3. Join Communities: Engage with other regex learners
    4. Contribute: Help others and share your knowledge
    5. Stay Updated: Follow regex developments

    When to Use Regex (and When Not To)

    ✅ Good for:

    • Pattern matching and validation
    • Text extraction and parsing
    • Find and replace operations
    • Data cleaning and preprocessing

    ❌ Consider alternatives for:

    • Complex parsing (use dedicated parsers)
    • Simple string operations (use str methods)
    • Structured data (JSON, XML, CSV libraries)
    • Performance-critical code without optimization

    Happy regex coding! 🚀


    *”Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” – Jamie Zawinski*

    While humorous, this quote reminds us to use regex thoughtfully. This guide teaches you to be one of the developers who wields regex effectively and responsibly.


    Discover more from Altgr Blog

    Subscribe to get the latest posts sent to your email.

    Leave a Reply

    Your email address will not be published. Required fields are marked *