From Beginner to Expert

    Master the fastest DataFrame library in Python with practical examples, advanced techniques, and real-world applications

    About This Book

    This comprehensive guide will take you from complete beginner to Polars expert through hands-on examples, practical exercises, and real-world case studies. Each chapter builds upon previous concepts while introducing new techniques and best practices.

    Who This Book Is For

    • Data Analysts looking for faster data processing
    • Data Scientists wanting to optimize their workflows
    • Python Developers seeking performant data manipulation tools
    • Pandas Users ready to upgrade to next-generation DataFrame operations

    Prerequisites

    • Basic Python programming knowledge
    • Familiarity with data concepts (optional but helpful)
    • No prior Polars experience required

    Table of Contents

    Part I: Foundations

    1. Why Polars? Understanding the Revolution
    2. Getting Started: Installation and Setup
    3. Core Concepts: DataFrames, Series, and Expressions
    4. Your First Polars Program

    Part II: Essential Operations

    1. Loading and Creating Data
    2. Data Inspection and Exploration
    3. Selecting and Filtering Data
    4. Data Transformation Fundamentals

    Part III: Intermediate Techniques

    1. Aggregations and Grouping
    2. Joining and Combining DataFrames
    3. Working with Strings and Dates
    4. Time Series Analysis

    Part IV: Advanced Features

    1. Lazy Evaluation and Query Optimization
    2. Custom Functions and Complex Operations
    3. Working with Nested Data (Lists and Structs)
    4. Window Functions and Advanced Analytics

    Part V: Performance and Production

    1. Performance Optimization Strategies
    2. Memory Management and Large Datasets
    3. Parallel Processing and Streaming

    Part VI: Real-World Applications

    1. Business Analytics Case Study
    2. Financial Data Analysis
    3. IoT and Sensor Data Processing
    4. Log Analysis and Monitoring

    Part VII: Migration and Best Practices

    1. Migrating from Pandas
    2. Testing and Debugging
    3. Production Deployment
    4. Best Practices and Style Guide

    Appendices


    1. Why Polars? Understanding the Revolution

    🎯 Learning Objectives

    By the end of this chapter, you will:

    • Understand what makes Polars revolutionary in data processing
    • Know when to choose Polars over other libraries
    • Grasp the architectural differences between Polars and Pandas
    • Appreciate the key advantages that drive Polars’ performance

    The Data Processing Challenge

    In today’s data-driven world, we’re dealing with increasingly large datasets. Traditional tools like Pandas, while excellent for small to medium datasets, often struggle with:

    • Performance bottlenecks on large datasets
    • Memory limitations that cause crashes
    • Single-threaded execution that doesn’t utilize modern hardware
    • Eager evaluation that processes unnecessary data

    Enter Polars: The Game Changer

    Polars is a revolutionary DataFrame library that addresses these challenges head-on. Built from the ground up in Rust with Python bindings, it represents the next generation of data processing tools.

    What Makes Polars Special?

    # A simple example showing Polars' power
    import polars as pl
    import time
    
    # Create a large dataset
    df = pl.DataFrame({
        "id": range(1_000_000),
        "value": range(1_000_000),
        "category": ["A", "B", "C"] * 333_334
    })
    
    # Lightning-fast operations
    start = time.time()
    result = (
        df
        .filter(pl.col("value") > 500_000)
        .group_by("category")
        .agg(pl.col("value").sum())
    )
    print(f"Processed 1M rows in {time.time() - start:.3f} seconds")
    Python

    Architecture Comparison: The Foundation of Speed

    graph TB
        subgraph "Traditional Approach (Pandas)"
            A[Python Code] --> B[Pandas API]
            B --> C[NumPy/C Backend]
            C --> D[Single Thread Execution]
            D --> E[Eager Evaluation]
            E --> F[Result]
        end
    
        subgraph "Modern Approach (Polars)"
            G[Python Code] --> H[Polars API]  
            H --> I[Rust Core Engine]
            I --> J[Multi-threaded Execution]
            J --> K[Lazy Evaluation]
            K --> L[Query Optimization]
            L --> M[SIMD Instructions]
            M --> N[Result]
        end
    
        style I fill:#f96
        style J fill:#9f6
        style K fill:#69f
        style L fill:#f69

    Key Advantages Explained

    1. Blazing Fast Performance

    • Rust Implementation: Native speed without Python overhead
    • SIMD Optimizations: Utilizes modern CPU instructions
    • Parallel Processing: Automatic multi-threading
    • Benchmark: Often 5-30x faster than Pandas

    2. Memory Efficiency

    • Columnar Storage: Optimized memory layout
    • Zero-Copy Operations: Minimal memory allocations
    • Streaming Support: Process datasets larger than RAM
    • Smart Memory Management: Rust’s ownership model prevents leaks

    3. Lazy Evaluation with Query Optimization

    # This creates a plan, doesn't execute immediately
    lazy_query = (
        pl.scan_csv("huge_file.csv")
        .filter(pl.col("date") > "2023-01-01")  # Pushed down to file reading
        .select(["id", "value"])                # Only read needed columns
        .group_by("id")
        .agg(pl.col("value").sum())
    )
    
    # Execute the optimized plan
    result = lazy_query.collect()
    Python

    4. Expressive and Intuitive API

    • Method Chaining: Natural, readable code
    • Rich Expression System: Powerful column operations
    • Type Safety: Catch errors at query planning time
    • Consistent Patterns: Learn once, apply everywhere

    5. Modern Data Types and Features

    • Native Date/Time Support: Built-in temporal operations
    • Nested Data Types: Lists, structs, and more
    • String Processing: Optimized text operations
    • Categorical Data: Efficient handling of categories

    When to Choose Polars

    ✅ Choose Polars When:

    • Working with datasets > 100MB
    • Performance is critical
    • You need out-of-core processing
    • Building production data pipelines
    • Dealing with time series data
    • Complex aggregations and joins

    ⚠️ Consider Alternatives When:

    • Dataset is very small (< 10MB)
    • Heavy integration with pandas ecosystem required
    • Team lacks Python/data processing experience
    • Quick prototyping with familiar tools

    Real-World Impact: A Quick Example

    Let’s see Polars in action with a practical scenario:

    import polars as pl
    import pandas as pd
    import numpy as np
    import time
    
    # Create sample sales data
    n_rows = 500_000
    data = {
        "date": pd.date_range("2020-01-01", periods=n_rows, freq="1H"),
        "product_id": np.random.randint(1, 1000, n_rows),
        "sales": np.random.normal(100, 25, n_rows),
        "region": np.random.choice(["North", "South", "East", "West"], n_rows)
    }
    
    # Pandas approach
    df_pandas = pd.DataFrame(data)
    start = time.time()
    pandas_result = (
        df_pandas
        .groupby(["region", df_pandas["date"].dt.date])
        .agg({"sales": ["sum", "mean", "count"]})
    )
    pandas_time = time.time() - start
    
    # Polars approach  
    df_polars = pl.DataFrame(data)
    start = time.time()
    polars_result = (
        df_polars
        .group_by(["region", pl.col("date").dt.date()])
        .agg([
            pl.col("sales").sum().alias("sales_sum"),
            pl.col("sales").mean().alias("sales_mean"),
            pl.col("sales").count().alias("sales_count")
        ])
    )
    polars_time = time.time() - start
    
    print(f"Pandas: {pandas_time:.2f}s")
    print(f"Polars: {polars_time:.2f}s") 
    print(f"Speedup: {pandas_time/polars_time:.1f}x faster")
    Python

    Chapter Summary

    Polars represents a paradigm shift in data processing, offering:

    • Revolutionary performance through Rust and modern algorithms
    • Memory efficiency that handles larger datasets
    • Query optimization that makes your code faster automatically
    • Modern API design that’s both powerful and intuitive

    In the next chapter, we’ll get Polars installed and write our first program to experience this performance firsthand.

    🔗 What’s Next?

    Ready to experience the Polars advantage? Let’s get it installed and running on your system!


    2. Getting Started: Installation and Setup

    🎯 Learning Objectives

    By the end of this chapter, you will:

    • Successfully install Polars in different environments
    • Configure your development environment for optimal Polars usage
    • Understand different installation options and their use cases
    • Verify your installation with a simple test program

    Installation Options

    Basic Installation

    # Most common installation
    pip install polars
    
    # For conda users
    conda install -c conda-forge polars
    Bash

    Advanced Installation Options

    # With all optional dependencies (recommended for full functionality)
    pip install polars[all]
    
    # For development and testing
    pip install polars[dev]
    
    # Specific feature sets
    pip install polars[pandas]  # Pandas interoperability
    pip install polars[numpy]   # NumPy integration
    pip install polars[xlsx]    # Excel file support
    pip install polars[sql]     # SQL query support
    Bash

    Installation Verification

    import polars as pl
    import sys
    
    print(f"✅ Polars version: {pl.__version__}")
    print(f"🐍 Python version: {sys.version}")
    print(f"🏗️  Available features: {pl.show_versions()}")
    
    # Quick functionality test
    test_df = pl.DataFrame({"test": [1, 2, 3]})
    print(f"🧪 Test DataFrame created successfully: {test_df.shape[0]} rows")
    Python

    Development Environment Setup

    Jupyter Notebook Configuration

    # Put this in your notebook's first cell
    import polars as pl
    import numpy as np
    import pandas as pd
    from datetime import datetime, date, timedelta
    import warnings
    
    # Polars display configuration
    pl.Config.set_tbl_rows(15)           # Show 15 rows in output
    pl.Config.set_tbl_cols(8)            # Show 8 columns in output  
    pl.Config.set_fmt_str_lengths(50)    # Truncate strings at 50 chars
    pl.Config.set_tbl_width_chars(120)   # Table width in characters
    
    # Enable better error messages
    pl.Config.set_verbose(True)
    
    # Suppress pandas warnings when comparing
    warnings.filterwarnings('ignore', category=FutureWarning)
    
    print("🚀 Polars environment ready!")
    Python

    IDE Configuration (VS Code, PyCharm)

    # Create a polars_config.py file for consistent setup
    import polars as pl
    
    def setup_polars_environment():
        """Configure Polars for development environment"""
        # Display settings
        pl.Config.set_tbl_rows(20)
        pl.Config.set_tbl_cols(10)
        pl.Config.set_fmt_str_lengths(100)
    
        # Performance settings
        pl.Config.set_streaming_chunk_size(10000)
    
        # Enable all available threads (good for development)
        import os
        n_threads = os.cpu_count()
        pl.Config.set_thread_pool_size(n_threads)
    
        print(f"🔧 Polars configured with {n_threads} threads")
        print(f"📊 Display: {pl.Config.get_tbl_rows()} rows, {pl.Config.get_tbl_cols()} columns")
    
        return pl
    
    # Import and use in your scripts
    # from polars_config import setup_polars_environment
    # pl = setup_polars_environment()
    Python

    Your First Polars Program

    Let’s create a simple program to verify everything is working:

    import polars as pl
    import time
    
    def first_polars_program():
        """Your first Polars program - data creation and basic operations"""
    
        print("🎉 Welcome to your first Polars program!")
    
        # Create sample data
        df = pl.DataFrame({
            "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
            "age": [25, 30, 35, 28, 32],
            "city": ["New York", "London", "Tokyo", "Paris", "Sydney"],
            "salary": [70000, 80000, 90000, 75000, 85000]
        })
    
        print("\n📊 Original Data:")
        print(df)
    
        # Basic operations
        print("\n🔍 Basic Statistics:")
        print(df.describe())
    
        # Filtering example
        high_earners = df.filter(pl.col("salary") > 80000)
        print(f"\n💰 High earners (salary > $80,000): {high_earners.height} people")
        print(high_earners)
    
        # Grouping example
        city_stats = df.group_by("city").agg([
            pl.col("salary").mean().alias("avg_salary"),
            pl.col("age").mean().alias("avg_age")
        ])
        print("\n🌍 City Statistics:")
        print(city_stats)
    
        # Performance demonstration
        print("\n⚡ Performance Test:")
        large_df = pl.DataFrame({
            "id": range(100_000),
            "value": range(100_000)
        })
    
        start_time = time.time()
        result = large_df.filter(pl.col("value") > 50_000).sum()
        elapsed = time.time() - start_time
    
        print(f"Processed 100,000 rows in {elapsed:.4f} seconds")
        print("🎯 Setup complete! You're ready to learn Polars.")
    
    # Run the program
    if __name__ == "__main__":
        first_polars_program()
    Python

    Common Installation Issues and Solutions

    Issue 1: Import Error

    # Problem: ImportError: No module named 'polars'
    # Solution: Ensure you're in the correct Python environment
    python -m pip install polars
    
    # Or check your Python path
    python -c "import sys; print(sys.path)"
    Bash

    Issue 2: Version Conflicts

    # Problem: Dependency conflicts with other packages
    # Solution: Use virtual environment
    python -m venv polars_env
    source polars_env/bin/activate  # On Windows: polars_env\Scripts\activate
    pip install polars[all]
    Bash

    Issue 3: Performance Issues

    # Problem: Slow performance on large datasets
    # Solution: Configure thread pool size
    import polars as pl
    import os
    
    # Use all available CPU cores
    pl.Config.set_thread_pool_size(os.cpu_count())
    
    # Or set manually for fine-tuning
    pl.Config.set_thread_pool_size(4)  # Use 4 threads
    Python

    Environment Checklist

    Before proceeding to the next chapter, ensure:

    • ✅ Polars is installed and importable
    • ✅ Version is 0.19.0 or higher (recommended)
    • ✅ Basic operations work correctly
    • ✅ Display configuration is set up
    • ✅ Your development environment (Jupyter/IDE) is configured

    Quick Reference Card

    # Essential imports for every Polars session
    import polars as pl
    import numpy as np
    from datetime import datetime, date
    
    # Basic configuration
    pl.Config.set_tbl_rows(20)
    pl.Config.set_tbl_cols(8)
    
    # Check installation
    print(f"Polars v{pl.__version__} ready!")
    
    # Create test DataFrame
    df = pl.DataFrame({"test": [1, 2, 3]})
    print(df)  # Should display a simple table
    Python

    Chapter Summary

    You now have:

    • Polars installed with the appropriate options for your needs
    • Environment configured for optimal development experience
    • First program running demonstrating basic Polars capabilities
    • Troubleshooting knowledge for common setup issues

    🔗 What’s Next?

    With Polars installed and configured, let’s dive into the core concepts that make Polars so powerful!


    3. Core Concepts: DataFrames, Series, and Expressions

    🎯 Learning Objectives

    By the end of this chapter, you will:

    • Understand the three core data structures in Polars
    • Know how to work with different data types effectively
    • Master the expression system that makes Polars powerful
    • Distinguish between eager and lazy evaluation

    The Three Pillars of Polars

    graph TD
        A[Polars Core] --> B[DataFrame]
        A --> C[Series]  
        A --> D[Expression]
    
        B --> E[Eager Evaluation]
        B --> F[LazyFrame]
        F --> G[Query Optimization]
    
        C --> H[Column Data]
        C --> I[Data Types]
    
        D --> J[Operations]
        D --> K[Transformations]
        D --> L[Aggregations]
    
        style A fill:#f96
        style B fill:#9f6
        style C fill:#69f
        style D fill:#f69

    1. DataFrame: Your Data Container

    A DataFrame is a 2D data structure with labeled columns, similar to a spreadsheet or SQL table.

    import polars as pl
    
    # Creating a DataFrame
    df = pl.DataFrame({
        "name": ["Alice", "Bob", "Charlie"],
        "age": [25, 30, 35],
        "salary": [50000, 60000, 70000],
        "active": [True, False, True]
    })
    
    print("📊 Basic DataFrame:")
    print(df)
    
    # DataFrame properties
    print(f"Shape: {df.shape}")  # (rows, columns)
    print(f"Columns: {df.columns}")
    print(f"Data types: {df.dtypes}")
    Python

    DataFrame vs LazyFrame

    flowchart LR
        A[DataFrame] --> B[Eager Execution]
        B --> C[Immediate Results]
    
        D[LazyFrame] --> E[Query Building]
        E --> F[Optimization]
        F --> G[Execution]
        G --> H[Results]
    
        style A fill:#9f6
        style D fill:#69f
    # Eager evaluation (DataFrame)
    eager_result = df.filter(pl.col("age") > 25).select(["name", "salary"])
    print("Eager result:", eager_result)
    
    # Lazy evaluation (LazyFrame)
    lazy_query = (
        df.lazy()
        .filter(pl.col("age") > 25)
        .select(["name", "salary"])
    )
    print("Lazy query plan:")
    print(lazy_query.explain())  # Show the execution plan
    
    # Execute the lazy query
    lazy_result = lazy_query.collect()
    print("Lazy result:", lazy_result)
    Python

    2. Series: Individual Columns

    A Series represents a single column of data with a specific type.

    # Creating Series
    ages = pl.Series("ages", [25, 30, 35, 28])
    names = pl.Series("names", ["Alice", "Bob", "Charlie", "Diana"])
    
    print("📈 Series examples:")
    print(f"Ages: {ages}")
    print(f"Names: {names}")
    
    # Series operations
    print(f"Mean age: {ages.mean()}")
    print(f"Unique names: {names.n_unique()}")
    print(f"Age statistics:\n{ages.describe()}")
    Python

    3. Expressions: The Heart of Polars

    Expressions are the key to Polars’ power – they describe computations that can be applied to columns.

    # Basic expressions
    print("🔧 Expression examples:")
    
    # Column references
    age_expr = pl.col("age")
    salary_expr = pl.col("salary")
    
    # Computations
    annual_bonus = pl.col("salary") * 0.1
    senior_status = pl.col("age") > 30
    
    # Complex expressions
    salary_category = (
        pl.when(pl.col("salary") > 60000)
        .then(pl.lit("High"))
        .when(pl.col("salary") > 40000)
        .then(pl.lit("Medium"))
        .otherwise(pl.lit("Low"))
    )
    
    # Apply expressions
    result = df.with_columns([
        annual_bonus.alias("bonus"),
        senior_status.alias("is_senior"),
        salary_category.alias("salary_category")
    ])
    
    print(result)
    Python

    Data Types Deep Dive

    graph TD
        A[Polars Data Types] --> B[Numeric]
        A --> C[Text]
        A --> D[Temporal]
        A --> E[Complex]
        A --> F[Special]
    
        B --> B1[Int8, Int16, Int32, Int64]
        B --> B2[UInt8, UInt16, UInt32, UInt64]
        B --> B3[Float32, Float64]
    
        C --> C1[Utf8]
        C --> C2[Categorical]
        C --> C3[Binary]
    
        D --> D1[Date]
        D --> D2[Datetime]
        D --> D3[Time]
        D --> D4[Duration]
    
        E --> E1[List]
        E --> E2[Struct]
        E --> E3[Array]
    
        F --> F1[Boolean]
        F --> F2[Null]
    # Data type examples with explanations
    examples_df = pl.DataFrame({
        # Integers - choose size based on your data range
        "small_int": pl.Series([1, 2, 3], dtype=pl.Int8),        # -128 to 127
        "big_int": pl.Series([1000000, 2000000], dtype=pl.Int64), # Very large numbers
    
        # Floats
        "precision_float": pl.Series([1.1, 2.2], dtype=pl.Float32),  # Less precision, less memory
        "high_precision": pl.Series([1.123456789], dtype=pl.Float64), # More precision
    
        # Text
        "text": pl.Series(["Hello", "World"], dtype=pl.Utf8),
        "category": pl.Series(["A", "B", "A"], dtype=pl.Categorical),  # Efficient for repeated values
    
        # Temporal
        "date": pl.Series([pl.date(2023, 1, 1), pl.date(2023, 1, 2)]),
        "datetime": pl.Series([pl.datetime(2023, 1, 1, 12, 0)]),
    
        # Boolean
        "flag": pl.Series([True, False, True], dtype=pl.Boolean),
    
        # Complex types
        "numbers_list": pl.Series([[1, 2, 3], [4, 5]], dtype=pl.List(pl.Int32)),
        "person": pl.Series([{"name": "Alice", "age": 30}], dtype=pl.Struct([
            pl.Field("name", pl.Utf8),
            pl.Field("age", pl.Int32)
        ]))
    })
    
    print("🎯 Data types showcase:")
    print(examples_df.dtypes)
    Python

    The Expression System: Building Blocks

    flowchart TD
        A[Expression Building Blocks] --> B[Column References]
        A --> C[Literals]
        A --> D[Functions]
        A --> E[Conditionals]
        A --> F[Aggregations]
    
        B --> B1["pl.col('name')"]
        B --> B2["pl.all()"]
        B --> B3["pl.exclude('id')"]
    
        C --> C1["pl.lit(42)"]
        C --> C2["pl.lit('text')"]
    
        D --> D1[".str.len()"]
        D --> D2[".dt.year()"]
        D --> D3[".abs()"]
    
        E --> E1["pl.when().then()"]
        E --> E2[".otherwise()"]
    
        F --> F1[".sum()"]
        F --> F2[".mean()"]
        F --> F3[".count()"]
    # Expression building blocks demonstration
    demo_df = pl.DataFrame({
        "name": ["Alice Johnson", "Bob Smith", "Charlie Brown"],
        "birth_date": ["1990-05-15", "1985-12-20", "1992-03-10"],
        "score": [85, 92, 78],
        "active": [True, False, True]
    })
    
    # Column references
    print("🔗 Column references:")
    all_cols = demo_df.select(pl.all())  # Select all columns
    name_only = demo_df.select(pl.col("name"))  # Select specific column
    no_active = demo_df.select(pl.exclude("active"))  # All except 'active'
    
    # Literals and computations
    enriched = demo_df.with_columns([
        pl.lit("Student").alias("role"),  # Add literal column
        (pl.col("score") * 1.1).alias("boosted_score"),  # Mathematical operation
        pl.col("name").str.len().alias("name_length"),  # String function
    ])
    
    print("📊 Enriched data:")
    print(enriched)
    
    # Conditional expressions
    categorized = demo_df.with_columns([
        pl.when(pl.col("score") >= 90)
        .then(pl.lit("Excellent"))
        .when(pl.col("score") >= 80)
        .then(pl.lit("Good"))
        .when(pl.col("score") >= 70)
        .then(pl.lit("Average"))
        .otherwise(pl.lit("Needs Improvement"))
        .alias("grade")
    ])
    
    print("🎓 Grade categories:")
    print(categorized.select(["name", "score", "grade"]))
    Python

    Performance Implications

    # Understanding eager vs lazy performance
    import time
    
    # Create larger dataset for demonstration
    large_df = pl.DataFrame({
        "id": range(100_000),
        "value": range(100_000),
        "category": ["A", "B", "C", "D"] * 25_000
    })
    
    print("⚡ Performance comparison:")
    
    # Eager evaluation
    start = time.time()
    eager_result = (
        large_df
        .filter(pl.col("value") > 50_000)
        .group_by("category")
        .agg(pl.col("value").sum())
    )
    eager_time = time.time() - start
    
    # Lazy evaluation
    start = time.time()
    lazy_result = (
        large_df
        .lazy()
        .filter(pl.col("value") > 50_000)
        .group_by("category")
        .agg(pl.col("value").sum())
        .collect()
    )
    lazy_time = time.time() - start
    
    print(f"Eager time: {eager_time:.4f}s")
    print(f"Lazy time: {lazy_time:.4f}s")
    print(f"Performance difference: {eager_time/lazy_time:.2f}x")
    Python

    Chapter Summary

    Key Concepts Mastered:

    • DataFrame: Your primary data container with eager evaluation
    • LazyFrame: Query builder with optimization capabilities
    • Series: Individual columns with specific data types
    • Expressions: Powerful building blocks for data operations
    • Data Types: Choosing the right type for memory and performance

    The Expression Advantage:

    • Composable and reusable
    • Type-safe operations
    • Optimizable by the query engine
    • Readable and intuitive syntax

    🔗 What’s Next?

    Now that you understand Polars’ foundation, let’s build your first complete Polars program!


    4. Your First Polars Program

    🎯 Learning Objectives

    By the end of this chapter, you will:

    • Build a complete data analysis program using Polars
    • Apply core concepts in a practical scenario
    • Understand common patterns and workflows
    • Gain confidence in writing Polars code

    The Scenario: Employee Data Analysis

    Let’s build a program that analyzes employee data for a fictional company. This will demonstrate real-world Polars usage patterns.

    import polars as pl
    from datetime import date, datetime
    import random
    
    def create_sample_data():
        """Create realistic employee data for our analysis"""
    
        # Set seed for reproducible results
        random.seed(42)
    
        departments = ["Engineering", "Sales", "Marketing", "HR", "Finance"]
        cities = ["New York", "London", "Tokyo", "Berlin", "Sydney"]
    
        # Generate 1000 employees
        employees = []
        for i in range(1000):
            employees.append({
                "employee_id": f"EMP{i+1:04d}",
                "name": f"Employee {i+1}",
                "department": random.choice(departments),
                "salary": random.randint(40000, 150000),
                "hire_date": date(
                    random.randint(2018, 2023),
                    random.randint(1, 12),
                    random.randint(1, 28)
                ),
                "city": random.choice(cities),
                "performance_score": round(random.uniform(1.0, 5.0), 1),
                "is_remote": random.choice([True, False])
            })
    
        return pl.DataFrame(employees)
    
    def analyze_employee_data():
        """Complete employee data analysis workflow"""
    
        print("🏢 Employee Data Analysis")
        print("=" * 50)
    
        # Step 1: Create data
        print("📊 Creating sample data...")
        df = create_sample_data()
    
        print(f"✅ Created dataset with {df.height} employees")
        print(f"📋 Columns: {', '.join(df.columns)}")
    
        # Step 2: Basic exploration
        print("\n🔍 Data Overview:")
        print(df.head())
        print(f"\nData types:\n{df.dtypes}")
    
        # Step 3: Data quality check
        print("\n🔍 Data Quality Check:")
        null_counts = df.null_count()
        print(null_counts)
    
        # Step 4: Department analysis
        print("\n🏢 Department Analysis:")
        dept_analysis = (
            df
            .group_by("department")
            .agg([
                pl.col("salary").mean().round(0).alias("avg_salary"),
                pl.col("salary").min().alias("min_salary"),
                pl.col("salary").max().alias("max_salary"),
                pl.col("employee_id").count().alias("employee_count"),
                pl.col("performance_score").mean().round(2).alias("avg_performance")
            ])
            .sort("avg_salary", descending=True)
        )
        print(dept_analysis)
    
        # Step 5: Salary analysis with categories
        print("\n💰 Salary Categories:")
        salary_categorized = (
            df
            .with_columns([
                pl.when(pl.col("salary") >= 100000)
                .then(pl.lit("High"))
                .when(pl.col("salary") >= 70000)
                .then(pl.lit("Medium"))
                .otherwise(pl.lit("Low"))
                .alias("salary_category")
            ])
            .group_by("salary_category")
            .agg([
                pl.col("employee_id").count().alias("count"),
                pl.col("salary").mean().round(0).alias("avg_salary")
            ])
            .sort("avg_salary", descending=True)
        )
        print(salary_categorized)
    
        # Step 6: Geographic distribution
        print("\n🌍 Geographic Distribution:")
        city_analysis = (
            df
            .group_by(["city", "is_remote"])
            .agg([
                pl.col("employee_id").count().alias("employee_count"),
                pl.col("salary").mean().round(0).alias("avg_salary")
            ])
            .sort(["city", "is_remote"])
        )
        print(city_analysis)
    
        # Step 7: Tenure analysis
        print("\n📅 Employee Tenure Analysis:")
        tenure_analysis = (
            df
            .with_columns([
                ((pl.lit(date.today()) - pl.col("hire_date")).dt.days() / 365.25)
                .round(1)
                .alias("years_of_service")
            ])
            .with_columns([
                pl.when(pl.col("years_of_service") >= 5)
                .then(pl.lit("Veteran"))
                .when(pl.col("years_of_service") >= 2)
                .then(pl.lit("Experienced"))
                .otherwise(pl.lit("New"))
                .alias("tenure_category")
            ])
            .group_by("tenure_category")
            .agg([
                pl.col("employee_id").count().alias("count"),
                pl.col("years_of_service").mean().round(1).alias("avg_tenure"),
                pl.col("salary").mean().round(0).alias("avg_salary"),
                pl.col("performance_score").mean().round(2).alias("avg_performance")
            ])
            .sort("avg_tenure", descending=True)
        )
        print(tenure_analysis)
    
        # Step 8: High performer identification
        print("\n⭐ High Performers (Top 10%):")
        high_performers = (
            df
            .filter(pl.col("performance_score") >= df["performance_score"].quantile(0.9))
            .select([
                "employee_id", "name", "department", "salary", 
                "performance_score", "hire_date"
            ])
            .sort("performance_score", descending=True)
            .head(10)
        )
        print(high_performers)
    
        # Step 9: Summary insights
        print("\n📈 Key Insights:")
        total_employees = df.height
        avg_salary = df["salary"].mean()
        avg_performance = df["performance_score"].mean()
        remote_pct = (df["is_remote"].sum() / total_employees * 100)
    
        print(f"• Total Employees: {total_employees:,}")
        print(f"• Average Salary: ${avg_salary:,.0f}")
        print(f"• Average Performance Score: {avg_performance:.2f}/5.0")
        print(f"• Remote Workers: {remote_pct:.1f}%")
    
        return df
    
    # Bonus: Using lazy evaluation for large datasets
    def demonstrate_lazy_evaluation():
        """Show the power of lazy evaluation"""
    
        print("\n⚡ Lazy Evaluation Demonstration")
        print("=" * 40)
    
        # Create a larger dataset
        large_df = pl.DataFrame({
            "id": range(100_000),
            "value": range(100_000),
            "category": ["A", "B", "C", "D", "E"] * 20_000
        })
    
        # Build a complex lazy query
        lazy_query = (
            large_df
            .lazy()
            .filter(pl.col("value") > 50_000)
            .with_columns([
                (pl.col("value") * 2).alias("doubled_value"),
                pl.col("category").cast(pl.Categorical).alias("category")
            ])
            .group_by("category")
            .agg([
                pl.col("value").sum().alias("total_value"),
                pl.col("doubled_value").mean().alias("avg_doubled"),
                pl.col("id").count().alias("count")
            ])
            .sort("total_value", descending=True)
        )
    
        print("🔍 Query Plan:")
        print(lazy_query.explain())
    
        print("\n📊 Execution Results:")
        result = lazy_query.collect()
        print(result)
    
    if __name__ == "__main__":
        # Run the main analysis
        employee_df = analyze_employee_data()
    
        # Demonstrate lazy evaluation
        demonstrate_lazy_evaluation()
    
        print("\n🎉 Analysis Complete!")
        print("You've successfully built and run your first comprehensive Polars program!")
    Python

    Understanding the Code Structure

    flowchart TD
        A[Data Creation] --> B[Data Exploration]
        B --> C[Data Quality Check]
        C --> D[Business Analysis]
        D --> E[Advanced Analytics]
        E --> F[Insights & Reporting]
    
        D --> D1[Department Analysis]
        D --> D2[Salary Categorization]
        D --> D3[Geographic Distribution]
    
        E --> E1[Tenure Analysis]
        E --> E2[Performance Ranking]
        E --> E3[Predictive Features]
    
        style A fill:#f96
        style D fill:#9f6
        style E fill:#69f
        style F fill:#f69

    Key Patterns Demonstrated

    1. Data Creation with Realistic Constraints
    # Using seeds for reproducible results
    random.seed(42)
    
    # Creating structured data with relationships
    
    employees.append({
        "employee_id": f"EMP{i+1:04d}",  # Consistent ID format
        "department": random.choice(departments),  # Controlled categories
        "salary": random.randint(40000, 150000),  # Realistic ranges
    })
    Python

    2. Progressive Data Enhancement

    # Building on previous calculations
    df.with_columns([
        # First calculate years of service
        ((pl.lit(date.today()) - pl.col("hire_date")).dt.days() / 365.25)
        .round(1)
        .alias("years_of_service")
    ]).with_columns([
        # Then categorize based on the calculation
        pl.when(pl.col("years_of_service") >= 5)
        .then(pl.lit("Veteran"))
        # ...
    ])
    Python
    1. Business Logic Implementation
    # Complex categorization logic
    pl.when(pl.col("salary") >= 100000).then(pl.lit("High"))
    .when(pl.col("salary") >= 70000).then(pl.lit("Medium"))
    .otherwise(pl.lit("Low"))
    Python

    Performance Tips Applied

    • Efficient data types: Using pl.Categorical for repeated strings
    • Method chaining: Readable, efficient operations
    • Lazy evaluation: Demonstrated for large datasets
    • Selective operations: Only computing what’s needed

    Chapter Summary

    Skills Acquired:

    • End-to-end workflow: From data creation to insights
    • Business logic: Implementing real-world categorization rules
    • Performance patterns: Efficient data processing techniques
    • Code organization: Structured, readable analysis functions

    Polars Patterns Mastered:

    • Progressive data enhancement with with_columns()
    • Complex grouping and aggregation
    • Conditional logic with when().then().otherwise()
    • Date/time calculations and analysis
    • Lazy evaluation for performance

    🔗 What’s Next?

    Ready to learn about the various ways to load and create data in Polars!


    5. Loading and Creating Data

    🎯 Learning Objectives

    By the end of this chapter, you will:

    • Master multiple ways to create DataFrames from scratch
    • Load data from various file formats efficiently
    • Understand when to use eager vs lazy loading
    • Handle data type specifications and schema validation

    Data Sources Overview

    graph TD
        A[Data Sources] --> B[File Formats]
        A --> C[In-Memory Creation]
        A --> D[External Systems]
        A --> E[Other Libraries]
    
        B --> B1[CSV/TSV]
        B --> B2[Parquet]
        B --> B3[JSON/NDJSON]
        B --> B4[Excel]
        B --> B5[Arrow/IPC]
    
        C --> C1[Dictionaries]
        C --> C2[Lists]
        C --> C3[Ranges]
        C --> C4[Manual Schema]
    
        D --> D1[Databases]
        D --> D2[APIs]
        D --> D3[Cloud Storage]
    
        E --> E1[Pandas]
        E --> E2[NumPy]
        E --> E3[PyArrow]
    
        style A fill:#f96
        style B fill:#9f6
        style C fill:#69f
        style D fill:#f69

    Creating DataFrames from Scratch

    1. From Dictionaries (Most Common)

    import polars as pl
    import numpy as np
    from datetime import date, datetime
    
    # Basic dictionary creation
    df = pl.DataFrame({
        "name": ["Alice", "Bob", "Charlie", "Diana"],
        "age": [25, 30, 35, 28],
        "salary": [50000, 60000, 70000, 55000],
        "active": [True, False, True, True]
    })
    
    print("📊 Basic DataFrame:")
    print(df)
    print(f"Schema: {df.schema}")
    
    # Advanced dictionary with mixed types
    advanced_df = pl.DataFrame({
        "employee_id": [f"EMP{i:03d}" for i in range(1, 5)],
        "hire_date": [date(2020, 1, 15), date(2019, 6, 10), date(2021, 3, 22), date(2020, 11, 5)],
        "last_login": [
            datetime(2023, 12, 1, 14, 30),
            datetime(2023, 12, 2, 9, 15),
            None,  # Missing value
            datetime(2023, 12, 1, 16, 45)
        ],
        "skills": [["Python", "SQL"], ["Java", "Docker"], ["Python", "React"], ["SQL", "Tableau"]],
        "performance_scores": [[4.5, 4.2, 4.8], [3.9, 4.1], [4.7, 4.9, 4.6], [4.0]]
    })
    
    print("\n🔧 Advanced DataFrame with complex types:")
    print(advanced_df)
    Python

    2. From Lists and Sequences

    # List of lists with schema
    data_rows = [
        ["Alice", 25, "Engineering"],
        ["Bob", 30, "Sales"],
        ["Charlie", 35, "Marketing"],
        ["Diana", 28, "Engineering"]
    ]
    
    df_from_lists = pl.DataFrame(
        data_rows,
        schema=["name", "age", "department"],
        orient="row"  # Default: each inner list is a row
    )
    
    print("📋 DataFrame from lists:")
    print(df_from_lists)
    
    # Column-oriented data
    col_data = [
        ["Alice", "Bob", "Charlie"],     # Names column
        [25, 30, 35],                    # Ages column
        ["NYC", "LA", "Chicago"]         # Cities column
    ]
    
    df_from_columns = pl.DataFrame(
        col_data,
        schema=["name", "age", "city"],
        orient="col"  # Each inner list is a column
    )
    
    print("\n📊 DataFrame from column lists:")
    print(df_from_columns)
    Python

    3. From NumPy Arrays

    # Random data generation
    np.random.seed(42)
    array_data = np.random.randn(1000, 4)  # 1000 rows, 4 columns
    
    df_numpy = pl.DataFrame(
        array_data,
        schema=["feature_1", "feature_2", "feature_3", "target"]
    )
    
    print("🔢 DataFrame from NumPy (first 5 rows):")
    print(df_numpy.head())
    print(f"Shape: {df_numpy.shape}")
    
    # Structured arrays
    structured_array = np.array([
        ("Alice", 25, 50000.0),
        ("Bob", 30, 60000.0),
        ("Charlie", 35, 70000.0)
    ], dtype=[("name", "U10"), ("age", "i4"), ("salary", "f8")])
    
    df_structured = pl.DataFrame(structured_array)
    print("\n🏗️ DataFrame from structured NumPy array:")
    print(df_structured)
    Python

    4. Empty DataFrames with Schema

    # Define schema first, add data later
    schema = {
        "user_id": pl.Int64,
        "username": pl.Utf8,
        "email": pl.Utf8,
        "signup_date": pl.Date,
        "is_premium": pl.Boolean,
        "login_count": pl.Int32,
        "last_active": pl.Datetime,
        "preferences": pl.List(pl.Utf8)
    }
    
    empty_df = pl.DataFrame(schema=schema)
    print("📋 Empty DataFrame with schema:")
    print(f"Columns: {empty_df.columns}")
    print(f"Schema: {empty_df.schema}")
    
    # Add rows to empty DataFrame
    new_rows = pl.DataFrame({
        "user_id": [1, 2],
        "username": ["alice_dev", "bob_analyst"],
        "email": ["alice@company.com", "bob@company.com"],
        "signup_date": [date(2023, 1, 15), date(2023, 2, 20)],
        "is_premium": [True, False],
        "login_count": [45, 23],
        "last_active": [datetime(2023, 12, 1, 14, 30), datetime(2023, 11, 28, 9, 15)],
        "preferences": [["dark_mode", "notifications"], ["light_mode"]]
    })
    
    filled_df = pl.concat([empty_df, new_rows])
    print("\n✅ DataFrame after adding data:")
    print(filled_df)
    Python

    File Loading Mastery

    CSV Files: The Workhorse

    # Basic CSV reading (assuming you have a CSV file)
    # df = pl.read_csv("data.csv")
    
    # Let's create a sample CSV first
    sample_csv_data = """id,name,age,department,salary,hire_date
    1,Alice Johnson,25,Engineering,75000,2021-03-15
    2,Bob Smith,30,Sales,65000,2020-07-22
    3,Charlie Brown,35,Marketing,70000,2019-11-10
    4,Diana Wilson,28,Engineering,80000,2022-01-08
    5,Eve Davis,32,Finance,72000,2020-05-18"""
    
    # Save to file for demonstration
    with open("sample_employees.csv", "w") as f:
        f.write(sample_csv_data)
    
    # Now read with various options
    df_basic = pl.read_csv("sample_employees.csv")
    print("📄 Basic CSV reading:")
    print(df_basic)
    
    # Advanced CSV reading with options
    df_advanced = pl.read_csv(
        "sample_employees.csv",
        dtypes={
            "id": pl.Int32,
            "age": pl.Int16,
            "salary": pl.Int32,
            "hire_date": pl.Date
        },
        parse_dates=True,
        null_values=["", "NULL", "N/A"],  # What to treat as null
        skip_rows=0,  # Skip header rows if needed
        n_rows=3  # Limit rows for testing
    )
    
    print("\n🔧 Advanced CSV reading:")
    print(df_advanced)
    print(f"Dtypes: {df_advanced.dtypes}")
    Python

    Lazy Loading for Large Files

    # Lazy CSV scanning (doesn't load data immediately)
    lazy_df = pl.scan_csv("sample_employees.csv")
    
    print("🔍 Lazy DataFrame operations:")
    print("Query plan:")
    
    # Build a query without executing
    query = (
        lazy_df
        .filter(pl.col("salary") > 70000)
        .select(["name", "department", "salary"])
        .sort("salary", descending=True)
    )
    
    print(query.explain())
    
    # Execute when ready
    result = query.collect()
    print("\n📊 Query results:")
    print(result)
    
    # Clean up
    import os
    os.remove("sample_employees.csv")
    Python

    JSON and Other Formats

    # JSON data example
    json_data = [
        {"id": 1, "name": "Alice", "scores": [85, 92, 78], "metadata": {"team": "A", "level": "senior"}},
        {"id": 2, "name": "Bob", "scores": [90, 88, 92], "metadata": {"team": "B", "level": "junior"}},
        {"id": 3, "name": "Charlie", "scores": [75, 85, 80], "metadata": {"team": "A", "level": "mid"}}
    ]
    
    # Save JSON for reading
    import json
    with open("sample_data.json", "w") as f:
        json.dump(json_data, f)
    
    # Read JSON
    df_json = pl.read_json("sample_data.json")
    print("📄 JSON DataFrame:")
    print(df_json)
    
    # NDJSON (newline-delimited JSON) - each line is a JSON object
    ndjson_data = """{"id": 1, "name": "Alice", "value": 10}
    {"id": 2, "name": "Bob", "value": 20}
    {"id": 3, "name": "Charlie", "value": 30}"""
    
    with open("sample_data.ndjson", "w") as f:
        f.write(ndjson_data)
    
    df_ndjson = pl.read_ndjson("sample_data.ndjson")
    print("\n📄 NDJSON DataFrame:")
    print(df_ndjson)
    
    # Clean up
    os.remove("sample_data.json")
    os.remove("sample_data.ndjson")
    Python

    Creating Series: Building Blocks

    # Basic Series creation
    numbers = pl.Series("numbers", [1, 2, 3, 4, 5])
    names = pl.Series("employee_names", ["Alice", "Bob", "Charlie"])
    
    print("📊 Basic Series:")
    print(f"Numbers: {numbers}")
    print(f"Names: {names}")
    
    # Typed Series with explicit dtypes
    typed_series = pl.Series("precise_values", [1.1, 2.2, 3.3], dtype=pl.Float32)
    categorical_series = pl.Series("categories", ["A", "B", "A", "C"], dtype=pl.Categorical)
    
    print(f"\n🎯 Typed Series:")
    print(f"Float32: {typed_series.dtype}")
    print(f"Categorical: {categorical_series.dtype}")
    
    # Series with null values
    with_nulls = pl.Series("incomplete", [1, 2, None, 4, None, 6])
    print(f"\n❓ Series with nulls: {with_nulls}")
    print(f"Null count: {with_nulls.null_count()}")
    
    # Series from ranges and generators
    range_series = pl.Series("range_data", range(1000))
    print(f"\n📈 Range Series (first 10): {range_series.head(10)}")
    
    # Date/time series
    date_series = pl.Series("dates", [
        date(2023, 1, 1),
        date(2023, 1, 2),
        date(2023, 1, 3)
    ])
    
    datetime_series = pl.Series("timestamps", [
        datetime(2023, 1, 1, 12, 0, 0),
        datetime(2023, 1, 1, 13, 30, 0),
        datetime(2023, 1, 1, 15, 45, 0)
    ])
    
    print(f"\n📅 Date Series: {date_series}")
    print(f"🕐 DateTime Series: {datetime_series}")
    Python

    Data Loading Best Practices

    flowchart TD
        A[Choose Loading Method] --> B{Data Size}
        B -->|Small < 100MB| C[Eager Loading]
        B -->|Large > 100MB| D[Lazy Loading]
    
        C --> E[read_csv, read_parquet]
        D --> F[scan_csv, scan_parquet]
    
        G[File Format Choice] --> H{Use Case}
        H -->|Fast I/O| I[Parquet]
        H -->|Human Readable| J[CSV]
        H -->|Nested Data| K[JSON]
        H -->|Streaming| L[NDJSON]
    
        M[Schema Handling] --> N[Explicit Types]
        N --> O[Better Performance]
        N --> P[Data Validation]
    
        style D fill:#f96
        style I fill:#9f6
        style N fill:#69f

    Performance Tips

    # 1. Use appropriate data types
    df_optimized = pl.DataFrame({
        "small_int": pl.Series([1, 2, 3], dtype=pl.Int8),  # vs Int64
        "category": pl.Series(["A", "B", "A"], dtype=pl.Categorical),  # vs Utf8
        "flag": pl.Series([True, False, True], dtype=pl.Boolean)  # vs string
    })
    
    print("💡 Optimized DataFrame memory usage:")
    print(f"Schema: {df_optimized.schema}")
    
    # 2. Lazy loading for large datasets
    def demonstrate_lazy_benefits():
        """Show lazy loading benefits"""
    
        # For demonstration - in real use, this would be a large file
        large_data = {
            "id": range(10_000),
            "value": range(10_000),
            "category": ["A", "B", "C", "D", "E"] * 2_000
        }
    
        # Save as parquet for demo
        temp_df = pl.DataFrame(large_data)
        temp_df.write_parquet("temp_large.parquet")
    
        # Lazy scan - no data loaded yet
        lazy_scan = pl.scan_parquet("temp_large.parquet")
    
        # Only load what we need
        result = (
            lazy_scan
            .filter(pl.col("value") > 8000)  # Predicate pushdown
            .select(["id", "category"])      # Column selection
            .group_by("category")
            .agg(pl.col("id").count())
            .collect()  # Execute query
        )
    
        print("⚡ Lazy loading result:")
        print(result)
    
        # Clean up
        os.remove("temp_large.parquet")
    
    demonstrate_lazy_benefits()
    Python

    Chapter Summary

    Data Creation Methods Mastered:

    • Dictionaries: Most flexible and intuitive method
    • Lists: Efficient for structured data
    • NumPy integration: Seamless array handling
    • Schema definition: Type-safe DataFrame creation

    File Loading Expertise:

    • CSV: Most common format with advanced options
    • JSON/NDJSON: Nested and streaming data
    • Lazy loading: Memory-efficient large file handling
    • Performance optimization: Right types, right methods

    Key Patterns:

    • Use lazy loading (scan_*) for large files
    • Specify data types explicitly for better performance
    • Choose file formats based on use case (Parquet for speed, CSV for compatibility)
    • Handle missing values appropriately during loading

    🔗 What’s Next?

    Now that you can create and load data efficiently, let’s learn how to explore and understand your datasets!


    5. Data Inspection and Exploration

    Basic Information

    # DataFrame info
    print(df.shape)          # (rows, columns)
    print(df.columns)        # Column names
    print(df.dtypes)         # Data types
    print(df.schema)         # Schema with types
    
    # Statistical summary
    df.describe()            # Statistical summary
    df.null_count()         # Null values per column
    df.is_empty()           # Check if empty
    
    # Memory usage
    df.estimated_size()     # Estimated memory usage
    Python

    Data Preview

    # View data
    df.head()               # First 5 rows
    df.head(10)             # First 10 rows
    df.tail()               # Last 5 rows
    df.sample(5)            # Random 5 rows
    df.glimpse()            # Transposed view
    
    # Unique values
    df["column"].unique()
    df["column"].n_unique()
    df["column"].value_counts()
    Python

    Data Quality Checks

    # Missing values analysis
    missing_report = df.select([
        pl.col("*").null_count().name.suffix("_nulls"),
        pl.col("*").count().name.suffix("_total")
    ])
    
    # Duplicate detection
    df.is_duplicated()      # Boolean mask for duplicates
    df.n_unique()          # Count unique combinations
    Python

    Data Flow Visualization

    flowchart TD
        A[Raw Data] --> B[Load into Polars]
        B --> C[Data Inspection]
        C --> D{Data Quality OK?}
        D -->|No| E[Data Cleaning]
        D -->|Yes| F[Data Analysis]
        E --> F
        F --> G[Results]
    
        subgraph "Inspection Methods"
            H[.head/.tail]
            I[.describe]
            J[.null_count]
            K[.dtypes]
        end
    
        C --> H
        C --> I
        C --> J
        C --> K

    6. Data Selection and Filtering

    Column Selection

    # Single column
    df.select("name")
    df["name"]              # Returns Series
    
    # Multiple columns
    df.select(["name", "age"])
    df.select(pl.col("name"), pl.col("age"))
    
    # Column patterns
    df.select(pl.col("^.*_id$"))        # Regex pattern
    df.select(pl.col("*").exclude("id")) # All except id
    
    # By data type
    df.select(pl.col(pl.Utf8))          # String columns
    df.select(pl.col(pl.NUMERIC_DTYPES)) # Numeric columns
    Python

    Row Filtering

    # Simple conditions
    df.filter(pl.col("age") > 25)
    df.filter(pl.col("name") == "Alice")
    df.filter(pl.col("city").is_in(["New York", "London"]))
    
    # Complex conditions
    df.filter(
        (pl.col("age") > 25) & 
        (pl.col("city") == "New York")
    )
    
    df.filter(
        (pl.col("age") < 20) | 
        (pl.col("age") > 60)
    )
    
    # String operations
    df.filter(pl.col("name").str.contains("^A"))
    df.filter(pl.col("name").str.starts_with("B"))
    df.filter(pl.col("name").str.ends_with("e"))
    
    # Null filtering
    df.filter(pl.col("value").is_null())
    df.filter(pl.col("value").is_not_null())
    Python

    Advanced Selection

    # Conditional selection
    df.select(
        pl.when(pl.col("age") > 30)
        .then(pl.lit("Senior"))
        .otherwise(pl.lit("Junior"))
        .alias("category")
    )
    
    # Multiple conditions
    df.select(
        pl.when(pl.col("score") >= 90).then("A")
        .when(pl.col("score") >= 80).then("B")  
        .when(pl.col("score") >= 70).then("C")
        .otherwise("F")
        .alias("grade")
    )
    
    # Dynamic column selection
    numeric_cols = [col for col, dtype in df.schema.items() 
                    if dtype in pl.NUMERIC_DTYPES]
    df.select(numeric_cols)
    Python

    Indexing and Slicing

    # Row slicing
    df[0]                   # First row as DataFrame
    df[0:5]                 # First 5 rows
    df[-5:]                 # Last 5 rows
    
    # Boolean indexing
    mask = df["age"] > 25
    df[mask]
    
    # Sample and slice
    df.slice(10, 20)        # Rows 10-29
    df.sample(n=100)        # Random 100 rows
    df.sample(fraction=0.1) # Random 10% of rows
    Python

    7. Data Transformation

    Adding and Modifying Columns

    # Add new columns
    df = df.with_columns([
        pl.col("age").alias("age_copy"),
        (pl.col("age") * 12).alias("age_months"),
        pl.lit("default_value").alias("new_column")
    ])
    
    # Modify existing columns
    df = df.with_columns([
        pl.col("name").str.to_uppercase().alias("name"),
        pl.col("age").cast(pl.Float64).alias("age")
    ])
    
    # Multiple transformations
    df = df.with_columns([
        # String manipulations
        pl.col("name").str.len().alias("name_length"),
        pl.col("city").str.replace("New ", "").alias("city_clean"),
    
        # Mathematical operations
        pl.col("value").log().alias("log_value"),
        pl.col("value").pow(2).alias("value_squared"),
    
        # Date operations
        pl.col("date").dt.year().alias("year"),
        pl.col("date").dt.month().alias("month")
    ])
    Python

    String Operations

    # String methods
    df.with_columns([
        pl.col("text").str.len().alias("length"),
        pl.col("text").str.to_lowercase().alias("lower"),
        pl.col("text").str.to_uppercase().alias("upper"),
        pl.col("text").str.strip().alias("stripped"),
        pl.col("text").str.slice(0, 5).alias("first_5"),
        pl.col("text").str.split(" ").alias("words"),
        pl.col("text").str.contains("pattern").alias("has_pattern"),
        pl.col("text").str.replace("old", "new").alias("replaced")
    ])
    
    # Regular expressions
    df.with_columns([
        pl.col("email").str.extract(r"([^@]+)@([^.]+)").alias("email_parts"),
        pl.col("phone").str.replace_all(r"[^\d]", "").alias("phone_clean")
    ])
    Python

    Mathematical Operations

    # Basic math
    df.with_columns([
        (pl.col("a") + pl.col("b")).alias("sum"),
        (pl.col("a") - pl.col("b")).alias("diff"), 
        (pl.col("a") * pl.col("b")).alias("product"),
        (pl.col("a") / pl.col("b")).alias("ratio"),
        (pl.col("a") % pl.col("b")).alias("remainder"),
        pl.col("a").pow(pl.col("b")).alias("power")
    ])
    
    # Advanced math
    df.with_columns([
        pl.col("value").log().alias("log"),
        pl.col("value").log10().alias("log10"),
        pl.col("value").exp().alias("exp"),
        pl.col("value").sqrt().alias("sqrt"),
        pl.col("angle").sin().alias("sin"),
        pl.col("angle").cos().alias("cos"),
        pl.col("value").abs().alias("absolute")
    ])
    
    # Statistical functions
    df.with_columns([
        pl.col("values").rank().alias("rank"),
        pl.col("values").pct_change().alias("pct_change"),
        pl.col("values").cumsum().alias("cumsum"),
        pl.col("values").diff().alias("diff")
    ])
    Python

    Conditional Transformations

    # Simple conditions
    df.with_columns([
        pl.when(pl.col("age") >= 18)
        .then(pl.lit("Adult"))
        .otherwise(pl.lit("Minor"))
        .alias("age_group")
    ])
    
    # Complex conditions
    df.with_columns([
        pl.when(pl.col("score") >= 90).then("Excellent")
        .when(pl.col("score") >= 80).then("Good")
        .when(pl.col("score") >= 70).then("Average")
        .when(pl.col("score") >= 60).then("Below Average")
        .otherwise("Poor")
        .alias("performance")
    ])
    
    # Multiple column conditions
    df.with_columns([
        pl.when((pl.col("age") > 25) & (pl.col("salary") > 50000))
        .then(pl.lit("Senior"))
        .when((pl.col("age") > 22) & (pl.col("salary") > 30000))
        .then(pl.lit("Mid-level"))
        .otherwise(pl.lit("Junior"))
        .alias("level")
    ])
    Python

    Data Type Conversions

    # Type casting
    df.with_columns([
        pl.col("string_number").cast(pl.Int64),
        pl.col("integer").cast(pl.Utf8),
        pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
        pl.col("timestamp").cast(pl.Datetime)
    ])
    
    # Safe casting with error handling
    df.with_columns([
        pl.col("maybe_number").cast(pl.Int64, strict=False).alias("number"),
        pl.col("date_text").str.strptime(pl.Date, "%Y-%m-%d", strict=False)
    ])
    Python

    Transformation Pipeline

    flowchart LR
        A[Raw Data] --> B[String Cleaning]
        B --> C[Type Conversion]
        C --> D[Mathematical Operations]
        D --> E[Conditional Logic]
        E --> F[Final Dataset]
    
        subgraph "Transformation Steps"
            G[with_columns]
            H[select]
            I[filter]
            J[cast]
        end

    8. Aggregations and Grouping

    Basic Aggregations

    # Single column aggregations
    df.select([
        pl.col("value").sum().alias("total"),
        pl.col("value").mean().alias("average"),
        pl.col("value").median().alias("median"),
        pl.col("value").std().alias("std_dev"),
        pl.col("value").min().alias("minimum"),
        pl.col("value").max().alias("maximum"),
        pl.col("value").count().alias("count"),
        pl.col("value").n_unique().alias("unique_count")
    ])
    
    # Multiple columns
    df.select([
        pl.all().sum().name.suffix("_sum"),
        pl.all().mean().name.suffix("_mean")
    ])
    
    # Quantiles and percentiles
    df.select([
        pl.col("value").quantile(0.25).alias("q25"),
        pl.col("value").quantile(0.5).alias("q50"),
        pl.col("value").quantile(0.75).alias("q75"),
        pl.col("value").quantile(0.95).alias("q95")
    ])
    Python

    Group By Operations

    # Basic grouping
    result = df.group_by("category").agg([
        pl.col("value").sum().alias("total_value"),
        pl.col("value").mean().alias("avg_value"),
        pl.col("value").count().alias("count")
    ])
    
    # Multiple grouping columns
    result = df.group_by(["department", "region"]).agg([
        pl.col("sales").sum(),
        pl.col("employees").count(),
        pl.col("salary").mean()
    ])
    
    # Dynamic grouping
    result = df.group_by(
        pl.col("date").dt.year().alias("year"),
        pl.col("date").dt.month().alias("month")
    ).agg([
        pl.col("revenue").sum(),
        pl.col("customers").n_unique()
    ])
    Python

    Advanced Aggregations

    # Custom aggregations
    result = df.group_by("group").agg([
        # Statistical measures
        pl.col("values").std().alias("std_dev"),
        pl.col("values").var().alias("variance"),
        pl.col("values").skew().alias("skewness"),
    
        # List aggregations
        pl.col("items").list().alias("all_items"),
        pl.col("values").sort().first().alias("min_value"),
        pl.col("values").sort().last().alias("max_value"),
    
        # Conditional aggregations
        pl.col("value").filter(pl.col("status") == "active").sum().alias("active_sum"),
        pl.col("value").filter(pl.col("value") > 100).count().alias("high_value_count")
    ])
    
    # Window functions in aggregations
    result = df.group_by("category").agg([
        pl.col("value").sum().alias("category_total"),
        (pl.col("value") / pl.col("value").sum()).alias("percentage")
    ])
    Python

    Rolling and Expanding Aggregations

    # Rolling window aggregations
    df.with_columns([
        pl.col("value").rolling_sum(window_size=3).alias("rolling_sum_3"),
        pl.col("value").rolling_mean(window_size=7).alias("rolling_avg_7"),
        pl.col("value").rolling_std(window_size=5).alias("rolling_std_5"),
        pl.col("value").rolling_max(window_size=10).alias("rolling_max_10")
    ])
    
    # Expanding window aggregations  
    df.with_columns([
        pl.col("value").cumsum().alias("cumulative_sum"),
        pl.col("value").cumcount().alias("cumulative_count"),
        pl.col("value").cummax().alias("cumulative_max"),
        pl.col("value").cummin().alias("cumulative_min")
    ])
    
    # Time-based rolling windows
    df.sort("timestamp").with_columns([
        pl.col("value").rolling_mean_by("timestamp", window_size="7d").alias("7day_avg"),
        pl.col("value").rolling_sum_by("timestamp", window_size="1mo").alias("monthly_sum")
    ])
    Python

    Pivot and Unpivot

    # Pivot table
    pivot_df = df.pivot(
        values="sales",
        index="region", 
        columns="month",
        aggregate_function="sum"
    )
    
    # Unpivot (melt)
    melted_df = df.melt(
        id_vars=["id", "name"],
        value_vars=["jan", "feb", "mar"],
        variable_name="month",
        value_name="sales"
    )
    
    # Multiple value columns
    melted_df = df.melt(
        id_vars=["id"],
        value_vars=["sales_jan", "sales_feb", "profit_jan", "profit_feb"],
        variable_name="metric_month",
        value_name="value"
    )
    Python

    Aggregation Pipeline

    flowchart TD
        A[Original DataFrame] --> B{Grouping Needed?}
        B -->|Yes| C[group_by]
        B -->|No| D[Direct Aggregation]
        C --> E[agg functions]
        D --> E
        E --> F[Result DataFrame]
    
        subgraph "Aggregation Functions"
            G[sum, mean, count]
            H[min, max, std]
            I[quantile, median]
            J[custom expressions]
        end
    
        E --> G
        E --> H  
        E --> I
        E --> J

    9. Joins and Concatenations

    Types of Joins

    # Sample DataFrames
    customers = pl.DataFrame({
        "customer_id": [1, 2, 3, 4],
        "name": ["Alice", "Bob", "Charlie", "Diana"],
        "city": ["NYC", "LA", "Chicago", "Miami"]
    })
    
    orders = pl.DataFrame({
        "order_id": [101, 102, 103, 104, 105],
        "customer_id": [1, 2, 2, 3, 5],
        "amount": [100, 200, 150, 300, 250]
    })
    
    # Inner join (default)
    result = customers.join(orders, on="customer_id", how="inner")
    
    # Left join
    result = customers.join(orders, on="customer_id", how="left")
    
    # Right join  
    result = customers.join(orders, on="customer_id", how="right")
    
    # Outer join
    result = customers.join(orders, on="customer_id", how="outer")
    
    # Cross join
    result = customers.join(orders, how="cross")
    Python

    Advanced Join Operations

    # Multiple column joins
    df1.join(df2, on=["col1", "col2"], how="inner")
    
    # Different column names
    df1.join(df2, left_on="id", right_on="customer_id", how="left")
    
    # Join with suffix for duplicate columns
    df1.join(df2, on="id", suffix="_right")
    
    # Anti join (rows in left not in right)
    df1.join(df2, on="id", how="anti")
    
    # Semi join (rows in left that have match in right)
    df1.join(df2, on="id", how="semi")
    Python

    Concatenations

    # Vertical concatenation (stack rows)
    df_combined = pl.concat([df1, df2, df3], how="vertical")
    
    # Horizontal concatenation (side by side)
    df_combined = pl.concat([df1, df2], how="horizontal")
    
    # Diagonal concatenation (union with different schemas)
    df_combined = pl.concat([df1, df2], how="diagonal")
    
    # With rechunk for better performance
    df_combined = pl.concat([df1, df2], rechunk=True)
    Python

    Complex Join Examples

    # Multiple joins in sequence
    result = (
        customers
        .join(orders, on="customer_id", how="left")
        .join(products, left_on="product_id", right_on="id", how="left")
        .join(categories, left_on="category_id", right_on="id", how="left")
    )
    
    # Join with aggregation
    customer_stats = (
        orders
        .group_by("customer_id")
        .agg([
            pl.col("amount").sum().alias("total_spent"),
            pl.col("order_id").count().alias("order_count")
        ])
    )
    
    enriched_customers = customers.join(
        customer_stats, 
        on="customer_id", 
        how="left"
    ).fill_null(0)
    
    # Self join
    df.join(
        df.select(["id", "manager_id", "name"]).rename({"name": "manager_name"}),
        left_on="manager_id",
        right_on="id",
        how="left"
    )
    Python

    Join Performance Optimization

    # Use lazy evaluation for large datasets
    lazy_result = (
        pl.scan_parquet("customers.parquet")
        .join(
            pl.scan_parquet("orders.parquet"),
            on="customer_id",
            how="inner"
        )
        .collect()
    )
    
    # Pre-sort data for better join performance
    customers_sorted = customers.sort("customer_id")
    orders_sorted = orders.sort("customer_id")
    result = customers_sorted.join(orders_sorted, on="customer_id")
    Python

    Join Strategy Visualization

    graph LR
        subgraph "Join Types"
            A[Inner Join] --> E[Only Matching Records]
            B[Left Join] --> F[All Left + Matching Right]
            C[Right Join] --> G[All Right + Matching Left]
            D[Outer Join] --> H[All Records from Both]
        end
    
        subgraph "Special Joins"
            I[Anti Join] --> J[Left Records Without Match]
            K[Semi Join] --> L[Left Records With Match]
            M[Cross Join] --> N[Cartesian Product]
        end

    10. Time Series Operations

    Date and Time Creation

    import polars as pl
    from datetime import datetime, date, time
    
    # Creating date/time columns
    df = pl.DataFrame({
        "date": [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 3)],
        "datetime": [
            datetime(2023, 1, 1, 10, 30),
            datetime(2023, 1, 2, 14, 45),
            datetime(2023, 1, 3, 9, 15)
        ],
        "value": [100, 150, 200]
    })
    
    # From strings
    df = pl.DataFrame({
        "date_str": ["2023-01-01", "2023-01-02", "2023-01-03"],
        "datetime_str": ["2023-01-01 10:30:00", "2023-01-02 14:45:00", "2023-01-03 09:15:00"],
        "value": [100, 150, 200]
    })
    
    df = df.with_columns([
        pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d").alias("date"),
        pl.col("datetime_str").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S").alias("datetime")
    ])
    Python

    Date/Time Extraction

    # Extract date components
    df = df.with_columns([
        pl.col("datetime").dt.year().alias("year"),
        pl.col("datetime").dt.month().alias("month"),
        pl.col("datetime").dt.day().alias("day"),
        pl.col("datetime").dt.weekday().alias("weekday"),
        pl.col("datetime").dt.hour().alias("hour"),
        pl.col("datetime").dt.minute().alias("minute"),
        pl.col("datetime").dt.second().alias("second"),
        pl.col("datetime").dt.week().alias("week_of_year"),
        pl.col("datetime").dt.quarter().alias("quarter")
    ])
    
    # Date arithmetic
    df = df.with_columns([
        (pl.col("datetime") + pl.duration(days=7)).alias("next_week"),
        (pl.col("datetime") - pl.duration(hours=2)).alias("two_hours_ago"),
        pl.col("datetime").dt.truncate("1d").alias("date_truncated"),
        pl.col("datetime").dt.round("1h").alias("hour_rounded")
    ])
    Python

    Time Series Resampling

    # Create time series data
    dates = pl.date_range(
        start=date(2023, 1, 1),
        end=date(2023, 12, 31),
        interval="1d",
        eager=True
    )
    
    ts_df = pl.DataFrame({
        "date": dates,
        "value": np.random.randn(len(dates)).cumsum()
    })
    
    # Resample to different frequencies
    monthly = (
        ts_df
        .group_by_dynamic("date", every="1mo")
        .agg([
            pl.col("value").mean().alias("avg_value"),
            pl.col("value").sum().alias("total_value"),
            pl.col("value").count().alias("count")
        ])
    )
    
    # Weekly resampling with custom aggregation
    weekly = (
        ts_df
        .group_by_dynamic("date", every="1w", label="left")
        .agg([
            pl.col("value").first().alias("week_start"),
            pl.col("value").last().alias("week_end"),
            pl.col("value").max().alias("week_max"),
            pl.col("value").min().alias("week_min")
        ])
    )
    Python

    Rolling Time Windows

    # Time-based rolling windows
    df = df.sort("datetime").with_columns([
        # Rolling statistics over time periods
        pl.col("value").rolling_mean_by("datetime", window_size="7d").alias("7day_avg"),
        pl.col("value").rolling_sum_by("datetime", window_size="30d").alias("30day_sum"),
        pl.col("value").rolling_std_by("datetime", window_size="14d").alias("14day_std"),
    
        # Rolling with minimum periods
        pl.col("value").rolling_mean_by(
            "datetime", 
            window_size="7d", 
            min_periods=3
        ).alias("7day_avg_min3")
    ])
    
    # Custom rolling functions
    df = df.with_columns([
        pl.col("value").rolling_quantile_by("datetime", quantile=0.9, window_size="30d").alias("30day_p90"),
        pl.col("value").rolling_median_by("datetime", window_size="14d").alias("14day_median")
    ])
    Python

    Time Zone Handling

    # Working with time zones
    df = pl.DataFrame({
        "datetime_utc": [
            datetime(2023, 1, 1, 12, 0),
            datetime(2023, 1, 2, 12, 0)
        ],
        "value": [100, 200]
    })
    
    # Convert to different time zones
    df = df.with_columns([
        pl.col("datetime_utc").dt.replace_time_zone("UTC").alias("utc"),
        pl.col("datetime_utc").dt.replace_time_zone("UTC")
            .dt.convert_time_zone("America/New_York").alias("ny_time"),
        pl.col("datetime_utc").dt.replace_time_zone("UTC")
            .dt.convert_time_zone("Europe/London").alias("london_time")
    ])
    Python

    Advanced Time Series Operations

    # Lag and lead operations
    df = df.sort("date").with_columns([
        pl.col("value").shift(1).alias("prev_value"),
        pl.col("value").shift(-1).alias("next_value"),
        pl.col("value").shift(periods=7).alias("value_7_days_ago")
    ])
    
    # Percentage change and differences
    df = df.with_columns([
        pl.col("value").pct_change().alias("pct_change"),
        pl.col("value").diff().alias("diff"),
        (pl.col("value") / pl.col("value").shift(1) - 1).alias("manual_pct_change")
    ])
    
    # Fill missing time periods
    complete_dates = pl.date_range(
        start=df["date"].min(),
        end=df["date"].max(),
        interval="1d",
        eager=True
    )
    
    complete_df = pl.DataFrame({"date": complete_dates})
    filled_df = complete_df.join(df, on="date", how="left")
    
    # Forward fill and backward fill
    filled_df = filled_df.with_columns([
        pl.col("value").forward_fill().alias("forward_filled"),
        pl.col("value").backward_fill().alias("backward_filled")
    ])
    Python

    Time Series Analysis Pipeline

    flowchart TD
        A[Raw Time Data] --> B[Parse Dates]
        B --> C[Sort by Time]
        C --> D[Handle Missing Periods]
        D --> E[Create Time Features]
        E --> F[Rolling Calculations]
        F --> G[Resampling]
        G --> H[Analysis Ready Data]
    
        subgraph "Time Operations"
            I[dt.year, dt.month]
            J[rolling_mean_by]
            K[group_by_dynamic]
            L[shift, diff]
        end
    
        E --> I
        F --> J
        G --> K
        F --> L

    11. Advanced Features

    Lazy Evaluation

    # Lazy DataFrames for query optimization
    lazy_df = pl.scan_csv("large_file.csv")
    
    # Build query without execution
    query = (
        lazy_df
        .filter(pl.col("value") > 100)
        .group_by("category")
        .agg([
            pl.col("value").sum(),
            pl.col("value").count()
        ])
        .sort("value_sum", descending=True)
        .limit(10)
    )
    
    # Show query plan
    print(query.explain())
    
    # Execute query
    result = query.collect()
    
    # Streaming for large datasets
    result = query.collect(streaming=True)
    Python

    Custom Functions and Expressions

    # Custom Python functions with map_elements
    def custom_function(x):
        return x ** 2 + 1
    
    df = df.with_columns([
        pl.col("value").map_elements(custom_function, return_dtype=pl.Float64).alias("custom")
    ])
    
    # Apply function to multiple columns
    def process_row(row):
        return row[0] * row[1] + row[2]
    
    df = df.with_columns([
        pl.struct(["a", "b", "c"]).map_elements(
            lambda x: process_row([x["a"], x["b"], x["c"]]),
            return_dtype=pl.Float64
        ).alias("processed")
    ])
    
    # Custom aggregation functions
    def weighted_mean(values, weights):
        return (values * weights).sum() / weights.sum()
    
    result = df.group_by("group").agg([
        pl.struct(["value", "weight"]).map_elements(
            lambda x: weighted_mean(x.struct.field("value"), x.struct.field("weight")),
            return_dtype=pl.Float64
        ).alias("weighted_avg")
    ])
    Python

    Working with Lists and Structs

    # List operations
    df = pl.DataFrame({
        "id": [1, 2, 3],
        "numbers": [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
        "names": [["Alice", "Bob"], ["Charlie"], ["Diana", "Eve", "Frank"]]
    })
    
    # List manipulations
    df = df.with_columns([
        pl.col("numbers").list.len().alias("list_length"),
        pl.col("numbers").list.sum().alias("list_sum"),
        pl.col("numbers").list.mean().alias("list_mean"),
        pl.col("numbers").list.get(0).alias("first_item"),
        pl.col("numbers").list.slice(1, 2).alias("slice_1_2"),
        pl.col("names").list.join(", ").alias("names_joined")
    ])
    
    # Explode lists to rows
    exploded = df.select(["id", "numbers"]).explode("numbers")
    
    # Struct operations
    df_struct = pl.DataFrame({
        "id": [1, 2, 3],
        "person": [
            {"name": "Alice", "age": 25},
            {"name": "Bob", "age": 30},
            {"name": "Charlie", "age": 35}
        ]
    })
    
    df_struct = df_struct.with_columns([
        pl.col("person").struct.field("name").alias("name"),
        pl.col("person").struct.field("age").alias("age")
    ])
    Python

    Window Functions

    # Window functions for ranking and statistics
    df = pl.DataFrame({
        "department": ["Sales", "Sales", "Engineering", "Engineering", "Marketing"],
        "employee": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
        "salary": [50000, 60000, 80000, 90000, 55000]
    })
    
    # Ranking functions
    df = df.with_columns([
        pl.col("salary").rank().over("department").alias("rank_in_dept"),
        pl.col("salary").rank(method="dense").over("department").alias("dense_rank"),
        pl.col("salary").rank(descending=True).over("department").alias("rank_desc")
    ])
    
    # Statistical window functions
    df = df.with_columns([
        pl.col("salary").mean().over("department").alias("dept_avg_salary"),
        pl.col("salary").std().over("department").alias("dept_salary_std"),
        (pl.col("salary") - pl.col("salary").mean().over("department")).alias("salary_diff_from_avg")
    ])
    
    # Cumulative operations within groups
    df = df.sort(["department", "salary"]).with_columns([
        pl.col("salary").cumsum().over("department").alias("cumulative_salary"),
        pl.col("salary").cumcount().over("department").alias("employee_number")
    ])
    Python

    Parallel Processing and Performance

    # Configure thread usage
    pl.Config.set_thread_pool_size(8)
    
    # Parallel operations on lazy frames
    lazy_results = []
    for file in file_list:
        lazy_df = (
            pl.scan_csv(file)
            .filter(pl.col("date").is_between(start_date, end_date))
            .group_by("category")
            .agg(pl.col("value").sum())
        )
        lazy_results.append(lazy_df)
    
    # Combine results
    final_result = pl.concat(lazy_results).collect()
    
    # Streaming for memory efficiency
    large_result = (
        pl.scan_csv("very_large_file.csv")
        .filter(pl.col("important_flag") == True)
        .select(["id", "value", "category"])
        .collect(streaming=True)
    )
    Python

    Plugin System and Extensions

    # Using Polars plugins (example with polars-xdt for extended datetime)
    import polars_xdt as xdt
    
    df = df.with_columns([
        pl.col("date").xdt.format_localized("%A, %B %d, %Y", locale="en_US"),
        pl.col("date").xdt.is_weekend(),
        pl.col("date").xdt.workday_count(pl.col("end_date"))
    ])
    
    # Custom namespace example
    @pl.api.register_expr_namespace("business")
    class BusinessOps:
        def __init__(self, expr: pl.Expr):
            self._expr = expr
    
        def workdays_between(self, end_date: pl.Expr) -> pl.Expr:
            # Custom business logic
            return self._expr.map_elements(
                lambda start: calculate_workdays(start, end_date),
                return_dtype=pl.Int32
            )
    
    # Usage
    df = df.with_columns([
        pl.col("start_date").business.workdays_between(pl.col("end_date"))
    ])
    Python

    12. Performance Optimization

    Query Optimization Strategies

    # Use lazy evaluation for complex queries
    optimized_query = (
        pl.scan_parquet("large_dataset.parquet")
        .filter(pl.col("date").is_between(start_date, end_date))  # Early filtering
        .select(["id", "value", "category"])  # Column selection
        .filter(pl.col("value") > threshold)  # Additional filtering
        .group_by("category")
        .agg(pl.col("value").sum())
        .collect()
    )
    
    # Predicate pushdown example
    lazy_df = (
        pl.scan_csv("data.csv")
        .filter(pl.col("year") == 2023)  # This filter is pushed down to file reading
        .select(["month", "sales"])
        .group_by("month")
        .agg(pl.col("sales").sum())
    )
    Python

    Memory Management

    # Use appropriate data types
    df = df.with_columns([
        pl.col("small_int").cast(pl.Int8),  # Instead of Int64
        pl.col("category").cast(pl.Categorical),  # For repeated strings
        pl.col("flag").cast(pl.Boolean),  # Instead of string
        pl.col("percentage").cast(pl.Float32)  # Instead of Float64 when precision allows
    ])
    
    # Chunking for large operations
    def process_in_chunks(df, chunk_size=10000):
        results = []
        for i in range(0, len(df), chunk_size):
            chunk = df.slice(i, chunk_size)
            processed = chunk.with_columns([
                # Expensive operations here
                pl.col("value").map_elements(expensive_function)
            ])
            results.append(processed)
        return pl.concat(results)
    
    # Memory-efficient file processing
    def process_large_file(filepath):
        return (
            pl.scan_csv(filepath)
            .filter(pl.col("relevant_column").is_not_null())
            .select(["id", "important_col1", "important_col2"])
            .collect(streaming=True)  # Process without loading entire file
        )
    Python

    Parallel Processing Patterns

    # Parallel file processing
    from concurrent.futures import ProcessPoolExecutor
    import multiprocessing as mp
    
    def process_file(filepath):
        return (
            pl.read_csv(filepath)
            .filter(pl.col("status") == "active")
            .group_by("category")
            .agg(pl.col("value").sum())
        )
    
    # Process multiple files in parallel
    file_paths = ["file1.csv", "file2.csv", "file3.csv"]
    with ProcessPoolExecutor(max_workers=mp.cpu_count()) as executor:
        results = list(executor.map(process_file, file_paths))
    
    combined = pl.concat(results)
    
    # Parallel groupby operations
    large_df = pl.read_parquet("large_dataset.parquet")
    
    # This automatically uses multiple threads
    result = (
        large_df
        .group_by("large_grouping_column")
        .agg([
            pl.col("numeric_col").sum(),
            pl.col("numeric_col").mean(),
            pl.col("categorical_col").n_unique()
        ])
    )
    Python

    Efficient Data Loading

    # Optimized CSV reading
    df = pl.read_csv(
        "large_file.csv",
        use_pyarrow=True,  # Use PyArrow engine for better performance
        rechunk=True,      # Optimize memory layout
        low_memory=False,  # Use more memory for speed
        ignore_errors=True # Skip problematic rows
    )
    
    # Lazy loading with projections
    df = (
        pl.scan_csv("huge_file.csv")
        .select(["col1", "col2", "col3"])  # Only read needed columns
        .filter(pl.col("date") > cutoff_date)  # Filter early
        .collect()
    )
    
    # Parquet optimization
    # Save with optimal settings
    df.write_parquet(
        "optimized.parquet",
        compression="snappy",  # Good balance of speed and size
        statistics=True,       # Enable predicate pushdown
        row_group_size=100000  # Optimize for your use case
    )
    Python

    Benchmarking and Profiling

    import time
    from functools import wraps
    
    def timing_decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            start = time.time()
            result = func(*args, **kwargs)
            end = time.time()
            print(f"{func.__name__} took {end - start:.4f} seconds")
            return result
        return wrapper
    
    @timing_decorator
    def slow_operation(df):
        return df.group_by("category").agg(pl.col("value").sum())
    
    @timing_decorator  
    def optimized_operation(df):
        return (
            df.lazy()
            .group_by("category")
            .agg(pl.col("value").sum())
            .collect()
        )
    
    # Memory profiling
    def profile_memory(operation, df):
        import psutil
        process = psutil.Process()
    
        memory_before = process.memory_info().rss / 1024 / 1024  # MB
        result = operation(df)
        memory_after = process.memory_info().rss / 1024 / 1024  # MB
    
        print(f"Memory used: {memory_after - memory_before:.2f} MB")
        return result
    Python

    Performance Optimization Checklist

    graph TD
        A[Performance Issue] --> B{Identify Bottleneck}
    
        B --> C[I/O Bound]
        B --> D[CPU Bound]
        B --> E[Memory Bound]
    
        C --> F[Use lazy evaluation]
        C --> G[Optimize file formats]
        C --> H[Column selection early]
    
        D --> I[Use proper data types]
        D --> J[Vectorized operations]
        D --> K[Parallel processing]
    
        E --> L[Streaming mode]
        E --> M[Chunking]
        E --> N[Reduce data size]
    
        F --> O[Benchmark Results]
        G --> O
        H --> O
        I --> O
        J --> O
        K --> O
        L --> O
        M --> O
        N --> O

    13. Real-World Examples

    Example 1: Sales Data Analysis

    # Load and prepare sales data
    sales_df = pl.read_csv("sales_data.csv").with_columns([
        pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
        pl.col("amount").cast(pl.Float64),
        pl.col("category").cast(pl.Categorical)
    ])
    
    # Monthly sales analysis
    monthly_sales = (
        sales_df
        .group_by_dynamic("date", every="1mo", label="left")
        .agg([
            pl.col("amount").sum().alias("total_sales"),
            pl.col("amount").count().alias("transaction_count"),
            pl.col("amount").mean().alias("avg_transaction"),
            pl.col("customer_id").n_unique().alias("unique_customers")
        ])
        .with_columns([
            pl.col("total_sales").pct_change().alias("sales_growth"),
            (pl.col("total_sales") / pl.col("transaction_count")).alias("avg_order_value")
        ])
    )
    
    # Top performing categories
    category_performance = (
        sales_df
        .group_by("category")
        .agg([
            pl.col("amount").sum().alias("total_revenue"),
            pl.col("amount").count().alias("transactions"),
            pl.col("customer_id").n_unique().alias("customers")
        ])
        .with_columns([
            (pl.col("total_revenue") / pl.col("total_revenue").sum() * 100).alias("revenue_share"),
            pl.col("total_revenue").rank(method="ordinal", descending=True).alias("rank")
        ])
        .sort("total_revenue", descending=True)
    )
    
    # Customer segmentation
    customer_segments = (
        sales_df
        .group_by("customer_id")
        .agg([
            pl.col("amount").sum().alias("total_spent"),
            pl.col("amount").count().alias("purchase_frequency"),
            pl.col("date").max().alias("last_purchase"),
            pl.col("date").min().alias("first_purchase")
        ])
        .with_columns([
            (pl.col("last_purchase") - pl.col("first_purchase")).dt.days().alias("customer_lifetime_days"),
            pl.when(pl.col("total_spent") > 1000).then("High Value")
            .when(pl.col("total_spent") > 500).then("Medium Value")
            .otherwise("Low Value").alias("value_segment"),
    
            pl.when(pl.col("purchase_frequency") > 10).then("Frequent")
            .when(pl.col("purchase_frequency") > 3).then("Regular")
            .otherwise("Occasional").alias("frequency_segment")
        ])
    )
    Python

    Example 2: Log File Analysis

    # Parse web server logs
    log_pattern = r'(\S+) - - \[([^\]]+)\] "(\S+) ([^\s]+) ([^"]+)" (\d+) (\d+)'
    
    logs_df = (
        pl.read_csv("access.log", separator="\t", has_header=False, new_columns=["raw_log"])
        .with_columns([
            pl.col("raw_log").str.extract_all(log_pattern).alias("parsed")
        ])
        .with_columns([
            pl.col("parsed").list.get(0).alias("ip"),
            pl.col("parsed").list.get(1).str.strptime(pl.Datetime, "%d/%b/%Y:%H:%M:%S %z").alias("timestamp"),
            pl.col("parsed").list.get(2).alias("method"),
            pl.col("parsed").list.get(3).alias("url"),
            pl.col("parsed").list.get(5).cast(pl.Int32).alias("status_code"),
            pl.col("parsed").list.get(6).cast(pl.Int32).alias("response_size")
        ])
        .select(["ip", "timestamp", "method", "url", "status_code", "response_size"])
    )
    
    # Traffic analysis
    hourly_traffic = (
        logs_df
        .group_by_dynamic("timestamp", every="1h")
        .agg([
            pl.col("ip").count().alias("requests"),
            pl.col("ip").n_unique().alias("unique_visitors"),
            pl.col("response_size").sum().alias("total_bytes"),
            pl.col("status_code").filter(pl.col("status_code") >= 400).count().alias("error_count")
        ])
        .with_columns([
            (pl.col("error_count") / pl.col("requests") * 100).alias("error_rate"),
            (pl.col("total_bytes") / 1024 / 1024).alias("total_mb")
        ])
    )
    
    # Top endpoints
    top_endpoints = (
        logs_df
        .group_by("url")
        .agg([
            pl.col("ip").count().alias("hits"),
            pl.col("ip").n_unique().alias("unique_visitors"),
            pl.col("response_size").mean().alias("avg_response_size")
        ])
        .sort("hits", descending=True)
        .head(20)
    )
    
    # Bot detection
    potential_bots = (
        logs_df
        .group_by("ip")
        .agg([
            pl.col("url").count().alias("request_count"),
            pl.col("url").n_unique().alias("unique_urls"),
            pl.col("timestamp").max().alias("last_seen"),
            pl.col("timestamp").min().alias("first_seen")
        ])
        .with_columns([
            (pl.col("last_seen") - pl.col("first_seen")).dt.seconds().alias("session_duration"),
            (pl.col("request_count") / pl.col("unique_urls")).alias("requests_per_url")
        ])
        .filter(
            (pl.col("request_count") > 100) & 
            (pl.col("requests_per_url") > 10) &
            (pl.col("session_duration") < 3600)  # Less than 1 hour
        )
    )
    Python

    Example 3: Financial Data Processing

    # Stock price analysis
    stock_df = pl.read_csv("stock_prices.csv").with_columns([
        pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
        pl.col("open", "high", "low", "close", "volume").cast(pl.Float64)
    ])
    
    # Technical indicators
    stock_analysis = (
        stock_df
        .sort("date")
        .with_columns([
            # Simple moving averages
            pl.col("close").rolling_mean(window_size=20).alias("sma_20"),
            pl.col("close").rolling_mean(window_size=50).alias("sma_50"),
    
            # Exponential moving average (approximation)
            pl.col("close").ewm_mean(span=12).alias("ema_12"),
    
            # Price changes
            pl.col("close").pct_change().alias("daily_return"),
            (pl.col("close") / pl.col("close").shift(252) - 1).alias("annual_return"),
    
            # Volatility (20-day rolling standard deviation of returns)
            pl.col("close").pct_change().rolling_std(window_size=20).alias("volatility_20d"),
    
            # Support and resistance levels
            pl.col("low").rolling_min(window_size=20).alias("support_20d"),
            pl.col("high").rolling_max(window_size=20).alias("resistance_20d")
        ])
    )
    
    # Risk metrics
    risk_metrics = (
        stock_analysis
        .select([
            pl.col("daily_return").mean().alias("avg_daily_return"),
            pl.col("daily_return").std().alias("daily_volatility"),
            pl.col("annual_return").mean().alias("avg_annual_return"),
            (pl.col("daily_return").mean() / pl.col("daily_return").std()).alias("sharpe_ratio_daily"),
            pl.col("daily_return").filter(pl.col("daily_return") < 0).count().alias("down_days"),
            pl.col("daily_return").count().alias("total_days")
        ])
        .with_columns([
            (pl.col("down_days") / pl.col("total_days")).alias("probability_of_loss"),
            (pl.col("avg_daily_return") * 252).alias("annualized_return"),
            (pl.col("daily_volatility") * (252 ** 0.5)).alias("annualized_volatility")
        ])
    )
    
    # Portfolio optimization (equal weight example)
    portfolio_stocks = ["AAPL", "GOOGL", "MSFT", "AMZN"]
    portfolio_analysis = (
        stock_df
        .filter(pl.col("symbol").is_in(portfolio_stocks))
        .group_by("date")
        .agg([
            pl.col("close").mean().alias("portfolio_close"),
            pl.col("volume").sum().alias("total_volume")
        ])
        .sort("date")
        .with_columns([
            pl.col("portfolio_close").pct_change().alias("portfolio_return"),
            pl.col("portfolio_close").rolling_std(window_size=252).alias("portfolio_volatility")
        ])
    )
    Python

    Example 4: IoT Sensor Data Processing

    # Process IoT sensor data
    sensor_df = (
        pl.read_csv("sensor_data.csv")
        .with_columns([
            pl.col("timestamp").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"),
            pl.col("temperature", "humidity", "pressure").cast(pl.Float64),
            pl.col("sensor_id").cast(pl.Categorical)
        ])
    )
    
    # Anomaly detection using statistical methods
    anomaly_detection = (
        sensor_df
        .sort(["sensor_id", "timestamp"])
        .with_columns([
            # Calculate rolling statistics
            pl.col("temperature").rolling_mean(window_size=100).over("sensor_id").alias("temp_mean"),
            pl.col("temperature").rolling_std(window_size=100).over("sensor_id").alias("temp_std"),
            pl.col("humidity").rolling_mean(window_size=100).over("sensor_id").alias("humidity_mean"),
            pl.col("humidity").rolling_std(window_size=100).over("sensor_id").alias("humidity_std")
        ])
        .with_columns([
            # Z-score for anomaly detection
            ((pl.col("temperature") - pl.col("temp_mean")) / pl.col("temp_std")).abs().alias("temp_zscore"),
            ((pl.col("humidity") - pl.col("humidity_mean")) / pl.col("humidity_std")).abs().alias("humidity_zscore"),
    
            # Flag anomalies (beyond 3 standard deviations)
            (pl.col("temp_zscore") > 3).alias("temp_anomaly"),
            (pl.col("humidity_zscore") > 3).alias("humidity_anomaly")
        ])
        .with_columns([
            (pl.col("temp_anomaly") | pl.col("humidity_anomaly")).alias("any_anomaly")
        ])
    )
    
    # Sensor health monitoring
    sensor_health = (
        sensor_df
        .group_by_dynamic("timestamp", every="1h", group_by="sensor_id")
        .agg([
            pl.col("temperature").count().alias("readings_count"),
            pl.col("temperature").mean().alias("avg_temp"),
            pl.col("temperature").std().alias("temp_variance"),
            pl.col("humidity").mean().alias("avg_humidity"),
            pl.col("pressure").mean().alias("avg_pressure")
        ])
        .with_columns([
            # Expected readings per hour (assuming 1 reading per minute)
            (pl.col("readings_count") / 60).alias("data_completeness"),
    
            # Flag sensors with issues
            pl.when(pl.col("readings_count") < 50).then("Low Data")
            .when(pl.col("temp_variance") > 10).then("High Variance")
            .when(pl.col("data_completeness") < 0.8).then("Missing Data")
            .otherwise("Normal").alias("health_status")
        ])
    )
    
    # Data quality report
    data_quality_report = (
        sensor_df
        .group_by("sensor_id")
        .agg([
            pl.col("timestamp").count().alias("total_readings"),
            pl.col("temperature").null_count().alias("temp_nulls"),
            pl.col("humidity").null_count().alias("humidity_nulls"),
            pl.col("pressure").null_count().alias("pressure_nulls"),
            pl.col("timestamp").min().alias("first_reading"),
            pl.col("timestamp").max().alias("last_reading")
        ])
        .with_columns([
            (pl.col("last_reading") - pl.col("first_reading")).dt.days().alias("operational_days"),
            (pl.col("temp_nulls") / pl.col("total_readings") * 100).alias("temp_null_pct"),
            (pl.col("humidity_nulls") / pl.col("total_readings") * 100).alias("humidity_null_pct"),
            (pl.col("pressure_nulls") / pl.col("total_readings") * 100).alias("pressure_null_pct")
        ])
    )
    Python

    14. Migration from Pandas

    Syntax Comparison

    # Pandas vs Polars equivalent operations
    
    # Data Loading
    # Pandas
    df_pandas = pd.read_csv("data.csv")
    
    # Polars
    df_polars = pl.read_csv("data.csv")
    
    # Column Selection
    # Pandas
    df_pandas[["col1", "col2"]]
    df_pandas.loc[:, "col1":"col3"]
    
    # Polars
    df_polars.select(["col1", "col2"])
    df_polars.select(pl.col("col1", "col2", "col3"))
    
    # Filtering continued
    # Pandas
    df_pandas[df_pandas["age"] > 25]
    df_pandas.query("age > 25 and city == 'NYC'")
    
    # Polars  
    df_polars.filter(pl.col("age") > 25)
    df_polars.filter((pl.col("age") > 25) & (pl.col("city") == "NYC"))
    
    # GroupBy and Aggregation
    # Pandas
    df_pandas.groupby("category").agg({"value": ["sum", "mean", "count"]})
    
    # Polars
    df_polars.group_by("category").agg([
        pl.col("value").sum(),
        pl.col("value").mean(),
        pl.col("value").count()
    ])
    
    # Adding Columns
    # Pandas
    df_pandas["new_col"] = df_pandas["col1"] * df_pandas["col2"]
    df_pandas.assign(new_col=lambda x: x["col1"] * x["col2"])
    
    # Polars
    df_polars.with_columns([
        (pl.col("col1") * pl.col("col2")).alias("new_col")
    ])
    
    # Sorting
    # Pandas
    df_pandas.sort_values(["col1", "col2"], ascending=[True, False])
    
    # Polars
    df_polars.sort(["col1", "col2"], descending=[False, True])
    
    # Joins
    # Pandas
    df1.merge(df2, on="key", how="left")
    
    # Polars
    df1.join(df2, on="key", how="left")
    Python

    Migration Strategy

    flowchart TD
        A[Pandas Code Analysis] --> B[Identify Operations]
        B --> C[Map to Polars Equivalents]
        C --> D[Test Performance]
        D --> E[Optimize with Lazy Operations]
        E --> F[Production Deployment]
    
        subgraph "Common Challenges"
            G[Index Operations]
            H[Inplace Operations]
            I[Method Chaining Differences]
            J[API Differences]
        end
    
        B --> G
        B --> H
        B --> I
        B --> J

    Common Migration Patterns

    # Pattern 1: Index-based operations
    # Pandas (avoid in Polars)
    df_pandas.reset_index()
    df_pandas.set_index("column")
    
    # Polars equivalent
    df_polars.with_row_count("index")  # Add row numbers
    df_polars.sort("column")  # Use sorting instead of indexing
    
    # Pattern 2: Inplace operations
    # Pandas
    df_pandas.dropna(inplace=True)
    df_pandas.fillna(0, inplace=True)
    
    # Polars (immutable)
    df_polars = df_polars.drop_nulls()
    df_polars = df_polars.fill_null(0)
    
    # Pattern 3: Apply functions
    # Pandas
    df_pandas["new_col"] = df_pandas["col"].apply(lambda x: x ** 2)
    
    # Polars
    df_polars = df_polars.with_columns([
        pl.col("col").map_elements(lambda x: x ** 2, return_dtype=pl.Float64).alias("new_col")
    ])
    
    # Better Polars (vectorized)
    df_polars = df_polars.with_columns([
        pl.col("col").pow(2).alias("new_col")
    ])
    Python

    Performance Comparison

    import time
    import pandas as pd
    import polars as pl
    import numpy as np
    
    # Generate test data
    n_rows = 1_000_000
    data = {
        "id": range(n_rows),
        "value": np.random.randn(n_rows),
        "category": np.random.choice(["A", "B", "C", "D"], n_rows),
        "date": pd.date_range("2020-01-01", periods=n_rows, freq="1min")
    }
    
    df_pandas = pd.DataFrame(data)
    df_polars = pl.DataFrame(data)
    
    # Benchmark groupby operations
    def benchmark_groupby():
        # Pandas
        start = time.time()
        result_pandas = df_pandas.groupby("category").agg({
            "value": ["sum", "mean", "std", "count"]
        })
        pandas_time = time.time() - start
    
        # Polars
        start = time.time()
        result_polars = df_polars.group_by("category").agg([
            pl.col("value").sum(),
            pl.col("value").mean(),
            pl.col("value").std(),
            pl.col("value").count()
        ])
        polars_time = time.time() - start
    
        print(f"Pandas: {pandas_time:.4f}s")
        print(f"Polars: {polars_time:.4f}s")
        print(f"Speedup: {pandas_time / polars_time:.2f}x")
    
    benchmark_groupby()
    Python

    15. Best Practices

    Code Organization

    # 1. Use lazy evaluation for complex pipelines
    def process_sales_data(file_path: str) -> pl.DataFrame:
        return (
            pl.scan_csv(file_path)
            .filter(pl.col("date").is_between("2023-01-01", "2023-12-31"))
            .with_columns([
                pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
                pl.col("amount").cast(pl.Float64)
            ])
            .group_by(["product", "region"])
            .agg([
                pl.col("amount").sum().alias("total_sales"),
                pl.col("amount").count().alias("transactions")
            ])
            .collect()
        )
    
    # 2. Create reusable expression functions
    def create_age_groups(age_column: str) -> pl.Expr:
        return (
            pl.when(pl.col(age_column) < 18).then("Under 18")
            .when(pl.col(age_column) < 65).then("Adult")
            .otherwise("Senior")
            .alias("age_group")
        )
    
    # 3. Use configuration for consistent formatting
    def setup_polars_config():
        pl.Config.set_tbl_rows(20)
        pl.Config.set_tbl_cols(8)
        pl.Config.set_fmt_str_lengths(50)
        pl.Config.set_thread_pool_size(8)
    Python

    Error Handling and Validation

    def safe_data_processing(df: pl.DataFrame) -> pl.DataFrame:
        try:
            # Validate input
            required_columns = ["id", "value", "date"]
            missing_cols = set(required_columns) - set(df.columns)
            if missing_cols:
                raise ValueError(f"Missing columns: {missing_cols}")
    
            # Safe type casting
            result = df.with_columns([
                pl.col("value").cast(pl.Float64, strict=False),
                pl.col("date").str.strptime(pl.Date, "%Y-%m-%d", strict=False)
            ])
    
            # Check for processing errors
            null_dates = result.filter(pl.col("date").is_null()).height
            if null_dates > 0:
                print(f"Warning: {null_dates} rows with invalid dates")
    
            return result
    
        except Exception as e:
            print(f"Error processing data: {e}")
            raise
    
    # Schema validation
    def validate_schema(df: pl.DataFrame, expected_schema: dict) -> bool:
        """Validate DataFrame schema against expected types"""
        for col, expected_type in expected_schema.items():
            if col not in df.columns:
                raise ValueError(f"Missing column: {col}")
            if df[col].dtype != expected_type:
                raise TypeError(f"Column {col} has type {df[col].dtype}, expected {expected_type}")
        return True
    
    # Usage
    expected_schema = {
        "id": pl.Int64,
        "name": pl.Utf8,
        "value": pl.Float64
    }
    validate_schema(df, expected_schema)
    Python

    Memory and Performance Best Practices

    # 1. Use streaming for large datasets
    def process_large_dataset(file_path: str):
        return (
            pl.scan_parquet(file_path)
            .filter(pl.col("active") == True)
            .select(["id", "value", "category"])
            .group_by("category")
            .agg(pl.col("value").sum())
            .collect(streaming=True)  # Process in chunks
        )
    
    # 2. Optimize data types
    def optimize_dtypes(df: pl.DataFrame) -> pl.DataFrame:
        """Optimize data types for memory efficiency"""
        return df.with_columns([
            # Use smaller integer types when possible
            pl.col("small_numbers").cast(pl.Int16),
    
            # Use categorical for repeated strings
            pl.col("categories").cast(pl.Categorical),
    
            # Use appropriate float precision
            pl.col("percentages").cast(pl.Float32)
        ])
    
    # 3. Efficient filtering order
    def efficient_filtering(df: pl.DataFrame):
        """Apply filters in order of selectivity"""
        return (
            df
            .filter(pl.col("status") == "active")  # Most selective first
            .filter(pl.col("date") > "2023-01-01")  # Then date range
            .filter(pl.col("value") > 0)  # Least selective last
        )
    
    # 4. Batch operations
    def batch_process_files(file_paths: list, batch_size: int = 10):
        """Process files in batches to manage memory"""
        results = []
        for i in range(0, len(file_paths), batch_size):
            batch = file_paths[i:i + batch_size]
            batch_dfs = [pl.read_csv(f) for f in batch]
            combined = pl.concat(batch_dfs)
            processed = combined.group_by("category").agg(pl.col("value").sum())
            results.append(processed)
    
        return pl.concat(results)
    Python

    Testing Strategies

    import pytest
    
    def test_data_transformation():
        """Test data transformation functions"""
        # Sample data
        test_df = pl.DataFrame({
            "age": [15, 25, 70],
            "name": ["Alice", "Bob", "Charlie"]
        })
    
        # Apply transformation
        result = test_df.with_columns([create_age_groups("age")])
    
        # Assertions
        assert "age_group" in result.columns
        assert result["age_group"].to_list() == ["Under 18", "Adult", "Senior"]
    
    def test_aggregation_consistency():
        """Test that aggregations produce consistent results"""
        df = pl.DataFrame({
            "group": ["A", "A", "B", "B"],
            "value": [1, 2, 3, 4]
        })
    
        result = df.group_by("group").agg(pl.col("value").sum())
        expected = pl.DataFrame({
            "group": ["A", "B"],
            "value": [3, 7]
        })
    
        assert result.sort("group").equals(expected.sort("group"))
    
    # Property-based testing example
    from hypothesis import given, strategies as st
    import hypothesis.extra.pandas as st_pd
    
    @given(st_pd.data_frames([
        st_pd.column("value", elements=st.floats(0, 1000)),
        st_pd.column("category", elements=st.text(min_size=1, max_size=5))
    ]))
    def test_groupby_sum_is_positive(df_pandas):
        """Property test: sum of positive values should be positive"""
        df = pl.from_pandas(df_pandas)
        if df.height > 0:
            result = df.group_by("category").agg(pl.col("value").sum())
            assert all(result["value"] >= 0)
    Python

    Documentation and Type Hints

    from typing import List, Dict, Optional, Union
    import polars as pl
    
    def analyze_sales_trends(
        df: pl.DataFrame,
        date_column: str = "date",
        value_column: str = "sales",
        groupby_columns: Optional[List[str]] = None,
        rolling_window: int = 30
    ) -> Dict[str, pl.DataFrame]:
        """
        Analyze sales trends with rolling averages and growth rates.
    
        Args:
            df: Input DataFrame with sales data
            date_column: Name of the date column
            value_column: Name of the sales value column
            groupby_columns: Optional columns to group by
            rolling_window: Window size for rolling average (days)
    
        Returns:
            Dictionary containing trend analysis results:
            - 'trends': DataFrame with rolling averages and growth rates
            - 'summary': DataFrame with summary statistics
    
        Example:
            >>> df = pl.DataFrame({
            ...     "date": ["2023-01-01", "2023-01-02"],
            ...     "sales": [100, 150],
            ...     "region": ["North", "South"]
            ... })
            >>> results = analyze_sales_trends(df, groupby_columns=["region"])
        """
        # Input validation
        required_columns = [date_column, value_column]
        missing_columns = set(required_columns) - set(df.columns)
        if missing_columns:
            raise ValueError(f"Missing required columns: {missing_columns}")
    
        # Main analysis
        if groupby_columns:
            trends = (
                df
                .sort([*groupby_columns, date_column])
                .with_columns([
                    pl.col(value_column)
                    .rolling_mean(rolling_window)
                    .over(groupby_columns)
                    .alias(f"rolling_avg_{rolling_window}d"),
    
                    pl.col(value_column)
                    .pct_change()
                    .over(groupby_columns)
                    .alias("growth_rate")
                ])
            )
    
            summary = (
                df
                .group_by(groupby_columns)
                .agg([
                    pl.col(value_column).sum().alias("total_sales"),
                    pl.col(value_column).mean().alias("avg_sales"),
                    pl.col(value_column).std().alias("sales_volatility")
                ])
            )
        else:
            trends = df.sort(date_column).with_columns([
                pl.col(value_column).rolling_mean(rolling_window).alias(f"rolling_avg_{rolling_window}d"),
                pl.col(value_column).pct_change().alias("growth_rate")
            ])
    
            summary = df.select([
                pl.col(value_column).sum().alias("total_sales"),
                pl.col(value_column).mean().alias("avg_sales"),
                pl.col(value_column).std().alias("sales_volatility")
            ])
    
        return {
            "trends": trends,
            "summary": summary
        }
    Python

    Production Deployment Best Practices

    import logging
    from pathlib import Path
    import polars as pl
    
    # Configure logging
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
    
    class DataProcessor:
        """Production-ready data processor with error handling and logging"""
    
        def __init__(self, config: dict):
            self.config = config
            self.setup_polars()
    
        def setup_polars(self):
            """Configure Polars for production use"""
            pl.Config.set_thread_pool_size(self.config.get("threads", 4))
            pl.Config.set_streaming_chunk_size(self.config.get("chunk_size", 10000))
    
        def process_file(self, file_path: Path) -> Optional[pl.DataFrame]:
            """Process a single file with error handling"""
            try:
                logger.info(f"Processing file: {file_path}")
    
                # Validate file exists
                if not file_path.exists():
                    logger.error(f"File not found: {file_path}")
                    return None
    
                # Process with lazy evaluation
                result = (
                    pl.scan_csv(file_path)
                    .filter(pl.col("date").is_not_null())
                    .filter(pl.col("value") > 0)
                    .with_columns([
                        pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
                        pl.col("value").cast(pl.Float64)
                    ])
                    .collect(streaming=True)
                )
    
                logger.info(f"Successfully processed {result.height} rows")
                return result
    
            except Exception as e:
                logger.error(f"Error processing {file_path}: {e}")
                return None
    
        def save_results(self, df: pl.DataFrame, output_path: Path):
            """Save results with appropriate format"""
            try:
                output_path.parent.mkdir(parents=True, exist_ok=True)
    
                if output_path.suffix == ".parquet":
                    df.write_parquet(output_path, compression="snappy")
                elif output_path.suffix == ".csv":
                    df.write_csv(output_path)
                else:
                    raise ValueError(f"Unsupported output format: {output_path.suffix}")
    
                logger.info(f"Results saved to: {output_path}")
    
            except Exception as e:
                logger.error(f"Error saving results: {e}")
                raise
    
    # Configuration management
    def load_config() -> dict:
        """Load configuration from environment or config file"""
        return {
            "threads": int(os.getenv("POLARS_THREADS", 4)),
            "chunk_size": int(os.getenv("POLARS_CHUNK_SIZE", 10000)),
            "input_dir": Path(os.getenv("INPUT_DIR", "./data")),
            "output_dir": Path(os.getenv("OUTPUT_DIR", "./output"))
        }
    Python

    Best Practices Summary

    mindmap
      root((Polars Best Practices))
        Performance
          Use lazy evaluation
          Optimize data types
          Stream large datasets
          Filter early and efficiently
        Code Quality
          Type hints
          Error handling
          Comprehensive testing
          Clear documentation
        Production
          Logging and monitoring
          Configuration management
          Resource management
          Graceful error handling
        Migration
          Incremental approach
          Performance benchmarking
          Feature parity validation
          Team training

    Final Recommendations

    1. Start Simple: Begin with basic operations and gradually adopt advanced features
    2. Leverage Lazy Evaluation: Use scan_* functions and .lazy() for better performance
    3. Optimize Data Types: Choose appropriate types to reduce memory usage
    4. Test Thoroughly: Write comprehensive tests for data transformations
    5. Monitor Performance: Benchmark critical operations and optimize bottlenecks
    6. Handle Errors Gracefully: Implement proper error handling and validation
    7. Document Well: Provide clear documentation for complex operations
    8. Stay Updated: Keep up with Polars releases and new features

    Appendix A: Quick Reference and Cheat Sheets

    Essential Imports and Setup

    import polars as pl
    import numpy as np
    from datetime import datetime, date, timedelta
    
    # Configuration
    pl.Config.set_tbl_rows(20)
    pl.Config.set_tbl_cols(8)
    pl.Config.set_fmt_str_lengths(50)
    Python

    DataFrame Creation Cheat Sheet

    # From dictionary (most common)
    df = pl.DataFrame({
        "name": ["Alice", "Bob"],
        "age": [25, 30],
        "salary": [50000, 60000]
    })
    
    # From lists with schema
    df = pl.DataFrame([
        ["Alice", 25], ["Bob", 30]
    ], schema=["name", "age"])
    
    # Empty with schema
    df = pl.DataFrame(schema={
        "id": pl.Int64,
        "name": pl.Utf8,
        "value": pl.Float64
    })
    Python

    File I/O Quick Reference

    # Reading files
    df = pl.read_csv("file.csv")
    df = pl.read_parquet("file.parquet")
    df = pl.read_json("file.json")
    
    # Lazy reading (for large files)
    df = pl.scan_csv("large_file.csv").collect()
    
    # Writing files
    df.write_csv("output.csv")
    df.write_parquet("output.parquet")
    df.write_json("output.json")
    Python

    Selection and Filtering Cheat Sheet

    # Column selection
    df.select("name")                    # Single column
    df.select(["name", "age"])           # Multiple columns
    df.select(pl.col("^.*_id$"))         # Regex pattern
    df.select(pl.exclude("id"))          # All except id
    
    # Row filtering
    df.filter(pl.col("age") > 25)                          # Simple condition
    df.filter((pl.col("age") > 20) & (pl.col("age") < 60)) # Multiple conditions
    df.filter(pl.col("name").str.contains("Alice"))        # String operations
    df.filter(pl.col("value").is_null())                   # Null values
    Python

    Transformation Cheat Sheet

    # Adding/modifying columns
    df.with_columns([
        pl.col("age").alias("years"),                    # Rename
        (pl.col("salary") * 1.1).alias("new_salary"),   # Calculate
        pl.lit("constant").alias("category"),            # Add constant
        pl.col("name").str.len().alias("name_length")   # String operation
    ])
    
    # Conditional logic
    df.with_columns([
        pl.when(pl.col("age") > 30)
        .then(pl.lit("Senior"))
        .otherwise(pl.lit("Junior"))
        .alias("category")
    ])
    Python

    Aggregation Quick Reference

    # Basic aggregations
    df.select([
        pl.col("value").sum(),
        pl.col("value").mean(),
        pl.col("value").count(),
        pl.col("value").min(),
        pl.col("value").max(),
        pl.col("value").std()
    ])
    
    # Group by operations
    df.group_by("category").agg([
        pl.col("value").sum().alias("total"),
        pl.col("value").mean().alias("average"),
        pl.col("id").count().alias("count")
    ])
    Python

    Join Operations Quick Reference

    # Join types
    df1.join(df2, on="key", how="inner")    # Inner join
    df1.join(df2, on="key", how="left")     # Left join
    df1.join(df2, on="key", how="outer")    # Outer join
    df1.join(df2, on="key", how="cross")    # Cross join
    
    # Different column names
    df1.join(df2, left_on="id", right_on="user_id", how="left")
    Python

    Data Types Quick Reference

    # Numeric types
    pl.Int8, pl.Int16, pl.Int32, pl.Int64      # Signed integers
    pl.UInt8, pl.UInt16, pl.UInt32, pl.UInt64  # Unsigned integers
    pl.Float32, pl.Float64                      # Floating point
    
    # Text and categorical
    pl.Utf8                                     # String
    pl.Categorical                              # Categories
    pl.Binary                                   # Binary data
    
    # Temporal
    pl.Date                                     # Date only
    pl.Datetime                                 # Date and time
    pl.Time                                     # Time only
    pl.Duration                                 # Time intervals
    
    # Complex types
    pl.List(pl.Int32)                          # List of integers
    pl.Struct([pl.Field("name", pl.Utf8)])     # Structured data
    
    # Special
    pl.Boolean                                  # True/False
    pl.Null                                     # Null type
    Python

    Common Expressions

    # Column operations
    pl.col("name")                             # Column reference
    pl.all()                                   # All columns
    pl.exclude("id")                           # All except specified
    pl.lit(42)                                 # Literal value
    
    # String operations
    pl.col("text").str.len()                   # String length
    pl.col("text").str.contains("pattern")     # Contains substring
    pl.col("text").str.to_uppercase()          # Uppercase
    pl.col("text").str.replace("old", "new")   # Replace text
    
    # Date operations
    pl.col("date").dt.year()                   # Extract year
    pl.col("date").dt.month()                  # Extract month
    pl.col("date").dt.day()                    # Extract day
    pl.col("date").dt.weekday()                # Day of week
    
    # Mathematical operations
    pl.col("value").abs()                      # Absolute value
    pl.col("value").sqrt()                     # Square root
    pl.col("value").round(2)                   # Round to 2 decimals
    pl.col("value").rank()                     # Rank values
    Python

    Performance Tips Summary

    # 1. Use lazy evaluation for large datasets
    df = pl.scan_csv("large_file.csv").filter(...).collect()
    
    # 2. Use appropriate data types
    df = df.with_columns([
        pl.col("small_numbers").cast(pl.Int16),
        pl.col("categories").cast(pl.Categorical)
    ])
    
    # 3. Filter early in lazy queries
    df = (
        pl.scan_csv("data.csv")
        .filter(pl.col("date") > "2023-01-01")  # Early filtering
        .select(["id", "value"])                # Column pruning
        .collect()
    )
    
    # 4. Use streaming for very large datasets
    df = lazy_query.collect(streaming=True)
    Python

    Appendix B: Troubleshooting Common Issues

    Installation Issues

    Problem: ImportError: No module named 'polars'

    # Solution: Check Python environment
    python -m pip install polars
    # or
    conda install -c conda-forge polars
    Bash

    Problem: Version conflicts

    # Solution: Use virtual environment
    python -m venv polars_env
    source polars_env/bin/activate  # Linux/Mac
    # or
    polars_env\Scripts\activate     # Windows
    pip install polars[all]
    Bash

    Data Type Issues

    Problem: Unexpected data types after loading

    # Check data types
    print(df.dtypes)
    
    # Fix with explicit casting
    df = df.with_columns([
        pl.col("number_col").cast(pl.Int64),
        pl.col("date_col").str.strptime(pl.Date, "%Y-%m-%d")
    ])
    Python

    Problem: Memory usage too high

    # Use more efficient data types
    df = df.with_columns([
        pl.col("small_int").cast(pl.Int16),  # Instead of Int64
        pl.col("category").cast(pl.Categorical),  # Instead of Utf8
        pl.col("flag").cast(pl.Boolean)  # Instead of string
    ])
    Python

    Performance Issues

    Problem: Slow operations on large datasets

    # Solution 1: Use lazy evaluation
    result = (
        pl.scan_csv("large_file.csv")
        .filter(pl.col("important") == True)
        .select(["id", "value"])
        .collect()
    )
    
    # Solution 2: Use streaming
    result = lazy_query.collect(streaming=True)
    
    # Solution 3: Configure thread pool
    pl.Config.set_thread_pool_size(8)
    Python

    Problem: Out of memory errors

    # Solution: Process in chunks or use streaming
    def process_in_chunks(file_path, chunk_size=10000):
        total_result = None
    
        for chunk_df in pl.read_csv_batched(file_path, batch_size=chunk_size):
            chunk_result = chunk_df.group_by("category").agg(pl.col("value").sum())
    
            if total_result is None:
                total_result = chunk_result
            else:
                total_result = pl.concat([total_result, chunk_result])
    
        return total_result.group_by("category").agg(pl.col("value").sum())
    Python

    Common Error Messages

    Error: ShapeError: Could not add column

    # Problem: Column length mismatch
    # Solution: Ensure all columns have the same length
    df = pl.DataFrame({
        "a": [1, 2, 3],
        "b": [4, 5, 6],  # Same length as 'a'
    })
    Python

    Error: ComputeError: Could not parse

    # Problem: Date/time parsing issues
    # Solution: Specify format or use strict=False
    df = df.with_columns([
        pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d", strict=False)
    ])
    Python

    Error: ColumnNotFoundError

    # Problem: Column name doesn't exist
    # Solution: Check column names
    print(df.columns)  # List all columns
    # or use regex for flexible matching
    df.select(pl.col("^.*name.*$"))  # Columns containing "name"
    Python

    Appendix C: Performance Benchmarks

    Polars vs Pandas Comparison

    Based on common operations with 1M rows:

    OperationPandas TimePolars TimeSpeedup
    GroupBy Sum145ms12ms12.1x
    Filter + Select89ms8ms11.1x
    Join Operations267ms23ms11.6x
    String Operations156ms19ms8.2x
    Date Parsing445ms67ms6.6x
    Aggregations78ms9ms8.7x

    Memory Usage Comparison

    Data TypePandas MemoryPolars MemoryReduction
    1M integers7.6 MB3.8 MB50%
    1M strings54.2 MB38.1 MB30%
    1M categories54.2 MB8.4 MB84%
    1M dates15.3 MB7.6 MB50%

    Appendix D: Resources and Further Learning

    Official Resources

    Community and Support

    • Discord Server: Join the Polars community for real-time help
    • Stack Overflow: Tag questions with python-polars
    • GitHub Issues: Report bugs and request features
    • Reddit: r/PolarsDataFrame community discussions
    • PyArrow: Arrow format integration
    • DuckDB: SQL interface for Polars
    • Connectorx: Fast database connectors
    • Great Expectations: Data quality validation

    Learning Path Recommendations

    Beginner Path:

    1. Master DataFrame creation and basic operations
    2. Learn filtering and selection patterns
    3. Practice with real datasets
    4. Understand lazy evaluation benefits

    Intermediate Path:

    1. Complex aggregations and window functions
    2. Join operations and data combination
    3. Time series analysis
    4. Performance optimization techniques

    Advanced Path:

    1. Custom functions and expressions
    2. Nested data manipulation
    3. Production deployment patterns
    4. Integration with other tools

    Sample Datasets for Practice

    # Built-in sample data generators
    def create_sales_data(n_rows=10000):
        import random
        random.seed(42)
    
        return pl.DataFrame({
            "date": pl.date_range(date(2020, 1, 1), date(2023, 12, 31), "1d")[:n_rows],
            "product": [f"Product_{i%100}" for i in range(n_rows)],
            "sales": [random.randint(10, 1000) for _ in range(n_rows)],
            "region": [random.choice(["North", "South", "East", "West"]) for _ in range(n_rows)]
        })
    
    def create_sensor_data(n_rows=50000):
        import random
        random.seed(42)
    
        return pl.DataFrame({
            "timestamp": pl.datetime_range(
                datetime(2023, 1, 1), 
                datetime(2023, 12, 31), 
                "1h"
            )[:n_rows],
            "sensor_id": [f"SENSOR_{i%20:03d}" for i in range(n_rows)],
            "temperature": [random.gauss(20, 5) for _ in range(n_rows)],
            "humidity": [random.gauss(50, 15) for _ in range(n_rows)],
            "pressure": [random.gauss(1013, 10) for _ in range(n_rows)]
        })
    
    # Usage
    sales_df = create_sales_data(10000)
    sensor_df = create_sensor_data(5000)
    Python

    Final Words

    This comprehensive guide covers Polars from beginner to expert level, providing practical examples, performance optimization techniques, and production-ready patterns. The combination of detailed explanations, code examples, and visual diagrams using Mermaid makes it a complete resource for mastering Python Polars.

    Key Takeaways

    1. Polars is Revolutionary: Built for modern data processing with speed and efficiency
    2. Lazy Evaluation: The key to performance optimization and memory efficiency
    3. Expression System: Powerful, composable operations that are both readable and fast
    4. Production Ready: Robust features for real-world data processing pipelines
    5. Growing Ecosystem: Active community and continuous development

    Your Journey Forward

    • Practice Regularly: Apply concepts with real datasets
    • Join the Community: Engage with other Polars users
    • Stay Updated: Follow Polars development for new features
    • Share Knowledge: Contribute to the growing Polars ecosystem

    Happy data processing with Polars! 🚀


    Discover more from Altgr Blog

    Subscribe to get the latest posts sent to your email.

    Leave a Reply

    Your email address will not be published. Required fields are marked *