Python Polars – Altgr Blog

From Beginner to Expert

Master the fastest DataFrame library in Python with practical examples, advanced techniques, and real-world applications

About This Book

This comprehensive guide will take you from complete beginner to Polars expert through hands-on examples, practical exercises, and real-world case studies. Each chapter builds upon previous concepts while introducing new techniques and best practices.

Who This Book Is For

Data Analysts looking for faster data processing
Data Scientists wanting to optimize their workflows
Python Developers seeking performant data manipulation tools
Pandas Users ready to upgrade to next-generation DataFrame operations

Prerequisites

Basic Python programming knowledge
Familiarity with data concepts (optional but helpful)
No prior Polars experience required

1. Why Polars? Understanding the Revolution

🎯 Learning Objectives

By the end of this chapter, you will:

Understand what makes Polars revolutionary in data processing
Know when to choose Polars over other libraries
Grasp the architectural differences between Polars and Pandas
Appreciate the key advantages that drive Polars’ performance

The Data Processing Challenge

In today’s data-driven world, we’re dealing with increasingly large datasets. Traditional tools like Pandas, while excellent for small to medium datasets, often struggle with:

Performance bottlenecks on large datasets
Memory limitations that cause crashes
Single-threaded execution that doesn’t utilize modern hardware
Eager evaluation that processes unnecessary data

Enter Polars: The Game Changer

Polars is a revolutionary DataFrame library that addresses these challenges head-on. Built from the ground up in Rust with Python bindings, it represents the next generation of data processing tools.

What Makes Polars Special?

# A simple example showing Polars' power
import polars as pl
import time

# Create a large dataset
df = pl.DataFrame({
    "id": range(1_000_000),
    "value": range(1_000_000),
    "category": ["A", "B", "C"] * 333_334
})

# Lightning-fast operations
start = time.time()
result = (
    df
    .filter(pl.col("value") > 500_000)
    .group_by("category")
    .agg(pl.col("value").sum())
)
print(f"Processed 1M rows in {time.time() - start:.3f} seconds")

# A simple example showing Polars' power
import polars as pl
import time

# Create a large dataset
df = pl.DataFrame({
    "id": range(1_000_000),
    "value": range(1_000_000),
    "category": ["A", "B", "C"] * 333_334
})

# Lightning-fast operations
start = time.time()
result = (
    df
    .filter(pl.col("value") > 500_000)
    .group_by("category")
    .agg(pl.col("value").sum())
)
print(f"Processed 1M rows in {time.time() - start:.3f} seconds")

Python

Architecture Comparison: The Foundation of Speed

graph TB
    subgraph "Traditional Approach (Pandas)"
        A[Python Code] --> B[Pandas API]
        B --> C[NumPy/C Backend]
        C --> D[Single Thread Execution]
        D --> E[Eager Evaluation]
        E --> F[Result]
    end

    subgraph "Modern Approach (Polars)"
        G[Python Code] --> H[Polars API]  
        H --> I[Rust Core Engine]
        I --> J[Multi-threaded Execution]
        J --> K[Lazy Evaluation]
        K --> L[Query Optimization]
        L --> M[SIMD Instructions]
        M --> N[Result]
    end

    style I fill:#f96
    style J fill:#9f6
    style K fill:#69f
    style L fill:#f69

Key Advantages Explained

1. Blazing Fast Performance

Rust Implementation: Native speed without Python overhead
SIMD Optimizations: Utilizes modern CPU instructions
Parallel Processing: Automatic multi-threading
Benchmark: Often 5-30x faster than Pandas

2. Memory Efficiency

Columnar Storage: Optimized memory layout
Zero-Copy Operations: Minimal memory allocations
Streaming Support: Process datasets larger than RAM
Smart Memory Management: Rust’s ownership model prevents leaks

3. Lazy Evaluation with Query Optimization

# This creates a plan, doesn't execute immediately
lazy_query = (
    pl.scan_csv("huge_file.csv")
    .filter(pl.col("date") > "2023-01-01")  # Pushed down to file reading
    .select(["id", "value"])                # Only read needed columns
    .group_by("id")
    .agg(pl.col("value").sum())
)

# Execute the optimized plan
result = lazy_query.collect()

# This creates a plan, doesn't execute immediately
lazy_query = (
    pl.scan_csv("huge_file.csv")
    .filter(pl.col("date") > "2023-01-01")  # Pushed down to file reading
    .select(["id", "value"])                # Only read needed columns
    .group_by("id")
    .agg(pl.col("value").sum())
)

# Execute the optimized plan
result = lazy_query.collect()

Python

4. Expressive and Intuitive API

Method Chaining: Natural, readable code
Rich Expression System: Powerful column operations
Type Safety: Catch errors at query planning time
Consistent Patterns: Learn once, apply everywhere

5. Modern Data Types and Features

Native Date/Time Support: Built-in temporal operations
Nested Data Types: Lists, structs, and more
String Processing: Optimized text operations
Categorical Data: Efficient handling of categories

When to Choose Polars

✅ Choose Polars When:

Working with datasets > 100MB
Performance is critical
You need out-of-core processing
Building production data pipelines
Dealing with time series data
Complex aggregations and joins

⚠️ Consider Alternatives When:

Dataset is very small (< 10MB)
Heavy integration with pandas ecosystem required
Team lacks Python/data processing experience
Quick prototyping with familiar tools

Real-World Impact: A Quick Example

Let’s see Polars in action with a practical scenario:

import polars as pl
import pandas as pd
import numpy as np
import time

# Create sample sales data
n_rows = 500_000
data = {
    "date": pd.date_range("2020-01-01", periods=n_rows, freq="1H"),
    "product_id": np.random.randint(1, 1000, n_rows),
    "sales": np.random.normal(100, 25, n_rows),
    "region": np.random.choice(["North", "South", "East", "West"], n_rows)
}

# Pandas approach
df_pandas = pd.DataFrame(data)
start = time.time()
pandas_result = (
    df_pandas
    .groupby(["region", df_pandas["date"].dt.date])
    .agg({"sales": ["sum", "mean", "count"]})
)
pandas_time = time.time() - start

# Polars approach  
df_polars = pl.DataFrame(data)
start = time.time()
polars_result = (
    df_polars
    .group_by(["region", pl.col("date").dt.date()])
    .agg([
        pl.col("sales").sum().alias("sales_sum"),
        pl.col("sales").mean().alias("sales_mean"),
        pl.col("sales").count().alias("sales_count")
    ])
)
polars_time = time.time() - start

print(f"Pandas: {pandas_time:.2f}s")
print(f"Polars: {polars_time:.2f}s") 
print(f"Speedup: {pandas_time/polars_time:.1f}x faster")

import polars as pl
import pandas as pd
import numpy as np
import time

# Create sample sales data
n_rows = 500_000
data = {
    "date": pd.date_range("2020-01-01", periods=n_rows, freq="1H"),
    "product_id": np.random.randint(1, 1000, n_rows),
    "sales": np.random.normal(100, 25, n_rows),
    "region": np.random.choice(["North", "South", "East", "West"], n_rows)
}

# Pandas approach
df_pandas = pd.DataFrame(data)
start = time.time()
pandas_result = (
    df_pandas
    .groupby(["region", df_pandas["date"].dt.date])
    .agg({"sales": ["sum", "mean", "count"]})
)
pandas_time = time.time() - start

# Polars approach  
df_polars = pl.DataFrame(data)
start = time.time()
polars_result = (
    df_polars
    .group_by(["region", pl.col("date").dt.date()])
    .agg([
        pl.col("sales").sum().alias("sales_sum"),
        pl.col("sales").mean().alias("sales_mean"),
        pl.col("sales").count().alias("sales_count")
    ])
)
polars_time = time.time() - start

print(f"Pandas: {pandas_time:.2f}s")
print(f"Polars: {polars_time:.2f}s") 
print(f"Speedup: {pandas_time/polars_time:.1f}x faster")

Python

Chapter Summary

Polars represents a paradigm shift in data processing, offering:

Revolutionary performance through Rust and modern algorithms
Memory efficiency that handles larger datasets
Query optimization that makes your code faster automatically
Modern API design that’s both powerful and intuitive

In the next chapter, we’ll get Polars installed and write our first program to experience this performance firsthand.

🔗 What’s Next?

Ready to experience the Polars advantage? Let’s get it installed and running on your system!

2. Getting Started: Installation and Setup

🎯 Learning Objectives

By the end of this chapter, you will:

Successfully install Polars in different environments
Configure your development environment for optimal Polars usage
Understand different installation options and their use cases
Verify your installation with a simple test program

Installation Options

Basic Installation

# Most common installation
pip install polars

# For conda users
conda install -c conda-forge polars

# Most common installation
pip install polars

# For conda users
conda install -c conda-forge polars

Bash

Advanced Installation Options

# With all optional dependencies (recommended for full functionality)
pip install polars[all]

# For development and testing
pip install polars[dev]

# Specific feature sets
pip install polars[pandas]  # Pandas interoperability
pip install polars[numpy]   # NumPy integration
pip install polars[xlsx]    # Excel file support
pip install polars[sql]     # SQL query support

# With all optional dependencies (recommended for full functionality)
pip install polars[all]

# For development and testing
pip install polars[dev]

# Specific feature sets
pip install polars[pandas]  # Pandas interoperability
pip install polars[numpy]   # NumPy integration
pip install polars[xlsx]    # Excel file support
pip install polars[sql]     # SQL query support

Bash

Installation Verification

import polars as pl
import sys

print(f"✅ Polars version: {pl.__version__}")
print(f"🐍 Python version: {sys.version}")
print(f"🏗️  Available features: {pl.show_versions()}")

# Quick functionality test
test_df = pl.DataFrame({"test": [1, 2, 3]})
print(f"🧪 Test DataFrame created successfully: {test_df.shape[0]} rows")

import polars as pl
import sys

print(f"✅ Polars version: {pl.__version__}")
print(f"🐍 Python version: {sys.version}")
print(f"🏗️  Available features: {pl.show_versions()}")

# Quick functionality test
test_df = pl.DataFrame({"test": [1, 2, 3]})
print(f"🧪 Test DataFrame created successfully: {test_df.shape[0]} rows")

Python

Development Environment Setup

Jupyter Notebook Configuration

# Put this in your notebook's first cell
import polars as pl
import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta
import warnings

# Polars display configuration
pl.Config.set_tbl_rows(15)           # Show 15 rows in output
pl.Config.set_tbl_cols(8)            # Show 8 columns in output  
pl.Config.set_fmt_str_lengths(50)    # Truncate strings at 50 chars
pl.Config.set_tbl_width_chars(120)   # Table width in characters

# Enable better error messages
pl.Config.set_verbose(True)

# Suppress pandas warnings when comparing
warnings.filterwarnings('ignore', category=FutureWarning)

print("🚀 Polars environment ready!")

# Put this in your notebook's first cell
import polars as pl
import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta
import warnings

# Polars display configuration
pl.Config.set_tbl_rows(15)           # Show 15 rows in output
pl.Config.set_tbl_cols(8)            # Show 8 columns in output  
pl.Config.set_fmt_str_lengths(50)    # Truncate strings at 50 chars
pl.Config.set_tbl_width_chars(120)   # Table width in characters

# Enable better error messages
pl.Config.set_verbose(True)

# Suppress pandas warnings when comparing
warnings.filterwarnings('ignore', category=FutureWarning)

print("🚀 Polars environment ready!")

Python

IDE Configuration (VS Code, PyCharm)

# Create a polars_config.py file for consistent setup
import polars as pl

def setup_polars_environment():
    """Configure Polars for development environment"""
    # Display settings
    pl.Config.set_tbl_rows(20)
    pl.Config.set_tbl_cols(10)
    pl.Config.set_fmt_str_lengths(100)

    # Performance settings
    pl.Config.set_streaming_chunk_size(10000)

    # Enable all available threads (good for development)
    import os
    n_threads = os.cpu_count()
    pl.Config.set_thread_pool_size(n_threads)

    print(f"🔧 Polars configured with {n_threads} threads")
    print(f"📊 Display: {pl.Config.get_tbl_rows()} rows, {pl.Config.get_tbl_cols()} columns")

    return pl

# Import and use in your scripts
# from polars_config import setup_polars_environment
# pl = setup_polars_environment()

# Create a polars_config.py file for consistent setup
import polars as pl

def setup_polars_environment():
    """Configure Polars for development environment"""
    # Display settings
    pl.Config.set_tbl_rows(20)
    pl.Config.set_tbl_cols(10)
    pl.Config.set_fmt_str_lengths(100)

    # Performance settings
    pl.Config.set_streaming_chunk_size(10000)

    # Enable all available threads (good for development)
    import os
    n_threads = os.cpu_count()
    pl.Config.set_thread_pool_size(n_threads)

    print(f"🔧 Polars configured with {n_threads} threads")
    print(f"📊 Display: {pl.Config.get_tbl_rows()} rows, {pl.Config.get_tbl_cols()} columns")

    return pl

# Import and use in your scripts
# from polars_config import setup_polars_environment
# pl = setup_polars_environment()

Python

Your First Polars Program

Let’s create a simple program to verify everything is working:

import polars as pl
import time

def first_polars_program():
    """Your first Polars program - data creation and basic operations"""

    print("🎉 Welcome to your first Polars program!")

    # Create sample data
    df = pl.DataFrame({
        "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
        "age": [25, 30, 35, 28, 32],
        "city": ["New York", "London", "Tokyo", "Paris", "Sydney"],
        "salary": [70000, 80000, 90000, 75000, 85000]
    })

    print("\n📊 Original Data:")
    print(df)

    # Basic operations
    print("\n🔍 Basic Statistics:")
    print(df.describe())

    # Filtering example
    high_earners = df.filter(pl.col("salary") > 80000)
    print(f"\n💰 High earners (salary > $80,000): {high_earners.height} people")
    print(high_earners)

    # Grouping example
    city_stats = df.group_by("city").agg([
        pl.col("salary").mean().alias("avg_salary"),
        pl.col("age").mean().alias("avg_age")
    ])
    print("\n🌍 City Statistics:")
    print(city_stats)

    # Performance demonstration
    print("\n⚡ Performance Test:")
    large_df = pl.DataFrame({
        "id": range(100_000),
        "value": range(100_000)
    })

    start_time = time.time()
    result = large_df.filter(pl.col("value") > 50_000).sum()
    elapsed = time.time() - start_time

    print(f"Processed 100,000 rows in {elapsed:.4f} seconds")
    print("🎯 Setup complete! You're ready to learn Polars.")

# Run the program
if __name__ == "__main__":
    first_polars_program()

import polars as pl
import time

def first_polars_program():
    """Your first Polars program - data creation and basic operations"""

    print("🎉 Welcome to your first Polars program!")

    # Create sample data
    df = pl.DataFrame({
        "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
        "age": [25, 30, 35, 28, 32],
        "city": ["New York", "London", "Tokyo", "Paris", "Sydney"],
        "salary": [70000, 80000, 90000, 75000, 85000]
    })

    print("\n📊 Original Data:")
    print(df)

    # Basic operations
    print("\n🔍 Basic Statistics:")
    print(df.describe())

    # Filtering example
    high_earners = df.filter(pl.col("salary") > 80000)
    print(f"\n💰 High earners (salary > $80,000): {high_earners.height} people")
    print(high_earners)

    # Grouping example
    city_stats = df.group_by("city").agg([
        pl.col("salary").mean().alias("avg_salary"),
        pl.col("age").mean().alias("avg_age")
    ])
    print("\n🌍 City Statistics:")
    print(city_stats)

    # Performance demonstration
    print("\n⚡ Performance Test:")
    large_df = pl.DataFrame({
        "id": range(100_000),
        "value": range(100_000)
    })

    start_time = time.time()
    result = large_df.filter(pl.col("value") > 50_000).sum()
    elapsed = time.time() - start_time

    print(f"Processed 100,000 rows in {elapsed:.4f} seconds")
    print("🎯 Setup complete! You're ready to learn Polars.")

# Run the program
if __name__ == "__main__":
    first_polars_program()

Python

Common Installation Issues and Solutions

Issue 1: Import Error

# Problem: ImportError: No module named 'polars'
# Solution: Ensure you're in the correct Python environment
python -m pip install polars

# Or check your Python path
python -c "import sys; print(sys.path)"

# Problem: ImportError: No module named 'polars'
# Solution: Ensure you're in the correct Python environment
python -m pip install polars

# Or check your Python path
python -c "import sys; print(sys.path)"

Bash

Issue 2: Version Conflicts

# Problem: Dependency conflicts with other packages
# Solution: Use virtual environment
python -m venv polars_env
source polars_env/bin/activate  # On Windows: polars_env\Scripts\activate
pip install polars[all]

# Problem: Dependency conflicts with other packages
# Solution: Use virtual environment
python -m venv polars_env
source polars_env/bin/activate  # On Windows: polars_env\Scripts\activate
pip install polars[all]

Bash

Issue 3: Performance Issues

# Problem: Slow performance on large datasets
# Solution: Configure thread pool size
import polars as pl
import os

# Use all available CPU cores
pl.Config.set_thread_pool_size(os.cpu_count())

# Or set manually for fine-tuning
pl.Config.set_thread_pool_size(4)  # Use 4 threads

# Problem: Slow performance on large datasets
# Solution: Configure thread pool size
import polars as pl
import os

# Use all available CPU cores
pl.Config.set_thread_pool_size(os.cpu_count())

# Or set manually for fine-tuning
pl.Config.set_thread_pool_size(4)  # Use 4 threads

Python

Environment Checklist

Before proceeding to the next chapter, ensure:

✅ Polars is installed and importable
✅ Version is 0.19.0 or higher (recommended)
✅ Basic operations work correctly
✅ Display configuration is set up
✅ Your development environment (Jupyter/IDE) is configured

Quick Reference Card

# Essential imports for every Polars session
import polars as pl
import numpy as np
from datetime import datetime, date

# Basic configuration
pl.Config.set_tbl_rows(20)
pl.Config.set_tbl_cols(8)

# Check installation
print(f"Polars v{pl.__version__} ready!")

# Create test DataFrame
df = pl.DataFrame({"test": [1, 2, 3]})
print(df)  # Should display a simple table

# Essential imports for every Polars session
import polars as pl
import numpy as np
from datetime import datetime, date

# Basic configuration
pl.Config.set_tbl_rows(20)
pl.Config.set_tbl_cols(8)

# Check installation
print(f"Polars v{pl.__version__} ready!")

# Create test DataFrame
df = pl.DataFrame({"test": [1, 2, 3]})
print(df)  # Should display a simple table

Python

Chapter Summary

You now have:

✅ Polars installed with the appropriate options for your needs
✅ Environment configured for optimal development experience
✅ First program running demonstrating basic Polars capabilities
✅ Troubleshooting knowledge for common setup issues

🔗 What’s Next?

With Polars installed and configured, let’s dive into the core concepts that make Polars so powerful!

3. Core Concepts: DataFrames, Series, and Expressions

🎯 Learning Objectives

By the end of this chapter, you will:

Understand the three core data structures in Polars
Know how to work with different data types effectively
Master the expression system that makes Polars powerful
Distinguish between eager and lazy evaluation

The Three Pillars of Polars

graph TD
    A[Polars Core] --> B[DataFrame]
    A --> C[Series]  
    A --> D[Expression]

    B --> E[Eager Evaluation]
    B --> F[LazyFrame]
    F --> G[Query Optimization]

    C --> H[Column Data]
    C --> I[Data Types]

    D --> J[Operations]
    D --> K[Transformations]
    D --> L[Aggregations]

    style A fill:#f96
    style B fill:#9f6
    style C fill:#69f
    style D fill:#f69

1. DataFrame: Your Data Container

A DataFrame is a 2D data structure with labeled columns, similar to a spreadsheet or SQL table.

import polars as pl

# Creating a DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "salary": [50000, 60000, 70000],
    "active": [True, False, True]
})

print("📊 Basic DataFrame:")
print(df)

# DataFrame properties
print(f"Shape: {df.shape}")  # (rows, columns)
print(f"Columns: {df.columns}")
print(f"Data types: {df.dtypes}")

import polars as pl

# Creating a DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "salary": [50000, 60000, 70000],
    "active": [True, False, True]
})

print("📊 Basic DataFrame:")
print(df)

# DataFrame properties
print(f"Shape: {df.shape}")  # (rows, columns)
print(f"Columns: {df.columns}")
print(f"Data types: {df.dtypes}")

Python

DataFrame vs LazyFrame

flowchart LR
    A[DataFrame] --> B[Eager Execution]
    B --> C[Immediate Results]

    D[LazyFrame] --> E[Query Building]
    E --> F[Optimization]
    F --> G[Execution]
    G --> H[Results]

    style A fill:#9f6
    style D fill:#69f

# Eager evaluation (DataFrame)
eager_result = df.filter(pl.col("age") > 25).select(["name", "salary"])
print("Eager result:", eager_result)

# Lazy evaluation (LazyFrame)
lazy_query = (
    df.lazy()
    .filter(pl.col("age") > 25)
    .select(["name", "salary"])
)
print("Lazy query plan:")
print(lazy_query.explain())  # Show the execution plan

# Execute the lazy query
lazy_result = lazy_query.collect()
print("Lazy result:", lazy_result)

# Eager evaluation (DataFrame)
eager_result = df.filter(pl.col("age") > 25).select(["name", "salary"])
print("Eager result:", eager_result)

# Lazy evaluation (LazyFrame)
lazy_query = (
    df.lazy()
    .filter(pl.col("age") > 25)
    .select(["name", "salary"])
)
print("Lazy query plan:")
print(lazy_query.explain())  # Show the execution plan

# Execute the lazy query
lazy_result = lazy_query.collect()
print("Lazy result:", lazy_result)

Python

2. Series: Individual Columns

A Series represents a single column of data with a specific type.

# Creating Series
ages = pl.Series("ages", [25, 30, 35, 28])
names = pl.Series("names", ["Alice", "Bob", "Charlie", "Diana"])

print("📈 Series examples:")
print(f"Ages: {ages}")
print(f"Names: {names}")

# Series operations
print(f"Mean age: {ages.mean()}")
print(f"Unique names: {names.n_unique()}")
print(f"Age statistics:\n{ages.describe()}")

# Creating Series
ages = pl.Series("ages", [25, 30, 35, 28])
names = pl.Series("names", ["Alice", "Bob", "Charlie", "Diana"])

print("📈 Series examples:")
print(f"Ages: {ages}")
print(f"Names: {names}")

# Series operations
print(f"Mean age: {ages.mean()}")
print(f"Unique names: {names.n_unique()}")
print(f"Age statistics:\n{ages.describe()}")

Python

3. Expressions: The Heart of Polars

Expressions are the key to Polars’ power – they describe computations that can be applied to columns.

# Basic expressions
print("🔧 Expression examples:")

# Column references
age_expr = pl.col("age")
salary_expr = pl.col("salary")

# Computations
annual_bonus = pl.col("salary") * 0.1
senior_status = pl.col("age") > 30

# Complex expressions
salary_category = (
    pl.when(pl.col("salary") > 60000)
    .then(pl.lit("High"))
    .when(pl.col("salary") > 40000)
    .then(pl.lit("Medium"))
    .otherwise(pl.lit("Low"))
)

# Apply expressions
result = df.with_columns([
    annual_bonus.alias("bonus"),
    senior_status.alias("is_senior"),
    salary_category.alias("salary_category")
])

print(result)

# Basic expressions
print("🔧 Expression examples:")

# Column references
age_expr = pl.col("age")
salary_expr = pl.col("salary")

# Computations
annual_bonus = pl.col("salary") * 0.1
senior_status = pl.col("age") > 30

# Complex expressions
salary_category = (
    pl.when(pl.col("salary") > 60000)
    .then(pl.lit("High"))
    .when(pl.col("salary") > 40000)
    .then(pl.lit("Medium"))
    .otherwise(pl.lit("Low"))
)

# Apply expressions
result = df.with_columns([
    annual_bonus.alias("bonus"),
    senior_status.alias("is_senior"),
    salary_category.alias("salary_category")
])

print(result)

Python

Data Types Deep Dive

graph TD
    A[Polars Data Types] --> B[Numeric]
    A --> C[Text]
    A --> D[Temporal]
    A --> E[Complex]
    A --> F[Special]

    B --> B1[Int8, Int16, Int32, Int64]
    B --> B2[UInt8, UInt16, UInt32, UInt64]
    B --> B3[Float32, Float64]

    C --> C1[Utf8]
    C --> C2[Categorical]
    C --> C3[Binary]

    D --> D1[Date]
    D --> D2[Datetime]
    D --> D3[Time]
    D --> D4[Duration]

    E --> E1[List]
    E --> E2[Struct]
    E --> E3[Array]

    F --> F1[Boolean]
    F --> F2[Null]

# Data type examples with explanations
examples_df = pl.DataFrame({
    # Integers - choose size based on your data range
    "small_int": pl.Series([1, 2, 3], dtype=pl.Int8),        # -128 to 127
    "big_int": pl.Series([1000000, 2000000], dtype=pl.Int64), # Very large numbers

    # Floats
    "precision_float": pl.Series([1.1, 2.2], dtype=pl.Float32),  # Less precision, less memory
    "high_precision": pl.Series([1.123456789], dtype=pl.Float64), # More precision

    # Text
    "text": pl.Series(["Hello", "World"], dtype=pl.Utf8),
    "category": pl.Series(["A", "B", "A"], dtype=pl.Categorical),  # Efficient for repeated values

    # Temporal
    "date": pl.Series([pl.date(2023, 1, 1), pl.date(2023, 1, 2)]),
    "datetime": pl.Series([pl.datetime(2023, 1, 1, 12, 0)]),

    # Boolean
    "flag": pl.Series([True, False, True], dtype=pl.Boolean),

    # Complex types
    "numbers_list": pl.Series([[1, 2, 3], [4, 5]], dtype=pl.List(pl.Int32)),
    "person": pl.Series([{"name": "Alice", "age": 30}], dtype=pl.Struct([
        pl.Field("name", pl.Utf8),
        pl.Field("age", pl.Int32)
    ]))
})

print("🎯 Data types showcase:")
print(examples_df.dtypes)

# Data type examples with explanations
examples_df = pl.DataFrame({
    # Integers - choose size based on your data range
    "small_int": pl.Series([1, 2, 3], dtype=pl.Int8),        # -128 to 127
    "big_int": pl.Series([1000000, 2000000], dtype=pl.Int64), # Very large numbers

    # Floats
    "precision_float": pl.Series([1.1, 2.2], dtype=pl.Float32),  # Less precision, less memory
    "high_precision": pl.Series([1.123456789], dtype=pl.Float64), # More precision

    # Text
    "text": pl.Series(["Hello", "World"], dtype=pl.Utf8),
    "category": pl.Series(["A", "B", "A"], dtype=pl.Categorical),  # Efficient for repeated values

    # Temporal
    "date": pl.Series([pl.date(2023, 1, 1), pl.date(2023, 1, 2)]),
    "datetime": pl.Series([pl.datetime(2023, 1, 1, 12, 0)]),

    # Boolean
    "flag": pl.Series([True, False, True], dtype=pl.Boolean),

    # Complex types
    "numbers_list": pl.Series([[1, 2, 3], [4, 5]], dtype=pl.List(pl.Int32)),
    "person": pl.Series([{"name": "Alice", "age": 30}], dtype=pl.Struct([
        pl.Field("name", pl.Utf8),
        pl.Field("age", pl.Int32)
    ]))
})

print("🎯 Data types showcase:")
print(examples_df.dtypes)

Python

The Expression System: Building Blocks

flowchart TD
    A[Expression Building Blocks] --> B[Column References]
    A --> C[Literals]
    A --> D[Functions]
    A --> E[Conditionals]
    A --> F[Aggregations]

    B --> B1["pl.col('name')"]
    B --> B2["pl.all()"]
    B --> B3["pl.exclude('id')"]

    C --> C1["pl.lit(42)"]
    C --> C2["pl.lit('text')"]

    D --> D1[".str.len()"]
    D --> D2[".dt.year()"]
    D --> D3[".abs()"]

    E --> E1["pl.when().then()"]
    E --> E2[".otherwise()"]

    F --> F1[".sum()"]
    F --> F2[".mean()"]
    F --> F3[".count()"]

# Expression building blocks demonstration
demo_df = pl.DataFrame({
    "name": ["Alice Johnson", "Bob Smith", "Charlie Brown"],
    "birth_date": ["1990-05-15", "1985-12-20", "1992-03-10"],
    "score": [85, 92, 78],
    "active": [True, False, True]
})

# Column references
print("🔗 Column references:")
all_cols = demo_df.select(pl.all())  # Select all columns
name_only = demo_df.select(pl.col("name"))  # Select specific column
no_active = demo_df.select(pl.exclude("active"))  # All except 'active'

# Literals and computations
enriched = demo_df.with_columns([
    pl.lit("Student").alias("role"),  # Add literal column
    (pl.col("score") * 1.1).alias("boosted_score"),  # Mathematical operation
    pl.col("name").str.len().alias("name_length"),  # String function
])

print("📊 Enriched data:")
print(enriched)

# Conditional expressions
categorized = demo_df.with_columns([
    pl.when(pl.col("score") >= 90)
    .then(pl.lit("Excellent"))
    .when(pl.col("score") >= 80)
    .then(pl.lit("Good"))
    .when(pl.col("score") >= 70)
    .then(pl.lit("Average"))
    .otherwise(pl.lit("Needs Improvement"))
    .alias("grade")
])

print("🎓 Grade categories:")
print(categorized.select(["name", "score", "grade"]))

# Expression building blocks demonstration
demo_df = pl.DataFrame({
    "name": ["Alice Johnson", "Bob Smith", "Charlie Brown"],
    "birth_date": ["1990-05-15", "1985-12-20", "1992-03-10"],
    "score": [85, 92, 78],
    "active": [True, False, True]
})

# Column references
print("🔗 Column references:")
all_cols = demo_df.select(pl.all())  # Select all columns
name_only = demo_df.select(pl.col("name"))  # Select specific column
no_active = demo_df.select(pl.exclude("active"))  # All except 'active'

# Literals and computations
enriched = demo_df.with_columns([
    pl.lit("Student").alias("role"),  # Add literal column
    (pl.col("score") * 1.1).alias("boosted_score"),  # Mathematical operation
    pl.col("name").str.len().alias("name_length"),  # String function
])

print("📊 Enriched data:")
print(enriched)

# Conditional expressions
categorized = demo_df.with_columns([
    pl.when(pl.col("score") >= 90)
    .then(pl.lit("Excellent"))
    .when(pl.col("score") >= 80)
    .then(pl.lit("Good"))
    .when(pl.col("score") >= 70)
    .then(pl.lit("Average"))
    .otherwise(pl.lit("Needs Improvement"))
    .alias("grade")
])

print("🎓 Grade categories:")
print(categorized.select(["name", "score", "grade"]))

Python

Performance Implications

# Understanding eager vs lazy performance
import time

# Create larger dataset for demonstration
large_df = pl.DataFrame({
    "id": range(100_000),
    "value": range(100_000),
    "category": ["A", "B", "C", "D"] * 25_000
})

print("⚡ Performance comparison:")

# Eager evaluation
start = time.time()
eager_result = (
    large_df
    .filter(pl.col("value") > 50_000)
    .group_by("category")
    .agg(pl.col("value").sum())
)
eager_time = time.time() - start

# Lazy evaluation
start = time.time()
lazy_result = (
    large_df
    .lazy()
    .filter(pl.col("value") > 50_000)
    .group_by("category")
    .agg(pl.col("value").sum())
    .collect()
)
lazy_time = time.time() - start

print(f"Eager time: {eager_time:.4f}s")
print(f"Lazy time: {lazy_time:.4f}s")
print(f"Performance difference: {eager_time/lazy_time:.2f}x")

# Understanding eager vs lazy performance
import time

# Create larger dataset for demonstration
large_df = pl.DataFrame({
    "id": range(100_000),
    "value": range(100_000),
    "category": ["A", "B", "C", "D"] * 25_000
})

print("⚡ Performance comparison:")

# Eager evaluation
start = time.time()
eager_result = (
    large_df
    .filter(pl.col("value") > 50_000)
    .group_by("category")
    .agg(pl.col("value").sum())
)
eager_time = time.time() - start

# Lazy evaluation
start = time.time()
lazy_result = (
    large_df
    .lazy()
    .filter(pl.col("value") > 50_000)
    .group_by("category")
    .agg(pl.col("value").sum())
    .collect()
)
lazy_time = time.time() - start

print(f"Eager time: {eager_time:.4f}s")
print(f"Lazy time: {lazy_time:.4f}s")
print(f"Performance difference: {eager_time/lazy_time:.2f}x")

Python

Chapter Summary

Key Concepts Mastered:

✅ DataFrame: Your primary data container with eager evaluation
✅ LazyFrame: Query builder with optimization capabilities
✅ Series: Individual columns with specific data types
✅ Expressions: Powerful building blocks for data operations
✅ Data Types: Choosing the right type for memory and performance

The Expression Advantage:

Composable and reusable
Type-safe operations
Optimizable by the query engine
Readable and intuitive syntax

🔗 What’s Next?

Now that you understand Polars’ foundation, let’s build your first complete Polars program!

4. Your First Polars Program

🎯 Learning Objectives

By the end of this chapter, you will:

Build a complete data analysis program using Polars
Apply core concepts in a practical scenario
Understand common patterns and workflows
Gain confidence in writing Polars code

The Scenario: Employee Data Analysis

Let’s build a program that analyzes employee data for a fictional company. This will demonstrate real-world Polars usage patterns.

import polars as pl
from datetime import date, datetime
import random

def create_sample_data():
    """Create realistic employee data for our analysis"""

    # Set seed for reproducible results
    random.seed(42)

    departments = ["Engineering", "Sales", "Marketing", "HR", "Finance"]
    cities = ["New York", "London", "Tokyo", "Berlin", "Sydney"]

    # Generate 1000 employees
    employees = []
    for i in range(1000):
        employees.append({
            "employee_id": f"EMP{i+1:04d}",
            "name": f"Employee {i+1}",
            "department": random.choice(departments),
            "salary": random.randint(40000, 150000),
            "hire_date": date(
                random.randint(2018, 2023),
                random.randint(1, 12),
                random.randint(1, 28)
            ),
            "city": random.choice(cities),
            "performance_score": round(random.uniform(1.0, 5.0), 1),
            "is_remote": random.choice([True, False])
        })

    return pl.DataFrame(employees)

def analyze_employee_data():
    """Complete employee data analysis workflow"""

    print("🏢 Employee Data Analysis")
    print("=" * 50)

    # Step 1: Create data
    print("📊 Creating sample data...")
    df = create_sample_data()

    print(f"✅ Created dataset with {df.height} employees")
    print(f"📋 Columns: {', '.join(df.columns)}")

    # Step 2: Basic exploration
    print("\n🔍 Data Overview:")
    print(df.head())
    print(f"\nData types:\n{df.dtypes}")

    # Step 3: Data quality check
    print("\n🔍 Data Quality Check:")
    null_counts = df.null_count()
    print(null_counts)

    # Step 4: Department analysis
    print("\n🏢 Department Analysis:")
    dept_analysis = (
        df
        .group_by("department")
        .agg([
            pl.col("salary").mean().round(0).alias("avg_salary"),
            pl.col("salary").min().alias("min_salary"),
            pl.col("salary").max().alias("max_salary"),
            pl.col("employee_id").count().alias("employee_count"),
            pl.col("performance_score").mean().round(2).alias("avg_performance")
        ])
        .sort("avg_salary", descending=True)
    )
    print(dept_analysis)

    # Step 5: Salary analysis with categories
    print("\n💰 Salary Categories:")
    salary_categorized = (
        df
        .with_columns([
            pl.when(pl.col("salary") >= 100000)
            .then(pl.lit("High"))
            .when(pl.col("salary") >= 70000)
            .then(pl.lit("Medium"))
            .otherwise(pl.lit("Low"))
            .alias("salary_category")
        ])
        .group_by("salary_category")
        .agg([
            pl.col("employee_id").count().alias("count"),
            pl.col("salary").mean().round(0).alias("avg_salary")
        ])
        .sort("avg_salary", descending=True)
    )
    print(salary_categorized)

    # Step 6: Geographic distribution
    print("\n🌍 Geographic Distribution:")
    city_analysis = (
        df
        .group_by(["city", "is_remote"])
        .agg([
            pl.col("employee_id").count().alias("employee_count"),
            pl.col("salary").mean().round(0).alias("avg_salary")
        ])
        .sort(["city", "is_remote"])
    )
    print(city_analysis)

    # Step 7: Tenure analysis
    print("\n📅 Employee Tenure Analysis:")
    tenure_analysis = (
        df
        .with_columns([
            ((pl.lit(date.today()) - pl.col("hire_date")).dt.days() / 365.25)
            .round(1)
            .alias("years_of_service")
        ])
        .with_columns([
            pl.when(pl.col("years_of_service") >= 5)
            .then(pl.lit("Veteran"))
            .when(pl.col("years_of_service") >= 2)
            .then(pl.lit("Experienced"))
            .otherwise(pl.lit("New"))
            .alias("tenure_category")
        ])
        .group_by("tenure_category")
        .agg([
            pl.col("employee_id").count().alias("count"),
            pl.col("years_of_service").mean().round(1).alias("avg_tenure"),
            pl.col("salary").mean().round(0).alias("avg_salary"),
            pl.col("performance_score").mean().round(2).alias("avg_performance")
        ])
        .sort("avg_tenure", descending=True)
    )
    print(tenure_analysis)

    # Step 8: High performer identification
    print("\n⭐ High Performers (Top 10%):")
    high_performers = (
        df
        .filter(pl.col("performance_score") >= df["performance_score"].quantile(0.9))
        .select([
            "employee_id", "name", "department", "salary", 
            "performance_score", "hire_date"
        ])
        .sort("performance_score", descending=True)
        .head(10)
    )
    print(high_performers)

    # Step 9: Summary insights
    print("\n📈 Key Insights:")
    total_employees = df.height
    avg_salary = df["salary"].mean()
    avg_performance = df["performance_score"].mean()
    remote_pct = (df["is_remote"].sum() / total_employees * 100)

    print(f"• Total Employees: {total_employees:,}")
    print(f"• Average Salary: ${avg_salary:,.0f}")
    print(f"• Average Performance Score: {avg_performance:.2f}/5.0")
    print(f"• Remote Workers: {remote_pct:.1f}%")

    return df

# Bonus: Using lazy evaluation for large datasets
def demonstrate_lazy_evaluation():
    """Show the power of lazy evaluation"""

    print("\n⚡ Lazy Evaluation Demonstration")
    print("=" * 40)

    # Create a larger dataset
    large_df = pl.DataFrame({
        "id": range(100_000),
        "value": range(100_000),
        "category": ["A", "B", "C", "D", "E"] * 20_000
    })

    # Build a complex lazy query
    lazy_query = (
        large_df
        .lazy()
        .filter(pl.col("value") > 50_000)
        .with_columns([
            (pl.col("value") * 2).alias("doubled_value"),
            pl.col("category").cast(pl.Categorical).alias("category")
        ])
        .group_by("category")
        .agg([
            pl.col("value").sum().alias("total_value"),
            pl.col("doubled_value").mean().alias("avg_doubled"),
            pl.col("id").count().alias("count")
        ])
        .sort("total_value", descending=True)
    )

    print("🔍 Query Plan:")
    print(lazy_query.explain())

    print("\n📊 Execution Results:")
    result = lazy_query.collect()
    print(result)

if __name__ == "__main__":
    # Run the main analysis
    employee_df = analyze_employee_data()

    # Demonstrate lazy evaluation
    demonstrate_lazy_evaluation()

    print("\n🎉 Analysis Complete!")
    print("You've successfully built and run your first comprehensive Polars program!")

import polars as pl
from datetime import date, datetime
import random

def create_sample_data():
    """Create realistic employee data for our analysis"""

    # Set seed for reproducible results
    random.seed(42)

    departments = ["Engineering", "Sales", "Marketing", "HR", "Finance"]
    cities = ["New York", "London", "Tokyo", "Berlin", "Sydney"]

    # Generate 1000 employees
    employees = []
    for i in range(1000):
        employees.append({
            "employee_id": f"EMP{i+1:04d}",
            "name": f"Employee {i+1}",
            "department": random.choice(departments),
            "salary": random.randint(40000, 150000),
            "hire_date": date(
                random.randint(2018, 2023),
                random.randint(1, 12),
                random.randint(1, 28)
            ),
            "city": random.choice(cities),
            "performance_score": round(random.uniform(1.0, 5.0), 1),
            "is_remote": random.choice([True, False])
        })

    return pl.DataFrame(employees)

def analyze_employee_data():
    """Complete employee data analysis workflow"""

    print("🏢 Employee Data Analysis")
    print("=" * 50)

    # Step 1: Create data
    print("📊 Creating sample data...")
    df = create_sample_data()

    print(f"✅ Created dataset with {df.height} employees")
    print(f"📋 Columns: {', '.join(df.columns)}")

    # Step 2: Basic exploration
    print("\n🔍 Data Overview:")
    print(df.head())
    print(f"\nData types:\n{df.dtypes}")

    # Step 3: Data quality check
    print("\n🔍 Data Quality Check:")
    null_counts = df.null_count()
    print(null_counts)

    # Step 4: Department analysis
    print("\n🏢 Department Analysis:")
    dept_analysis = (
        df
        .group_by("department")
        .agg([
            pl.col("salary").mean().round(0).alias("avg_salary"),
            pl.col("salary").min().alias("min_salary"),
            pl.col("salary").max().alias("max_salary"),
            pl.col("employee_id").count().alias("employee_count"),
            pl.col("performance_score").mean().round(2).alias("avg_performance")
        ])
        .sort("avg_salary", descending=True)
    )
    print(dept_analysis)

    # Step 5: Salary analysis with categories
    print("\n💰 Salary Categories:")
    salary_categorized = (
        df
        .with_columns([
            pl.when(pl.col("salary") >= 100000)
            .then(pl.lit("High"))
            .when(pl.col("salary") >= 70000)
            .then(pl.lit("Medium"))
            .otherwise(pl.lit("Low"))
            .alias("salary_category")
        ])
        .group_by("salary_category")
        .agg([
            pl.col("employee_id").count().alias("count"),
            pl.col("salary").mean().round(0).alias("avg_salary")
        ])
        .sort("avg_salary", descending=True)
    )
    print(salary_categorized)

    # Step 6: Geographic distribution
    print("\n🌍 Geographic Distribution:")
    city_analysis = (
        df
        .group_by(["city", "is_remote"])
        .agg([
            pl.col("employee_id").count().alias("employee_count"),
            pl.col("salary").mean().round(0).alias("avg_salary")
        ])
        .sort(["city", "is_remote"])
    )
    print(city_analysis)

    # Step 7: Tenure analysis
    print("\n📅 Employee Tenure Analysis:")
    tenure_analysis = (
        df
        .with_columns([
            ((pl.lit(date.today()) - pl.col("hire_date")).dt.days() / 365.25)
            .round(1)
            .alias("years_of_service")
        ])
        .with_columns([
            pl.when(pl.col("years_of_service") >= 5)
            .then(pl.lit("Veteran"))
            .when(pl.col("years_of_service") >= 2)
            .then(pl.lit("Experienced"))
            .otherwise(pl.lit("New"))
            .alias("tenure_category")
        ])
        .group_by("tenure_category")
        .agg([
            pl.col("employee_id").count().alias("count"),
            pl.col("years_of_service").mean().round(1).alias("avg_tenure"),
            pl.col("salary").mean().round(0).alias("avg_salary"),
            pl.col("performance_score").mean().round(2).alias("avg_performance")
        ])
        .sort("avg_tenure", descending=True)
    )
    print(tenure_analysis)

    # Step 8: High performer identification
    print("\n⭐ High Performers (Top 10%):")
    high_performers = (
        df
        .filter(pl.col("performance_score") >= df["performance_score"].quantile(0.9))
        .select([
            "employee_id", "name", "department", "salary", 
            "performance_score", "hire_date"
        ])
        .sort("performance_score", descending=True)
        .head(10)
    )
    print(high_performers)

    # Step 9: Summary insights
    print("\n📈 Key Insights:")
    total_employees = df.height
    avg_salary = df["salary"].mean()
    avg_performance = df["performance_score"].mean()
    remote_pct = (df["is_remote"].sum() / total_employees * 100)

    print(f"• Total Employees: {total_employees:,}")
    print(f"• Average Salary: ${avg_salary:,.0f}")
    print(f"• Average Performance Score: {avg_performance:.2f}/5.0")
    print(f"• Remote Workers: {remote_pct:.1f}%")

    return df

# Bonus: Using lazy evaluation for large datasets
def demonstrate_lazy_evaluation():
    """Show the power of lazy evaluation"""

    print("\n⚡ Lazy Evaluation Demonstration")
    print("=" * 40)

    # Create a larger dataset
    large_df = pl.DataFrame({
        "id": range(100_000),
        "value": range(100_000),
        "category": ["A", "B", "C", "D", "E"] * 20_000
    })

    # Build a complex lazy query
    lazy_query = (
        large_df
        .lazy()
        .filter(pl.col("value") > 50_000)
        .with_columns([
            (pl.col("value") * 2).alias("doubled_value"),
            pl.col("category").cast(pl.Categorical).alias("category")
        ])
        .group_by("category")
        .agg([
            pl.col("value").sum().alias("total_value"),
            pl.col("doubled_value").mean().alias("avg_doubled"),
            pl.col("id").count().alias("count")
        ])
        .sort("total_value", descending=True)
    )

    print("🔍 Query Plan:")
    print(lazy_query.explain())

    print("\n📊 Execution Results:")
    result = lazy_query.collect()
    print(result)

if __name__ == "__main__":
    # Run the main analysis
    employee_df = analyze_employee_data()

    # Demonstrate lazy evaluation
    demonstrate_lazy_evaluation()

    print("\n🎉 Analysis Complete!")
    print("You've successfully built and run your first comprehensive Polars program!")

Python

Understanding the Code Structure

flowchart TD
    A[Data Creation] --> B[Data Exploration]
    B --> C[Data Quality Check]
    C --> D[Business Analysis]
    D --> E[Advanced Analytics]
    E --> F[Insights & Reporting]

    D --> D1[Department Analysis]
    D --> D2[Salary Categorization]
    D --> D3[Geographic Distribution]

    E --> E1[Tenure Analysis]
    E --> E2[Performance Ranking]
    E --> E3[Predictive Features]

    style A fill:#f96
    style D fill:#9f6
    style E fill:#69f
    style F fill:#f69

Key Patterns Demonstrated

Data Creation with Realistic Constraints

# Using seeds for reproducible results
random.seed(42)

# Creating structured data with relationships

employees.append({
    "employee_id": f"EMP{i+1:04d}",  # Consistent ID format
    "department": random.choice(departments),  # Controlled categories
    "salary": random.randint(40000, 150000),  # Realistic ranges
})

# Using seeds for reproducible results
random.seed(42)

# Creating structured data with relationships

employees.append({
    "employee_id": f"EMP{i+1:04d}",  # Consistent ID format
    "department": random.choice(departments),  # Controlled categories
    "salary": random.randint(40000, 150000),  # Realistic ranges
})

Python

2. Progressive Data Enhancement

# Building on previous calculations
df.with_columns([
    # First calculate years of service
    ((pl.lit(date.today()) - pl.col("hire_date")).dt.days() / 365.25)
    .round(1)
    .alias("years_of_service")
]).with_columns([
    # Then categorize based on the calculation
    pl.when(pl.col("years_of_service") >= 5)
    .then(pl.lit("Veteran"))
    # ...
])

# Building on previous calculations
df.with_columns([
    # First calculate years of service
    ((pl.lit(date.today()) - pl.col("hire_date")).dt.days() / 365.25)
    .round(1)
    .alias("years_of_service")
]).with_columns([
    # Then categorize based on the calculation
    pl.when(pl.col("years_of_service") >= 5)
    .then(pl.lit("Veteran"))
    # ...
])

Python

Business Logic Implementation

# Complex categorization logic
pl.when(pl.col("salary") >= 100000).then(pl.lit("High"))
.when(pl.col("salary") >= 70000).then(pl.lit("Medium"))
.otherwise(pl.lit("Low"))

# Complex categorization logic
pl.when(pl.col("salary") >= 100000).then(pl.lit("High"))
.when(pl.col("salary") >= 70000).then(pl.lit("Medium"))
.otherwise(pl.lit("Low"))

Python

Performance Tips Applied

Efficient data types: Using pl.Categorical for repeated strings
Method chaining: Readable, efficient operations
Lazy evaluation: Demonstrated for large datasets
Selective operations: Only computing what’s needed

Chapter Summary

Skills Acquired:

✅ End-to-end workflow: From data creation to insights
✅ Business logic: Implementing real-world categorization rules
✅ Performance patterns: Efficient data processing techniques
✅ Code organization: Structured, readable analysis functions

Polars Patterns Mastered:

Progressive data enhancement with with_columns()
Complex grouping and aggregation
Conditional logic with when().then().otherwise()
Date/time calculations and analysis
Lazy evaluation for performance

🔗 What’s Next?

Ready to learn about the various ways to load and create data in Polars!

5. Loading and Creating Data

🎯 Learning Objectives

By the end of this chapter, you will:

Master multiple ways to create DataFrames from scratch
Load data from various file formats efficiently
Understand when to use eager vs lazy loading
Handle data type specifications and schema validation

Data Sources Overview

graph TD
    A[Data Sources] --> B[File Formats]
    A --> C[In-Memory Creation]
    A --> D[External Systems]
    A --> E[Other Libraries]

    B --> B1[CSV/TSV]
    B --> B2[Parquet]
    B --> B3[JSON/NDJSON]
    B --> B4[Excel]
    B --> B5[Arrow/IPC]

    C --> C1[Dictionaries]
    C --> C2[Lists]
    C --> C3[Ranges]
    C --> C4[Manual Schema]

    D --> D1[Databases]
    D --> D2[APIs]
    D --> D3[Cloud Storage]

    E --> E1[Pandas]
    E --> E2[NumPy]
    E --> E3[PyArrow]

    style A fill:#f96
    style B fill:#9f6
    style C fill:#69f
    style D fill:#f69

Creating DataFrames from Scratch

1. From Dictionaries (Most Common)

import polars as pl
import numpy as np
from datetime import date, datetime

# Basic dictionary creation
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 28],
    "salary": [50000, 60000, 70000, 55000],
    "active": [True, False, True, True]
})

print("📊 Basic DataFrame:")
print(df)
print(f"Schema: {df.schema}")

# Advanced dictionary with mixed types
advanced_df = pl.DataFrame({
    "employee_id": [f"EMP{i:03d}" for i in range(1, 5)],
    "hire_date": [date(2020, 1, 15), date(2019, 6, 10), date(2021, 3, 22), date(2020, 11, 5)],
    "last_login": [
        datetime(2023, 12, 1, 14, 30),
        datetime(2023, 12, 2, 9, 15),
        None,  # Missing value
        datetime(2023, 12, 1, 16, 45)
    ],
    "skills": [["Python", "SQL"], ["Java", "Docker"], ["Python", "React"], ["SQL", "Tableau"]],
    "performance_scores": [[4.5, 4.2, 4.8], [3.9, 4.1], [4.7, 4.9, 4.6], [4.0]]
})

print("\n🔧 Advanced DataFrame with complex types:")
print(advanced_df)

import polars as pl
import numpy as np
from datetime import date, datetime

# Basic dictionary creation
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 28],
    "salary": [50000, 60000, 70000, 55000],
    "active": [True, False, True, True]
})

print("📊 Basic DataFrame:")
print(df)
print(f"Schema: {df.schema}")

# Advanced dictionary with mixed types
advanced_df = pl.DataFrame({
    "employee_id": [f"EMP{i:03d}" for i in range(1, 5)],
    "hire_date": [date(2020, 1, 15), date(2019, 6, 10), date(2021, 3, 22), date(2020, 11, 5)],
    "last_login": [
        datetime(2023, 12, 1, 14, 30),
        datetime(2023, 12, 2, 9, 15),
        None,  # Missing value
        datetime(2023, 12, 1, 16, 45)
    ],
    "skills": [["Python", "SQL"], ["Java", "Docker"], ["Python", "React"], ["SQL", "Tableau"]],
    "performance_scores": [[4.5, 4.2, 4.8], [3.9, 4.1], [4.7, 4.9, 4.6], [4.0]]
})

print("\n🔧 Advanced DataFrame with complex types:")
print(advanced_df)

Python

2. From Lists and Sequences

# List of lists with schema
data_rows = [
    ["Alice", 25, "Engineering"],
    ["Bob", 30, "Sales"],
    ["Charlie", 35, "Marketing"],
    ["Diana", 28, "Engineering"]
]

df_from_lists = pl.DataFrame(
    data_rows,
    schema=["name", "age", "department"],
    orient="row"  # Default: each inner list is a row
)

print("📋 DataFrame from lists:")
print(df_from_lists)

# Column-oriented data
col_data = [
    ["Alice", "Bob", "Charlie"],     # Names column
    [25, 30, 35],                    # Ages column
    ["NYC", "LA", "Chicago"]         # Cities column
]

df_from_columns = pl.DataFrame(
    col_data,
    schema=["name", "age", "city"],
    orient="col"  # Each inner list is a column
)

print("\n📊 DataFrame from column lists:")
print(df_from_columns)

# List of lists with schema
data_rows = [
    ["Alice", 25, "Engineering"],
    ["Bob", 30, "Sales"],
    ["Charlie", 35, "Marketing"],
    ["Diana", 28, "Engineering"]
]

df_from_lists = pl.DataFrame(
    data_rows,
    schema=["name", "age", "department"],
    orient="row"  # Default: each inner list is a row
)

print("📋 DataFrame from lists:")
print(df_from_lists)

# Column-oriented data
col_data = [
    ["Alice", "Bob", "Charlie"],     # Names column
    [25, 30, 35],                    # Ages column
    ["NYC", "LA", "Chicago"]         # Cities column
]

df_from_columns = pl.DataFrame(
    col_data,
    schema=["name", "age", "city"],
    orient="col"  # Each inner list is a column
)

print("\n📊 DataFrame from column lists:")
print(df_from_columns)

Python

3. From NumPy Arrays

# Random data generation
np.random.seed(42)
array_data = np.random.randn(1000, 4)  # 1000 rows, 4 columns

df_numpy = pl.DataFrame(
    array_data,
    schema=["feature_1", "feature_2", "feature_3", "target"]
)

print("🔢 DataFrame from NumPy (first 5 rows):")
print(df_numpy.head())
print(f"Shape: {df_numpy.shape}")

# Structured arrays
structured_array = np.array([
    ("Alice", 25, 50000.0),
    ("Bob", 30, 60000.0),
    ("Charlie", 35, 70000.0)
], dtype=[("name", "U10"), ("age", "i4"), ("salary", "f8")])

df_structured = pl.DataFrame(structured_array)
print("\n🏗️ DataFrame from structured NumPy array:")
print(df_structured)

# Random data generation
np.random.seed(42)
array_data = np.random.randn(1000, 4)  # 1000 rows, 4 columns

df_numpy = pl.DataFrame(
    array_data,
    schema=["feature_1", "feature_2", "feature_3", "target"]
)

print("🔢 DataFrame from NumPy (first 5 rows):")
print(df_numpy.head())
print(f"Shape: {df_numpy.shape}")

# Structured arrays
structured_array = np.array([
    ("Alice", 25, 50000.0),
    ("Bob", 30, 60000.0),
    ("Charlie", 35, 70000.0)
], dtype=[("name", "U10"), ("age", "i4"), ("salary", "f8")])

df_structured = pl.DataFrame(structured_array)
print("\n🏗️ DataFrame from structured NumPy array:")
print(df_structured)

Python

4. Empty DataFrames with Schema

# Define schema first, add data later
schema = {
    "user_id": pl.Int64,
    "username": pl.Utf8,
    "email": pl.Utf8,
    "signup_date": pl.Date,
    "is_premium": pl.Boolean,
    "login_count": pl.Int32,
    "last_active": pl.Datetime,
    "preferences": pl.List(pl.Utf8)
}

empty_df = pl.DataFrame(schema=schema)
print("📋 Empty DataFrame with schema:")
print(f"Columns: {empty_df.columns}")
print(f"Schema: {empty_df.schema}")

# Add rows to empty DataFrame
new_rows = pl.DataFrame({
    "user_id": [1, 2],
    "username": ["alice_dev", "bob_analyst"],
    "email": ["alice@company.com", "bob@company.com"],
    "signup_date": [date(2023, 1, 15), date(2023, 2, 20)],
    "is_premium": [True, False],
    "login_count": [45, 23],
    "last_active": [datetime(2023, 12, 1, 14, 30), datetime(2023, 11, 28, 9, 15)],
    "preferences": [["dark_mode", "notifications"], ["light_mode"]]
})

filled_df = pl.concat([empty_df, new_rows])
print("\n✅ DataFrame after adding data:")
print(filled_df)

# Define schema first, add data later
schema = {
    "user_id": pl.Int64,
    "username": pl.Utf8,
    "email": pl.Utf8,
    "signup_date": pl.Date,
    "is_premium": pl.Boolean,
    "login_count": pl.Int32,
    "last_active": pl.Datetime,
    "preferences": pl.List(pl.Utf8)
}

empty_df = pl.DataFrame(schema=schema)
print("📋 Empty DataFrame with schema:")
print(f"Columns: {empty_df.columns}")
print(f"Schema: {empty_df.schema}")

# Add rows to empty DataFrame
new_rows = pl.DataFrame({
    "user_id": [1, 2],
    "username": ["alice_dev", "bob_analyst"],
    "email": ["alice@company.com", "bob@company.com"],
    "signup_date": [date(2023, 1, 15), date(2023, 2, 20)],
    "is_premium": [True, False],
    "login_count": [45, 23],
    "last_active": [datetime(2023, 12, 1, 14, 30), datetime(2023, 11, 28, 9, 15)],
    "preferences": [["dark_mode", "notifications"], ["light_mode"]]
})

filled_df = pl.concat([empty_df, new_rows])
print("\n✅ DataFrame after adding data:")
print(filled_df)

Python

File Loading Mastery

CSV Files: The Workhorse

# Basic CSV reading (assuming you have a CSV file)
# df = pl.read_csv("data.csv")

# Let's create a sample CSV first
sample_csv_data = """id,name,age,department,salary,hire_date
1,Alice Johnson,25,Engineering,75000,2021-03-15
2,Bob Smith,30,Sales,65000,2020-07-22
3,Charlie Brown,35,Marketing,70000,2019-11-10
4,Diana Wilson,28,Engineering,80000,2022-01-08
5,Eve Davis,32,Finance,72000,2020-05-18"""

# Save to file for demonstration
with open("sample_employees.csv", "w") as f:
    f.write(sample_csv_data)

# Now read with various options
df_basic = pl.read_csv("sample_employees.csv")
print("📄 Basic CSV reading:")
print(df_basic)

# Advanced CSV reading with options
df_advanced = pl.read_csv(
    "sample_employees.csv",
    dtypes={
        "id": pl.Int32,
        "age": pl.Int16,
        "salary": pl.Int32,
        "hire_date": pl.Date
    },
    parse_dates=True,
    null_values=["", "NULL", "N/A"],  # What to treat as null
    skip_rows=0,  # Skip header rows if needed
    n_rows=3  # Limit rows for testing
)

print("\n🔧 Advanced CSV reading:")
print(df_advanced)
print(f"Dtypes: {df_advanced.dtypes}")

# Basic CSV reading (assuming you have a CSV file)
# df = pl.read_csv("data.csv")

# Let's create a sample CSV first
sample_csv_data = """id,name,age,department,salary,hire_date
1,Alice Johnson,25,Engineering,75000,2021-03-15
2,Bob Smith,30,Sales,65000,2020-07-22
3,Charlie Brown,35,Marketing,70000,2019-11-10
4,Diana Wilson,28,Engineering,80000,2022-01-08
5,Eve Davis,32,Finance,72000,2020-05-18"""

# Save to file for demonstration
with open("sample_employees.csv", "w") as f:
    f.write(sample_csv_data)

# Now read with various options
df_basic = pl.read_csv("sample_employees.csv")
print("📄 Basic CSV reading:")
print(df_basic)

# Advanced CSV reading with options
df_advanced = pl.read_csv(
    "sample_employees.csv",
    dtypes={
        "id": pl.Int32,
        "age": pl.Int16,
        "salary": pl.Int32,
        "hire_date": pl.Date
    },
    parse_dates=True,
    null_values=["", "NULL", "N/A"],  # What to treat as null
    skip_rows=0,  # Skip header rows if needed
    n_rows=3  # Limit rows for testing
)

print("\n🔧 Advanced CSV reading:")
print(df_advanced)
print(f"Dtypes: {df_advanced.dtypes}")

Python

Lazy Loading for Large Files

# Lazy CSV scanning (doesn't load data immediately)
lazy_df = pl.scan_csv("sample_employees.csv")

print("🔍 Lazy DataFrame operations:")
print("Query plan:")

# Build a query without executing
query = (
    lazy_df
    .filter(pl.col("salary") > 70000)
    .select(["name", "department", "salary"])
    .sort("salary", descending=True)
)

print(query.explain())

# Execute when ready
result = query.collect()
print("\n📊 Query results:")
print(result)

# Clean up
import os
os.remove("sample_employees.csv")

# Lazy CSV scanning (doesn't load data immediately)
lazy_df = pl.scan_csv("sample_employees.csv")

print("🔍 Lazy DataFrame operations:")
print("Query plan:")

# Build a query without executing
query = (
    lazy_df
    .filter(pl.col("salary") > 70000)
    .select(["name", "department", "salary"])
    .sort("salary", descending=True)
)

print(query.explain())

# Execute when ready
result = query.collect()
print("\n📊 Query results:")
print(result)

# Clean up
import os
os.remove("sample_employees.csv")

Python

JSON and Other Formats

# JSON data example
json_data = [
    {"id": 1, "name": "Alice", "scores": [85, 92, 78], "metadata": {"team": "A", "level": "senior"}},
    {"id": 2, "name": "Bob", "scores": [90, 88, 92], "metadata": {"team": "B", "level": "junior"}},
    {"id": 3, "name": "Charlie", "scores": [75, 85, 80], "metadata": {"team": "A", "level": "mid"}}
]

# Save JSON for reading
import json
with open("sample_data.json", "w") as f:
    json.dump(json_data, f)

# Read JSON
df_json = pl.read_json("sample_data.json")
print("📄 JSON DataFrame:")
print(df_json)

# NDJSON (newline-delimited JSON) - each line is a JSON object
ndjson_data = """{"id": 1, "name": "Alice", "value": 10}
{"id": 2, "name": "Bob", "value": 20}
{"id": 3, "name": "Charlie", "value": 30}"""

with open("sample_data.ndjson", "w") as f:
    f.write(ndjson_data)

df_ndjson = pl.read_ndjson("sample_data.ndjson")
print("\n📄 NDJSON DataFrame:")
print(df_ndjson)

# Clean up
os.remove("sample_data.json")
os.remove("sample_data.ndjson")

# JSON data example
json_data = [
    {"id": 1, "name": "Alice", "scores": [85, 92, 78], "metadata": {"team": "A", "level": "senior"}},
    {"id": 2, "name": "Bob", "scores": [90, 88, 92], "metadata": {"team": "B", "level": "junior"}},
    {"id": 3, "name": "Charlie", "scores": [75, 85, 80], "metadata": {"team": "A", "level": "mid"}}
]

# Save JSON for reading
import json
with open("sample_data.json", "w") as f:
    json.dump(json_data, f)

# Read JSON
df_json = pl.read_json("sample_data.json")
print("📄 JSON DataFrame:")
print(df_json)

# NDJSON (newline-delimited JSON) - each line is a JSON object
ndjson_data = """{"id": 1, "name": "Alice", "value": 10}
{"id": 2, "name": "Bob", "value": 20}
{"id": 3, "name": "Charlie", "value": 30}"""

with open("sample_data.ndjson", "w") as f:
    f.write(ndjson_data)

df_ndjson = pl.read_ndjson("sample_data.ndjson")
print("\n📄 NDJSON DataFrame:")
print(df_ndjson)

# Clean up
os.remove("sample_data.json")
os.remove("sample_data.ndjson")

Python

Creating Series: Building Blocks

# Basic Series creation
numbers = pl.Series("numbers", [1, 2, 3, 4, 5])
names = pl.Series("employee_names", ["Alice", "Bob", "Charlie"])

print("📊 Basic Series:")
print(f"Numbers: {numbers}")
print(f"Names: {names}")

# Typed Series with explicit dtypes
typed_series = pl.Series("precise_values", [1.1, 2.2, 3.3], dtype=pl.Float32)
categorical_series = pl.Series("categories", ["A", "B", "A", "C"], dtype=pl.Categorical)

print(f"\n🎯 Typed Series:")
print(f"Float32: {typed_series.dtype}")
print(f"Categorical: {categorical_series.dtype}")

# Series with null values
with_nulls = pl.Series("incomplete", [1, 2, None, 4, None, 6])
print(f"\n❓ Series with nulls: {with_nulls}")
print(f"Null count: {with_nulls.null_count()}")

# Series from ranges and generators
range_series = pl.Series("range_data", range(1000))
print(f"\n📈 Range Series (first 10): {range_series.head(10)}")

# Date/time series
date_series = pl.Series("dates", [
    date(2023, 1, 1),
    date(2023, 1, 2),
    date(2023, 1, 3)
])

datetime_series = pl.Series("timestamps", [
    datetime(2023, 1, 1, 12, 0, 0),
    datetime(2023, 1, 1, 13, 30, 0),
    datetime(2023, 1, 1, 15, 45, 0)
])

print(f"\n📅 Date Series: {date_series}")
print(f"🕐 DateTime Series: {datetime_series}")

# Basic Series creation
numbers = pl.Series("numbers", [1, 2, 3, 4, 5])
names = pl.Series("employee_names", ["Alice", "Bob", "Charlie"])

print("📊 Basic Series:")
print(f"Numbers: {numbers}")
print(f"Names: {names}")

# Typed Series with explicit dtypes
typed_series = pl.Series("precise_values", [1.1, 2.2, 3.3], dtype=pl.Float32)
categorical_series = pl.Series("categories", ["A", "B", "A", "C"], dtype=pl.Categorical)

print(f"\n🎯 Typed Series:")
print(f"Float32: {typed_series.dtype}")
print(f"Categorical: {categorical_series.dtype}")

# Series with null values
with_nulls = pl.Series("incomplete", [1, 2, None, 4, None, 6])
print(f"\n❓ Series with nulls: {with_nulls}")
print(f"Null count: {with_nulls.null_count()}")

# Series from ranges and generators
range_series = pl.Series("range_data", range(1000))
print(f"\n📈 Range Series (first 10): {range_series.head(10)}")

# Date/time series
date_series = pl.Series("dates", [
    date(2023, 1, 1),
    date(2023, 1, 2),
    date(2023, 1, 3)
])

datetime_series = pl.Series("timestamps", [
    datetime(2023, 1, 1, 12, 0, 0),
    datetime(2023, 1, 1, 13, 30, 0),
    datetime(2023, 1, 1, 15, 45, 0)
])

print(f"\n📅 Date Series: {date_series}")
print(f"🕐 DateTime Series: {datetime_series}")

Python

Data Loading Best Practices

flowchart TD
    A[Choose Loading Method] --> B{Data Size}
    B -->|Small < 100MB| C[Eager Loading]
    B -->|Large > 100MB| D[Lazy Loading]

    C --> E[read_csv, read_parquet]
    D --> F[scan_csv, scan_parquet]

    G[File Format Choice] --> H{Use Case}
    H -->|Fast I/O| I[Parquet]
    H -->|Human Readable| J[CSV]
    H -->|Nested Data| K[JSON]
    H -->|Streaming| L[NDJSON]

    M[Schema Handling] --> N[Explicit Types]
    N --> O[Better Performance]
    N --> P[Data Validation]

    style D fill:#f96
    style I fill:#9f6
    style N fill:#69f

Performance Tips

# 1. Use appropriate data types
df_optimized = pl.DataFrame({
    "small_int": pl.Series([1, 2, 3], dtype=pl.Int8),  # vs Int64
    "category": pl.Series(["A", "B", "A"], dtype=pl.Categorical),  # vs Utf8
    "flag": pl.Series([True, False, True], dtype=pl.Boolean)  # vs string
})

print("💡 Optimized DataFrame memory usage:")
print(f"Schema: {df_optimized.schema}")

# 2. Lazy loading for large datasets
def demonstrate_lazy_benefits():
    """Show lazy loading benefits"""

    # For demonstration - in real use, this would be a large file
    large_data = {
        "id": range(10_000),
        "value": range(10_000),
        "category": ["A", "B", "C", "D", "E"] * 2_000
    }

    # Save as parquet for demo
    temp_df = pl.DataFrame(large_data)
    temp_df.write_parquet("temp_large.parquet")

    # Lazy scan - no data loaded yet
    lazy_scan = pl.scan_parquet("temp_large.parquet")

    # Only load what we need
    result = (
        lazy_scan
        .filter(pl.col("value") > 8000)  # Predicate pushdown
        .select(["id", "category"])      # Column selection
        .group_by("category")
        .agg(pl.col("id").count())
        .collect()  # Execute query
    )

    print("⚡ Lazy loading result:")
    print(result)

    # Clean up
    os.remove("temp_large.parquet")

demonstrate_lazy_benefits()

# 1. Use appropriate data types
df_optimized = pl.DataFrame({
    "small_int": pl.Series([1, 2, 3], dtype=pl.Int8),  # vs Int64
    "category": pl.Series(["A", "B", "A"], dtype=pl.Categorical),  # vs Utf8
    "flag": pl.Series([True, False, True], dtype=pl.Boolean)  # vs string
})

print("💡 Optimized DataFrame memory usage:")
print(f"Schema: {df_optimized.schema}")

# 2. Lazy loading for large datasets
def demonstrate_lazy_benefits():
    """Show lazy loading benefits"""

    # For demonstration - in real use, this would be a large file
    large_data = {
        "id": range(10_000),
        "value": range(10_000),
        "category": ["A", "B", "C", "D", "E"] * 2_000
    }

    # Save as parquet for demo
    temp_df = pl.DataFrame(large_data)
    temp_df.write_parquet("temp_large.parquet")

    # Lazy scan - no data loaded yet
    lazy_scan = pl.scan_parquet("temp_large.parquet")

    # Only load what we need
    result = (
        lazy_scan
        .filter(pl.col("value") > 8000)  # Predicate pushdown
        .select(["id", "category"])      # Column selection
        .group_by("category")
        .agg(pl.col("id").count())
        .collect()  # Execute query
    )

    print("⚡ Lazy loading result:")
    print(result)

    # Clean up
    os.remove("temp_large.parquet")

demonstrate_lazy_benefits()

Python

Chapter Summary

Data Creation Methods Mastered:

✅ Dictionaries: Most flexible and intuitive method
✅ Lists: Efficient for structured data
✅ NumPy integration: Seamless array handling
✅ Schema definition: Type-safe DataFrame creation

File Loading Expertise:

✅ CSV: Most common format with advanced options
✅ JSON/NDJSON: Nested and streaming data
✅ Lazy loading: Memory-efficient large file handling
✅ Performance optimization: Right types, right methods

Key Patterns:

Use lazy loading (scan_*) for large files
Specify data types explicitly for better performance
Choose file formats based on use case (Parquet for speed, CSV for compatibility)
Handle missing values appropriately during loading

🔗 What’s Next?

Now that you can create and load data efficiently, let’s learn how to explore and understand your datasets!

5. Data Inspection and Exploration

Basic Information

# DataFrame info
print(df.shape)          # (rows, columns)
print(df.columns)        # Column names
print(df.dtypes)         # Data types
print(df.schema)         # Schema with types

# Statistical summary
df.describe()            # Statistical summary
df.null_count()         # Null values per column
df.is_empty()           # Check if empty

# Memory usage
df.estimated_size()     # Estimated memory usage

# DataFrame info
print(df.shape)          # (rows, columns)
print(df.columns)        # Column names
print(df.dtypes)         # Data types
print(df.schema)         # Schema with types

# Statistical summary
df.describe()            # Statistical summary
df.null_count()         # Null values per column
df.is_empty()           # Check if empty

# Memory usage
df.estimated_size()     # Estimated memory usage

Python

Data Preview

# View data
df.head()               # First 5 rows
df.head(10)             # First 10 rows
df.tail()               # Last 5 rows
df.sample(5)            # Random 5 rows
df.glimpse()            # Transposed view

# Unique values
df["column"].unique()
df["column"].n_unique()
df["column"].value_counts()

# View data
df.head()               # First 5 rows
df.head(10)             # First 10 rows
df.tail()               # Last 5 rows
df.sample(5)            # Random 5 rows
df.glimpse()            # Transposed view

# Unique values
df["column"].unique()
df["column"].n_unique()
df["column"].value_counts()

Python

Data Quality Checks

# Missing values analysis
missing_report = df.select([
    pl.col("*").null_count().name.suffix("_nulls"),
    pl.col("*").count().name.suffix("_total")
])

# Duplicate detection
df.is_duplicated()      # Boolean mask for duplicates
df.n_unique()          # Count unique combinations

# Missing values analysis
missing_report = df.select([
    pl.col("*").null_count().name.suffix("_nulls"),
    pl.col("*").count().name.suffix("_total")
])

# Duplicate detection
df.is_duplicated()      # Boolean mask for duplicates
df.n_unique()          # Count unique combinations

Python

Data Flow Visualization

flowchart TD
    A[Raw Data] --> B[Load into Polars]
    B --> C[Data Inspection]
    C --> D{Data Quality OK?}
    D -->|No| E[Data Cleaning]
    D -->|Yes| F[Data Analysis]
    E --> F
    F --> G[Results]

    subgraph "Inspection Methods"
        H[.head/.tail]
        I[.describe]
        J[.null_count]
        K[.dtypes]
    end

    C --> H
    C --> I
    C --> J
    C --> K

6. Data Selection and Filtering

Column Selection

# Single column
df.select("name")
df["name"]              # Returns Series

# Multiple columns
df.select(["name", "age"])
df.select(pl.col("name"), pl.col("age"))

# Column patterns
df.select(pl.col("^.*_id$"))        # Regex pattern
df.select(pl.col("*").exclude("id")) # All except id

# By data type
df.select(pl.col(pl.Utf8))          # String columns
df.select(pl.col(pl.NUMERIC_DTYPES)) # Numeric columns

# Single column
df.select("name")
df["name"]              # Returns Series

# Multiple columns
df.select(["name", "age"])
df.select(pl.col("name"), pl.col("age"))

# Column patterns
df.select(pl.col("^.*_id$"))        # Regex pattern
df.select(pl.col("*").exclude("id")) # All except id

# By data type
df.select(pl.col(pl.Utf8))          # String columns
df.select(pl.col(pl.NUMERIC_DTYPES)) # Numeric columns

Python

Row Filtering

# Simple conditions
df.filter(pl.col("age") > 25)
df.filter(pl.col("name") == "Alice")
df.filter(pl.col("city").is_in(["New York", "London"]))

# Complex conditions
df.filter(
    (pl.col("age") > 25) & 
    (pl.col("city") == "New York")
)

df.filter(
    (pl.col("age") < 20) | 
    (pl.col("age") > 60)
)

# String operations
df.filter(pl.col("name").str.contains("^A"))
df.filter(pl.col("name").str.starts_with("B"))
df.filter(pl.col("name").str.ends_with("e"))

# Null filtering
df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())

# Simple conditions
df.filter(pl.col("age") > 25)
df.filter(pl.col("name") == "Alice")
df.filter(pl.col("city").is_in(["New York", "London"]))

# Complex conditions
df.filter(
    (pl.col("age") > 25) & 
    (pl.col("city") == "New York")
)

df.filter(
    (pl.col("age") < 20) | 
    (pl.col("age") > 60)
)

# String operations
df.filter(pl.col("name").str.contains("^A"))
df.filter(pl.col("name").str.starts_with("B"))
df.filter(pl.col("name").str.ends_with("e"))

# Null filtering
df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())

Python

Advanced Selection

# Conditional selection
df.select(
    pl.when(pl.col("age") > 30)
    .then(pl.lit("Senior"))
    .otherwise(pl.lit("Junior"))
    .alias("category")
)

# Multiple conditions
df.select(
    pl.when(pl.col("score") >= 90).then("A")
    .when(pl.col("score") >= 80).then("B")  
    .when(pl.col("score") >= 70).then("C")
    .otherwise("F")
    .alias("grade")
)

# Dynamic column selection
numeric_cols = [col for col, dtype in df.schema.items() 
                if dtype in pl.NUMERIC_DTYPES]
df.select(numeric_cols)

# Conditional selection
df.select(
    pl.when(pl.col("age") > 30)
    .then(pl.lit("Senior"))
    .otherwise(pl.lit("Junior"))
    .alias("category")
)

# Multiple conditions
df.select(
    pl.when(pl.col("score") >= 90).then("A")
    .when(pl.col("score") >= 80).then("B")  
    .when(pl.col("score") >= 70).then("C")
    .otherwise("F")
    .alias("grade")
)

# Dynamic column selection
numeric_cols = [col for col, dtype in df.schema.items() 
                if dtype in pl.NUMERIC_DTYPES]
df.select(numeric_cols)

Python

Indexing and Slicing

# Row slicing
df[0]                   # First row as DataFrame
df[0:5]                 # First 5 rows
df[-5:]                 # Last 5 rows

# Boolean indexing
mask = df["age"] > 25
df[mask]

# Sample and slice
df.slice(10, 20)        # Rows 10-29
df.sample(n=100)        # Random 100 rows
df.sample(fraction=0.1) # Random 10% of rows

# Row slicing
df[0]                   # First row as DataFrame
df[0:5]                 # First 5 rows
df[-5:]                 # Last 5 rows

# Boolean indexing
mask = df["age"] > 25
df[mask]

# Sample and slice
df.slice(10, 20)        # Rows 10-29
df.sample(n=100)        # Random 100 rows
df.sample(fraction=0.1) # Random 10% of rows

Python

7. Data Transformation

Adding and Modifying Columns

# Add new columns
df = df.with_columns([
    pl.col("age").alias("age_copy"),
    (pl.col("age") * 12).alias("age_months"),
    pl.lit("default_value").alias("new_column")
])

# Modify existing columns
df = df.with_columns([
    pl.col("name").str.to_uppercase().alias("name"),
    pl.col("age").cast(pl.Float64).alias("age")
])

# Multiple transformations
df = df.with_columns([
    # String manipulations
    pl.col("name").str.len().alias("name_length"),
    pl.col("city").str.replace("New ", "").alias("city_clean"),

    # Mathematical operations
    pl.col("value").log().alias("log_value"),
    pl.col("value").pow(2).alias("value_squared"),

    # Date operations
    pl.col("date").dt.year().alias("year"),
    pl.col("date").dt.month().alias("month")
])

# Add new columns
df = df.with_columns([
    pl.col("age").alias("age_copy"),
    (pl.col("age") * 12).alias("age_months"),
    pl.lit("default_value").alias("new_column")
])

# Modify existing columns
df = df.with_columns([
    pl.col("name").str.to_uppercase().alias("name"),
    pl.col("age").cast(pl.Float64).alias("age")
])

# Multiple transformations
df = df.with_columns([
    # String manipulations
    pl.col("name").str.len().alias("name_length"),
    pl.col("city").str.replace("New ", "").alias("city_clean"),

    # Mathematical operations
    pl.col("value").log().alias("log_value"),
    pl.col("value").pow(2).alias("value_squared"),

    # Date operations
    pl.col("date").dt.year().alias("year"),
    pl.col("date").dt.month().alias("month")
])

Python

String Operations

# String methods
df.with_columns([
    pl.col("text").str.len().alias("length"),
    pl.col("text").str.to_lowercase().alias("lower"),
    pl.col("text").str.to_uppercase().alias("upper"),
    pl.col("text").str.strip().alias("stripped"),
    pl.col("text").str.slice(0, 5).alias("first_5"),
    pl.col("text").str.split(" ").alias("words"),
    pl.col("text").str.contains("pattern").alias("has_pattern"),
    pl.col("text").str.replace("old", "new").alias("replaced")
])

# Regular expressions
df.with_columns([
    pl.col("email").str.extract(r"([^@]+)@([^.]+)").alias("email_parts"),
    pl.col("phone").str.replace_all(r"[^\d]", "").alias("phone_clean")
])

# String methods
df.with_columns([
    pl.col("text").str.len().alias("length"),
    pl.col("text").str.to_lowercase().alias("lower"),
    pl.col("text").str.to_uppercase().alias("upper"),
    pl.col("text").str.strip().alias("stripped"),
    pl.col("text").str.slice(0, 5).alias("first_5"),
    pl.col("text").str.split(" ").alias("words"),
    pl.col("text").str.contains("pattern").alias("has_pattern"),
    pl.col("text").str.replace("old", "new").alias("replaced")
])

# Regular expressions
df.with_columns([
    pl.col("email").str.extract(r"([^@]+)@([^.]+)").alias("email_parts"),
    pl.col("phone").str.replace_all(r"[^\d]", "").alias("phone_clean")
])

Python

Mathematical Operations

# Basic math
df.with_columns([
    (pl.col("a") + pl.col("b")).alias("sum"),
    (pl.col("a") - pl.col("b")).alias("diff"), 
    (pl.col("a") * pl.col("b")).alias("product"),
    (pl.col("a") / pl.col("b")).alias("ratio"),
    (pl.col("a") % pl.col("b")).alias("remainder"),
    pl.col("a").pow(pl.col("b")).alias("power")
])

# Advanced math
df.with_columns([
    pl.col("value").log().alias("log"),
    pl.col("value").log10().alias("log10"),
    pl.col("value").exp().alias("exp"),
    pl.col("value").sqrt().alias("sqrt"),
    pl.col("angle").sin().alias("sin"),
    pl.col("angle").cos().alias("cos"),
    pl.col("value").abs().alias("absolute")
])

# Statistical functions
df.with_columns([
    pl.col("values").rank().alias("rank"),
    pl.col("values").pct_change().alias("pct_change"),
    pl.col("values").cumsum().alias("cumsum"),
    pl.col("values").diff().alias("diff")
])

# Basic math
df.with_columns([
    (pl.col("a") + pl.col("b")).alias("sum"),
    (pl.col("a") - pl.col("b")).alias("diff"), 
    (pl.col("a") * pl.col("b")).alias("product"),
    (pl.col("a") / pl.col("b")).alias("ratio"),
    (pl.col("a") % pl.col("b")).alias("remainder"),
    pl.col("a").pow(pl.col("b")).alias("power")
])

# Advanced math
df.with_columns([
    pl.col("value").log().alias("log"),
    pl.col("value").log10().alias("log10"),
    pl.col("value").exp().alias("exp"),
    pl.col("value").sqrt().alias("sqrt"),
    pl.col("angle").sin().alias("sin"),
    pl.col("angle").cos().alias("cos"),
    pl.col("value").abs().alias("absolute")
])

# Statistical functions
df.with_columns([
    pl.col("values").rank().alias("rank"),
    pl.col("values").pct_change().alias("pct_change"),
    pl.col("values").cumsum().alias("cumsum"),
    pl.col("values").diff().alias("diff")
])

Python

Conditional Transformations

# Simple conditions
df.with_columns([
    pl.when(pl.col("age") >= 18)
    .then(pl.lit("Adult"))
    .otherwise(pl.lit("Minor"))
    .alias("age_group")
])

# Complex conditions
df.with_columns([
    pl.when(pl.col("score") >= 90).then("Excellent")
    .when(pl.col("score") >= 80).then("Good")
    .when(pl.col("score") >= 70).then("Average")
    .when(pl.col("score") >= 60).then("Below Average")
    .otherwise("Poor")
    .alias("performance")
])

# Multiple column conditions
df.with_columns([
    pl.when((pl.col("age") > 25) & (pl.col("salary") > 50000))
    .then(pl.lit("Senior"))
    .when((pl.col("age") > 22) & (pl.col("salary") > 30000))
    .then(pl.lit("Mid-level"))
    .otherwise(pl.lit("Junior"))
    .alias("level")
])

# Simple conditions
df.with_columns([
    pl.when(pl.col("age") >= 18)
    .then(pl.lit("Adult"))
    .otherwise(pl.lit("Minor"))
    .alias("age_group")
])

# Complex conditions
df.with_columns([
    pl.when(pl.col("score") >= 90).then("Excellent")
    .when(pl.col("score") >= 80).then("Good")
    .when(pl.col("score") >= 70).then("Average")
    .when(pl.col("score") >= 60).then("Below Average")
    .otherwise("Poor")
    .alias("performance")
])

# Multiple column conditions
df.with_columns([
    pl.when((pl.col("age") > 25) & (pl.col("salary") > 50000))
    .then(pl.lit("Senior"))
    .when((pl.col("age") > 22) & (pl.col("salary") > 30000))
    .then(pl.lit("Mid-level"))
    .otherwise(pl.lit("Junior"))
    .alias("level")
])

Python

Data Type Conversions

# Type casting
df.with_columns([
    pl.col("string_number").cast(pl.Int64),
    pl.col("integer").cast(pl.Utf8),
    pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
    pl.col("timestamp").cast(pl.Datetime)
])

# Safe casting with error handling
df.with_columns([
    pl.col("maybe_number").cast(pl.Int64, strict=False).alias("number"),
    pl.col("date_text").str.strptime(pl.Date, "%Y-%m-%d", strict=False)
])

# Type casting
df.with_columns([
    pl.col("string_number").cast(pl.Int64),
    pl.col("integer").cast(pl.Utf8),
    pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
    pl.col("timestamp").cast(pl.Datetime)
])

# Safe casting with error handling
df.with_columns([
    pl.col("maybe_number").cast(pl.Int64, strict=False).alias("number"),
    pl.col("date_text").str.strptime(pl.Date, "%Y-%m-%d", strict=False)
])

Python

Transformation Pipeline

flowchart LR
    A[Raw Data] --> B[String Cleaning]
    B --> C[Type Conversion]
    C --> D[Mathematical Operations]
    D --> E[Conditional Logic]
    E --> F[Final Dataset]

    subgraph "Transformation Steps"
        G[with_columns]
        H[select]
        I[filter]
        J[cast]
    end

8. Aggregations and Grouping

Basic Aggregations

# Single column aggregations
df.select([
    pl.col("value").sum().alias("total"),
    pl.col("value").mean().alias("average"),
    pl.col("value").median().alias("median"),
    pl.col("value").std().alias("std_dev"),
    pl.col("value").min().alias("minimum"),
    pl.col("value").max().alias("maximum"),
    pl.col("value").count().alias("count"),
    pl.col("value").n_unique().alias("unique_count")
])

# Multiple columns
df.select([
    pl.all().sum().name.suffix("_sum"),
    pl.all().mean().name.suffix("_mean")
])

# Quantiles and percentiles
df.select([
    pl.col("value").quantile(0.25).alias("q25"),
    pl.col("value").quantile(0.5).alias("q50"),
    pl.col("value").quantile(0.75).alias("q75"),
    pl.col("value").quantile(0.95).alias("q95")
])

# Single column aggregations
df.select([
    pl.col("value").sum().alias("total"),
    pl.col("value").mean().alias("average"),
    pl.col("value").median().alias("median"),
    pl.col("value").std().alias("std_dev"),
    pl.col("value").min().alias("minimum"),
    pl.col("value").max().alias("maximum"),
    pl.col("value").count().alias("count"),
    pl.col("value").n_unique().alias("unique_count")
])

# Multiple columns
df.select([
    pl.all().sum().name.suffix("_sum"),
    pl.all().mean().name.suffix("_mean")
])

# Quantiles and percentiles
df.select([
    pl.col("value").quantile(0.25).alias("q25"),
    pl.col("value").quantile(0.5).alias("q50"),
    pl.col("value").quantile(0.75).alias("q75"),
    pl.col("value").quantile(0.95).alias("q95")
])

Python

Group By Operations

# Basic grouping
result = df.group_by("category").agg([
    pl.col("value").sum().alias("total_value"),
    pl.col("value").mean().alias("avg_value"),
    pl.col("value").count().alias("count")
])

# Multiple grouping columns
result = df.group_by(["department", "region"]).agg([
    pl.col("sales").sum(),
    pl.col("employees").count(),
    pl.col("salary").mean()
])

# Dynamic grouping
result = df.group_by(
    pl.col("date").dt.year().alias("year"),
    pl.col("date").dt.month().alias("month")
).agg([
    pl.col("revenue").sum(),
    pl.col("customers").n_unique()
])

# Basic grouping
result = df.group_by("category").agg([
    pl.col("value").sum().alias("total_value"),
    pl.col("value").mean().alias("avg_value"),
    pl.col("value").count().alias("count")
])

# Multiple grouping columns
result = df.group_by(["department", "region"]).agg([
    pl.col("sales").sum(),
    pl.col("employees").count(),
    pl.col("salary").mean()
])

# Dynamic grouping
result = df.group_by(
    pl.col("date").dt.year().alias("year"),
    pl.col("date").dt.month().alias("month")
).agg([
    pl.col("revenue").sum(),
    pl.col("customers").n_unique()
])

Python

Advanced Aggregations

# Custom aggregations
result = df.group_by("group").agg([
    # Statistical measures
    pl.col("values").std().alias("std_dev"),
    pl.col("values").var().alias("variance"),
    pl.col("values").skew().alias("skewness"),

    # List aggregations
    pl.col("items").list().alias("all_items"),
    pl.col("values").sort().first().alias("min_value"),
    pl.col("values").sort().last().alias("max_value"),

    # Conditional aggregations
    pl.col("value").filter(pl.col("status") == "active").sum().alias("active_sum"),
    pl.col("value").filter(pl.col("value") > 100).count().alias("high_value_count")
])

# Window functions in aggregations
result = df.group_by("category").agg([
    pl.col("value").sum().alias("category_total"),
    (pl.col("value") / pl.col("value").sum()).alias("percentage")
])

# Custom aggregations
result = df.group_by("group").agg([
    # Statistical measures
    pl.col("values").std().alias("std_dev"),
    pl.col("values").var().alias("variance"),
    pl.col("values").skew().alias("skewness"),

    # List aggregations
    pl.col("items").list().alias("all_items"),
    pl.col("values").sort().first().alias("min_value"),
    pl.col("values").sort().last().alias("max_value"),

    # Conditional aggregations
    pl.col("value").filter(pl.col("status") == "active").sum().alias("active_sum"),
    pl.col("value").filter(pl.col("value") > 100).count().alias("high_value_count")
])

# Window functions in aggregations
result = df.group_by("category").agg([
    pl.col("value").sum().alias("category_total"),
    (pl.col("value") / pl.col("value").sum()).alias("percentage")
])

Python

Rolling and Expanding Aggregations

# Rolling window aggregations
df.with_columns([
    pl.col("value").rolling_sum(window_size=3).alias("rolling_sum_3"),
    pl.col("value").rolling_mean(window_size=7).alias("rolling_avg_7"),
    pl.col("value").rolling_std(window_size=5).alias("rolling_std_5"),
    pl.col("value").rolling_max(window_size=10).alias("rolling_max_10")
])

# Expanding window aggregations  
df.with_columns([
    pl.col("value").cumsum().alias("cumulative_sum"),
    pl.col("value").cumcount().alias("cumulative_count"),
    pl.col("value").cummax().alias("cumulative_max"),
    pl.col("value").cummin().alias("cumulative_min")
])

# Time-based rolling windows
df.sort("timestamp").with_columns([
    pl.col("value").rolling_mean_by("timestamp", window_size="7d").alias("7day_avg"),
    pl.col("value").rolling_sum_by("timestamp", window_size="1mo").alias("monthly_sum")
])

# Rolling window aggregations
df.with_columns([
    pl.col("value").rolling_sum(window_size=3).alias("rolling_sum_3"),
    pl.col("value").rolling_mean(window_size=7).alias("rolling_avg_7"),
    pl.col("value").rolling_std(window_size=5).alias("rolling_std_5"),
    pl.col("value").rolling_max(window_size=10).alias("rolling_max_10")
])

# Expanding window aggregations  
df.with_columns([
    pl.col("value").cumsum().alias("cumulative_sum"),
    pl.col("value").cumcount().alias("cumulative_count"),
    pl.col("value").cummax().alias("cumulative_max"),
    pl.col("value").cummin().alias("cumulative_min")
])

# Time-based rolling windows
df.sort("timestamp").with_columns([
    pl.col("value").rolling_mean_by("timestamp", window_size="7d").alias("7day_avg"),
    pl.col("value").rolling_sum_by("timestamp", window_size="1mo").alias("monthly_sum")
])

Python

Pivot and Unpivot

# Pivot table
pivot_df = df.pivot(
    values="sales",
    index="region", 
    columns="month",
    aggregate_function="sum"
)

# Unpivot (melt)
melted_df = df.melt(
    id_vars=["id", "name"],
    value_vars=["jan", "feb", "mar"],
    variable_name="month",
    value_name="sales"
)

# Multiple value columns
melted_df = df.melt(
    id_vars=["id"],
    value_vars=["sales_jan", "sales_feb", "profit_jan", "profit_feb"],
    variable_name="metric_month",
    value_name="value"
)

# Pivot table
pivot_df = df.pivot(
    values="sales",
    index="region", 
    columns="month",
    aggregate_function="sum"
)

# Unpivot (melt)
melted_df = df.melt(
    id_vars=["id", "name"],
    value_vars=["jan", "feb", "mar"],
    variable_name="month",
    value_name="sales"
)

# Multiple value columns
melted_df = df.melt(
    id_vars=["id"],
    value_vars=["sales_jan", "sales_feb", "profit_jan", "profit_feb"],
    variable_name="metric_month",
    value_name="value"
)

Python

Aggregation Pipeline

flowchart TD
    A[Original DataFrame] --> B{Grouping Needed?}
    B -->|Yes| C[group_by]
    B -->|No| D[Direct Aggregation]
    C --> E[agg functions]
    D --> E
    E --> F[Result DataFrame]

    subgraph "Aggregation Functions"
        G[sum, mean, count]
        H[min, max, std]
        I[quantile, median]
        J[custom expressions]
    end

    E --> G
    E --> H  
    E --> I
    E --> J

9. Joins and Concatenations

Types of Joins

# Sample DataFrames
customers = pl.DataFrame({
    "customer_id": [1, 2, 3, 4],
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "city": ["NYC", "LA", "Chicago", "Miami"]
})

orders = pl.DataFrame({
    "order_id": [101, 102, 103, 104, 105],
    "customer_id": [1, 2, 2, 3, 5],
    "amount": [100, 200, 150, 300, 250]
})

# Inner join (default)
result = customers.join(orders, on="customer_id", how="inner")

# Left join
result = customers.join(orders, on="customer_id", how="left")

# Right join  
result = customers.join(orders, on="customer_id", how="right")

# Outer join
result = customers.join(orders, on="customer_id", how="outer")

# Cross join
result = customers.join(orders, how="cross")

# Sample DataFrames
customers = pl.DataFrame({
    "customer_id": [1, 2, 3, 4],
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "city": ["NYC", "LA", "Chicago", "Miami"]
})

orders = pl.DataFrame({
    "order_id": [101, 102, 103, 104, 105],
    "customer_id": [1, 2, 2, 3, 5],
    "amount": [100, 200, 150, 300, 250]
})

# Inner join (default)
result = customers.join(orders, on="customer_id", how="inner")

# Left join
result = customers.join(orders, on="customer_id", how="left")

# Right join  
result = customers.join(orders, on="customer_id", how="right")

# Outer join
result = customers.join(orders, on="customer_id", how="outer")

# Cross join
result = customers.join(orders, how="cross")

Python

Advanced Join Operations

# Multiple column joins
df1.join(df2, on=["col1", "col2"], how="inner")

# Different column names
df1.join(df2, left_on="id", right_on="customer_id", how="left")

# Join with suffix for duplicate columns
df1.join(df2, on="id", suffix="_right")

# Anti join (rows in left not in right)
df1.join(df2, on="id", how="anti")

# Semi join (rows in left that have match in right)
df1.join(df2, on="id", how="semi")

# Multiple column joins
df1.join(df2, on=["col1", "col2"], how="inner")

# Different column names
df1.join(df2, left_on="id", right_on="customer_id", how="left")

# Join with suffix for duplicate columns
df1.join(df2, on="id", suffix="_right")

# Anti join (rows in left not in right)
df1.join(df2, on="id", how="anti")

# Semi join (rows in left that have match in right)
df1.join(df2, on="id", how="semi")

Python

Concatenations

# Vertical concatenation (stack rows)
df_combined = pl.concat([df1, df2, df3], how="vertical")

# Horizontal concatenation (side by side)
df_combined = pl.concat([df1, df2], how="horizontal")

# Diagonal concatenation (union with different schemas)
df_combined = pl.concat([df1, df2], how="diagonal")

# With rechunk for better performance
df_combined = pl.concat([df1, df2], rechunk=True)

# Vertical concatenation (stack rows)
df_combined = pl.concat([df1, df2, df3], how="vertical")

# Horizontal concatenation (side by side)
df_combined = pl.concat([df1, df2], how="horizontal")

# Diagonal concatenation (union with different schemas)
df_combined = pl.concat([df1, df2], how="diagonal")

# With rechunk for better performance
df_combined = pl.concat([df1, df2], rechunk=True)

Python

Complex Join Examples

# Multiple joins in sequence
result = (
    customers
    .join(orders, on="customer_id", how="left")
    .join(products, left_on="product_id", right_on="id", how="left")
    .join(categories, left_on="category_id", right_on="id", how="left")
)

# Join with aggregation
customer_stats = (
    orders
    .group_by("customer_id")
    .agg([
        pl.col("amount").sum().alias("total_spent"),
        pl.col("order_id").count().alias("order_count")
    ])
)

enriched_customers = customers.join(
    customer_stats, 
    on="customer_id", 
    how="left"
).fill_null(0)

# Self join
df.join(
    df.select(["id", "manager_id", "name"]).rename({"name": "manager_name"}),
    left_on="manager_id",
    right_on="id",
    how="left"
)

# Multiple joins in sequence
result = (
    customers
    .join(orders, on="customer_id", how="left")
    .join(products, left_on="product_id", right_on="id", how="left")
    .join(categories, left_on="category_id", right_on="id", how="left")
)

# Join with aggregation
customer_stats = (
    orders
    .group_by("customer_id")
    .agg([
        pl.col("amount").sum().alias("total_spent"),
        pl.col("order_id").count().alias("order_count")
    ])
)

enriched_customers = customers.join(
    customer_stats, 
    on="customer_id", 
    how="left"
).fill_null(0)

# Self join
df.join(
    df.select(["id", "manager_id", "name"]).rename({"name": "manager_name"}),
    left_on="manager_id",
    right_on="id",
    how="left"
)

Python

Join Performance Optimization

# Use lazy evaluation for large datasets
lazy_result = (
    pl.scan_parquet("customers.parquet")
    .join(
        pl.scan_parquet("orders.parquet"),
        on="customer_id",
        how="inner"
    )
    .collect()
)

# Pre-sort data for better join performance
customers_sorted = customers.sort("customer_id")
orders_sorted = orders.sort("customer_id")
result = customers_sorted.join(orders_sorted, on="customer_id")

# Use lazy evaluation for large datasets
lazy_result = (
    pl.scan_parquet("customers.parquet")
    .join(
        pl.scan_parquet("orders.parquet"),
        on="customer_id",
        how="inner"
    )
    .collect()
)

# Pre-sort data for better join performance
customers_sorted = customers.sort("customer_id")
orders_sorted = orders.sort("customer_id")
result = customers_sorted.join(orders_sorted, on="customer_id")

Python

Join Strategy Visualization

graph LR
    subgraph "Join Types"
        A[Inner Join] --> E[Only Matching Records]
        B[Left Join] --> F[All Left + Matching Right]
        C[Right Join] --> G[All Right + Matching Left]
        D[Outer Join] --> H[All Records from Both]
    end

    subgraph "Special Joins"
        I[Anti Join] --> J[Left Records Without Match]
        K[Semi Join] --> L[Left Records With Match]
        M[Cross Join] --> N[Cartesian Product]
    end

10. Time Series Operations

Date and Time Creation

import polars as pl
from datetime import datetime, date, time

# Creating date/time columns
df = pl.DataFrame({
    "date": [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 3)],
    "datetime": [
        datetime(2023, 1, 1, 10, 30),
        datetime(2023, 1, 2, 14, 45),
        datetime(2023, 1, 3, 9, 15)
    ],
    "value": [100, 150, 200]
})

# From strings
df = pl.DataFrame({
    "date_str": ["2023-01-01", "2023-01-02", "2023-01-03"],
    "datetime_str": ["2023-01-01 10:30:00", "2023-01-02 14:45:00", "2023-01-03 09:15:00"],
    "value": [100, 150, 200]
})

df = df.with_columns([
    pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d").alias("date"),
    pl.col("datetime_str").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S").alias("datetime")
])

import polars as pl
from datetime import datetime, date, time

# Creating date/time columns
df = pl.DataFrame({
    "date": [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 3)],
    "datetime": [
        datetime(2023, 1, 1, 10, 30),
        datetime(2023, 1, 2, 14, 45),
        datetime(2023, 1, 3, 9, 15)
    ],
    "value": [100, 150, 200]
})

# From strings
df = pl.DataFrame({
    "date_str": ["2023-01-01", "2023-01-02", "2023-01-03"],
    "datetime_str": ["2023-01-01 10:30:00", "2023-01-02 14:45:00", "2023-01-03 09:15:00"],
    "value": [100, 150, 200]
})

df = df.with_columns([
    pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d").alias("date"),
    pl.col("datetime_str").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S").alias("datetime")
])

Python

Date/Time Extraction

# Extract date components
df = df.with_columns([
    pl.col("datetime").dt.year().alias("year"),
    pl.col("datetime").dt.month().alias("month"),
    pl.col("datetime").dt.day().alias("day"),
    pl.col("datetime").dt.weekday().alias("weekday"),
    pl.col("datetime").dt.hour().alias("hour"),
    pl.col("datetime").dt.minute().alias("minute"),
    pl.col("datetime").dt.second().alias("second"),
    pl.col("datetime").dt.week().alias("week_of_year"),
    pl.col("datetime").dt.quarter().alias("quarter")
])

# Date arithmetic
df = df.with_columns([
    (pl.col("datetime") + pl.duration(days=7)).alias("next_week"),
    (pl.col("datetime") - pl.duration(hours=2)).alias("two_hours_ago"),
    pl.col("datetime").dt.truncate("1d").alias("date_truncated"),
    pl.col("datetime").dt.round("1h").alias("hour_rounded")
])

# Extract date components
df = df.with_columns([
    pl.col("datetime").dt.year().alias("year"),
    pl.col("datetime").dt.month().alias("month"),
    pl.col("datetime").dt.day().alias("day"),
    pl.col("datetime").dt.weekday().alias("weekday"),
    pl.col("datetime").dt.hour().alias("hour"),
    pl.col("datetime").dt.minute().alias("minute"),
    pl.col("datetime").dt.second().alias("second"),
    pl.col("datetime").dt.week().alias("week_of_year"),
    pl.col("datetime").dt.quarter().alias("quarter")
])

# Date arithmetic
df = df.with_columns([
    (pl.col("datetime") + pl.duration(days=7)).alias("next_week"),
    (pl.col("datetime") - pl.duration(hours=2)).alias("two_hours_ago"),
    pl.col("datetime").dt.truncate("1d").alias("date_truncated"),
    pl.col("datetime").dt.round("1h").alias("hour_rounded")
])

Python

Time Series Resampling

# Create time series data
dates = pl.date_range(
    start=date(2023, 1, 1),
    end=date(2023, 12, 31),
    interval="1d",
    eager=True
)

ts_df = pl.DataFrame({
    "date": dates,
    "value": np.random.randn(len(dates)).cumsum()
})

# Resample to different frequencies
monthly = (
    ts_df
    .group_by_dynamic("date", every="1mo")
    .agg([
        pl.col("value").mean().alias("avg_value"),
        pl.col("value").sum().alias("total_value"),
        pl.col("value").count().alias("count")
    ])
)

# Weekly resampling with custom aggregation
weekly = (
    ts_df
    .group_by_dynamic("date", every="1w", label="left")
    .agg([
        pl.col("value").first().alias("week_start"),
        pl.col("value").last().alias("week_end"),
        pl.col("value").max().alias("week_max"),
        pl.col("value").min().alias("week_min")
    ])
)

# Create time series data
dates = pl.date_range(
    start=date(2023, 1, 1),
    end=date(2023, 12, 31),
    interval="1d",
    eager=True
)

ts_df = pl.DataFrame({
    "date": dates,
    "value": np.random.randn(len(dates)).cumsum()
})

# Resample to different frequencies
monthly = (
    ts_df
    .group_by_dynamic("date", every="1mo")
    .agg([
        pl.col("value").mean().alias("avg_value"),
        pl.col("value").sum().alias("total_value"),
        pl.col("value").count().alias("count")
    ])
)

# Weekly resampling with custom aggregation
weekly = (
    ts_df
    .group_by_dynamic("date", every="1w", label="left")
    .agg([
        pl.col("value").first().alias("week_start"),
        pl.col("value").last().alias("week_end"),
        pl.col("value").max().alias("week_max"),
        pl.col("value").min().alias("week_min")
    ])
)

Python

Rolling Time Windows

# Time-based rolling windows
df = df.sort("datetime").with_columns([
    # Rolling statistics over time periods
    pl.col("value").rolling_mean_by("datetime", window_size="7d").alias("7day_avg"),
    pl.col("value").rolling_sum_by("datetime", window_size="30d").alias("30day_sum"),
    pl.col("value").rolling_std_by("datetime", window_size="14d").alias("14day_std"),

    # Rolling with minimum periods
    pl.col("value").rolling_mean_by(
        "datetime", 
        window_size="7d", 
        min_periods=3
    ).alias("7day_avg_min3")
])

# Custom rolling functions
df = df.with_columns([
    pl.col("value").rolling_quantile_by("datetime", quantile=0.9, window_size="30d").alias("30day_p90"),
    pl.col("value").rolling_median_by("datetime", window_size="14d").alias("14day_median")
])

# Time-based rolling windows
df = df.sort("datetime").with_columns([
    # Rolling statistics over time periods
    pl.col("value").rolling_mean_by("datetime", window_size="7d").alias("7day_avg"),
    pl.col("value").rolling_sum_by("datetime", window_size="30d").alias("30day_sum"),
    pl.col("value").rolling_std_by("datetime", window_size="14d").alias("14day_std"),

    # Rolling with minimum periods
    pl.col("value").rolling_mean_by(
        "datetime", 
        window_size="7d", 
        min_periods=3
    ).alias("7day_avg_min3")
])

# Custom rolling functions
df = df.with_columns([
    pl.col("value").rolling_quantile_by("datetime", quantile=0.9, window_size="30d").alias("30day_p90"),
    pl.col("value").rolling_median_by("datetime", window_size="14d").alias("14day_median")
])

Python

Time Zone Handling

# Working with time zones
df = pl.DataFrame({
    "datetime_utc": [
        datetime(2023, 1, 1, 12, 0),
        datetime(2023, 1, 2, 12, 0)
    ],
    "value": [100, 200]
})

# Convert to different time zones
df = df.with_columns([
    pl.col("datetime_utc").dt.replace_time_zone("UTC").alias("utc"),
    pl.col("datetime_utc").dt.replace_time_zone("UTC")
        .dt.convert_time_zone("America/New_York").alias("ny_time"),
    pl.col("datetime_utc").dt.replace_time_zone("UTC")
        .dt.convert_time_zone("Europe/London").alias("london_time")
])

# Working with time zones
df = pl.DataFrame({
    "datetime_utc": [
        datetime(2023, 1, 1, 12, 0),
        datetime(2023, 1, 2, 12, 0)
    ],
    "value": [100, 200]
})

# Convert to different time zones
df = df.with_columns([
    pl.col("datetime_utc").dt.replace_time_zone("UTC").alias("utc"),
    pl.col("datetime_utc").dt.replace_time_zone("UTC")
        .dt.convert_time_zone("America/New_York").alias("ny_time"),
    pl.col("datetime_utc").dt.replace_time_zone("UTC")
        .dt.convert_time_zone("Europe/London").alias("london_time")
])

Python

Advanced Time Series Operations

# Lag and lead operations
df = df.sort("date").with_columns([
    pl.col("value").shift(1).alias("prev_value"),
    pl.col("value").shift(-1).alias("next_value"),
    pl.col("value").shift(periods=7).alias("value_7_days_ago")
])

# Percentage change and differences
df = df.with_columns([
    pl.col("value").pct_change().alias("pct_change"),
    pl.col("value").diff().alias("diff"),
    (pl.col("value") / pl.col("value").shift(1) - 1).alias("manual_pct_change")
])

# Fill missing time periods
complete_dates = pl.date_range(
    start=df["date"].min(),
    end=df["date"].max(),
    interval="1d",
    eager=True
)

complete_df = pl.DataFrame({"date": complete_dates})
filled_df = complete_df.join(df, on="date", how="left")

# Forward fill and backward fill
filled_df = filled_df.with_columns([
    pl.col("value").forward_fill().alias("forward_filled"),
    pl.col("value").backward_fill().alias("backward_filled")
])

# Lag and lead operations
df = df.sort("date").with_columns([
    pl.col("value").shift(1).alias("prev_value"),
    pl.col("value").shift(-1).alias("next_value"),
    pl.col("value").shift(periods=7).alias("value_7_days_ago")
])

# Percentage change and differences
df = df.with_columns([
    pl.col("value").pct_change().alias("pct_change"),
    pl.col("value").diff().alias("diff"),
    (pl.col("value") / pl.col("value").shift(1) - 1).alias("manual_pct_change")
])

# Fill missing time periods
complete_dates = pl.date_range(
    start=df["date"].min(),
    end=df["date"].max(),
    interval="1d",
    eager=True
)

complete_df = pl.DataFrame({"date": complete_dates})
filled_df = complete_df.join(df, on="date", how="left")

# Forward fill and backward fill
filled_df = filled_df.with_columns([
    pl.col("value").forward_fill().alias("forward_filled"),
    pl.col("value").backward_fill().alias("backward_filled")
])

Python

Time Series Analysis Pipeline

flowchart TD
    A[Raw Time Data] --> B[Parse Dates]
    B --> C[Sort by Time]
    C --> D[Handle Missing Periods]
    D --> E[Create Time Features]
    E --> F[Rolling Calculations]
    F --> G[Resampling]
    G --> H[Analysis Ready Data]

    subgraph "Time Operations"
        I[dt.year, dt.month]
        J[rolling_mean_by]
        K[group_by_dynamic]
        L[shift, diff]
    end

    E --> I
    F --> J
    G --> K
    F --> L

11. Advanced Features

Lazy Evaluation

# Lazy DataFrames for query optimization
lazy_df = pl.scan_csv("large_file.csv")

# Build query without execution
query = (
    lazy_df
    .filter(pl.col("value") > 100)
    .group_by("category")
    .agg([
        pl.col("value").sum(),
        pl.col("value").count()
    ])
    .sort("value_sum", descending=True)
    .limit(10)
)

# Show query plan
print(query.explain())

# Execute query
result = query.collect()

# Streaming for large datasets
result = query.collect(streaming=True)

# Lazy DataFrames for query optimization
lazy_df = pl.scan_csv("large_file.csv")

# Build query without execution
query = (
    lazy_df
    .filter(pl.col("value") > 100)
    .group_by("category")
    .agg([
        pl.col("value").sum(),
        pl.col("value").count()
    ])
    .sort("value_sum", descending=True)
    .limit(10)
)

# Show query plan
print(query.explain())

# Execute query
result = query.collect()

# Streaming for large datasets
result = query.collect(streaming=True)

Python

Custom Functions and Expressions

# Custom Python functions with map_elements
def custom_function(x):
    return x ** 2 + 1

df = df.with_columns([
    pl.col("value").map_elements(custom_function, return_dtype=pl.Float64).alias("custom")
])

# Apply function to multiple columns
def process_row(row):
    return row[0] * row[1] + row[2]

df = df.with_columns([
    pl.struct(["a", "b", "c"]).map_elements(
        lambda x: process_row([x["a"], x["b"], x["c"]]),
        return_dtype=pl.Float64
    ).alias("processed")
])

# Custom aggregation functions
def weighted_mean(values, weights):
    return (values * weights).sum() / weights.sum()

result = df.group_by("group").agg([
    pl.struct(["value", "weight"]).map_elements(
        lambda x: weighted_mean(x.struct.field("value"), x.struct.field("weight")),
        return_dtype=pl.Float64
    ).alias("weighted_avg")
])

# Custom Python functions with map_elements
def custom_function(x):
    return x ** 2 + 1

df = df.with_columns([
    pl.col("value").map_elements(custom_function, return_dtype=pl.Float64).alias("custom")
])

# Apply function to multiple columns
def process_row(row):
    return row[0] * row[1] + row[2]

df = df.with_columns([
    pl.struct(["a", "b", "c"]).map_elements(
        lambda x: process_row([x["a"], x["b"], x["c"]]),
        return_dtype=pl.Float64
    ).alias("processed")
])

# Custom aggregation functions
def weighted_mean(values, weights):
    return (values * weights).sum() / weights.sum()

result = df.group_by("group").agg([
    pl.struct(["value", "weight"]).map_elements(
        lambda x: weighted_mean(x.struct.field("value"), x.struct.field("weight")),
        return_dtype=pl.Float64
    ).alias("weighted_avg")
])

Python

Working with Lists and Structs

# List operations
df = pl.DataFrame({
    "id": [1, 2, 3],
    "numbers": [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
    "names": [["Alice", "Bob"], ["Charlie"], ["Diana", "Eve", "Frank"]]
})

# List manipulations
df = df.with_columns([
    pl.col("numbers").list.len().alias("list_length"),
    pl.col("numbers").list.sum().alias("list_sum"),
    pl.col("numbers").list.mean().alias("list_mean"),
    pl.col("numbers").list.get(0).alias("first_item"),
    pl.col("numbers").list.slice(1, 2).alias("slice_1_2"),
    pl.col("names").list.join(", ").alias("names_joined")
])

# Explode lists to rows
exploded = df.select(["id", "numbers"]).explode("numbers")

# Struct operations
df_struct = pl.DataFrame({
    "id": [1, 2, 3],
    "person": [
        {"name": "Alice", "age": 25},
        {"name": "Bob", "age": 30},
        {"name": "Charlie", "age": 35}
    ]
})

df_struct = df_struct.with_columns([
    pl.col("person").struct.field("name").alias("name"),
    pl.col("person").struct.field("age").alias("age")
])

# List operations
df = pl.DataFrame({
    "id": [1, 2, 3],
    "numbers": [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
    "names": [["Alice", "Bob"], ["Charlie"], ["Diana", "Eve", "Frank"]]
})

# List manipulations
df = df.with_columns([
    pl.col("numbers").list.len().alias("list_length"),
    pl.col("numbers").list.sum().alias("list_sum"),
    pl.col("numbers").list.mean().alias("list_mean"),
    pl.col("numbers").list.get(0).alias("first_item"),
    pl.col("numbers").list.slice(1, 2).alias("slice_1_2"),
    pl.col("names").list.join(", ").alias("names_joined")
])

# Explode lists to rows
exploded = df.select(["id", "numbers"]).explode("numbers")

# Struct operations
df_struct = pl.DataFrame({
    "id": [1, 2, 3],
    "person": [
        {"name": "Alice", "age": 25},
        {"name": "Bob", "age": 30},
        {"name": "Charlie", "age": 35}
    ]
})

df_struct = df_struct.with_columns([
    pl.col("person").struct.field("name").alias("name"),
    pl.col("person").struct.field("age").alias("age")
])

Python

Window Functions

# Window functions for ranking and statistics
df = pl.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering", "Marketing"],
    "employee": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "salary": [50000, 60000, 80000, 90000, 55000]
})

# Ranking functions
df = df.with_columns([
    pl.col("salary").rank().over("department").alias("rank_in_dept"),
    pl.col("salary").rank(method="dense").over("department").alias("dense_rank"),
    pl.col("salary").rank(descending=True).over("department").alias("rank_desc")
])

# Statistical window functions
df = df.with_columns([
    pl.col("salary").mean().over("department").alias("dept_avg_salary"),
    pl.col("salary").std().over("department").alias("dept_salary_std"),
    (pl.col("salary") - pl.col("salary").mean().over("department")).alias("salary_diff_from_avg")
])

# Cumulative operations within groups
df = df.sort(["department", "salary"]).with_columns([
    pl.col("salary").cumsum().over("department").alias("cumulative_salary"),
    pl.col("salary").cumcount().over("department").alias("employee_number")
])

# Window functions for ranking and statistics
df = pl.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering", "Marketing"],
    "employee": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "salary": [50000, 60000, 80000, 90000, 55000]
})

# Ranking functions
df = df.with_columns([
    pl.col("salary").rank().over("department").alias("rank_in_dept"),
    pl.col("salary").rank(method="dense").over("department").alias("dense_rank"),
    pl.col("salary").rank(descending=True).over("department").alias("rank_desc")
])

# Statistical window functions
df = df.with_columns([
    pl.col("salary").mean().over("department").alias("dept_avg_salary"),
    pl.col("salary").std().over("department").alias("dept_salary_std"),
    (pl.col("salary") - pl.col("salary").mean().over("department")).alias("salary_diff_from_avg")
])

# Cumulative operations within groups
df = df.sort(["department", "salary"]).with_columns([
    pl.col("salary").cumsum().over("department").alias("cumulative_salary"),
    pl.col("salary").cumcount().over("department").alias("employee_number")
])

Python

Parallel Processing and Performance

# Configure thread usage
pl.Config.set_thread_pool_size(8)

# Parallel operations on lazy frames
lazy_results = []
for file in file_list:
    lazy_df = (
        pl.scan_csv(file)
        .filter(pl.col("date").is_between(start_date, end_date))
        .group_by("category")
        .agg(pl.col("value").sum())
    )
    lazy_results.append(lazy_df)

# Combine results
final_result = pl.concat(lazy_results).collect()

# Streaming for memory efficiency
large_result = (
    pl.scan_csv("very_large_file.csv")
    .filter(pl.col("important_flag") == True)
    .select(["id", "value", "category"])
    .collect(streaming=True)
)

# Configure thread usage
pl.Config.set_thread_pool_size(8)

# Parallel operations on lazy frames
lazy_results = []
for file in file_list:
    lazy_df = (
        pl.scan_csv(file)
        .filter(pl.col("date").is_between(start_date, end_date))
        .group_by("category")
        .agg(pl.col("value").sum())
    )
    lazy_results.append(lazy_df)

# Combine results
final_result = pl.concat(lazy_results).collect()

# Streaming for memory efficiency
large_result = (
    pl.scan_csv("very_large_file.csv")
    .filter(pl.col("important_flag") == True)
    .select(["id", "value", "category"])
    .collect(streaming=True)
)

Python

Plugin System and Extensions

# Using Polars plugins (example with polars-xdt for extended datetime)
import polars_xdt as xdt

df = df.with_columns([
    pl.col("date").xdt.format_localized("%A, %B %d, %Y", locale="en_US"),
    pl.col("date").xdt.is_weekend(),
    pl.col("date").xdt.workday_count(pl.col("end_date"))
])

# Custom namespace example
@pl.api.register_expr_namespace("business")
class BusinessOps:
    def __init__(self, expr: pl.Expr):
        self._expr = expr

    def workdays_between(self, end_date: pl.Expr) -> pl.Expr:
        # Custom business logic
        return self._expr.map_elements(
            lambda start: calculate_workdays(start, end_date),
            return_dtype=pl.Int32
        )

# Usage
df = df.with_columns([
    pl.col("start_date").business.workdays_between(pl.col("end_date"))
])

# Using Polars plugins (example with polars-xdt for extended datetime)
import polars_xdt as xdt

df = df.with_columns([
    pl.col("date").xdt.format_localized("%A, %B %d, %Y", locale="en_US"),
    pl.col("date").xdt.is_weekend(),
    pl.col("date").xdt.workday_count(pl.col("end_date"))
])

# Custom namespace example
@pl.api.register_expr_namespace("business")
class BusinessOps:
    def __init__(self, expr: pl.Expr):
        self._expr = expr

    def workdays_between(self, end_date: pl.Expr) -> pl.Expr:
        # Custom business logic
        return self._expr.map_elements(
            lambda start: calculate_workdays(start, end_date),
            return_dtype=pl.Int32
        )

# Usage
df = df.with_columns([
    pl.col("start_date").business.workdays_between(pl.col("end_date"))
])

Python

12. Performance Optimization

Query Optimization Strategies

# Use lazy evaluation for complex queries
optimized_query = (
    pl.scan_parquet("large_dataset.parquet")
    .filter(pl.col("date").is_between(start_date, end_date))  # Early filtering
    .select(["id", "value", "category"])  # Column selection
    .filter(pl.col("value") > threshold)  # Additional filtering
    .group_by("category")
    .agg(pl.col("value").sum())
    .collect()
)

# Predicate pushdown example
lazy_df = (
    pl.scan_csv("data.csv")
    .filter(pl.col("year") == 2023)  # This filter is pushed down to file reading
    .select(["month", "sales"])
    .group_by("month")
    .agg(pl.col("sales").sum())
)

# Use lazy evaluation for complex queries
optimized_query = (
    pl.scan_parquet("large_dataset.parquet")
    .filter(pl.col("date").is_between(start_date, end_date))  # Early filtering
    .select(["id", "value", "category"])  # Column selection
    .filter(pl.col("value") > threshold)  # Additional filtering
    .group_by("category")
    .agg(pl.col("value").sum())
    .collect()
)

# Predicate pushdown example
lazy_df = (
    pl.scan_csv("data.csv")
    .filter(pl.col("year") == 2023)  # This filter is pushed down to file reading
    .select(["month", "sales"])
    .group_by("month")
    .agg(pl.col("sales").sum())
)

Python

Memory Management

# Use appropriate data types
df = df.with_columns([
    pl.col("small_int").cast(pl.Int8),  # Instead of Int64
    pl.col("category").cast(pl.Categorical),  # For repeated strings
    pl.col("flag").cast(pl.Boolean),  # Instead of string
    pl.col("percentage").cast(pl.Float32)  # Instead of Float64 when precision allows
])

# Chunking for large operations
def process_in_chunks(df, chunk_size=10000):
    results = []
    for i in range(0, len(df), chunk_size):
        chunk = df.slice(i, chunk_size)
        processed = chunk.with_columns([
            # Expensive operations here
            pl.col("value").map_elements(expensive_function)
        ])
        results.append(processed)
    return pl.concat(results)

# Memory-efficient file processing
def process_large_file(filepath):
    return (
        pl.scan_csv(filepath)
        .filter(pl.col("relevant_column").is_not_null())
        .select(["id", "important_col1", "important_col2"])
        .collect(streaming=True)  # Process without loading entire file
    )

# Use appropriate data types
df = df.with_columns([
    pl.col("small_int").cast(pl.Int8),  # Instead of Int64
    pl.col("category").cast(pl.Categorical),  # For repeated strings
    pl.col("flag").cast(pl.Boolean),  # Instead of string
    pl.col("percentage").cast(pl.Float32)  # Instead of Float64 when precision allows
])

# Chunking for large operations
def process_in_chunks(df, chunk_size=10000):
    results = []
    for i in range(0, len(df), chunk_size):
        chunk = df.slice(i, chunk_size)
        processed = chunk.with_columns([
            # Expensive operations here
            pl.col("value").map_elements(expensive_function)
        ])
        results.append(processed)
    return pl.concat(results)

# Memory-efficient file processing
def process_large_file(filepath):
    return (
        pl.scan_csv(filepath)
        .filter(pl.col("relevant_column").is_not_null())
        .select(["id", "important_col1", "important_col2"])
        .collect(streaming=True)  # Process without loading entire file
    )

Python

Parallel Processing Patterns

# Parallel file processing
from concurrent.futures import ProcessPoolExecutor
import multiprocessing as mp

def process_file(filepath):
    return (
        pl.read_csv(filepath)
        .filter(pl.col("status") == "active")
        .group_by("category")
        .agg(pl.col("value").sum())
    )

# Process multiple files in parallel
file_paths = ["file1.csv", "file2.csv", "file3.csv"]
with ProcessPoolExecutor(max_workers=mp.cpu_count()) as executor:
    results = list(executor.map(process_file, file_paths))

combined = pl.concat(results)

# Parallel groupby operations
large_df = pl.read_parquet("large_dataset.parquet")

# This automatically uses multiple threads
result = (
    large_df
    .group_by("large_grouping_column")
    .agg([
        pl.col("numeric_col").sum(),
        pl.col("numeric_col").mean(),
        pl.col("categorical_col").n_unique()
    ])
)

# Parallel file processing
from concurrent.futures import ProcessPoolExecutor
import multiprocessing as mp

def process_file(filepath):
    return (
        pl.read_csv(filepath)
        .filter(pl.col("status") == "active")
        .group_by("category")
        .agg(pl.col("value").sum())
    )

# Process multiple files in parallel
file_paths = ["file1.csv", "file2.csv", "file3.csv"]
with ProcessPoolExecutor(max_workers=mp.cpu_count()) as executor:
    results = list(executor.map(process_file, file_paths))

combined = pl.concat(results)

# Parallel groupby operations
large_df = pl.read_parquet("large_dataset.parquet")

# This automatically uses multiple threads
result = (
    large_df
    .group_by("large_grouping_column")
    .agg([
        pl.col("numeric_col").sum(),
        pl.col("numeric_col").mean(),
        pl.col("categorical_col").n_unique()
    ])
)

Python

Efficient Data Loading

# Optimized CSV reading
df = pl.read_csv(
    "large_file.csv",
    use_pyarrow=True,  # Use PyArrow engine for better performance
    rechunk=True,      # Optimize memory layout
    low_memory=False,  # Use more memory for speed
    ignore_errors=True # Skip problematic rows
)

# Lazy loading with projections
df = (
    pl.scan_csv("huge_file.csv")
    .select(["col1", "col2", "col3"])  # Only read needed columns
    .filter(pl.col("date") > cutoff_date)  # Filter early
    .collect()
)

# Parquet optimization
# Save with optimal settings
df.write_parquet(
    "optimized.parquet",
    compression="snappy",  # Good balance of speed and size
    statistics=True,       # Enable predicate pushdown
    row_group_size=100000  # Optimize for your use case
)

# Optimized CSV reading
df = pl.read_csv(
    "large_file.csv",
    use_pyarrow=True,  # Use PyArrow engine for better performance
    rechunk=True,      # Optimize memory layout
    low_memory=False,  # Use more memory for speed
    ignore_errors=True # Skip problematic rows
)

# Lazy loading with projections
df = (
    pl.scan_csv("huge_file.csv")
    .select(["col1", "col2", "col3"])  # Only read needed columns
    .filter(pl.col("date") > cutoff_date)  # Filter early
    .collect()
)

# Parquet optimization
# Save with optimal settings
df.write_parquet(
    "optimized.parquet",
    compression="snappy",  # Good balance of speed and size
    statistics=True,       # Enable predicate pushdown
    row_group_size=100000  # Optimize for your use case
)

Python

Benchmarking and Profiling

import time
from functools import wraps

def timing_decorator(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"{func.__name__} took {end - start:.4f} seconds")
        return result
    return wrapper

@timing_decorator
def slow_operation(df):
    return df.group_by("category").agg(pl.col("value").sum())

@timing_decorator  
def optimized_operation(df):
    return (
        df.lazy()
        .group_by("category")
        .agg(pl.col("value").sum())
        .collect()
    )

# Memory profiling
def profile_memory(operation, df):
    import psutil
    process = psutil.Process()

    memory_before = process.memory_info().rss / 1024 / 1024  # MB
    result = operation(df)
    memory_after = process.memory_info().rss / 1024 / 1024  # MB

    print(f"Memory used: {memory_after - memory_before:.2f} MB")
    return result

import time
from functools import wraps

def timing_decorator(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"{func.__name__} took {end - start:.4f} seconds")
        return result
    return wrapper

@timing_decorator
def slow_operation(df):
    return df.group_by("category").agg(pl.col("value").sum())

@timing_decorator  
def optimized_operation(df):
    return (
        df.lazy()
        .group_by("category")
        .agg(pl.col("value").sum())
        .collect()
    )

# Memory profiling
def profile_memory(operation, df):
    import psutil
    process = psutil.Process()

    memory_before = process.memory_info().rss / 1024 / 1024  # MB
    result = operation(df)
    memory_after = process.memory_info().rss / 1024 / 1024  # MB

    print(f"Memory used: {memory_after - memory_before:.2f} MB")
    return result

Python

Performance Optimization Checklist

graph TD
    A[Performance Issue] --> B{Identify Bottleneck}

    B --> C[I/O Bound]
    B --> D[CPU Bound]
    B --> E[Memory Bound]

    C --> F[Use lazy evaluation]
    C --> G[Optimize file formats]
    C --> H[Column selection early]

    D --> I[Use proper data types]
    D --> J[Vectorized operations]
    D --> K[Parallel processing]

    E --> L[Streaming mode]
    E --> M[Chunking]
    E --> N[Reduce data size]

    F --> O[Benchmark Results]
    G --> O
    H --> O
    I --> O
    J --> O
    K --> O
    L --> O
    M --> O
    N --> O

13. Real-World Examples

Example 1: Sales Data Analysis

# Load and prepare sales data
sales_df = pl.read_csv("sales_data.csv").with_columns([
    pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
    pl.col("amount").cast(pl.Float64),
    pl.col("category").cast(pl.Categorical)
])

# Monthly sales analysis
monthly_sales = (
    sales_df
    .group_by_dynamic("date", every="1mo", label="left")
    .agg([
        pl.col("amount").sum().alias("total_sales"),
        pl.col("amount").count().alias("transaction_count"),
        pl.col("amount").mean().alias("avg_transaction"),
        pl.col("customer_id").n_unique().alias("unique_customers")
    ])
    .with_columns([
        pl.col("total_sales").pct_change().alias("sales_growth"),
        (pl.col("total_sales") / pl.col("transaction_count")).alias("avg_order_value")
    ])
)

# Top performing categories
category_performance = (
    sales_df
    .group_by("category")
    .agg([
        pl.col("amount").sum().alias("total_revenue"),
        pl.col("amount").count().alias("transactions"),
        pl.col("customer_id").n_unique().alias("customers")
    ])
    .with_columns([
        (pl.col("total_revenue") / pl.col("total_revenue").sum() * 100).alias("revenue_share"),
        pl.col("total_revenue").rank(method="ordinal", descending=True).alias("rank")
    ])
    .sort("total_revenue", descending=True)
)

# Customer segmentation
customer_segments = (
    sales_df
    .group_by("customer_id")
    .agg([
        pl.col("amount").sum().alias("total_spent"),
        pl.col("amount").count().alias("purchase_frequency"),
        pl.col("date").max().alias("last_purchase"),
        pl.col("date").min().alias("first_purchase")
    ])
    .with_columns([
        (pl.col("last_purchase") - pl.col("first_purchase")).dt.days().alias("customer_lifetime_days"),
        pl.when(pl.col("total_spent") > 1000).then("High Value")
        .when(pl.col("total_spent") > 500).then("Medium Value")
        .otherwise("Low Value").alias("value_segment"),

        pl.when(pl.col("purchase_frequency") > 10).then("Frequent")
        .when(pl.col("purchase_frequency") > 3).then("Regular")
        .otherwise("Occasional").alias("frequency_segment")
    ])
)

# Load and prepare sales data
sales_df = pl.read_csv("sales_data.csv").with_columns([
    pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
    pl.col("amount").cast(pl.Float64),
    pl.col("category").cast(pl.Categorical)
])

# Monthly sales analysis
monthly_sales = (
    sales_df
    .group_by_dynamic("date", every="1mo", label="left")
    .agg([
        pl.col("amount").sum().alias("total_sales"),
        pl.col("amount").count().alias("transaction_count"),
        pl.col("amount").mean().alias("avg_transaction"),
        pl.col("customer_id").n_unique().alias("unique_customers")
    ])
    .with_columns([
        pl.col("total_sales").pct_change().alias("sales_growth"),
        (pl.col("total_sales") / pl.col("transaction_count")).alias("avg_order_value")
    ])
)

# Top performing categories
category_performance = (
    sales_df
    .group_by("category")
    .agg([
        pl.col("amount").sum().alias("total_revenue"),
        pl.col("amount").count().alias("transactions"),
        pl.col("customer_id").n_unique().alias("customers")
    ])
    .with_columns([
        (pl.col("total_revenue") / pl.col("total_revenue").sum() * 100).alias("revenue_share"),
        pl.col("total_revenue").rank(method="ordinal", descending=True).alias("rank")
    ])
    .sort("total_revenue", descending=True)
)

# Customer segmentation
customer_segments = (
    sales_df
    .group_by("customer_id")
    .agg([
        pl.col("amount").sum().alias("total_spent"),
        pl.col("amount").count().alias("purchase_frequency"),
        pl.col("date").max().alias("last_purchase"),
        pl.col("date").min().alias("first_purchase")
    ])
    .with_columns([
        (pl.col("last_purchase") - pl.col("first_purchase")).dt.days().alias("customer_lifetime_days"),
        pl.when(pl.col("total_spent") > 1000).then("High Value")
        .when(pl.col("total_spent") > 500).then("Medium Value")
        .otherwise("Low Value").alias("value_segment"),

        pl.when(pl.col("purchase_frequency") > 10).then("Frequent")
        .when(pl.col("purchase_frequency") > 3).then("Regular")
        .otherwise("Occasional").alias("frequency_segment")
    ])
)

Python

Example 2: Log File Analysis

# Parse web server logs
log_pattern = r'(\S+) - - \[([^\]]+)\] "(\S+) ([^\s]+) ([^"]+)" (\d+) (\d+)'

logs_df = (
    pl.read_csv("access.log", separator="\t", has_header=False, new_columns=["raw_log"])
    .with_columns([
        pl.col("raw_log").str.extract_all(log_pattern).alias("parsed")
    ])
    .with_columns([
        pl.col("parsed").list.get(0).alias("ip"),
        pl.col("parsed").list.get(1).str.strptime(pl.Datetime, "%d/%b/%Y:%H:%M:%S %z").alias("timestamp"),
        pl.col("parsed").list.get(2).alias("method"),
        pl.col("parsed").list.get(3).alias("url"),
        pl.col("parsed").list.get(5).cast(pl.Int32).alias("status_code"),
        pl.col("parsed").list.get(6).cast(pl.Int32).alias("response_size")
    ])
    .select(["ip", "timestamp", "method", "url", "status_code", "response_size"])
)

# Traffic analysis
hourly_traffic = (
    logs_df
    .group_by_dynamic("timestamp", every="1h")
    .agg([
        pl.col("ip").count().alias("requests"),
        pl.col("ip").n_unique().alias("unique_visitors"),
        pl.col("response_size").sum().alias("total_bytes"),
        pl.col("status_code").filter(pl.col("status_code") >= 400).count().alias("error_count")
    ])
    .with_columns([
        (pl.col("error_count") / pl.col("requests") * 100).alias("error_rate"),
        (pl.col("total_bytes") / 1024 / 1024).alias("total_mb")
    ])
)

# Top endpoints
top_endpoints = (
    logs_df
    .group_by("url")
    .agg([
        pl.col("ip").count().alias("hits"),
        pl.col("ip").n_unique().alias("unique_visitors"),
        pl.col("response_size").mean().alias("avg_response_size")
    ])
    .sort("hits", descending=True)
    .head(20)
)

# Bot detection
potential_bots = (
    logs_df
    .group_by("ip")
    .agg([
        pl.col("url").count().alias("request_count"),
        pl.col("url").n_unique().alias("unique_urls"),
        pl.col("timestamp").max().alias("last_seen"),
        pl.col("timestamp").min().alias("first_seen")
    ])
    .with_columns([
        (pl.col("last_seen") - pl.col("first_seen")).dt.seconds().alias("session_duration"),
        (pl.col("request_count") / pl.col("unique_urls")).alias("requests_per_url")
    ])
    .filter(
        (pl.col("request_count") > 100) & 
        (pl.col("requests_per_url") > 10) &
        (pl.col("session_duration") < 3600)  # Less than 1 hour
    )
)

# Parse web server logs
log_pattern = r'(\S+) - - \[([^\]]+)\] "(\S+) ([^\s]+) ([^"]+)" (\d+) (\d+)'

logs_df = (
    pl.read_csv("access.log", separator="\t", has_header=False, new_columns=["raw_log"])
    .with_columns([
        pl.col("raw_log").str.extract_all(log_pattern).alias("parsed")
    ])
    .with_columns([
        pl.col("parsed").list.get(0).alias("ip"),
        pl.col("parsed").list.get(1).str.strptime(pl.Datetime, "%d/%b/%Y:%H:%M:%S %z").alias("timestamp"),
        pl.col("parsed").list.get(2).alias("method"),
        pl.col("parsed").list.get(3).alias("url"),
        pl.col("parsed").list.get(5).cast(pl.Int32).alias("status_code"),
        pl.col("parsed").list.get(6).cast(pl.Int32).alias("response_size")
    ])
    .select(["ip", "timestamp", "method", "url", "status_code", "response_size"])
)

# Traffic analysis
hourly_traffic = (
    logs_df
    .group_by_dynamic("timestamp", every="1h")
    .agg([
        pl.col("ip").count().alias("requests"),
        pl.col("ip").n_unique().alias("unique_visitors"),
        pl.col("response_size").sum().alias("total_bytes"),
        pl.col("status_code").filter(pl.col("status_code") >= 400).count().alias("error_count")
    ])
    .with_columns([
        (pl.col("error_count") / pl.col("requests") * 100).alias("error_rate"),
        (pl.col("total_bytes") / 1024 / 1024).alias("total_mb")
    ])
)

# Top endpoints
top_endpoints = (
    logs_df
    .group_by("url")
    .agg([
        pl.col("ip").count().alias("hits"),
        pl.col("ip").n_unique().alias("unique_visitors"),
        pl.col("response_size").mean().alias("avg_response_size")
    ])
    .sort("hits", descending=True)
    .head(20)
)

# Bot detection
potential_bots = (
    logs_df
    .group_by("ip")
    .agg([
        pl.col("url").count().alias("request_count"),
        pl.col("url").n_unique().alias("unique_urls"),
        pl.col("timestamp").max().alias("last_seen"),
        pl.col("timestamp").min().alias("first_seen")
    ])
    .with_columns([
        (pl.col("last_seen") - pl.col("first_seen")).dt.seconds().alias("session_duration"),
        (pl.col("request_count") / pl.col("unique_urls")).alias("requests_per_url")
    ])
    .filter(
        (pl.col("request_count") > 100) & 
        (pl.col("requests_per_url") > 10) &
        (pl.col("session_duration") < 3600)  # Less than 1 hour
    )
)

Python

Example 3: Financial Data Processing

# Stock price analysis
stock_df = pl.read_csv("stock_prices.csv").with_columns([
    pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
    pl.col("open", "high", "low", "close", "volume").cast(pl.Float64)
])

# Technical indicators
stock_analysis = (
    stock_df
    .sort("date")
    .with_columns([
        # Simple moving averages
        pl.col("close").rolling_mean(window_size=20).alias("sma_20"),
        pl.col("close").rolling_mean(window_size=50).alias("sma_50"),

        # Exponential moving average (approximation)
        pl.col("close").ewm_mean(span=12).alias("ema_12"),

        # Price changes
        pl.col("close").pct_change().alias("daily_return"),
        (pl.col("close") / pl.col("close").shift(252) - 1).alias("annual_return"),

        # Volatility (20-day rolling standard deviation of returns)
        pl.col("close").pct_change().rolling_std(window_size=20).alias("volatility_20d"),

        # Support and resistance levels
        pl.col("low").rolling_min(window_size=20).alias("support_20d"),
        pl.col("high").rolling_max(window_size=20).alias("resistance_20d")
    ])
)

# Risk metrics
risk_metrics = (
    stock_analysis
    .select([
        pl.col("daily_return").mean().alias("avg_daily_return"),
        pl.col("daily_return").std().alias("daily_volatility"),
        pl.col("annual_return").mean().alias("avg_annual_return"),
        (pl.col("daily_return").mean() / pl.col("daily_return").std()).alias("sharpe_ratio_daily"),
        pl.col("daily_return").filter(pl.col("daily_return") < 0).count().alias("down_days"),
        pl.col("daily_return").count().alias("total_days")
    ])
    .with_columns([
        (pl.col("down_days") / pl.col("total_days")).alias("probability_of_loss"),
        (pl.col("avg_daily_return") * 252).alias("annualized_return"),
        (pl.col("daily_volatility") * (252 ** 0.5)).alias("annualized_volatility")
    ])
)

# Portfolio optimization (equal weight example)
portfolio_stocks = ["AAPL", "GOOGL", "MSFT", "AMZN"]
portfolio_analysis = (
    stock_df
    .filter(pl.col("symbol").is_in(portfolio_stocks))
    .group_by("date")
    .agg([
        pl.col("close").mean().alias("portfolio_close"),
        pl.col("volume").sum().alias("total_volume")
    ])
    .sort("date")
    .with_columns([
        pl.col("portfolio_close").pct_change().alias("portfolio_return"),
        pl.col("portfolio_close").rolling_std(window_size=252).alias("portfolio_volatility")
    ])
)

# Stock price analysis
stock_df = pl.read_csv("stock_prices.csv").with_columns([
    pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
    pl.col("open", "high", "low", "close", "volume").cast(pl.Float64)
])

# Technical indicators
stock_analysis = (
    stock_df
    .sort("date")
    .with_columns([
        # Simple moving averages
        pl.col("close").rolling_mean(window_size=20).alias("sma_20"),
        pl.col("close").rolling_mean(window_size=50).alias("sma_50"),

        # Exponential moving average (approximation)
        pl.col("close").ewm_mean(span=12).alias("ema_12"),

        # Price changes
        pl.col("close").pct_change().alias("daily_return"),
        (pl.col("close") / pl.col("close").shift(252) - 1).alias("annual_return"),

        # Volatility (20-day rolling standard deviation of returns)
        pl.col("close").pct_change().rolling_std(window_size=20).alias("volatility_20d"),

        # Support and resistance levels
        pl.col("low").rolling_min(window_size=20).alias("support_20d"),
        pl.col("high").rolling_max(window_size=20).alias("resistance_20d")
    ])
)

# Risk metrics
risk_metrics = (
    stock_analysis
    .select([
        pl.col("daily_return").mean().alias("avg_daily_return"),
        pl.col("daily_return").std().alias("daily_volatility"),
        pl.col("annual_return").mean().alias("avg_annual_return"),
        (pl.col("daily_return").mean() / pl.col("daily_return").std()).alias("sharpe_ratio_daily"),
        pl.col("daily_return").filter(pl.col("daily_return") < 0).count().alias("down_days"),
        pl.col("daily_return").count().alias("total_days")
    ])
    .with_columns([
        (pl.col("down_days") / pl.col("total_days")).alias("probability_of_loss"),
        (pl.col("avg_daily_return") * 252).alias("annualized_return"),
        (pl.col("daily_volatility") * (252 ** 0.5)).alias("annualized_volatility")
    ])
)

# Portfolio optimization (equal weight example)
portfolio_stocks = ["AAPL", "GOOGL", "MSFT", "AMZN"]
portfolio_analysis = (
    stock_df
    .filter(pl.col("symbol").is_in(portfolio_stocks))
    .group_by("date")
    .agg([
        pl.col("close").mean().alias("portfolio_close"),
        pl.col("volume").sum().alias("total_volume")
    ])
    .sort("date")
    .with_columns([
        pl.col("portfolio_close").pct_change().alias("portfolio_return"),
        pl.col("portfolio_close").rolling_std(window_size=252).alias("portfolio_volatility")
    ])
)

Python

Example 4: IoT Sensor Data Processing

# Process IoT sensor data
sensor_df = (
    pl.read_csv("sensor_data.csv")
    .with_columns([
        pl.col("timestamp").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"),
        pl.col("temperature", "humidity", "pressure").cast(pl.Float64),
        pl.col("sensor_id").cast(pl.Categorical)
    ])
)

# Anomaly detection using statistical methods
anomaly_detection = (
    sensor_df
    .sort(["sensor_id", "timestamp"])
    .with_columns([
        # Calculate rolling statistics
        pl.col("temperature").rolling_mean(window_size=100).over("sensor_id").alias("temp_mean"),
        pl.col("temperature").rolling_std(window_size=100).over("sensor_id").alias("temp_std"),
        pl.col("humidity").rolling_mean(window_size=100).over("sensor_id").alias("humidity_mean"),
        pl.col("humidity").rolling_std(window_size=100).over("sensor_id").alias("humidity_std")
    ])
    .with_columns([
        # Z-score for anomaly detection
        ((pl.col("temperature") - pl.col("temp_mean")) / pl.col("temp_std")).abs().alias("temp_zscore"),
        ((pl.col("humidity") - pl.col("humidity_mean")) / pl.col("humidity_std")).abs().alias("humidity_zscore"),

        # Flag anomalies (beyond 3 standard deviations)
        (pl.col("temp_zscore") > 3).alias("temp_anomaly"),
        (pl.col("humidity_zscore") > 3).alias("humidity_anomaly")
    ])
    .with_columns([
        (pl.col("temp_anomaly") | pl.col("humidity_anomaly")).alias("any_anomaly")
    ])
)

# Sensor health monitoring
sensor_health = (
    sensor_df
    .group_by_dynamic("timestamp", every="1h", group_by="sensor_id")
    .agg([
        pl.col("temperature").count().alias("readings_count"),
        pl.col("temperature").mean().alias("avg_temp"),
        pl.col("temperature").std().alias("temp_variance"),
        pl.col("humidity").mean().alias("avg_humidity"),
        pl.col("pressure").mean().alias("avg_pressure")
    ])
    .with_columns([
        # Expected readings per hour (assuming 1 reading per minute)
        (pl.col("readings_count") / 60).alias("data_completeness"),

        # Flag sensors with issues
        pl.when(pl.col("readings_count") < 50).then("Low Data")
        .when(pl.col("temp_variance") > 10).then("High Variance")
        .when(pl.col("data_completeness") < 0.8).then("Missing Data")
        .otherwise("Normal").alias("health_status")
    ])
)

# Data quality report
data_quality_report = (
    sensor_df
    .group_by("sensor_id")
    .agg([
        pl.col("timestamp").count().alias("total_readings"),
        pl.col("temperature").null_count().alias("temp_nulls"),
        pl.col("humidity").null_count().alias("humidity_nulls"),
        pl.col("pressure").null_count().alias("pressure_nulls"),
        pl.col("timestamp").min().alias("first_reading"),
        pl.col("timestamp").max().alias("last_reading")
    ])
    .with_columns([
        (pl.col("last_reading") - pl.col("first_reading")).dt.days().alias("operational_days"),
        (pl.col("temp_nulls") / pl.col("total_readings") * 100).alias("temp_null_pct"),
        (pl.col("humidity_nulls") / pl.col("total_readings") * 100).alias("humidity_null_pct"),
        (pl.col("pressure_nulls") / pl.col("total_readings") * 100).alias("pressure_null_pct")
    ])
)

# Process IoT sensor data
sensor_df = (
    pl.read_csv("sensor_data.csv")
    .with_columns([
        pl.col("timestamp").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"),
        pl.col("temperature", "humidity", "pressure").cast(pl.Float64),
        pl.col("sensor_id").cast(pl.Categorical)
    ])
)

# Anomaly detection using statistical methods
anomaly_detection = (
    sensor_df
    .sort(["sensor_id", "timestamp"])
    .with_columns([
        # Calculate rolling statistics
        pl.col("temperature").rolling_mean(window_size=100).over("sensor_id").alias("temp_mean"),
        pl.col("temperature").rolling_std(window_size=100).over("sensor_id").alias("temp_std"),
        pl.col("humidity").rolling_mean(window_size=100).over("sensor_id").alias("humidity_mean"),
        pl.col("humidity").rolling_std(window_size=100).over("sensor_id").alias("humidity_std")
    ])
    .with_columns([
        # Z-score for anomaly detection
        ((pl.col("temperature") - pl.col("temp_mean")) / pl.col("temp_std")).abs().alias("temp_zscore"),
        ((pl.col("humidity") - pl.col("humidity_mean")) / pl.col("humidity_std")).abs().alias("humidity_zscore"),

        # Flag anomalies (beyond 3 standard deviations)
        (pl.col("temp_zscore") > 3).alias("temp_anomaly"),
        (pl.col("humidity_zscore") > 3).alias("humidity_anomaly")
    ])
    .with_columns([
        (pl.col("temp_anomaly") | pl.col("humidity_anomaly")).alias("any_anomaly")
    ])
)

# Sensor health monitoring
sensor_health = (
    sensor_df
    .group_by_dynamic("timestamp", every="1h", group_by="sensor_id")
    .agg([
        pl.col("temperature").count().alias("readings_count"),
        pl.col("temperature").mean().alias("avg_temp"),
        pl.col("temperature").std().alias("temp_variance"),
        pl.col("humidity").mean().alias("avg_humidity"),
        pl.col("pressure").mean().alias("avg_pressure")
    ])
    .with_columns([
        # Expected readings per hour (assuming 1 reading per minute)
        (pl.col("readings_count") / 60).alias("data_completeness"),

        # Flag sensors with issues
        pl.when(pl.col("readings_count") < 50).then("Low Data")
        .when(pl.col("temp_variance") > 10).then("High Variance")
        .when(pl.col("data_completeness") < 0.8).then("Missing Data")
        .otherwise("Normal").alias("health_status")
    ])
)

# Data quality report
data_quality_report = (
    sensor_df
    .group_by("sensor_id")
    .agg([
        pl.col("timestamp").count().alias("total_readings"),
        pl.col("temperature").null_count().alias("temp_nulls"),
        pl.col("humidity").null_count().alias("humidity_nulls"),
        pl.col("pressure").null_count().alias("pressure_nulls"),
        pl.col("timestamp").min().alias("first_reading"),
        pl.col("timestamp").max().alias("last_reading")
    ])
    .with_columns([
        (pl.col("last_reading") - pl.col("first_reading")).dt.days().alias("operational_days"),
        (pl.col("temp_nulls") / pl.col("total_readings") * 100).alias("temp_null_pct"),
        (pl.col("humidity_nulls") / pl.col("total_readings") * 100).alias("humidity_null_pct"),
        (pl.col("pressure_nulls") / pl.col("total_readings") * 100).alias("pressure_null_pct")
    ])
)

Python

14. Migration from Pandas

Syntax Comparison

# Pandas vs Polars equivalent operations

# Data Loading
# Pandas
df_pandas = pd.read_csv("data.csv")

# Polars
df_polars = pl.read_csv("data.csv")

# Column Selection
# Pandas
df_pandas[["col1", "col2"]]
df_pandas.loc[:, "col1":"col3"]

# Polars
df_polars.select(["col1", "col2"])
df_polars.select(pl.col("col1", "col2", "col3"))

# Filtering continued
# Pandas
df_pandas[df_pandas["age"] > 25]
df_pandas.query("age > 25 and city == 'NYC'")

# Polars  
df_polars.filter(pl.col("age") > 25)
df_polars.filter((pl.col("age") > 25) & (pl.col("city") == "NYC"))

# GroupBy and Aggregation
# Pandas
df_pandas.groupby("category").agg({"value": ["sum", "mean", "count"]})

# Polars
df_polars.group_by("category").agg([
    pl.col("value").sum(),
    pl.col("value").mean(),
    pl.col("value").count()
])

# Adding Columns
# Pandas
df_pandas["new_col"] = df_pandas["col1"] * df_pandas["col2"]
df_pandas.assign(new_col=lambda x: x["col1"] * x["col2"])

# Polars
df_polars.with_columns([
    (pl.col("col1") * pl.col("col2")).alias("new_col")
])

# Sorting
# Pandas
df_pandas.sort_values(["col1", "col2"], ascending=[True, False])

# Polars
df_polars.sort(["col1", "col2"], descending=[False, True])

# Joins
# Pandas
df1.merge(df2, on="key", how="left")

# Polars
df1.join(df2, on="key", how="left")

# Pandas vs Polars equivalent operations

# Data Loading
# Pandas
df_pandas = pd.read_csv("data.csv")

# Polars
df_polars = pl.read_csv("data.csv")

# Column Selection
# Pandas
df_pandas[["col1", "col2"]]
df_pandas.loc[:, "col1":"col3"]

# Polars
df_polars.select(["col1", "col2"])
df_polars.select(pl.col("col1", "col2", "col3"))

# Filtering continued
# Pandas
df_pandas[df_pandas["age"] > 25]
df_pandas.query("age > 25 and city == 'NYC'")

# Polars  
df_polars.filter(pl.col("age") > 25)
df_polars.filter((pl.col("age") > 25) & (pl.col("city") == "NYC"))

# GroupBy and Aggregation
# Pandas
df_pandas.groupby("category").agg({"value": ["sum", "mean", "count"]})

# Polars
df_polars.group_by("category").agg([
    pl.col("value").sum(),
    pl.col("value").mean(),
    pl.col("value").count()
])

# Adding Columns
# Pandas
df_pandas["new_col"] = df_pandas["col1"] * df_pandas["col2"]
df_pandas.assign(new_col=lambda x: x["col1"] * x["col2"])

# Polars
df_polars.with_columns([
    (pl.col("col1") * pl.col("col2")).alias("new_col")
])

# Sorting
# Pandas
df_pandas.sort_values(["col1", "col2"], ascending=[True, False])

# Polars
df_polars.sort(["col1", "col2"], descending=[False, True])

# Joins
# Pandas
df1.merge(df2, on="key", how="left")

# Polars
df1.join(df2, on="key", how="left")

Python

Migration Strategy

flowchart TD
    A[Pandas Code Analysis] --> B[Identify Operations]
    B --> C[Map to Polars Equivalents]
    C --> D[Test Performance]
    D --> E[Optimize with Lazy Operations]
    E --> F[Production Deployment]

    subgraph "Common Challenges"
        G[Index Operations]
        H[Inplace Operations]
        I[Method Chaining Differences]
        J[API Differences]
    end

    B --> G
    B --> H
    B --> I
    B --> J

Common Migration Patterns

# Pattern 1: Index-based operations
# Pandas (avoid in Polars)
df_pandas.reset_index()
df_pandas.set_index("column")

# Polars equivalent
df_polars.with_row_count("index")  # Add row numbers
df_polars.sort("column")  # Use sorting instead of indexing

# Pattern 2: Inplace operations
# Pandas
df_pandas.dropna(inplace=True)
df_pandas.fillna(0, inplace=True)

# Polars (immutable)
df_polars = df_polars.drop_nulls()
df_polars = df_polars.fill_null(0)

# Pattern 3: Apply functions
# Pandas
df_pandas["new_col"] = df_pandas["col"].apply(lambda x: x ** 2)

# Polars
df_polars = df_polars.with_columns([
    pl.col("col").map_elements(lambda x: x ** 2, return_dtype=pl.Float64).alias("new_col")
])

# Better Polars (vectorized)
df_polars = df_polars.with_columns([
    pl.col("col").pow(2).alias("new_col")
])

# Pattern 1: Index-based operations
# Pandas (avoid in Polars)
df_pandas.reset_index()
df_pandas.set_index("column")

# Polars equivalent
df_polars.with_row_count("index")  # Add row numbers
df_polars.sort("column")  # Use sorting instead of indexing

# Pattern 2: Inplace operations
# Pandas
df_pandas.dropna(inplace=True)
df_pandas.fillna(0, inplace=True)

# Polars (immutable)
df_polars = df_polars.drop_nulls()
df_polars = df_polars.fill_null(0)

# Pattern 3: Apply functions
# Pandas
df_pandas["new_col"] = df_pandas["col"].apply(lambda x: x ** 2)

# Polars
df_polars = df_polars.with_columns([
    pl.col("col").map_elements(lambda x: x ** 2, return_dtype=pl.Float64).alias("new_col")
])

# Better Polars (vectorized)
df_polars = df_polars.with_columns([
    pl.col("col").pow(2).alias("new_col")
])

Python

Performance Comparison

import time
import pandas as pd
import polars as pl
import numpy as np

# Generate test data
n_rows = 1_000_000
data = {
    "id": range(n_rows),
    "value": np.random.randn(n_rows),
    "category": np.random.choice(["A", "B", "C", "D"], n_rows),
    "date": pd.date_range("2020-01-01", periods=n_rows, freq="1min")
}

df_pandas = pd.DataFrame(data)
df_polars = pl.DataFrame(data)

# Benchmark groupby operations
def benchmark_groupby():
    # Pandas
    start = time.time()
    result_pandas = df_pandas.groupby("category").agg({
        "value": ["sum", "mean", "std", "count"]
    })
    pandas_time = time.time() - start

    # Polars
    start = time.time()
    result_polars = df_polars.group_by("category").agg([
        pl.col("value").sum(),
        pl.col("value").mean(),
        pl.col("value").std(),
        pl.col("value").count()
    ])
    polars_time = time.time() - start

    print(f"Pandas: {pandas_time:.4f}s")
    print(f"Polars: {polars_time:.4f}s")
    print(f"Speedup: {pandas_time / polars_time:.2f}x")

benchmark_groupby()

import time
import pandas as pd
import polars as pl
import numpy as np

# Generate test data
n_rows = 1_000_000
data = {
    "id": range(n_rows),
    "value": np.random.randn(n_rows),
    "category": np.random.choice(["A", "B", "C", "D"], n_rows),
    "date": pd.date_range("2020-01-01", periods=n_rows, freq="1min")
}

df_pandas = pd.DataFrame(data)
df_polars = pl.DataFrame(data)

# Benchmark groupby operations
def benchmark_groupby():
    # Pandas
    start = time.time()
    result_pandas = df_pandas.groupby("category").agg({
        "value": ["sum", "mean", "std", "count"]
    })
    pandas_time = time.time() - start

    # Polars
    start = time.time()
    result_polars = df_polars.group_by("category").agg([
        pl.col("value").sum(),
        pl.col("value").mean(),
        pl.col("value").std(),
        pl.col("value").count()
    ])
    polars_time = time.time() - start

    print(f"Pandas: {pandas_time:.4f}s")
    print(f"Polars: {polars_time:.4f}s")
    print(f"Speedup: {pandas_time / polars_time:.2f}x")

benchmark_groupby()

Python

15. Best Practices

Code Organization

# 1. Use lazy evaluation for complex pipelines
def process_sales_data(file_path: str) -> pl.DataFrame:
    return (
        pl.scan_csv(file_path)
        .filter(pl.col("date").is_between("2023-01-01", "2023-12-31"))
        .with_columns([
            pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
            pl.col("amount").cast(pl.Float64)
        ])
        .group_by(["product", "region"])
        .agg([
            pl.col("amount").sum().alias("total_sales"),
            pl.col("amount").count().alias("transactions")
        ])
        .collect()
    )

# 2. Create reusable expression functions
def create_age_groups(age_column: str) -> pl.Expr:
    return (
        pl.when(pl.col(age_column) < 18).then("Under 18")
        .when(pl.col(age_column) < 65).then("Adult")
        .otherwise("Senior")
        .alias("age_group")
    )

# 3. Use configuration for consistent formatting
def setup_polars_config():
    pl.Config.set_tbl_rows(20)
    pl.Config.set_tbl_cols(8)
    pl.Config.set_fmt_str_lengths(50)
    pl.Config.set_thread_pool_size(8)

# 1. Use lazy evaluation for complex pipelines
def process_sales_data(file_path: str) -> pl.DataFrame:
    return (
        pl.scan_csv(file_path)
        .filter(pl.col("date").is_between("2023-01-01", "2023-12-31"))
        .with_columns([
            pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
            pl.col("amount").cast(pl.Float64)
        ])
        .group_by(["product", "region"])
        .agg([
            pl.col("amount").sum().alias("total_sales"),
            pl.col("amount").count().alias("transactions")
        ])
        .collect()
    )

# 2. Create reusable expression functions
def create_age_groups(age_column: str) -> pl.Expr:
    return (
        pl.when(pl.col(age_column) < 18).then("Under 18")
        .when(pl.col(age_column) < 65).then("Adult")
        .otherwise("Senior")
        .alias("age_group")
    )

# 3. Use configuration for consistent formatting
def setup_polars_config():
    pl.Config.set_tbl_rows(20)
    pl.Config.set_tbl_cols(8)
    pl.Config.set_fmt_str_lengths(50)
    pl.Config.set_thread_pool_size(8)

Python

Error Handling and Validation

def safe_data_processing(df: pl.DataFrame) -> pl.DataFrame:
    try:
        # Validate input
        required_columns = ["id", "value", "date"]
        missing_cols = set(required_columns) - set(df.columns)
        if missing_cols:
            raise ValueError(f"Missing columns: {missing_cols}")

        # Safe type casting
        result = df.with_columns([
            pl.col("value").cast(pl.Float64, strict=False),
            pl.col("date").str.strptime(pl.Date, "%Y-%m-%d", strict=False)
        ])

        # Check for processing errors
        null_dates = result.filter(pl.col("date").is_null()).height
        if null_dates > 0:
            print(f"Warning: {null_dates} rows with invalid dates")

        return result

    except Exception as e:
        print(f"Error processing data: {e}")
        raise

# Schema validation
def validate_schema(df: pl.DataFrame, expected_schema: dict) -> bool:
    """Validate DataFrame schema against expected types"""
    for col, expected_type in expected_schema.items():
        if col not in df.columns:
            raise ValueError(f"Missing column: {col}")
        if df[col].dtype != expected_type:
            raise TypeError(f"Column {col} has type {df[col].dtype}, expected {expected_type}")
    return True

# Usage
expected_schema = {
    "id": pl.Int64,
    "name": pl.Utf8,
    "value": pl.Float64
}
validate_schema(df, expected_schema)

def safe_data_processing(df: pl.DataFrame) -> pl.DataFrame:
    try:
        # Validate input
        required_columns = ["id", "value", "date"]
        missing_cols = set(required_columns) - set(df.columns)
        if missing_cols:
            raise ValueError(f"Missing columns: {missing_cols}")

        # Safe type casting
        result = df.with_columns([
            pl.col("value").cast(pl.Float64, strict=False),
            pl.col("date").str.strptime(pl.Date, "%Y-%m-%d", strict=False)
        ])

        # Check for processing errors
        null_dates = result.filter(pl.col("date").is_null()).height
        if null_dates > 0:
            print(f"Warning: {null_dates} rows with invalid dates")

        return result

    except Exception as e:
        print(f"Error processing data: {e}")
        raise

# Schema validation
def validate_schema(df: pl.DataFrame, expected_schema: dict) -> bool:
    """Validate DataFrame schema against expected types"""
    for col, expected_type in expected_schema.items():
        if col not in df.columns:
            raise ValueError(f"Missing column: {col}")
        if df[col].dtype != expected_type:
            raise TypeError(f"Column {col} has type {df[col].dtype}, expected {expected_type}")
    return True

# Usage
expected_schema = {
    "id": pl.Int64,
    "name": pl.Utf8,
    "value": pl.Float64
}
validate_schema(df, expected_schema)

Python

Memory and Performance Best Practices

# 1. Use streaming for large datasets
def process_large_dataset(file_path: str):
    return (
        pl.scan_parquet(file_path)
        .filter(pl.col("active") == True)
        .select(["id", "value", "category"])
        .group_by("category")
        .agg(pl.col("value").sum())
        .collect(streaming=True)  # Process in chunks
    )

# 2. Optimize data types
def optimize_dtypes(df: pl.DataFrame) -> pl.DataFrame:
    """Optimize data types for memory efficiency"""
    return df.with_columns([
        # Use smaller integer types when possible
        pl.col("small_numbers").cast(pl.Int16),

        # Use categorical for repeated strings
        pl.col("categories").cast(pl.Categorical),

        # Use appropriate float precision
        pl.col("percentages").cast(pl.Float32)
    ])

# 3. Efficient filtering order
def efficient_filtering(df: pl.DataFrame):
    """Apply filters in order of selectivity"""
    return (
        df
        .filter(pl.col("status") == "active")  # Most selective first
        .filter(pl.col("date") > "2023-01-01")  # Then date range
        .filter(pl.col("value") > 0)  # Least selective last
    )

# 4. Batch operations
def batch_process_files(file_paths: list, batch_size: int = 10):
    """Process files in batches to manage memory"""
    results = []
    for i in range(0, len(file_paths), batch_size):
        batch = file_paths[i:i + batch_size]
        batch_dfs = [pl.read_csv(f) for f in batch]
        combined = pl.concat(batch_dfs)
        processed = combined.group_by("category").agg(pl.col("value").sum())
        results.append(processed)

    return pl.concat(results)

# 1. Use streaming for large datasets
def process_large_dataset(file_path: str):
    return (
        pl.scan_parquet(file_path)
        .filter(pl.col("active") == True)
        .select(["id", "value", "category"])
        .group_by("category")
        .agg(pl.col("value").sum())
        .collect(streaming=True)  # Process in chunks
    )

# 2. Optimize data types
def optimize_dtypes(df: pl.DataFrame) -> pl.DataFrame:
    """Optimize data types for memory efficiency"""
    return df.with_columns([
        # Use smaller integer types when possible
        pl.col("small_numbers").cast(pl.Int16),

        # Use categorical for repeated strings
        pl.col("categories").cast(pl.Categorical),

        # Use appropriate float precision
        pl.col("percentages").cast(pl.Float32)
    ])

# 3. Efficient filtering order
def efficient_filtering(df: pl.DataFrame):
    """Apply filters in order of selectivity"""
    return (
        df
        .filter(pl.col("status") == "active")  # Most selective first
        .filter(pl.col("date") > "2023-01-01")  # Then date range
        .filter(pl.col("value") > 0)  # Least selective last
    )

# 4. Batch operations
def batch_process_files(file_paths: list, batch_size: int = 10):
    """Process files in batches to manage memory"""
    results = []
    for i in range(0, len(file_paths), batch_size):
        batch = file_paths[i:i + batch_size]
        batch_dfs = [pl.read_csv(f) for f in batch]
        combined = pl.concat(batch_dfs)
        processed = combined.group_by("category").agg(pl.col("value").sum())
        results.append(processed)

    return pl.concat(results)

Python

Testing Strategies

import pytest

def test_data_transformation():
    """Test data transformation functions"""
    # Sample data
    test_df = pl.DataFrame({
        "age": [15, 25, 70],
        "name": ["Alice", "Bob", "Charlie"]
    })

    # Apply transformation
    result = test_df.with_columns([create_age_groups("age")])

    # Assertions
    assert "age_group" in result.columns
    assert result["age_group"].to_list() == ["Under 18", "Adult", "Senior"]

def test_aggregation_consistency():
    """Test that aggregations produce consistent results"""
    df = pl.DataFrame({
        "group": ["A", "A", "B", "B"],
        "value": [1, 2, 3, 4]
    })

    result = df.group_by("group").agg(pl.col("value").sum())
    expected = pl.DataFrame({
        "group": ["A", "B"],
        "value": [3, 7]
    })

    assert result.sort("group").equals(expected.sort("group"))

# Property-based testing example
from hypothesis import given, strategies as st
import hypothesis.extra.pandas as st_pd

@given(st_pd.data_frames([
    st_pd.column("value", elements=st.floats(0, 1000)),
    st_pd.column("category", elements=st.text(min_size=1, max_size=5))
]))
def test_groupby_sum_is_positive(df_pandas):
    """Property test: sum of positive values should be positive"""
    df = pl.from_pandas(df_pandas)
    if df.height > 0:
        result = df.group_by("category").agg(pl.col("value").sum())
        assert all(result["value"] >= 0)

import pytest

def test_data_transformation():
    """Test data transformation functions"""
    # Sample data
    test_df = pl.DataFrame({
        "age": [15, 25, 70],
        "name": ["Alice", "Bob", "Charlie"]
    })

    # Apply transformation
    result = test_df.with_columns([create_age_groups("age")])

    # Assertions
    assert "age_group" in result.columns
    assert result["age_group"].to_list() == ["Under 18", "Adult", "Senior"]

def test_aggregation_consistency():
    """Test that aggregations produce consistent results"""
    df = pl.DataFrame({
        "group": ["A", "A", "B", "B"],
        "value": [1, 2, 3, 4]
    })

    result = df.group_by("group").agg(pl.col("value").sum())
    expected = pl.DataFrame({
        "group": ["A", "B"],
        "value": [3, 7]
    })

    assert result.sort("group").equals(expected.sort("group"))

# Property-based testing example
from hypothesis import given, strategies as st
import hypothesis.extra.pandas as st_pd

@given(st_pd.data_frames([
    st_pd.column("value", elements=st.floats(0, 1000)),
    st_pd.column("category", elements=st.text(min_size=1, max_size=5))
]))
def test_groupby_sum_is_positive(df_pandas):
    """Property test: sum of positive values should be positive"""
    df = pl.from_pandas(df_pandas)
    if df.height > 0:
        result = df.group_by("category").agg(pl.col("value").sum())
        assert all(result["value"] >= 0)

Python

Documentation and Type Hints

from typing import List, Dict, Optional, Union
import polars as pl

def analyze_sales_trends(
    df: pl.DataFrame,
    date_column: str = "date",
    value_column: str = "sales",
    groupby_columns: Optional[List[str]] = None,
    rolling_window: int = 30
) -> Dict[str, pl.DataFrame]:
    """
    Analyze sales trends with rolling averages and growth rates.

    Args:
        df: Input DataFrame with sales data
        date_column: Name of the date column
        value_column: Name of the sales value column
        groupby_columns: Optional columns to group by
        rolling_window: Window size for rolling average (days)

    Returns:
        Dictionary containing trend analysis results:
        - 'trends': DataFrame with rolling averages and growth rates
        - 'summary': DataFrame with summary statistics

    Example:
        >>> df = pl.DataFrame({
        ...     "date": ["2023-01-01", "2023-01-02"],
        ...     "sales": [100, 150],
        ...     "region": ["North", "South"]
        ... })
        >>> results = analyze_sales_trends(df, groupby_columns=["region"])
    """
    # Input validation
    required_columns = [date_column, value_column]
    missing_columns = set(required_columns) - set(df.columns)
    if missing_columns:
        raise ValueError(f"Missing required columns: {missing_columns}")

    # Main analysis
    if groupby_columns:
        trends = (
            df
            .sort([*groupby_columns, date_column])
            .with_columns([
                pl.col(value_column)
                .rolling_mean(rolling_window)
                .over(groupby_columns)
                .alias(f"rolling_avg_{rolling_window}d"),

                pl.col(value_column)
                .pct_change()
                .over(groupby_columns)
                .alias("growth_rate")
            ])
        )

        summary = (
            df
            .group_by(groupby_columns)
            .agg([
                pl.col(value_column).sum().alias("total_sales"),
                pl.col(value_column).mean().alias("avg_sales"),
                pl.col(value_column).std().alias("sales_volatility")
            ])
        )
    else:
        trends = df.sort(date_column).with_columns([
            pl.col(value_column).rolling_mean(rolling_window).alias(f"rolling_avg_{rolling_window}d"),
            pl.col(value_column).pct_change().alias("growth_rate")
        ])

        summary = df.select([
            pl.col(value_column).sum().alias("total_sales"),
            pl.col(value_column).mean().alias("avg_sales"),
            pl.col(value_column).std().alias("sales_volatility")
        ])

    return {
        "trends": trends,
        "summary": summary
    }

from typing import List, Dict, Optional, Union
import polars as pl

def analyze_sales_trends(
    df: pl.DataFrame,
    date_column: str = "date",
    value_column: str = "sales",
    groupby_columns: Optional[List[str]] = None,
    rolling_window: int = 30
) -> Dict[str, pl.DataFrame]:
    """
    Analyze sales trends with rolling averages and growth rates.

    Args:
        df: Input DataFrame with sales data
        date_column: Name of the date column
        value_column: Name of the sales value column
        groupby_columns: Optional columns to group by
        rolling_window: Window size for rolling average (days)

    Returns:
        Dictionary containing trend analysis results:
        - 'trends': DataFrame with rolling averages and growth rates
        - 'summary': DataFrame with summary statistics

    Example:
        >>> df = pl.DataFrame({
        ...     "date": ["2023-01-01", "2023-01-02"],
        ...     "sales": [100, 150],
        ...     "region": ["North", "South"]
        ... })
        >>> results = analyze_sales_trends(df, groupby_columns=["region"])
    """
    # Input validation
    required_columns = [date_column, value_column]
    missing_columns = set(required_columns) - set(df.columns)
    if missing_columns:
        raise ValueError(f"Missing required columns: {missing_columns}")

    # Main analysis
    if groupby_columns:
        trends = (
            df
            .sort([*groupby_columns, date_column])
            .with_columns([
                pl.col(value_column)
                .rolling_mean(rolling_window)
                .over(groupby_columns)
                .alias(f"rolling_avg_{rolling_window}d"),

                pl.col(value_column)
                .pct_change()
                .over(groupby_columns)
                .alias("growth_rate")
            ])
        )

        summary = (
            df
            .group_by(groupby_columns)
            .agg([
                pl.col(value_column).sum().alias("total_sales"),
                pl.col(value_column).mean().alias("avg_sales"),
                pl.col(value_column).std().alias("sales_volatility")
            ])
        )
    else:
        trends = df.sort(date_column).with_columns([
            pl.col(value_column).rolling_mean(rolling_window).alias(f"rolling_avg_{rolling_window}d"),
            pl.col(value_column).pct_change().alias("growth_rate")
        ])

        summary = df.select([
            pl.col(value_column).sum().alias("total_sales"),
            pl.col(value_column).mean().alias("avg_sales"),
            pl.col(value_column).std().alias("sales_volatility")
        ])

    return {
        "trends": trends,
        "summary": summary
    }

Python

Production Deployment Best Practices

import logging
from pathlib import Path
import polars as pl

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class DataProcessor:
    """Production-ready data processor with error handling and logging"""

    def __init__(self, config: dict):
        self.config = config
        self.setup_polars()

    def setup_polars(self):
        """Configure Polars for production use"""
        pl.Config.set_thread_pool_size(self.config.get("threads", 4))
        pl.Config.set_streaming_chunk_size(self.config.get("chunk_size", 10000))

    def process_file(self, file_path: Path) -> Optional[pl.DataFrame]:
        """Process a single file with error handling"""
        try:
            logger.info(f"Processing file: {file_path}")

            # Validate file exists
            if not file_path.exists():
                logger.error(f"File not found: {file_path}")
                return None

            # Process with lazy evaluation
            result = (
                pl.scan_csv(file_path)
                .filter(pl.col("date").is_not_null())
                .filter(pl.col("value") > 0)
                .with_columns([
                    pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
                    pl.col("value").cast(pl.Float64)
                ])
                .collect(streaming=True)
            )

            logger.info(f"Successfully processed {result.height} rows")
            return result

        except Exception as e:
            logger.error(f"Error processing {file_path}: {e}")
            return None

    def save_results(self, df: pl.DataFrame, output_path: Path):
        """Save results with appropriate format"""
        try:
            output_path.parent.mkdir(parents=True, exist_ok=True)

            if output_path.suffix == ".parquet":
                df.write_parquet(output_path, compression="snappy")
            elif output_path.suffix == ".csv":
                df.write_csv(output_path)
            else:
                raise ValueError(f"Unsupported output format: {output_path.suffix}")

            logger.info(f"Results saved to: {output_path}")

        except Exception as e:
            logger.error(f"Error saving results: {e}")
            raise

# Configuration management
def load_config() -> dict:
    """Load configuration from environment or config file"""
    return {
        "threads": int(os.getenv("POLARS_THREADS", 4)),
        "chunk_size": int(os.getenv("POLARS_CHUNK_SIZE", 10000)),
        "input_dir": Path(os.getenv("INPUT_DIR", "./data")),
        "output_dir": Path(os.getenv("OUTPUT_DIR", "./output"))
    }

import logging
from pathlib import Path
import polars as pl

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class DataProcessor:
    """Production-ready data processor with error handling and logging"""

    def __init__(self, config: dict):
        self.config = config
        self.setup_polars()

    def setup_polars(self):
        """Configure Polars for production use"""
        pl.Config.set_thread_pool_size(self.config.get("threads", 4))
        pl.Config.set_streaming_chunk_size(self.config.get("chunk_size", 10000))

    def process_file(self, file_path: Path) -> Optional[pl.DataFrame]:
        """Process a single file with error handling"""
        try:
            logger.info(f"Processing file: {file_path}")

            # Validate file exists
            if not file_path.exists():
                logger.error(f"File not found: {file_path}")
                return None

            # Process with lazy evaluation
            result = (
                pl.scan_csv(file_path)
                .filter(pl.col("date").is_not_null())
                .filter(pl.col("value") > 0)
                .with_columns([
                    pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
                    pl.col("value").cast(pl.Float64)
                ])
                .collect(streaming=True)
            )

            logger.info(f"Successfully processed {result.height} rows")
            return result

        except Exception as e:
            logger.error(f"Error processing {file_path}: {e}")
            return None

    def save_results(self, df: pl.DataFrame, output_path: Path):
        """Save results with appropriate format"""
        try:
            output_path.parent.mkdir(parents=True, exist_ok=True)

            if output_path.suffix == ".parquet":
                df.write_parquet(output_path, compression="snappy")
            elif output_path.suffix == ".csv":
                df.write_csv(output_path)
            else:
                raise ValueError(f"Unsupported output format: {output_path.suffix}")

            logger.info(f"Results saved to: {output_path}")

        except Exception as e:
            logger.error(f"Error saving results: {e}")
            raise

# Configuration management
def load_config() -> dict:
    """Load configuration from environment or config file"""
    return {
        "threads": int(os.getenv("POLARS_THREADS", 4)),
        "chunk_size": int(os.getenv("POLARS_CHUNK_SIZE", 10000)),
        "input_dir": Path(os.getenv("INPUT_DIR", "./data")),
        "output_dir": Path(os.getenv("OUTPUT_DIR", "./output"))
    }

Python

Best Practices Summary

mindmap
  root((Polars Best Practices))
    Performance
      Use lazy evaluation
      Optimize data types
      Stream large datasets
      Filter early and efficiently
    Code Quality
      Type hints
      Error handling
      Comprehensive testing
      Clear documentation
    Production
      Logging and monitoring
      Configuration management
      Resource management
      Graceful error handling
    Migration
      Incremental approach
      Performance benchmarking
      Feature parity validation
      Team training

Final Recommendations

Start Simple: Begin with basic operations and gradually adopt advanced features
Leverage Lazy Evaluation: Use scan_* functions and .lazy() for better performance
Optimize Data Types: Choose appropriate types to reduce memory usage
Test Thoroughly: Write comprehensive tests for data transformations
Monitor Performance: Benchmark critical operations and optimize bottlenecks
Handle Errors Gracefully: Implement proper error handling and validation
Document Well: Provide clear documentation for complex operations
Stay Updated: Keep up with Polars releases and new features

Appendix A: Quick Reference and Cheat Sheets

Essential Imports and Setup

import polars as pl
import numpy as np
from datetime import datetime, date, timedelta

# Configuration
pl.Config.set_tbl_rows(20)
pl.Config.set_tbl_cols(8)
pl.Config.set_fmt_str_lengths(50)

import polars as pl
import numpy as np
from datetime import datetime, date, timedelta

# Configuration
pl.Config.set_tbl_rows(20)
pl.Config.set_tbl_cols(8)
pl.Config.set_fmt_str_lengths(50)

Python

DataFrame Creation Cheat Sheet

# From dictionary (most common)
df = pl.DataFrame({
    "name": ["Alice", "Bob"],
    "age": [25, 30],
    "salary": [50000, 60000]
})

# From lists with schema
df = pl.DataFrame([
    ["Alice", 25], ["Bob", 30]
], schema=["name", "age"])

# Empty with schema
df = pl.DataFrame(schema={
    "id": pl.Int64,
    "name": pl.Utf8,
    "value": pl.Float64
})

# From dictionary (most common)
df = pl.DataFrame({
    "name": ["Alice", "Bob"],
    "age": [25, 30],
    "salary": [50000, 60000]
})

# From lists with schema
df = pl.DataFrame([
    ["Alice", 25], ["Bob", 30]
], schema=["name", "age"])

# Empty with schema
df = pl.DataFrame(schema={
    "id": pl.Int64,
    "name": pl.Utf8,
    "value": pl.Float64
})

Python

File I/O Quick Reference

# Reading files
df = pl.read_csv("file.csv")
df = pl.read_parquet("file.parquet")
df = pl.read_json("file.json")

# Lazy reading (for large files)
df = pl.scan_csv("large_file.csv").collect()

# Writing files
df.write_csv("output.csv")
df.write_parquet("output.parquet")
df.write_json("output.json")

# Reading files
df = pl.read_csv("file.csv")
df = pl.read_parquet("file.parquet")
df = pl.read_json("file.json")

# Lazy reading (for large files)
df = pl.scan_csv("large_file.csv").collect()

# Writing files
df.write_csv("output.csv")
df.write_parquet("output.parquet")
df.write_json("output.json")

Python

Selection and Filtering Cheat Sheet

# Column selection
df.select("name")                    # Single column
df.select(["name", "age"])           # Multiple columns
df.select(pl.col("^.*_id$"))         # Regex pattern
df.select(pl.exclude("id"))          # All except id

# Row filtering
df.filter(pl.col("age") > 25)                          # Simple condition
df.filter((pl.col("age") > 20) & (pl.col("age") < 60)) # Multiple conditions
df.filter(pl.col("name").str.contains("Alice"))        # String operations
df.filter(pl.col("value").is_null())                   # Null values

# Column selection
df.select("name")                    # Single column
df.select(["name", "age"])           # Multiple columns
df.select(pl.col("^.*_id$"))         # Regex pattern
df.select(pl.exclude("id"))          # All except id

# Row filtering
df.filter(pl.col("age") > 25)                          # Simple condition
df.filter((pl.col("age") > 20) & (pl.col("age") < 60)) # Multiple conditions
df.filter(pl.col("name").str.contains("Alice"))        # String operations
df.filter(pl.col("value").is_null())                   # Null values

Python

Transformation Cheat Sheet

# Adding/modifying columns
df.with_columns([
    pl.col("age").alias("years"),                    # Rename
    (pl.col("salary") * 1.1).alias("new_salary"),   # Calculate
    pl.lit("constant").alias("category"),            # Add constant
    pl.col("name").str.len().alias("name_length")   # String operation
])

# Conditional logic
df.with_columns([
    pl.when(pl.col("age") > 30)
    .then(pl.lit("Senior"))
    .otherwise(pl.lit("Junior"))
    .alias("category")
])

# Adding/modifying columns
df.with_columns([
    pl.col("age").alias("years"),                    # Rename
    (pl.col("salary") * 1.1).alias("new_salary"),   # Calculate
    pl.lit("constant").alias("category"),            # Add constant
    pl.col("name").str.len().alias("name_length")   # String operation
])

# Conditional logic
df.with_columns([
    pl.when(pl.col("age") > 30)
    .then(pl.lit("Senior"))
    .otherwise(pl.lit("Junior"))
    .alias("category")
])

Python

Aggregation Quick Reference

# Basic aggregations
df.select([
    pl.col("value").sum(),
    pl.col("value").mean(),
    pl.col("value").count(),
    pl.col("value").min(),
    pl.col("value").max(),
    pl.col("value").std()
])

# Group by operations
df.group_by("category").agg([
    pl.col("value").sum().alias("total"),
    pl.col("value").mean().alias("average"),
    pl.col("id").count().alias("count")
])

# Basic aggregations
df.select([
    pl.col("value").sum(),
    pl.col("value").mean(),
    pl.col("value").count(),
    pl.col("value").min(),
    pl.col("value").max(),
    pl.col("value").std()
])

# Group by operations
df.group_by("category").agg([
    pl.col("value").sum().alias("total"),
    pl.col("value").mean().alias("average"),
    pl.col("id").count().alias("count")
])

Python

Join Operations Quick Reference

# Join types
df1.join(df2, on="key", how="inner")    # Inner join
df1.join(df2, on="key", how="left")     # Left join
df1.join(df2, on="key", how="outer")    # Outer join
df1.join(df2, on="key", how="cross")    # Cross join

# Different column names
df1.join(df2, left_on="id", right_on="user_id", how="left")

# Join types
df1.join(df2, on="key", how="inner")    # Inner join
df1.join(df2, on="key", how="left")     # Left join
df1.join(df2, on="key", how="outer")    # Outer join
df1.join(df2, on="key", how="cross")    # Cross join

# Different column names
df1.join(df2, left_on="id", right_on="user_id", how="left")

Python

Data Types Quick Reference

# Numeric types
pl.Int8, pl.Int16, pl.Int32, pl.Int64      # Signed integers
pl.UInt8, pl.UInt16, pl.UInt32, pl.UInt64  # Unsigned integers
pl.Float32, pl.Float64                      # Floating point

# Text and categorical
pl.Utf8                                     # String
pl.Categorical                              # Categories
pl.Binary                                   # Binary data

# Temporal
pl.Date                                     # Date only
pl.Datetime                                 # Date and time
pl.Time                                     # Time only
pl.Duration                                 # Time intervals

# Complex types
pl.List(pl.Int32)                          # List of integers
pl.Struct([pl.Field("name", pl.Utf8)])     # Structured data

# Special
pl.Boolean                                  # True/False
pl.Null                                     # Null type

# Numeric types
pl.Int8, pl.Int16, pl.Int32, pl.Int64      # Signed integers
pl.UInt8, pl.UInt16, pl.UInt32, pl.UInt64  # Unsigned integers
pl.Float32, pl.Float64                      # Floating point

# Text and categorical
pl.Utf8                                     # String
pl.Categorical                              # Categories
pl.Binary                                   # Binary data

# Temporal
pl.Date                                     # Date only
pl.Datetime                                 # Date and time
pl.Time                                     # Time only
pl.Duration                                 # Time intervals

# Complex types
pl.List(pl.Int32)                          # List of integers
pl.Struct([pl.Field("name", pl.Utf8)])     # Structured data

# Special
pl.Boolean                                  # True/False
pl.Null                                     # Null type

Python

Common Expressions

# Column operations
pl.col("name")                             # Column reference
pl.all()                                   # All columns
pl.exclude("id")                           # All except specified
pl.lit(42)                                 # Literal value

# String operations
pl.col("text").str.len()                   # String length
pl.col("text").str.contains("pattern")     # Contains substring
pl.col("text").str.to_uppercase()          # Uppercase
pl.col("text").str.replace("old", "new")   # Replace text

# Date operations
pl.col("date").dt.year()                   # Extract year
pl.col("date").dt.month()                  # Extract month
pl.col("date").dt.day()                    # Extract day
pl.col("date").dt.weekday()                # Day of week

# Mathematical operations
pl.col("value").abs()                      # Absolute value
pl.col("value").sqrt()                     # Square root
pl.col("value").round(2)                   # Round to 2 decimals
pl.col("value").rank()                     # Rank values

# Column operations
pl.col("name")                             # Column reference
pl.all()                                   # All columns
pl.exclude("id")                           # All except specified
pl.lit(42)                                 # Literal value

# String operations
pl.col("text").str.len()                   # String length
pl.col("text").str.contains("pattern")     # Contains substring
pl.col("text").str.to_uppercase()          # Uppercase
pl.col("text").str.replace("old", "new")   # Replace text

# Date operations
pl.col("date").dt.year()                   # Extract year
pl.col("date").dt.month()                  # Extract month
pl.col("date").dt.day()                    # Extract day
pl.col("date").dt.weekday()                # Day of week

# Mathematical operations
pl.col("value").abs()                      # Absolute value
pl.col("value").sqrt()                     # Square root
pl.col("value").round(2)                   # Round to 2 decimals
pl.col("value").rank()                     # Rank values

Python

Performance Tips Summary

# 1. Use lazy evaluation for large datasets
df = pl.scan_csv("large_file.csv").filter(...).collect()

# 2. Use appropriate data types
df = df.with_columns([
    pl.col("small_numbers").cast(pl.Int16),
    pl.col("categories").cast(pl.Categorical)
])

# 3. Filter early in lazy queries
df = (
    pl.scan_csv("data.csv")
    .filter(pl.col("date") > "2023-01-01")  # Early filtering
    .select(["id", "value"])                # Column pruning
    .collect()
)

# 4. Use streaming for very large datasets
df = lazy_query.collect(streaming=True)

# 1. Use lazy evaluation for large datasets
df = pl.scan_csv("large_file.csv").filter(...).collect()

# 2. Use appropriate data types
df = df.with_columns([
    pl.col("small_numbers").cast(pl.Int16),
    pl.col("categories").cast(pl.Categorical)
])

# 3. Filter early in lazy queries
df = (
    pl.scan_csv("data.csv")
    .filter(pl.col("date") > "2023-01-01")  # Early filtering
    .select(["id", "value"])                # Column pruning
    .collect()
)

# 4. Use streaming for very large datasets
df = lazy_query.collect(streaming=True)

Python

Appendix B: Troubleshooting Common Issues

Installation Issues

Problem: ImportError: No module named 'polars'

# Solution: Check Python environment
python -m pip install polars
# or
conda install -c conda-forge polars

# Solution: Check Python environment
python -m pip install polars
# or
conda install -c conda-forge polars

Bash

Problem: Version conflicts

# Solution: Use virtual environment
python -m venv polars_env
source polars_env/bin/activate  # Linux/Mac
# or
polars_env\Scripts\activate     # Windows
pip install polars[all]

# Solution: Use virtual environment
python -m venv polars_env
source polars_env/bin/activate  # Linux/Mac
# or
polars_env\Scripts\activate     # Windows
pip install polars[all]

Bash

Data Type Issues

Problem: Unexpected data types after loading

# Check data types
print(df.dtypes)

# Fix with explicit casting
df = df.with_columns([
    pl.col("number_col").cast(pl.Int64),
    pl.col("date_col").str.strptime(pl.Date, "%Y-%m-%d")
])

# Check data types
print(df.dtypes)

# Fix with explicit casting
df = df.with_columns([
    pl.col("number_col").cast(pl.Int64),
    pl.col("date_col").str.strptime(pl.Date, "%Y-%m-%d")
])

Python

Problem: Memory usage too high

# Use more efficient data types
df = df.with_columns([
    pl.col("small_int").cast(pl.Int16),  # Instead of Int64
    pl.col("category").cast(pl.Categorical),  # Instead of Utf8
    pl.col("flag").cast(pl.Boolean)  # Instead of string
])

# Use more efficient data types
df = df.with_columns([
    pl.col("small_int").cast(pl.Int16),  # Instead of Int64
    pl.col("category").cast(pl.Categorical),  # Instead of Utf8
    pl.col("flag").cast(pl.Boolean)  # Instead of string
])

Python

Performance Issues

Problem: Slow operations on large datasets

# Solution 1: Use lazy evaluation
result = (
    pl.scan_csv("large_file.csv")
    .filter(pl.col("important") == True)
    .select(["id", "value"])
    .collect()
)

# Solution 2: Use streaming
result = lazy_query.collect(streaming=True)

# Solution 3: Configure thread pool
pl.Config.set_thread_pool_size(8)

# Solution 1: Use lazy evaluation
result = (
    pl.scan_csv("large_file.csv")
    .filter(pl.col("important") == True)
    .select(["id", "value"])
    .collect()
)

# Solution 2: Use streaming
result = lazy_query.collect(streaming=True)

# Solution 3: Configure thread pool
pl.Config.set_thread_pool_size(8)

Python

Problem: Out of memory errors

# Solution: Process in chunks or use streaming
def process_in_chunks(file_path, chunk_size=10000):
    total_result = None

    for chunk_df in pl.read_csv_batched(file_path, batch_size=chunk_size):
        chunk_result = chunk_df.group_by("category").agg(pl.col("value").sum())

        if total_result is None:
            total_result = chunk_result
        else:
            total_result = pl.concat([total_result, chunk_result])

    return total_result.group_by("category").agg(pl.col("value").sum())

# Solution: Process in chunks or use streaming
def process_in_chunks(file_path, chunk_size=10000):
    total_result = None

    for chunk_df in pl.read_csv_batched(file_path, batch_size=chunk_size):
        chunk_result = chunk_df.group_by("category").agg(pl.col("value").sum())

        if total_result is None:
            total_result = chunk_result
        else:
            total_result = pl.concat([total_result, chunk_result])

    return total_result.group_by("category").agg(pl.col("value").sum())

Python

Common Error Messages

Error: ShapeError: Could not add column

# Problem: Column length mismatch
# Solution: Ensure all columns have the same length
df = pl.DataFrame({
    "a": [1, 2, 3],
    "b": [4, 5, 6],  # Same length as 'a'
})

# Problem: Column length mismatch
# Solution: Ensure all columns have the same length
df = pl.DataFrame({
    "a": [1, 2, 3],
    "b": [4, 5, 6],  # Same length as 'a'
})

Python

Error: ComputeError: Could not parse

# Problem: Date/time parsing issues
# Solution: Specify format or use strict=False
df = df.with_columns([
    pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d", strict=False)
])

# Problem: Date/time parsing issues
# Solution: Specify format or use strict=False
df = df.with_columns([
    pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d", strict=False)
])

Python

Error: ColumnNotFoundError

# Problem: Column name doesn't exist
# Solution: Check column names
print(df.columns)  # List all columns
# or use regex for flexible matching
df.select(pl.col("^.*name.*$"))  # Columns containing "name"

# Problem: Column name doesn't exist
# Solution: Check column names
print(df.columns)  # List all columns
# or use regex for flexible matching
df.select(pl.col("^.*name.*$"))  # Columns containing "name"

Python

Appendix C: Performance Benchmarks

Polars vs Pandas Comparison

Based on common operations with 1M rows:

Operation	Pandas Time	Polars Time	Speedup
GroupBy Sum	145ms	12ms	12.1x
Filter + Select	89ms	8ms	11.1x
Join Operations	267ms	23ms	11.6x
String Operations	156ms	19ms	8.2x
Date Parsing	445ms	67ms	6.6x
Aggregations	78ms	9ms	8.7x

Memory Usage Comparison

Data Type	Pandas Memory	Polars Memory	Reduction
1M integers	7.6 MB	3.8 MB	50%
1M strings	54.2 MB	38.1 MB	30%
1M categories	54.2 MB	8.4 MB	84%
1M dates	15.3 MB	7.6 MB	50%

Appendix D: Resources and Further Learning

Official Resources

Documentation: https://pola-rs.github.io/polars/
GitHub Repository: https://github.com/pola-rs/polars
User Guide: https://pola-rs.github.io/polars-book/
API Reference: https://pola-rs.github.io/polars/py-polars/html/

Community and Support

Discord Server: Join the Polars community for real-time help
Stack Overflow: Tag questions with python-polars
GitHub Issues: Report bugs and request features
Reddit: r/PolarsDataFrame community discussions

PyArrow: Arrow format integration
DuckDB: SQL interface for Polars
Connectorx: Fast database connectors
Great Expectations: Data quality validation

Learning Path Recommendations

Beginner Path:

Master DataFrame creation and basic operations
Learn filtering and selection patterns
Practice with real datasets
Understand lazy evaluation benefits

Intermediate Path:

Complex aggregations and window functions
Join operations and data combination
Time series analysis
Performance optimization techniques

Advanced Path:

Custom functions and expressions
Nested data manipulation
Production deployment patterns
Integration with other tools

Sample Datasets for Practice

# Built-in sample data generators
def create_sales_data(n_rows=10000):
    import random
    random.seed(42)

    return pl.DataFrame({
        "date": pl.date_range(date(2020, 1, 1), date(2023, 12, 31), "1d")[:n_rows],
        "product": [f"Product_{i%100}" for i in range(n_rows)],
        "sales": [random.randint(10, 1000) for _ in range(n_rows)],
        "region": [random.choice(["North", "South", "East", "West"]) for _ in range(n_rows)]
    })

def create_sensor_data(n_rows=50000):
    import random
    random.seed(42)

    return pl.DataFrame({
        "timestamp": pl.datetime_range(
            datetime(2023, 1, 1), 
            datetime(2023, 12, 31), 
            "1h"
        )[:n_rows],
        "sensor_id": [f"SENSOR_{i%20:03d}" for i in range(n_rows)],
        "temperature": [random.gauss(20, 5) for _ in range(n_rows)],
        "humidity": [random.gauss(50, 15) for _ in range(n_rows)],
        "pressure": [random.gauss(1013, 10) for _ in range(n_rows)]
    })

# Usage
sales_df = create_sales_data(10000)
sensor_df = create_sensor_data(5000)

# Built-in sample data generators
def create_sales_data(n_rows=10000):
    import random
    random.seed(42)

    return pl.DataFrame({
        "date": pl.date_range(date(2020, 1, 1), date(2023, 12, 31), "1d")[:n_rows],
        "product": [f"Product_{i%100}" for i in range(n_rows)],
        "sales": [random.randint(10, 1000) for _ in range(n_rows)],
        "region": [random.choice(["North", "South", "East", "West"]) for _ in range(n_rows)]
    })

def create_sensor_data(n_rows=50000):
    import random
    random.seed(42)

    return pl.DataFrame({
        "timestamp": pl.datetime_range(
            datetime(2023, 1, 1), 
            datetime(2023, 12, 31), 
            "1h"
        )[:n_rows],
        "sensor_id": [f"SENSOR_{i%20:03d}" for i in range(n_rows)],
        "temperature": [random.gauss(20, 5) for _ in range(n_rows)],
        "humidity": [random.gauss(50, 15) for _ in range(n_rows)],
        "pressure": [random.gauss(1013, 10) for _ in range(n_rows)]
    })

# Usage
sales_df = create_sales_data(10000)
sensor_df = create_sensor_data(5000)

Python

Final Words

This comprehensive guide covers Polars from beginner to expert level, providing practical examples, performance optimization techniques, and production-ready patterns. The combination of detailed explanations, code examples, and visual diagrams using Mermaid makes it a complete resource for mastering Python Polars.

Key Takeaways

Polars is Revolutionary: Built for modern data processing with speed and efficiency
Lazy Evaluation: The key to performance optimization and memory efficiency
Expression System: Powerful, composable operations that are both readable and fast
Production Ready: Robust features for real-world data processing pipelines
Growing Ecosystem: Active community and continuous development

From Beginner to Expert

About This Book

Who This Book Is For

Prerequisites

Table of Contents

Part I: Foundations

Part II: Essential Operations

Part III: Intermediate Techniques

Part IV: Advanced Features

Part V: Performance and Production

Part VI: Real-World Applications

Part VII: Migration and Best Practices

Appendices

1. Why Polars? Understanding the Revolution

🎯 Learning Objectives

The Data Processing Challenge

Enter Polars: The Game Changer

What Makes Polars Special?

Architecture Comparison: The Foundation of Speed

Key Advantages Explained

1. Blazing Fast Performance

2. Memory Efficiency

3. Lazy Evaluation with Query Optimization

4. Expressive and Intuitive API

5. Modern Data Types and Features

When to Choose Polars

✅ Choose Polars When:

⚠️ Consider Alternatives When:

Real-World Impact: A Quick Example

Chapter Summary

🔗 What’s Next?

2. Getting Started: Installation and Setup

🎯 Learning Objectives

Installation Options

Basic Installation

Advanced Installation Options

Installation Verification

Development Environment Setup

Jupyter Notebook Configuration

IDE Configuration (VS Code, PyCharm)

Your First Polars Program

Common Installation Issues and Solutions

Issue 1: Import Error

Issue 2: Version Conflicts

Issue 3: Performance Issues

Environment Checklist

Quick Reference Card

Chapter Summary

🔗 What’s Next?

3. Core Concepts: DataFrames, Series, and Expressions

🎯 Learning Objectives

The Three Pillars of Polars

1. DataFrame: Your Data Container

DataFrame vs LazyFrame

2. Series: Individual Columns

3. Expressions: The Heart of Polars

Data Types Deep Dive

The Expression System: Building Blocks

Performance Implications

Chapter Summary

🔗 What’s Next?

4. Your First Polars Program

🎯 Learning Objectives

The Scenario: Employee Data Analysis

Understanding the Code Structure

Key Patterns Demonstrated

2. Progressive Data Enhancement

Performance Tips Applied

Chapter Summary

🔗 What’s Next?

5. Loading and Creating Data

🎯 Learning Objectives

Data Sources Overview

Creating DataFrames from Scratch

1. From Dictionaries (Most Common)

2. From Lists and Sequences

3. From NumPy Arrays

4. Empty DataFrames with Schema

File Loading Mastery

CSV Files: The Workhorse