From Beginner to Expert
Master the fastest DataFrame library in Python with practical examples, advanced techniques, and real-world applications
About This Book
This comprehensive guide will take you from complete beginner to Polars expert through hands-on examples, practical exercises, and real-world case studies. Each chapter builds upon previous concepts while introducing new techniques and best practices.
Who This Book Is For
- Data Analysts looking for faster data processing
- Data Scientists wanting to optimize their workflows
- Python Developers seeking performant data manipulation tools
- Pandas Users ready to upgrade to next-generation DataFrame operations
Prerequisites
- Basic Python programming knowledge
- Familiarity with data concepts (optional but helpful)
- No prior Polars experience required
Table of Contents
Part I: Foundations
- Why Polars? Understanding the Revolution
- Getting Started: Installation and Setup
- Core Concepts: DataFrames, Series, and Expressions
- Your First Polars Program
Part II: Essential Operations
- Loading and Creating Data
- Data Inspection and Exploration
- Selecting and Filtering Data
- Data Transformation Fundamentals
Part III: Intermediate Techniques
- Aggregations and Grouping
- Joining and Combining DataFrames
- Working with Strings and Dates
- Time Series Analysis
Part IV: Advanced Features
- Lazy Evaluation and Query Optimization
- Custom Functions and Complex Operations
- Working with Nested Data (Lists and Structs)
- Window Functions and Advanced Analytics
Part V: Performance and Production
- Performance Optimization Strategies
- Memory Management and Large Datasets
- Parallel Processing and Streaming
Part VI: Real-World Applications
- Business Analytics Case Study
- Financial Data Analysis
- IoT and Sensor Data Processing
- Log Analysis and Monitoring
Part VII: Migration and Best Practices
Appendices
- A: Quick Reference and Cheat Sheets
- B: Troubleshooting Common Issues
- C: Performance Benchmarks
- D: Resources and Further Learning
1. Why Polars? Understanding the Revolution
🎯 Learning Objectives
By the end of this chapter, you will:
- Understand what makes Polars revolutionary in data processing
- Know when to choose Polars over other libraries
- Grasp the architectural differences between Polars and Pandas
- Appreciate the key advantages that drive Polars’ performance
The Data Processing Challenge
In today’s data-driven world, we’re dealing with increasingly large datasets. Traditional tools like Pandas, while excellent for small to medium datasets, often struggle with:
- Performance bottlenecks on large datasets
- Memory limitations that cause crashes
- Single-threaded execution that doesn’t utilize modern hardware
- Eager evaluation that processes unnecessary data
Enter Polars: The Game Changer
Polars is a revolutionary DataFrame library that addresses these challenges head-on. Built from the ground up in Rust with Python bindings, it represents the next generation of data processing tools.
What Makes Polars Special?
# A simple example showing Polars' power
import polars as pl
import time
# Create a large dataset
df = pl.DataFrame({
"id": range(1_000_000),
"value": range(1_000_000),
"category": ["A", "B", "C"] * 333_334
})
# Lightning-fast operations
start = time.time()
result = (
df
.filter(pl.col("value") > 500_000)
.group_by("category")
.agg(pl.col("value").sum())
)
print(f"Processed 1M rows in {time.time() - start:.3f} seconds")PythonArchitecture Comparison: The Foundation of Speed
graph TB
subgraph "Traditional Approach (Pandas)"
A[Python Code] --> B[Pandas API]
B --> C[NumPy/C Backend]
C --> D[Single Thread Execution]
D --> E[Eager Evaluation]
E --> F[Result]
end
subgraph "Modern Approach (Polars)"
G[Python Code] --> H[Polars API]
H --> I[Rust Core Engine]
I --> J[Multi-threaded Execution]
J --> K[Lazy Evaluation]
K --> L[Query Optimization]
L --> M[SIMD Instructions]
M --> N[Result]
end
style I fill:#f96
style J fill:#9f6
style K fill:#69f
style L fill:#f69Key Advantages Explained
1. Blazing Fast Performance
- Rust Implementation: Native speed without Python overhead
- SIMD Optimizations: Utilizes modern CPU instructions
- Parallel Processing: Automatic multi-threading
- Benchmark: Often 5-30x faster than Pandas
2. Memory Efficiency
- Columnar Storage: Optimized memory layout
- Zero-Copy Operations: Minimal memory allocations
- Streaming Support: Process datasets larger than RAM
- Smart Memory Management: Rust’s ownership model prevents leaks
3. Lazy Evaluation with Query Optimization
# This creates a plan, doesn't execute immediately
lazy_query = (
pl.scan_csv("huge_file.csv")
.filter(pl.col("date") > "2023-01-01") # Pushed down to file reading
.select(["id", "value"]) # Only read needed columns
.group_by("id")
.agg(pl.col("value").sum())
)
# Execute the optimized plan
result = lazy_query.collect()Python4. Expressive and Intuitive API
- Method Chaining: Natural, readable code
- Rich Expression System: Powerful column operations
- Type Safety: Catch errors at query planning time
- Consistent Patterns: Learn once, apply everywhere
5. Modern Data Types and Features
- Native Date/Time Support: Built-in temporal operations
- Nested Data Types: Lists, structs, and more
- String Processing: Optimized text operations
- Categorical Data: Efficient handling of categories
When to Choose Polars
✅ Choose Polars When:
- Working with datasets > 100MB
- Performance is critical
- You need out-of-core processing
- Building production data pipelines
- Dealing with time series data
- Complex aggregations and joins
⚠️ Consider Alternatives When:
- Dataset is very small (< 10MB)
- Heavy integration with pandas ecosystem required
- Team lacks Python/data processing experience
- Quick prototyping with familiar tools
Real-World Impact: A Quick Example
Let’s see Polars in action with a practical scenario:
import polars as pl
import pandas as pd
import numpy as np
import time
# Create sample sales data
n_rows = 500_000
data = {
"date": pd.date_range("2020-01-01", periods=n_rows, freq="1H"),
"product_id": np.random.randint(1, 1000, n_rows),
"sales": np.random.normal(100, 25, n_rows),
"region": np.random.choice(["North", "South", "East", "West"], n_rows)
}
# Pandas approach
df_pandas = pd.DataFrame(data)
start = time.time()
pandas_result = (
df_pandas
.groupby(["region", df_pandas["date"].dt.date])
.agg({"sales": ["sum", "mean", "count"]})
)
pandas_time = time.time() - start
# Polars approach
df_polars = pl.DataFrame(data)
start = time.time()
polars_result = (
df_polars
.group_by(["region", pl.col("date").dt.date()])
.agg([
pl.col("sales").sum().alias("sales_sum"),
pl.col("sales").mean().alias("sales_mean"),
pl.col("sales").count().alias("sales_count")
])
)
polars_time = time.time() - start
print(f"Pandas: {pandas_time:.2f}s")
print(f"Polars: {polars_time:.2f}s")
print(f"Speedup: {pandas_time/polars_time:.1f}x faster")PythonChapter Summary
Polars represents a paradigm shift in data processing, offering:
- Revolutionary performance through Rust and modern algorithms
- Memory efficiency that handles larger datasets
- Query optimization that makes your code faster automatically
- Modern API design that’s both powerful and intuitive
In the next chapter, we’ll get Polars installed and write our first program to experience this performance firsthand.
🔗 What’s Next?
Ready to experience the Polars advantage? Let’s get it installed and running on your system!
2. Getting Started: Installation and Setup
🎯 Learning Objectives
By the end of this chapter, you will:
- Successfully install Polars in different environments
- Configure your development environment for optimal Polars usage
- Understand different installation options and their use cases
- Verify your installation with a simple test program
Installation Options
Basic Installation
# Most common installation
pip install polars
# For conda users
conda install -c conda-forge polarsBashAdvanced Installation Options
# With all optional dependencies (recommended for full functionality)
pip install polars[all]
# For development and testing
pip install polars[dev]
# Specific feature sets
pip install polars[pandas] # Pandas interoperability
pip install polars[numpy] # NumPy integration
pip install polars[xlsx] # Excel file support
pip install polars[sql] # SQL query supportBashInstallation Verification
import polars as pl
import sys
print(f"✅ Polars version: {pl.__version__}")
print(f"🐍 Python version: {sys.version}")
print(f"🏗️ Available features: {pl.show_versions()}")
# Quick functionality test
test_df = pl.DataFrame({"test": [1, 2, 3]})
print(f"🧪 Test DataFrame created successfully: {test_df.shape[0]} rows")PythonDevelopment Environment Setup
Jupyter Notebook Configuration
# Put this in your notebook's first cell
import polars as pl
import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta
import warnings
# Polars display configuration
pl.Config.set_tbl_rows(15) # Show 15 rows in output
pl.Config.set_tbl_cols(8) # Show 8 columns in output
pl.Config.set_fmt_str_lengths(50) # Truncate strings at 50 chars
pl.Config.set_tbl_width_chars(120) # Table width in characters
# Enable better error messages
pl.Config.set_verbose(True)
# Suppress pandas warnings when comparing
warnings.filterwarnings('ignore', category=FutureWarning)
print("🚀 Polars environment ready!")PythonIDE Configuration (VS Code, PyCharm)
# Create a polars_config.py file for consistent setup
import polars as pl
def setup_polars_environment():
"""Configure Polars for development environment"""
# Display settings
pl.Config.set_tbl_rows(20)
pl.Config.set_tbl_cols(10)
pl.Config.set_fmt_str_lengths(100)
# Performance settings
pl.Config.set_streaming_chunk_size(10000)
# Enable all available threads (good for development)
import os
n_threads = os.cpu_count()
pl.Config.set_thread_pool_size(n_threads)
print(f"🔧 Polars configured with {n_threads} threads")
print(f"📊 Display: {pl.Config.get_tbl_rows()} rows, {pl.Config.get_tbl_cols()} columns")
return pl
# Import and use in your scripts
# from polars_config import setup_polars_environment
# pl = setup_polars_environment()PythonYour First Polars Program
Let’s create a simple program to verify everything is working:
import polars as pl
import time
def first_polars_program():
"""Your first Polars program - data creation and basic operations"""
print("🎉 Welcome to your first Polars program!")
# Create sample data
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"age": [25, 30, 35, 28, 32],
"city": ["New York", "London", "Tokyo", "Paris", "Sydney"],
"salary": [70000, 80000, 90000, 75000, 85000]
})
print("\n📊 Original Data:")
print(df)
# Basic operations
print("\n🔍 Basic Statistics:")
print(df.describe())
# Filtering example
high_earners = df.filter(pl.col("salary") > 80000)
print(f"\n💰 High earners (salary > $80,000): {high_earners.height} people")
print(high_earners)
# Grouping example
city_stats = df.group_by("city").agg([
pl.col("salary").mean().alias("avg_salary"),
pl.col("age").mean().alias("avg_age")
])
print("\n🌍 City Statistics:")
print(city_stats)
# Performance demonstration
print("\n⚡ Performance Test:")
large_df = pl.DataFrame({
"id": range(100_000),
"value": range(100_000)
})
start_time = time.time()
result = large_df.filter(pl.col("value") > 50_000).sum()
elapsed = time.time() - start_time
print(f"Processed 100,000 rows in {elapsed:.4f} seconds")
print("🎯 Setup complete! You're ready to learn Polars.")
# Run the program
if __name__ == "__main__":
first_polars_program()PythonCommon Installation Issues and Solutions
Issue 1: Import Error
# Problem: ImportError: No module named 'polars'
# Solution: Ensure you're in the correct Python environment
python -m pip install polars
# Or check your Python path
python -c "import sys; print(sys.path)"BashIssue 2: Version Conflicts
# Problem: Dependency conflicts with other packages
# Solution: Use virtual environment
python -m venv polars_env
source polars_env/bin/activate # On Windows: polars_env\Scripts\activate
pip install polars[all]BashIssue 3: Performance Issues
# Problem: Slow performance on large datasets
# Solution: Configure thread pool size
import polars as pl
import os
# Use all available CPU cores
pl.Config.set_thread_pool_size(os.cpu_count())
# Or set manually for fine-tuning
pl.Config.set_thread_pool_size(4) # Use 4 threadsPythonEnvironment Checklist
Before proceeding to the next chapter, ensure:
- ✅ Polars is installed and importable
- ✅ Version is 0.19.0 or higher (recommended)
- ✅ Basic operations work correctly
- ✅ Display configuration is set up
- ✅ Your development environment (Jupyter/IDE) is configured
Quick Reference Card
# Essential imports for every Polars session
import polars as pl
import numpy as np
from datetime import datetime, date
# Basic configuration
pl.Config.set_tbl_rows(20)
pl.Config.set_tbl_cols(8)
# Check installation
print(f"Polars v{pl.__version__} ready!")
# Create test DataFrame
df = pl.DataFrame({"test": [1, 2, 3]})
print(df) # Should display a simple tablePythonChapter Summary
You now have:
- ✅ Polars installed with the appropriate options for your needs
- ✅ Environment configured for optimal development experience
- ✅ First program running demonstrating basic Polars capabilities
- ✅ Troubleshooting knowledge for common setup issues
🔗 What’s Next?
With Polars installed and configured, let’s dive into the core concepts that make Polars so powerful!
3. Core Concepts: DataFrames, Series, and Expressions
🎯 Learning Objectives
By the end of this chapter, you will:
- Understand the three core data structures in Polars
- Know how to work with different data types effectively
- Master the expression system that makes Polars powerful
- Distinguish between eager and lazy evaluation
The Three Pillars of Polars
graph TD
A[Polars Core] --> B[DataFrame]
A --> C[Series]
A --> D[Expression]
B --> E[Eager Evaluation]
B --> F[LazyFrame]
F --> G[Query Optimization]
C --> H[Column Data]
C --> I[Data Types]
D --> J[Operations]
D --> K[Transformations]
D --> L[Aggregations]
style A fill:#f96
style B fill:#9f6
style C fill:#69f
style D fill:#f691. DataFrame: Your Data Container
A DataFrame is a 2D data structure with labeled columns, similar to a spreadsheet or SQL table.
import polars as pl
# Creating a DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"salary": [50000, 60000, 70000],
"active": [True, False, True]
})
print("📊 Basic DataFrame:")
print(df)
# DataFrame properties
print(f"Shape: {df.shape}") # (rows, columns)
print(f"Columns: {df.columns}")
print(f"Data types: {df.dtypes}")PythonDataFrame vs LazyFrame
flowchart LR
A[DataFrame] --> B[Eager Execution]
B --> C[Immediate Results]
D[LazyFrame] --> E[Query Building]
E --> F[Optimization]
F --> G[Execution]
G --> H[Results]
style A fill:#9f6
style D fill:#69f# Eager evaluation (DataFrame)
eager_result = df.filter(pl.col("age") > 25).select(["name", "salary"])
print("Eager result:", eager_result)
# Lazy evaluation (LazyFrame)
lazy_query = (
df.lazy()
.filter(pl.col("age") > 25)
.select(["name", "salary"])
)
print("Lazy query plan:")
print(lazy_query.explain()) # Show the execution plan
# Execute the lazy query
lazy_result = lazy_query.collect()
print("Lazy result:", lazy_result)Python2. Series: Individual Columns
A Series represents a single column of data with a specific type.
# Creating Series
ages = pl.Series("ages", [25, 30, 35, 28])
names = pl.Series("names", ["Alice", "Bob", "Charlie", "Diana"])
print("📈 Series examples:")
print(f"Ages: {ages}")
print(f"Names: {names}")
# Series operations
print(f"Mean age: {ages.mean()}")
print(f"Unique names: {names.n_unique()}")
print(f"Age statistics:\n{ages.describe()}")Python3. Expressions: The Heart of Polars
Expressions are the key to Polars’ power – they describe computations that can be applied to columns.
# Basic expressions
print("🔧 Expression examples:")
# Column references
age_expr = pl.col("age")
salary_expr = pl.col("salary")
# Computations
annual_bonus = pl.col("salary") * 0.1
senior_status = pl.col("age") > 30
# Complex expressions
salary_category = (
pl.when(pl.col("salary") > 60000)
.then(pl.lit("High"))
.when(pl.col("salary") > 40000)
.then(pl.lit("Medium"))
.otherwise(pl.lit("Low"))
)
# Apply expressions
result = df.with_columns([
annual_bonus.alias("bonus"),
senior_status.alias("is_senior"),
salary_category.alias("salary_category")
])
print(result)PythonData Types Deep Dive
graph TD
A[Polars Data Types] --> B[Numeric]
A --> C[Text]
A --> D[Temporal]
A --> E[Complex]
A --> F[Special]
B --> B1[Int8, Int16, Int32, Int64]
B --> B2[UInt8, UInt16, UInt32, UInt64]
B --> B3[Float32, Float64]
C --> C1[Utf8]
C --> C2[Categorical]
C --> C3[Binary]
D --> D1[Date]
D --> D2[Datetime]
D --> D3[Time]
D --> D4[Duration]
E --> E1[List]
E --> E2[Struct]
E --> E3[Array]
F --> F1[Boolean]
F --> F2[Null]# Data type examples with explanations
examples_df = pl.DataFrame({
# Integers - choose size based on your data range
"small_int": pl.Series([1, 2, 3], dtype=pl.Int8), # -128 to 127
"big_int": pl.Series([1000000, 2000000], dtype=pl.Int64), # Very large numbers
# Floats
"precision_float": pl.Series([1.1, 2.2], dtype=pl.Float32), # Less precision, less memory
"high_precision": pl.Series([1.123456789], dtype=pl.Float64), # More precision
# Text
"text": pl.Series(["Hello", "World"], dtype=pl.Utf8),
"category": pl.Series(["A", "B", "A"], dtype=pl.Categorical), # Efficient for repeated values
# Temporal
"date": pl.Series([pl.date(2023, 1, 1), pl.date(2023, 1, 2)]),
"datetime": pl.Series([pl.datetime(2023, 1, 1, 12, 0)]),
# Boolean
"flag": pl.Series([True, False, True], dtype=pl.Boolean),
# Complex types
"numbers_list": pl.Series([[1, 2, 3], [4, 5]], dtype=pl.List(pl.Int32)),
"person": pl.Series([{"name": "Alice", "age": 30}], dtype=pl.Struct([
pl.Field("name", pl.Utf8),
pl.Field("age", pl.Int32)
]))
})
print("🎯 Data types showcase:")
print(examples_df.dtypes)PythonThe Expression System: Building Blocks
flowchart TD
A[Expression Building Blocks] --> B[Column References]
A --> C[Literals]
A --> D[Functions]
A --> E[Conditionals]
A --> F[Aggregations]
B --> B1["pl.col('name')"]
B --> B2["pl.all()"]
B --> B3["pl.exclude('id')"]
C --> C1["pl.lit(42)"]
C --> C2["pl.lit('text')"]
D --> D1[".str.len()"]
D --> D2[".dt.year()"]
D --> D3[".abs()"]
E --> E1["pl.when().then()"]
E --> E2[".otherwise()"]
F --> F1[".sum()"]
F --> F2[".mean()"]
F --> F3[".count()"]# Expression building blocks demonstration
demo_df = pl.DataFrame({
"name": ["Alice Johnson", "Bob Smith", "Charlie Brown"],
"birth_date": ["1990-05-15", "1985-12-20", "1992-03-10"],
"score": [85, 92, 78],
"active": [True, False, True]
})
# Column references
print("🔗 Column references:")
all_cols = demo_df.select(pl.all()) # Select all columns
name_only = demo_df.select(pl.col("name")) # Select specific column
no_active = demo_df.select(pl.exclude("active")) # All except 'active'
# Literals and computations
enriched = demo_df.with_columns([
pl.lit("Student").alias("role"), # Add literal column
(pl.col("score") * 1.1).alias("boosted_score"), # Mathematical operation
pl.col("name").str.len().alias("name_length"), # String function
])
print("📊 Enriched data:")
print(enriched)
# Conditional expressions
categorized = demo_df.with_columns([
pl.when(pl.col("score") >= 90)
.then(pl.lit("Excellent"))
.when(pl.col("score") >= 80)
.then(pl.lit("Good"))
.when(pl.col("score") >= 70)
.then(pl.lit("Average"))
.otherwise(pl.lit("Needs Improvement"))
.alias("grade")
])
print("🎓 Grade categories:")
print(categorized.select(["name", "score", "grade"]))PythonPerformance Implications
# Understanding eager vs lazy performance
import time
# Create larger dataset for demonstration
large_df = pl.DataFrame({
"id": range(100_000),
"value": range(100_000),
"category": ["A", "B", "C", "D"] * 25_000
})
print("⚡ Performance comparison:")
# Eager evaluation
start = time.time()
eager_result = (
large_df
.filter(pl.col("value") > 50_000)
.group_by("category")
.agg(pl.col("value").sum())
)
eager_time = time.time() - start
# Lazy evaluation
start = time.time()
lazy_result = (
large_df
.lazy()
.filter(pl.col("value") > 50_000)
.group_by("category")
.agg(pl.col("value").sum())
.collect()
)
lazy_time = time.time() - start
print(f"Eager time: {eager_time:.4f}s")
print(f"Lazy time: {lazy_time:.4f}s")
print(f"Performance difference: {eager_time/lazy_time:.2f}x")PythonChapter Summary
Key Concepts Mastered:
- ✅ DataFrame: Your primary data container with eager evaluation
- ✅ LazyFrame: Query builder with optimization capabilities
- ✅ Series: Individual columns with specific data types
- ✅ Expressions: Powerful building blocks for data operations
- ✅ Data Types: Choosing the right type for memory and performance
The Expression Advantage:
- Composable and reusable
- Type-safe operations
- Optimizable by the query engine
- Readable and intuitive syntax
🔗 What’s Next?
Now that you understand Polars’ foundation, let’s build your first complete Polars program!
4. Your First Polars Program
🎯 Learning Objectives
By the end of this chapter, you will:
- Build a complete data analysis program using Polars
- Apply core concepts in a practical scenario
- Understand common patterns and workflows
- Gain confidence in writing Polars code
The Scenario: Employee Data Analysis
Let’s build a program that analyzes employee data for a fictional company. This will demonstrate real-world Polars usage patterns.
import polars as pl
from datetime import date, datetime
import random
def create_sample_data():
"""Create realistic employee data for our analysis"""
# Set seed for reproducible results
random.seed(42)
departments = ["Engineering", "Sales", "Marketing", "HR", "Finance"]
cities = ["New York", "London", "Tokyo", "Berlin", "Sydney"]
# Generate 1000 employees
employees = []
for i in range(1000):
employees.append({
"employee_id": f"EMP{i+1:04d}",
"name": f"Employee {i+1}",
"department": random.choice(departments),
"salary": random.randint(40000, 150000),
"hire_date": date(
random.randint(2018, 2023),
random.randint(1, 12),
random.randint(1, 28)
),
"city": random.choice(cities),
"performance_score": round(random.uniform(1.0, 5.0), 1),
"is_remote": random.choice([True, False])
})
return pl.DataFrame(employees)
def analyze_employee_data():
"""Complete employee data analysis workflow"""
print("🏢 Employee Data Analysis")
print("=" * 50)
# Step 1: Create data
print("📊 Creating sample data...")
df = create_sample_data()
print(f"✅ Created dataset with {df.height} employees")
print(f"📋 Columns: {', '.join(df.columns)}")
# Step 2: Basic exploration
print("\n🔍 Data Overview:")
print(df.head())
print(f"\nData types:\n{df.dtypes}")
# Step 3: Data quality check
print("\n🔍 Data Quality Check:")
null_counts = df.null_count()
print(null_counts)
# Step 4: Department analysis
print("\n🏢 Department Analysis:")
dept_analysis = (
df
.group_by("department")
.agg([
pl.col("salary").mean().round(0).alias("avg_salary"),
pl.col("salary").min().alias("min_salary"),
pl.col("salary").max().alias("max_salary"),
pl.col("employee_id").count().alias("employee_count"),
pl.col("performance_score").mean().round(2).alias("avg_performance")
])
.sort("avg_salary", descending=True)
)
print(dept_analysis)
# Step 5: Salary analysis with categories
print("\n💰 Salary Categories:")
salary_categorized = (
df
.with_columns([
pl.when(pl.col("salary") >= 100000)
.then(pl.lit("High"))
.when(pl.col("salary") >= 70000)
.then(pl.lit("Medium"))
.otherwise(pl.lit("Low"))
.alias("salary_category")
])
.group_by("salary_category")
.agg([
pl.col("employee_id").count().alias("count"),
pl.col("salary").mean().round(0).alias("avg_salary")
])
.sort("avg_salary", descending=True)
)
print(salary_categorized)
# Step 6: Geographic distribution
print("\n🌍 Geographic Distribution:")
city_analysis = (
df
.group_by(["city", "is_remote"])
.agg([
pl.col("employee_id").count().alias("employee_count"),
pl.col("salary").mean().round(0).alias("avg_salary")
])
.sort(["city", "is_remote"])
)
print(city_analysis)
# Step 7: Tenure analysis
print("\n📅 Employee Tenure Analysis:")
tenure_analysis = (
df
.with_columns([
((pl.lit(date.today()) - pl.col("hire_date")).dt.days() / 365.25)
.round(1)
.alias("years_of_service")
])
.with_columns([
pl.when(pl.col("years_of_service") >= 5)
.then(pl.lit("Veteran"))
.when(pl.col("years_of_service") >= 2)
.then(pl.lit("Experienced"))
.otherwise(pl.lit("New"))
.alias("tenure_category")
])
.group_by("tenure_category")
.agg([
pl.col("employee_id").count().alias("count"),
pl.col("years_of_service").mean().round(1).alias("avg_tenure"),
pl.col("salary").mean().round(0).alias("avg_salary"),
pl.col("performance_score").mean().round(2).alias("avg_performance")
])
.sort("avg_tenure", descending=True)
)
print(tenure_analysis)
# Step 8: High performer identification
print("\n⭐ High Performers (Top 10%):")
high_performers = (
df
.filter(pl.col("performance_score") >= df["performance_score"].quantile(0.9))
.select([
"employee_id", "name", "department", "salary",
"performance_score", "hire_date"
])
.sort("performance_score", descending=True)
.head(10)
)
print(high_performers)
# Step 9: Summary insights
print("\n📈 Key Insights:")
total_employees = df.height
avg_salary = df["salary"].mean()
avg_performance = df["performance_score"].mean()
remote_pct = (df["is_remote"].sum() / total_employees * 100)
print(f"• Total Employees: {total_employees:,}")
print(f"• Average Salary: ${avg_salary:,.0f}")
print(f"• Average Performance Score: {avg_performance:.2f}/5.0")
print(f"• Remote Workers: {remote_pct:.1f}%")
return df
# Bonus: Using lazy evaluation for large datasets
def demonstrate_lazy_evaluation():
"""Show the power of lazy evaluation"""
print("\n⚡ Lazy Evaluation Demonstration")
print("=" * 40)
# Create a larger dataset
large_df = pl.DataFrame({
"id": range(100_000),
"value": range(100_000),
"category": ["A", "B", "C", "D", "E"] * 20_000
})
# Build a complex lazy query
lazy_query = (
large_df
.lazy()
.filter(pl.col("value") > 50_000)
.with_columns([
(pl.col("value") * 2).alias("doubled_value"),
pl.col("category").cast(pl.Categorical).alias("category")
])
.group_by("category")
.agg([
pl.col("value").sum().alias("total_value"),
pl.col("doubled_value").mean().alias("avg_doubled"),
pl.col("id").count().alias("count")
])
.sort("total_value", descending=True)
)
print("🔍 Query Plan:")
print(lazy_query.explain())
print("\n📊 Execution Results:")
result = lazy_query.collect()
print(result)
if __name__ == "__main__":
# Run the main analysis
employee_df = analyze_employee_data()
# Demonstrate lazy evaluation
demonstrate_lazy_evaluation()
print("\n🎉 Analysis Complete!")
print("You've successfully built and run your first comprehensive Polars program!")PythonUnderstanding the Code Structure
flowchart TD
A[Data Creation] --> B[Data Exploration]
B --> C[Data Quality Check]
C --> D[Business Analysis]
D --> E[Advanced Analytics]
E --> F[Insights & Reporting]
D --> D1[Department Analysis]
D --> D2[Salary Categorization]
D --> D3[Geographic Distribution]
E --> E1[Tenure Analysis]
E --> E2[Performance Ranking]
E --> E3[Predictive Features]
style A fill:#f96
style D fill:#9f6
style E fill:#69f
style F fill:#f69Key Patterns Demonstrated
- Data Creation with Realistic Constraints
# Using seeds for reproducible results
random.seed(42)
# Creating structured data with relationships
employees.append({
"employee_id": f"EMP{i+1:04d}", # Consistent ID format
"department": random.choice(departments), # Controlled categories
"salary": random.randint(40000, 150000), # Realistic ranges
})Python2. Progressive Data Enhancement
# Building on previous calculations
df.with_columns([
# First calculate years of service
((pl.lit(date.today()) - pl.col("hire_date")).dt.days() / 365.25)
.round(1)
.alias("years_of_service")
]).with_columns([
# Then categorize based on the calculation
pl.when(pl.col("years_of_service") >= 5)
.then(pl.lit("Veteran"))
# ...
])Python- Business Logic Implementation
# Complex categorization logic
pl.when(pl.col("salary") >= 100000).then(pl.lit("High"))
.when(pl.col("salary") >= 70000).then(pl.lit("Medium"))
.otherwise(pl.lit("Low"))PythonPerformance Tips Applied
- Efficient data types: Using
pl.Categoricalfor repeated strings - Method chaining: Readable, efficient operations
- Lazy evaluation: Demonstrated for large datasets
- Selective operations: Only computing what’s needed
Chapter Summary
Skills Acquired:
- ✅ End-to-end workflow: From data creation to insights
- ✅ Business logic: Implementing real-world categorization rules
- ✅ Performance patterns: Efficient data processing techniques
- ✅ Code organization: Structured, readable analysis functions
Polars Patterns Mastered:
- Progressive data enhancement with
with_columns() - Complex grouping and aggregation
- Conditional logic with
when().then().otherwise() - Date/time calculations and analysis
- Lazy evaluation for performance
🔗 What’s Next?
Ready to learn about the various ways to load and create data in Polars!
5. Loading and Creating Data
🎯 Learning Objectives
By the end of this chapter, you will:
- Master multiple ways to create DataFrames from scratch
- Load data from various file formats efficiently
- Understand when to use eager vs lazy loading
- Handle data type specifications and schema validation
Data Sources Overview
graph TD
A[Data Sources] --> B[File Formats]
A --> C[In-Memory Creation]
A --> D[External Systems]
A --> E[Other Libraries]
B --> B1[CSV/TSV]
B --> B2[Parquet]
B --> B3[JSON/NDJSON]
B --> B4[Excel]
B --> B5[Arrow/IPC]
C --> C1[Dictionaries]
C --> C2[Lists]
C --> C3[Ranges]
C --> C4[Manual Schema]
D --> D1[Databases]
D --> D2[APIs]
D --> D3[Cloud Storage]
E --> E1[Pandas]
E --> E2[NumPy]
E --> E3[PyArrow]
style A fill:#f96
style B fill:#9f6
style C fill:#69f
style D fill:#f69Creating DataFrames from Scratch
1. From Dictionaries (Most Common)
import polars as pl
import numpy as np
from datetime import date, datetime
# Basic dictionary creation
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie", "Diana"],
"age": [25, 30, 35, 28],
"salary": [50000, 60000, 70000, 55000],
"active": [True, False, True, True]
})
print("📊 Basic DataFrame:")
print(df)
print(f"Schema: {df.schema}")
# Advanced dictionary with mixed types
advanced_df = pl.DataFrame({
"employee_id": [f"EMP{i:03d}" for i in range(1, 5)],
"hire_date": [date(2020, 1, 15), date(2019, 6, 10), date(2021, 3, 22), date(2020, 11, 5)],
"last_login": [
datetime(2023, 12, 1, 14, 30),
datetime(2023, 12, 2, 9, 15),
None, # Missing value
datetime(2023, 12, 1, 16, 45)
],
"skills": [["Python", "SQL"], ["Java", "Docker"], ["Python", "React"], ["SQL", "Tableau"]],
"performance_scores": [[4.5, 4.2, 4.8], [3.9, 4.1], [4.7, 4.9, 4.6], [4.0]]
})
print("\n🔧 Advanced DataFrame with complex types:")
print(advanced_df)Python2. From Lists and Sequences
# List of lists with schema
data_rows = [
["Alice", 25, "Engineering"],
["Bob", 30, "Sales"],
["Charlie", 35, "Marketing"],
["Diana", 28, "Engineering"]
]
df_from_lists = pl.DataFrame(
data_rows,
schema=["name", "age", "department"],
orient="row" # Default: each inner list is a row
)
print("📋 DataFrame from lists:")
print(df_from_lists)
# Column-oriented data
col_data = [
["Alice", "Bob", "Charlie"], # Names column
[25, 30, 35], # Ages column
["NYC", "LA", "Chicago"] # Cities column
]
df_from_columns = pl.DataFrame(
col_data,
schema=["name", "age", "city"],
orient="col" # Each inner list is a column
)
print("\n📊 DataFrame from column lists:")
print(df_from_columns)Python3. From NumPy Arrays
# Random data generation
np.random.seed(42)
array_data = np.random.randn(1000, 4) # 1000 rows, 4 columns
df_numpy = pl.DataFrame(
array_data,
schema=["feature_1", "feature_2", "feature_3", "target"]
)
print("🔢 DataFrame from NumPy (first 5 rows):")
print(df_numpy.head())
print(f"Shape: {df_numpy.shape}")
# Structured arrays
structured_array = np.array([
("Alice", 25, 50000.0),
("Bob", 30, 60000.0),
("Charlie", 35, 70000.0)
], dtype=[("name", "U10"), ("age", "i4"), ("salary", "f8")])
df_structured = pl.DataFrame(structured_array)
print("\n🏗️ DataFrame from structured NumPy array:")
print(df_structured)Python4. Empty DataFrames with Schema
# Define schema first, add data later
schema = {
"user_id": pl.Int64,
"username": pl.Utf8,
"email": pl.Utf8,
"signup_date": pl.Date,
"is_premium": pl.Boolean,
"login_count": pl.Int32,
"last_active": pl.Datetime,
"preferences": pl.List(pl.Utf8)
}
empty_df = pl.DataFrame(schema=schema)
print("📋 Empty DataFrame with schema:")
print(f"Columns: {empty_df.columns}")
print(f"Schema: {empty_df.schema}")
# Add rows to empty DataFrame
new_rows = pl.DataFrame({
"user_id": [1, 2],
"username": ["alice_dev", "bob_analyst"],
"email": ["alice@company.com", "bob@company.com"],
"signup_date": [date(2023, 1, 15), date(2023, 2, 20)],
"is_premium": [True, False],
"login_count": [45, 23],
"last_active": [datetime(2023, 12, 1, 14, 30), datetime(2023, 11, 28, 9, 15)],
"preferences": [["dark_mode", "notifications"], ["light_mode"]]
})
filled_df = pl.concat([empty_df, new_rows])
print("\n✅ DataFrame after adding data:")
print(filled_df)PythonFile Loading Mastery
CSV Files: The Workhorse
# Basic CSV reading (assuming you have a CSV file)
# df = pl.read_csv("data.csv")
# Let's create a sample CSV first
sample_csv_data = """id,name,age,department,salary,hire_date
1,Alice Johnson,25,Engineering,75000,2021-03-15
2,Bob Smith,30,Sales,65000,2020-07-22
3,Charlie Brown,35,Marketing,70000,2019-11-10
4,Diana Wilson,28,Engineering,80000,2022-01-08
5,Eve Davis,32,Finance,72000,2020-05-18"""
# Save to file for demonstration
with open("sample_employees.csv", "w") as f:
f.write(sample_csv_data)
# Now read with various options
df_basic = pl.read_csv("sample_employees.csv")
print("📄 Basic CSV reading:")
print(df_basic)
# Advanced CSV reading with options
df_advanced = pl.read_csv(
"sample_employees.csv",
dtypes={
"id": pl.Int32,
"age": pl.Int16,
"salary": pl.Int32,
"hire_date": pl.Date
},
parse_dates=True,
null_values=["", "NULL", "N/A"], # What to treat as null
skip_rows=0, # Skip header rows if needed
n_rows=3 # Limit rows for testing
)
print("\n🔧 Advanced CSV reading:")
print(df_advanced)
print(f"Dtypes: {df_advanced.dtypes}")PythonLazy Loading for Large Files
# Lazy CSV scanning (doesn't load data immediately)
lazy_df = pl.scan_csv("sample_employees.csv")
print("🔍 Lazy DataFrame operations:")
print("Query plan:")
# Build a query without executing
query = (
lazy_df
.filter(pl.col("salary") > 70000)
.select(["name", "department", "salary"])
.sort("salary", descending=True)
)
print(query.explain())
# Execute when ready
result = query.collect()
print("\n📊 Query results:")
print(result)
# Clean up
import os
os.remove("sample_employees.csv")PythonJSON and Other Formats
# JSON data example
json_data = [
{"id": 1, "name": "Alice", "scores": [85, 92, 78], "metadata": {"team": "A", "level": "senior"}},
{"id": 2, "name": "Bob", "scores": [90, 88, 92], "metadata": {"team": "B", "level": "junior"}},
{"id": 3, "name": "Charlie", "scores": [75, 85, 80], "metadata": {"team": "A", "level": "mid"}}
]
# Save JSON for reading
import json
with open("sample_data.json", "w") as f:
json.dump(json_data, f)
# Read JSON
df_json = pl.read_json("sample_data.json")
print("📄 JSON DataFrame:")
print(df_json)
# NDJSON (newline-delimited JSON) - each line is a JSON object
ndjson_data = """{"id": 1, "name": "Alice", "value": 10}
{"id": 2, "name": "Bob", "value": 20}
{"id": 3, "name": "Charlie", "value": 30}"""
with open("sample_data.ndjson", "w") as f:
f.write(ndjson_data)
df_ndjson = pl.read_ndjson("sample_data.ndjson")
print("\n📄 NDJSON DataFrame:")
print(df_ndjson)
# Clean up
os.remove("sample_data.json")
os.remove("sample_data.ndjson")PythonCreating Series: Building Blocks
# Basic Series creation
numbers = pl.Series("numbers", [1, 2, 3, 4, 5])
names = pl.Series("employee_names", ["Alice", "Bob", "Charlie"])
print("📊 Basic Series:")
print(f"Numbers: {numbers}")
print(f"Names: {names}")
# Typed Series with explicit dtypes
typed_series = pl.Series("precise_values", [1.1, 2.2, 3.3], dtype=pl.Float32)
categorical_series = pl.Series("categories", ["A", "B", "A", "C"], dtype=pl.Categorical)
print(f"\n🎯 Typed Series:")
print(f"Float32: {typed_series.dtype}")
print(f"Categorical: {categorical_series.dtype}")
# Series with null values
with_nulls = pl.Series("incomplete", [1, 2, None, 4, None, 6])
print(f"\n❓ Series with nulls: {with_nulls}")
print(f"Null count: {with_nulls.null_count()}")
# Series from ranges and generators
range_series = pl.Series("range_data", range(1000))
print(f"\n📈 Range Series (first 10): {range_series.head(10)}")
# Date/time series
date_series = pl.Series("dates", [
date(2023, 1, 1),
date(2023, 1, 2),
date(2023, 1, 3)
])
datetime_series = pl.Series("timestamps", [
datetime(2023, 1, 1, 12, 0, 0),
datetime(2023, 1, 1, 13, 30, 0),
datetime(2023, 1, 1, 15, 45, 0)
])
print(f"\n📅 Date Series: {date_series}")
print(f"🕐 DateTime Series: {datetime_series}")PythonData Loading Best Practices
flowchart TD
A[Choose Loading Method] --> B{Data Size}
B -->|Small < 100MB| C[Eager Loading]
B -->|Large > 100MB| D[Lazy Loading]
C --> E[read_csv, read_parquet]
D --> F[scan_csv, scan_parquet]
G[File Format Choice] --> H{Use Case}
H -->|Fast I/O| I[Parquet]
H -->|Human Readable| J[CSV]
H -->|Nested Data| K[JSON]
H -->|Streaming| L[NDJSON]
M[Schema Handling] --> N[Explicit Types]
N --> O[Better Performance]
N --> P[Data Validation]
style D fill:#f96
style I fill:#9f6
style N fill:#69fPerformance Tips
# 1. Use appropriate data types
df_optimized = pl.DataFrame({
"small_int": pl.Series([1, 2, 3], dtype=pl.Int8), # vs Int64
"category": pl.Series(["A", "B", "A"], dtype=pl.Categorical), # vs Utf8
"flag": pl.Series([True, False, True], dtype=pl.Boolean) # vs string
})
print("💡 Optimized DataFrame memory usage:")
print(f"Schema: {df_optimized.schema}")
# 2. Lazy loading for large datasets
def demonstrate_lazy_benefits():
"""Show lazy loading benefits"""
# For demonstration - in real use, this would be a large file
large_data = {
"id": range(10_000),
"value": range(10_000),
"category": ["A", "B", "C", "D", "E"] * 2_000
}
# Save as parquet for demo
temp_df = pl.DataFrame(large_data)
temp_df.write_parquet("temp_large.parquet")
# Lazy scan - no data loaded yet
lazy_scan = pl.scan_parquet("temp_large.parquet")
# Only load what we need
result = (
lazy_scan
.filter(pl.col("value") > 8000) # Predicate pushdown
.select(["id", "category"]) # Column selection
.group_by("category")
.agg(pl.col("id").count())
.collect() # Execute query
)
print("⚡ Lazy loading result:")
print(result)
# Clean up
os.remove("temp_large.parquet")
demonstrate_lazy_benefits()PythonChapter Summary
Data Creation Methods Mastered:
- ✅ Dictionaries: Most flexible and intuitive method
- ✅ Lists: Efficient for structured data
- ✅ NumPy integration: Seamless array handling
- ✅ Schema definition: Type-safe DataFrame creation
File Loading Expertise:
- ✅ CSV: Most common format with advanced options
- ✅ JSON/NDJSON: Nested and streaming data
- ✅ Lazy loading: Memory-efficient large file handling
- ✅ Performance optimization: Right types, right methods
Key Patterns:
- Use lazy loading (
scan_*) for large files - Specify data types explicitly for better performance
- Choose file formats based on use case (Parquet for speed, CSV for compatibility)
- Handle missing values appropriately during loading
🔗 What’s Next?
Now that you can create and load data efficiently, let’s learn how to explore and understand your datasets!
5. Data Inspection and Exploration
Basic Information
# DataFrame info
print(df.shape) # (rows, columns)
print(df.columns) # Column names
print(df.dtypes) # Data types
print(df.schema) # Schema with types
# Statistical summary
df.describe() # Statistical summary
df.null_count() # Null values per column
df.is_empty() # Check if empty
# Memory usage
df.estimated_size() # Estimated memory usagePythonData Preview
# View data
df.head() # First 5 rows
df.head(10) # First 10 rows
df.tail() # Last 5 rows
df.sample(5) # Random 5 rows
df.glimpse() # Transposed view
# Unique values
df["column"].unique()
df["column"].n_unique()
df["column"].value_counts()PythonData Quality Checks
# Missing values analysis
missing_report = df.select([
pl.col("*").null_count().name.suffix("_nulls"),
pl.col("*").count().name.suffix("_total")
])
# Duplicate detection
df.is_duplicated() # Boolean mask for duplicates
df.n_unique() # Count unique combinationsPythonData Flow Visualization
flowchart TD
A[Raw Data] --> B[Load into Polars]
B --> C[Data Inspection]
C --> D{Data Quality OK?}
D -->|No| E[Data Cleaning]
D -->|Yes| F[Data Analysis]
E --> F
F --> G[Results]
subgraph "Inspection Methods"
H[.head/.tail]
I[.describe]
J[.null_count]
K[.dtypes]
end
C --> H
C --> I
C --> J
C --> K6. Data Selection and Filtering
Column Selection
# Single column
df.select("name")
df["name"] # Returns Series
# Multiple columns
df.select(["name", "age"])
df.select(pl.col("name"), pl.col("age"))
# Column patterns
df.select(pl.col("^.*_id$")) # Regex pattern
df.select(pl.col("*").exclude("id")) # All except id
# By data type
df.select(pl.col(pl.Utf8)) # String columns
df.select(pl.col(pl.NUMERIC_DTYPES)) # Numeric columnsPythonRow Filtering
# Simple conditions
df.filter(pl.col("age") > 25)
df.filter(pl.col("name") == "Alice")
df.filter(pl.col("city").is_in(["New York", "London"]))
# Complex conditions
df.filter(
(pl.col("age") > 25) &
(pl.col("city") == "New York")
)
df.filter(
(pl.col("age") < 20) |
(pl.col("age") > 60)
)
# String operations
df.filter(pl.col("name").str.contains("^A"))
df.filter(pl.col("name").str.starts_with("B"))
df.filter(pl.col("name").str.ends_with("e"))
# Null filtering
df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())PythonAdvanced Selection
# Conditional selection
df.select(
pl.when(pl.col("age") > 30)
.then(pl.lit("Senior"))
.otherwise(pl.lit("Junior"))
.alias("category")
)
# Multiple conditions
df.select(
pl.when(pl.col("score") >= 90).then("A")
.when(pl.col("score") >= 80).then("B")
.when(pl.col("score") >= 70).then("C")
.otherwise("F")
.alias("grade")
)
# Dynamic column selection
numeric_cols = [col for col, dtype in df.schema.items()
if dtype in pl.NUMERIC_DTYPES]
df.select(numeric_cols)PythonIndexing and Slicing
# Row slicing
df[0] # First row as DataFrame
df[0:5] # First 5 rows
df[-5:] # Last 5 rows
# Boolean indexing
mask = df["age"] > 25
df[mask]
# Sample and slice
df.slice(10, 20) # Rows 10-29
df.sample(n=100) # Random 100 rows
df.sample(fraction=0.1) # Random 10% of rowsPython7. Data Transformation
Adding and Modifying Columns
# Add new columns
df = df.with_columns([
pl.col("age").alias("age_copy"),
(pl.col("age") * 12).alias("age_months"),
pl.lit("default_value").alias("new_column")
])
# Modify existing columns
df = df.with_columns([
pl.col("name").str.to_uppercase().alias("name"),
pl.col("age").cast(pl.Float64).alias("age")
])
# Multiple transformations
df = df.with_columns([
# String manipulations
pl.col("name").str.len().alias("name_length"),
pl.col("city").str.replace("New ", "").alias("city_clean"),
# Mathematical operations
pl.col("value").log().alias("log_value"),
pl.col("value").pow(2).alias("value_squared"),
# Date operations
pl.col("date").dt.year().alias("year"),
pl.col("date").dt.month().alias("month")
])PythonString Operations
# String methods
df.with_columns([
pl.col("text").str.len().alias("length"),
pl.col("text").str.to_lowercase().alias("lower"),
pl.col("text").str.to_uppercase().alias("upper"),
pl.col("text").str.strip().alias("stripped"),
pl.col("text").str.slice(0, 5).alias("first_5"),
pl.col("text").str.split(" ").alias("words"),
pl.col("text").str.contains("pattern").alias("has_pattern"),
pl.col("text").str.replace("old", "new").alias("replaced")
])
# Regular expressions
df.with_columns([
pl.col("email").str.extract(r"([^@]+)@([^.]+)").alias("email_parts"),
pl.col("phone").str.replace_all(r"[^\d]", "").alias("phone_clean")
])PythonMathematical Operations
# Basic math
df.with_columns([
(pl.col("a") + pl.col("b")).alias("sum"),
(pl.col("a") - pl.col("b")).alias("diff"),
(pl.col("a") * pl.col("b")).alias("product"),
(pl.col("a") / pl.col("b")).alias("ratio"),
(pl.col("a") % pl.col("b")).alias("remainder"),
pl.col("a").pow(pl.col("b")).alias("power")
])
# Advanced math
df.with_columns([
pl.col("value").log().alias("log"),
pl.col("value").log10().alias("log10"),
pl.col("value").exp().alias("exp"),
pl.col("value").sqrt().alias("sqrt"),
pl.col("angle").sin().alias("sin"),
pl.col("angle").cos().alias("cos"),
pl.col("value").abs().alias("absolute")
])
# Statistical functions
df.with_columns([
pl.col("values").rank().alias("rank"),
pl.col("values").pct_change().alias("pct_change"),
pl.col("values").cumsum().alias("cumsum"),
pl.col("values").diff().alias("diff")
])PythonConditional Transformations
# Simple conditions
df.with_columns([
pl.when(pl.col("age") >= 18)
.then(pl.lit("Adult"))
.otherwise(pl.lit("Minor"))
.alias("age_group")
])
# Complex conditions
df.with_columns([
pl.when(pl.col("score") >= 90).then("Excellent")
.when(pl.col("score") >= 80).then("Good")
.when(pl.col("score") >= 70).then("Average")
.when(pl.col("score") >= 60).then("Below Average")
.otherwise("Poor")
.alias("performance")
])
# Multiple column conditions
df.with_columns([
pl.when((pl.col("age") > 25) & (pl.col("salary") > 50000))
.then(pl.lit("Senior"))
.when((pl.col("age") > 22) & (pl.col("salary") > 30000))
.then(pl.lit("Mid-level"))
.otherwise(pl.lit("Junior"))
.alias("level")
])PythonData Type Conversions
# Type casting
df.with_columns([
pl.col("string_number").cast(pl.Int64),
pl.col("integer").cast(pl.Utf8),
pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
pl.col("timestamp").cast(pl.Datetime)
])
# Safe casting with error handling
df.with_columns([
pl.col("maybe_number").cast(pl.Int64, strict=False).alias("number"),
pl.col("date_text").str.strptime(pl.Date, "%Y-%m-%d", strict=False)
])PythonTransformation Pipeline
flowchart LR
A[Raw Data] --> B[String Cleaning]
B --> C[Type Conversion]
C --> D[Mathematical Operations]
D --> E[Conditional Logic]
E --> F[Final Dataset]
subgraph "Transformation Steps"
G[with_columns]
H[select]
I[filter]
J[cast]
end8. Aggregations and Grouping
Basic Aggregations
# Single column aggregations
df.select([
pl.col("value").sum().alias("total"),
pl.col("value").mean().alias("average"),
pl.col("value").median().alias("median"),
pl.col("value").std().alias("std_dev"),
pl.col("value").min().alias("minimum"),
pl.col("value").max().alias("maximum"),
pl.col("value").count().alias("count"),
pl.col("value").n_unique().alias("unique_count")
])
# Multiple columns
df.select([
pl.all().sum().name.suffix("_sum"),
pl.all().mean().name.suffix("_mean")
])
# Quantiles and percentiles
df.select([
pl.col("value").quantile(0.25).alias("q25"),
pl.col("value").quantile(0.5).alias("q50"),
pl.col("value").quantile(0.75).alias("q75"),
pl.col("value").quantile(0.95).alias("q95")
])PythonGroup By Operations
# Basic grouping
result = df.group_by("category").agg([
pl.col("value").sum().alias("total_value"),
pl.col("value").mean().alias("avg_value"),
pl.col("value").count().alias("count")
])
# Multiple grouping columns
result = df.group_by(["department", "region"]).agg([
pl.col("sales").sum(),
pl.col("employees").count(),
pl.col("salary").mean()
])
# Dynamic grouping
result = df.group_by(
pl.col("date").dt.year().alias("year"),
pl.col("date").dt.month().alias("month")
).agg([
pl.col("revenue").sum(),
pl.col("customers").n_unique()
])PythonAdvanced Aggregations
# Custom aggregations
result = df.group_by("group").agg([
# Statistical measures
pl.col("values").std().alias("std_dev"),
pl.col("values").var().alias("variance"),
pl.col("values").skew().alias("skewness"),
# List aggregations
pl.col("items").list().alias("all_items"),
pl.col("values").sort().first().alias("min_value"),
pl.col("values").sort().last().alias("max_value"),
# Conditional aggregations
pl.col("value").filter(pl.col("status") == "active").sum().alias("active_sum"),
pl.col("value").filter(pl.col("value") > 100).count().alias("high_value_count")
])
# Window functions in aggregations
result = df.group_by("category").agg([
pl.col("value").sum().alias("category_total"),
(pl.col("value") / pl.col("value").sum()).alias("percentage")
])PythonRolling and Expanding Aggregations
# Rolling window aggregations
df.with_columns([
pl.col("value").rolling_sum(window_size=3).alias("rolling_sum_3"),
pl.col("value").rolling_mean(window_size=7).alias("rolling_avg_7"),
pl.col("value").rolling_std(window_size=5).alias("rolling_std_5"),
pl.col("value").rolling_max(window_size=10).alias("rolling_max_10")
])
# Expanding window aggregations
df.with_columns([
pl.col("value").cumsum().alias("cumulative_sum"),
pl.col("value").cumcount().alias("cumulative_count"),
pl.col("value").cummax().alias("cumulative_max"),
pl.col("value").cummin().alias("cumulative_min")
])
# Time-based rolling windows
df.sort("timestamp").with_columns([
pl.col("value").rolling_mean_by("timestamp", window_size="7d").alias("7day_avg"),
pl.col("value").rolling_sum_by("timestamp", window_size="1mo").alias("monthly_sum")
])PythonPivot and Unpivot
# Pivot table
pivot_df = df.pivot(
values="sales",
index="region",
columns="month",
aggregate_function="sum"
)
# Unpivot (melt)
melted_df = df.melt(
id_vars=["id", "name"],
value_vars=["jan", "feb", "mar"],
variable_name="month",
value_name="sales"
)
# Multiple value columns
melted_df = df.melt(
id_vars=["id"],
value_vars=["sales_jan", "sales_feb", "profit_jan", "profit_feb"],
variable_name="metric_month",
value_name="value"
)PythonAggregation Pipeline
flowchart TD
A[Original DataFrame] --> B{Grouping Needed?}
B -->|Yes| C[group_by]
B -->|No| D[Direct Aggregation]
C --> E[agg functions]
D --> E
E --> F[Result DataFrame]
subgraph "Aggregation Functions"
G[sum, mean, count]
H[min, max, std]
I[quantile, median]
J[custom expressions]
end
E --> G
E --> H
E --> I
E --> J9. Joins and Concatenations
Types of Joins
# Sample DataFrames
customers = pl.DataFrame({
"customer_id": [1, 2, 3, 4],
"name": ["Alice", "Bob", "Charlie", "Diana"],
"city": ["NYC", "LA", "Chicago", "Miami"]
})
orders = pl.DataFrame({
"order_id": [101, 102, 103, 104, 105],
"customer_id": [1, 2, 2, 3, 5],
"amount": [100, 200, 150, 300, 250]
})
# Inner join (default)
result = customers.join(orders, on="customer_id", how="inner")
# Left join
result = customers.join(orders, on="customer_id", how="left")
# Right join
result = customers.join(orders, on="customer_id", how="right")
# Outer join
result = customers.join(orders, on="customer_id", how="outer")
# Cross join
result = customers.join(orders, how="cross")PythonAdvanced Join Operations
# Multiple column joins
df1.join(df2, on=["col1", "col2"], how="inner")
# Different column names
df1.join(df2, left_on="id", right_on="customer_id", how="left")
# Join with suffix for duplicate columns
df1.join(df2, on="id", suffix="_right")
# Anti join (rows in left not in right)
df1.join(df2, on="id", how="anti")
# Semi join (rows in left that have match in right)
df1.join(df2, on="id", how="semi")PythonConcatenations
# Vertical concatenation (stack rows)
df_combined = pl.concat([df1, df2, df3], how="vertical")
# Horizontal concatenation (side by side)
df_combined = pl.concat([df1, df2], how="horizontal")
# Diagonal concatenation (union with different schemas)
df_combined = pl.concat([df1, df2], how="diagonal")
# With rechunk for better performance
df_combined = pl.concat([df1, df2], rechunk=True)PythonComplex Join Examples
# Multiple joins in sequence
result = (
customers
.join(orders, on="customer_id", how="left")
.join(products, left_on="product_id", right_on="id", how="left")
.join(categories, left_on="category_id", right_on="id", how="left")
)
# Join with aggregation
customer_stats = (
orders
.group_by("customer_id")
.agg([
pl.col("amount").sum().alias("total_spent"),
pl.col("order_id").count().alias("order_count")
])
)
enriched_customers = customers.join(
customer_stats,
on="customer_id",
how="left"
).fill_null(0)
# Self join
df.join(
df.select(["id", "manager_id", "name"]).rename({"name": "manager_name"}),
left_on="manager_id",
right_on="id",
how="left"
)PythonJoin Performance Optimization
# Use lazy evaluation for large datasets
lazy_result = (
pl.scan_parquet("customers.parquet")
.join(
pl.scan_parquet("orders.parquet"),
on="customer_id",
how="inner"
)
.collect()
)
# Pre-sort data for better join performance
customers_sorted = customers.sort("customer_id")
orders_sorted = orders.sort("customer_id")
result = customers_sorted.join(orders_sorted, on="customer_id")PythonJoin Strategy Visualization
graph LR
subgraph "Join Types"
A[Inner Join] --> E[Only Matching Records]
B[Left Join] --> F[All Left + Matching Right]
C[Right Join] --> G[All Right + Matching Left]
D[Outer Join] --> H[All Records from Both]
end
subgraph "Special Joins"
I[Anti Join] --> J[Left Records Without Match]
K[Semi Join] --> L[Left Records With Match]
M[Cross Join] --> N[Cartesian Product]
end10. Time Series Operations
Date and Time Creation
import polars as pl
from datetime import datetime, date, time
# Creating date/time columns
df = pl.DataFrame({
"date": [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 3)],
"datetime": [
datetime(2023, 1, 1, 10, 30),
datetime(2023, 1, 2, 14, 45),
datetime(2023, 1, 3, 9, 15)
],
"value": [100, 150, 200]
})
# From strings
df = pl.DataFrame({
"date_str": ["2023-01-01", "2023-01-02", "2023-01-03"],
"datetime_str": ["2023-01-01 10:30:00", "2023-01-02 14:45:00", "2023-01-03 09:15:00"],
"value": [100, 150, 200]
})
df = df.with_columns([
pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d").alias("date"),
pl.col("datetime_str").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S").alias("datetime")
])PythonDate/Time Extraction
# Extract date components
df = df.with_columns([
pl.col("datetime").dt.year().alias("year"),
pl.col("datetime").dt.month().alias("month"),
pl.col("datetime").dt.day().alias("day"),
pl.col("datetime").dt.weekday().alias("weekday"),
pl.col("datetime").dt.hour().alias("hour"),
pl.col("datetime").dt.minute().alias("minute"),
pl.col("datetime").dt.second().alias("second"),
pl.col("datetime").dt.week().alias("week_of_year"),
pl.col("datetime").dt.quarter().alias("quarter")
])
# Date arithmetic
df = df.with_columns([
(pl.col("datetime") + pl.duration(days=7)).alias("next_week"),
(pl.col("datetime") - pl.duration(hours=2)).alias("two_hours_ago"),
pl.col("datetime").dt.truncate("1d").alias("date_truncated"),
pl.col("datetime").dt.round("1h").alias("hour_rounded")
])PythonTime Series Resampling
# Create time series data
dates = pl.date_range(
start=date(2023, 1, 1),
end=date(2023, 12, 31),
interval="1d",
eager=True
)
ts_df = pl.DataFrame({
"date": dates,
"value": np.random.randn(len(dates)).cumsum()
})
# Resample to different frequencies
monthly = (
ts_df
.group_by_dynamic("date", every="1mo")
.agg([
pl.col("value").mean().alias("avg_value"),
pl.col("value").sum().alias("total_value"),
pl.col("value").count().alias("count")
])
)
# Weekly resampling with custom aggregation
weekly = (
ts_df
.group_by_dynamic("date", every="1w", label="left")
.agg([
pl.col("value").first().alias("week_start"),
pl.col("value").last().alias("week_end"),
pl.col("value").max().alias("week_max"),
pl.col("value").min().alias("week_min")
])
)PythonRolling Time Windows
# Time-based rolling windows
df = df.sort("datetime").with_columns([
# Rolling statistics over time periods
pl.col("value").rolling_mean_by("datetime", window_size="7d").alias("7day_avg"),
pl.col("value").rolling_sum_by("datetime", window_size="30d").alias("30day_sum"),
pl.col("value").rolling_std_by("datetime", window_size="14d").alias("14day_std"),
# Rolling with minimum periods
pl.col("value").rolling_mean_by(
"datetime",
window_size="7d",
min_periods=3
).alias("7day_avg_min3")
])
# Custom rolling functions
df = df.with_columns([
pl.col("value").rolling_quantile_by("datetime", quantile=0.9, window_size="30d").alias("30day_p90"),
pl.col("value").rolling_median_by("datetime", window_size="14d").alias("14day_median")
])PythonTime Zone Handling
# Working with time zones
df = pl.DataFrame({
"datetime_utc": [
datetime(2023, 1, 1, 12, 0),
datetime(2023, 1, 2, 12, 0)
],
"value": [100, 200]
})
# Convert to different time zones
df = df.with_columns([
pl.col("datetime_utc").dt.replace_time_zone("UTC").alias("utc"),
pl.col("datetime_utc").dt.replace_time_zone("UTC")
.dt.convert_time_zone("America/New_York").alias("ny_time"),
pl.col("datetime_utc").dt.replace_time_zone("UTC")
.dt.convert_time_zone("Europe/London").alias("london_time")
])PythonAdvanced Time Series Operations
# Lag and lead operations
df = df.sort("date").with_columns([
pl.col("value").shift(1).alias("prev_value"),
pl.col("value").shift(-1).alias("next_value"),
pl.col("value").shift(periods=7).alias("value_7_days_ago")
])
# Percentage change and differences
df = df.with_columns([
pl.col("value").pct_change().alias("pct_change"),
pl.col("value").diff().alias("diff"),
(pl.col("value") / pl.col("value").shift(1) - 1).alias("manual_pct_change")
])
# Fill missing time periods
complete_dates = pl.date_range(
start=df["date"].min(),
end=df["date"].max(),
interval="1d",
eager=True
)
complete_df = pl.DataFrame({"date": complete_dates})
filled_df = complete_df.join(df, on="date", how="left")
# Forward fill and backward fill
filled_df = filled_df.with_columns([
pl.col("value").forward_fill().alias("forward_filled"),
pl.col("value").backward_fill().alias("backward_filled")
])PythonTime Series Analysis Pipeline
flowchart TD
A[Raw Time Data] --> B[Parse Dates]
B --> C[Sort by Time]
C --> D[Handle Missing Periods]
D --> E[Create Time Features]
E --> F[Rolling Calculations]
F --> G[Resampling]
G --> H[Analysis Ready Data]
subgraph "Time Operations"
I[dt.year, dt.month]
J[rolling_mean_by]
K[group_by_dynamic]
L[shift, diff]
end
E --> I
F --> J
G --> K
F --> L11. Advanced Features
Lazy Evaluation
# Lazy DataFrames for query optimization
lazy_df = pl.scan_csv("large_file.csv")
# Build query without execution
query = (
lazy_df
.filter(pl.col("value") > 100)
.group_by("category")
.agg([
pl.col("value").sum(),
pl.col("value").count()
])
.sort("value_sum", descending=True)
.limit(10)
)
# Show query plan
print(query.explain())
# Execute query
result = query.collect()
# Streaming for large datasets
result = query.collect(streaming=True)PythonCustom Functions and Expressions
# Custom Python functions with map_elements
def custom_function(x):
return x ** 2 + 1
df = df.with_columns([
pl.col("value").map_elements(custom_function, return_dtype=pl.Float64).alias("custom")
])
# Apply function to multiple columns
def process_row(row):
return row[0] * row[1] + row[2]
df = df.with_columns([
pl.struct(["a", "b", "c"]).map_elements(
lambda x: process_row([x["a"], x["b"], x["c"]]),
return_dtype=pl.Float64
).alias("processed")
])
# Custom aggregation functions
def weighted_mean(values, weights):
return (values * weights).sum() / weights.sum()
result = df.group_by("group").agg([
pl.struct(["value", "weight"]).map_elements(
lambda x: weighted_mean(x.struct.field("value"), x.struct.field("weight")),
return_dtype=pl.Float64
).alias("weighted_avg")
])PythonWorking with Lists and Structs
# List operations
df = pl.DataFrame({
"id": [1, 2, 3],
"numbers": [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
"names": [["Alice", "Bob"], ["Charlie"], ["Diana", "Eve", "Frank"]]
})
# List manipulations
df = df.with_columns([
pl.col("numbers").list.len().alias("list_length"),
pl.col("numbers").list.sum().alias("list_sum"),
pl.col("numbers").list.mean().alias("list_mean"),
pl.col("numbers").list.get(0).alias("first_item"),
pl.col("numbers").list.slice(1, 2).alias("slice_1_2"),
pl.col("names").list.join(", ").alias("names_joined")
])
# Explode lists to rows
exploded = df.select(["id", "numbers"]).explode("numbers")
# Struct operations
df_struct = pl.DataFrame({
"id": [1, 2, 3],
"person": [
{"name": "Alice", "age": 25},
{"name": "Bob", "age": 30},
{"name": "Charlie", "age": 35}
]
})
df_struct = df_struct.with_columns([
pl.col("person").struct.field("name").alias("name"),
pl.col("person").struct.field("age").alias("age")
])PythonWindow Functions
# Window functions for ranking and statistics
df = pl.DataFrame({
"department": ["Sales", "Sales", "Engineering", "Engineering", "Marketing"],
"employee": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"salary": [50000, 60000, 80000, 90000, 55000]
})
# Ranking functions
df = df.with_columns([
pl.col("salary").rank().over("department").alias("rank_in_dept"),
pl.col("salary").rank(method="dense").over("department").alias("dense_rank"),
pl.col("salary").rank(descending=True).over("department").alias("rank_desc")
])
# Statistical window functions
df = df.with_columns([
pl.col("salary").mean().over("department").alias("dept_avg_salary"),
pl.col("salary").std().over("department").alias("dept_salary_std"),
(pl.col("salary") - pl.col("salary").mean().over("department")).alias("salary_diff_from_avg")
])
# Cumulative operations within groups
df = df.sort(["department", "salary"]).with_columns([
pl.col("salary").cumsum().over("department").alias("cumulative_salary"),
pl.col("salary").cumcount().over("department").alias("employee_number")
])PythonParallel Processing and Performance
# Configure thread usage
pl.Config.set_thread_pool_size(8)
# Parallel operations on lazy frames
lazy_results = []
for file in file_list:
lazy_df = (
pl.scan_csv(file)
.filter(pl.col("date").is_between(start_date, end_date))
.group_by("category")
.agg(pl.col("value").sum())
)
lazy_results.append(lazy_df)
# Combine results
final_result = pl.concat(lazy_results).collect()
# Streaming for memory efficiency
large_result = (
pl.scan_csv("very_large_file.csv")
.filter(pl.col("important_flag") == True)
.select(["id", "value", "category"])
.collect(streaming=True)
)PythonPlugin System and Extensions
# Using Polars plugins (example with polars-xdt for extended datetime)
import polars_xdt as xdt
df = df.with_columns([
pl.col("date").xdt.format_localized("%A, %B %d, %Y", locale="en_US"),
pl.col("date").xdt.is_weekend(),
pl.col("date").xdt.workday_count(pl.col("end_date"))
])
# Custom namespace example
@pl.api.register_expr_namespace("business")
class BusinessOps:
def __init__(self, expr: pl.Expr):
self._expr = expr
def workdays_between(self, end_date: pl.Expr) -> pl.Expr:
# Custom business logic
return self._expr.map_elements(
lambda start: calculate_workdays(start, end_date),
return_dtype=pl.Int32
)
# Usage
df = df.with_columns([
pl.col("start_date").business.workdays_between(pl.col("end_date"))
])Python12. Performance Optimization
Query Optimization Strategies
# Use lazy evaluation for complex queries
optimized_query = (
pl.scan_parquet("large_dataset.parquet")
.filter(pl.col("date").is_between(start_date, end_date)) # Early filtering
.select(["id", "value", "category"]) # Column selection
.filter(pl.col("value") > threshold) # Additional filtering
.group_by("category")
.agg(pl.col("value").sum())
.collect()
)
# Predicate pushdown example
lazy_df = (
pl.scan_csv("data.csv")
.filter(pl.col("year") == 2023) # This filter is pushed down to file reading
.select(["month", "sales"])
.group_by("month")
.agg(pl.col("sales").sum())
)PythonMemory Management
# Use appropriate data types
df = df.with_columns([
pl.col("small_int").cast(pl.Int8), # Instead of Int64
pl.col("category").cast(pl.Categorical), # For repeated strings
pl.col("flag").cast(pl.Boolean), # Instead of string
pl.col("percentage").cast(pl.Float32) # Instead of Float64 when precision allows
])
# Chunking for large operations
def process_in_chunks(df, chunk_size=10000):
results = []
for i in range(0, len(df), chunk_size):
chunk = df.slice(i, chunk_size)
processed = chunk.with_columns([
# Expensive operations here
pl.col("value").map_elements(expensive_function)
])
results.append(processed)
return pl.concat(results)
# Memory-efficient file processing
def process_large_file(filepath):
return (
pl.scan_csv(filepath)
.filter(pl.col("relevant_column").is_not_null())
.select(["id", "important_col1", "important_col2"])
.collect(streaming=True) # Process without loading entire file
)PythonParallel Processing Patterns
# Parallel file processing
from concurrent.futures import ProcessPoolExecutor
import multiprocessing as mp
def process_file(filepath):
return (
pl.read_csv(filepath)
.filter(pl.col("status") == "active")
.group_by("category")
.agg(pl.col("value").sum())
)
# Process multiple files in parallel
file_paths = ["file1.csv", "file2.csv", "file3.csv"]
with ProcessPoolExecutor(max_workers=mp.cpu_count()) as executor:
results = list(executor.map(process_file, file_paths))
combined = pl.concat(results)
# Parallel groupby operations
large_df = pl.read_parquet("large_dataset.parquet")
# This automatically uses multiple threads
result = (
large_df
.group_by("large_grouping_column")
.agg([
pl.col("numeric_col").sum(),
pl.col("numeric_col").mean(),
pl.col("categorical_col").n_unique()
])
)PythonEfficient Data Loading
# Optimized CSV reading
df = pl.read_csv(
"large_file.csv",
use_pyarrow=True, # Use PyArrow engine for better performance
rechunk=True, # Optimize memory layout
low_memory=False, # Use more memory for speed
ignore_errors=True # Skip problematic rows
)
# Lazy loading with projections
df = (
pl.scan_csv("huge_file.csv")
.select(["col1", "col2", "col3"]) # Only read needed columns
.filter(pl.col("date") > cutoff_date) # Filter early
.collect()
)
# Parquet optimization
# Save with optimal settings
df.write_parquet(
"optimized.parquet",
compression="snappy", # Good balance of speed and size
statistics=True, # Enable predicate pushdown
row_group_size=100000 # Optimize for your use case
)PythonBenchmarking and Profiling
import time
from functools import wraps
def timing_decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
end = time.time()
print(f"{func.__name__} took {end - start:.4f} seconds")
return result
return wrapper
@timing_decorator
def slow_operation(df):
return df.group_by("category").agg(pl.col("value").sum())
@timing_decorator
def optimized_operation(df):
return (
df.lazy()
.group_by("category")
.agg(pl.col("value").sum())
.collect()
)
# Memory profiling
def profile_memory(operation, df):
import psutil
process = psutil.Process()
memory_before = process.memory_info().rss / 1024 / 1024 # MB
result = operation(df)
memory_after = process.memory_info().rss / 1024 / 1024 # MB
print(f"Memory used: {memory_after - memory_before:.2f} MB")
return resultPythonPerformance Optimization Checklist
graph TD
A[Performance Issue] --> B{Identify Bottleneck}
B --> C[I/O Bound]
B --> D[CPU Bound]
B --> E[Memory Bound]
C --> F[Use lazy evaluation]
C --> G[Optimize file formats]
C --> H[Column selection early]
D --> I[Use proper data types]
D --> J[Vectorized operations]
D --> K[Parallel processing]
E --> L[Streaming mode]
E --> M[Chunking]
E --> N[Reduce data size]
F --> O[Benchmark Results]
G --> O
H --> O
I --> O
J --> O
K --> O
L --> O
M --> O
N --> O13. Real-World Examples
Example 1: Sales Data Analysis
# Load and prepare sales data
sales_df = pl.read_csv("sales_data.csv").with_columns([
pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
pl.col("amount").cast(pl.Float64),
pl.col("category").cast(pl.Categorical)
])
# Monthly sales analysis
monthly_sales = (
sales_df
.group_by_dynamic("date", every="1mo", label="left")
.agg([
pl.col("amount").sum().alias("total_sales"),
pl.col("amount").count().alias("transaction_count"),
pl.col("amount").mean().alias("avg_transaction"),
pl.col("customer_id").n_unique().alias("unique_customers")
])
.with_columns([
pl.col("total_sales").pct_change().alias("sales_growth"),
(pl.col("total_sales") / pl.col("transaction_count")).alias("avg_order_value")
])
)
# Top performing categories
category_performance = (
sales_df
.group_by("category")
.agg([
pl.col("amount").sum().alias("total_revenue"),
pl.col("amount").count().alias("transactions"),
pl.col("customer_id").n_unique().alias("customers")
])
.with_columns([
(pl.col("total_revenue") / pl.col("total_revenue").sum() * 100).alias("revenue_share"),
pl.col("total_revenue").rank(method="ordinal", descending=True).alias("rank")
])
.sort("total_revenue", descending=True)
)
# Customer segmentation
customer_segments = (
sales_df
.group_by("customer_id")
.agg([
pl.col("amount").sum().alias("total_spent"),
pl.col("amount").count().alias("purchase_frequency"),
pl.col("date").max().alias("last_purchase"),
pl.col("date").min().alias("first_purchase")
])
.with_columns([
(pl.col("last_purchase") - pl.col("first_purchase")).dt.days().alias("customer_lifetime_days"),
pl.when(pl.col("total_spent") > 1000).then("High Value")
.when(pl.col("total_spent") > 500).then("Medium Value")
.otherwise("Low Value").alias("value_segment"),
pl.when(pl.col("purchase_frequency") > 10).then("Frequent")
.when(pl.col("purchase_frequency") > 3).then("Regular")
.otherwise("Occasional").alias("frequency_segment")
])
)PythonExample 2: Log File Analysis
# Parse web server logs
log_pattern = r'(\S+) - - \[([^\]]+)\] "(\S+) ([^\s]+) ([^"]+)" (\d+) (\d+)'
logs_df = (
pl.read_csv("access.log", separator="\t", has_header=False, new_columns=["raw_log"])
.with_columns([
pl.col("raw_log").str.extract_all(log_pattern).alias("parsed")
])
.with_columns([
pl.col("parsed").list.get(0).alias("ip"),
pl.col("parsed").list.get(1).str.strptime(pl.Datetime, "%d/%b/%Y:%H:%M:%S %z").alias("timestamp"),
pl.col("parsed").list.get(2).alias("method"),
pl.col("parsed").list.get(3).alias("url"),
pl.col("parsed").list.get(5).cast(pl.Int32).alias("status_code"),
pl.col("parsed").list.get(6).cast(pl.Int32).alias("response_size")
])
.select(["ip", "timestamp", "method", "url", "status_code", "response_size"])
)
# Traffic analysis
hourly_traffic = (
logs_df
.group_by_dynamic("timestamp", every="1h")
.agg([
pl.col("ip").count().alias("requests"),
pl.col("ip").n_unique().alias("unique_visitors"),
pl.col("response_size").sum().alias("total_bytes"),
pl.col("status_code").filter(pl.col("status_code") >= 400).count().alias("error_count")
])
.with_columns([
(pl.col("error_count") / pl.col("requests") * 100).alias("error_rate"),
(pl.col("total_bytes") / 1024 / 1024).alias("total_mb")
])
)
# Top endpoints
top_endpoints = (
logs_df
.group_by("url")
.agg([
pl.col("ip").count().alias("hits"),
pl.col("ip").n_unique().alias("unique_visitors"),
pl.col("response_size").mean().alias("avg_response_size")
])
.sort("hits", descending=True)
.head(20)
)
# Bot detection
potential_bots = (
logs_df
.group_by("ip")
.agg([
pl.col("url").count().alias("request_count"),
pl.col("url").n_unique().alias("unique_urls"),
pl.col("timestamp").max().alias("last_seen"),
pl.col("timestamp").min().alias("first_seen")
])
.with_columns([
(pl.col("last_seen") - pl.col("first_seen")).dt.seconds().alias("session_duration"),
(pl.col("request_count") / pl.col("unique_urls")).alias("requests_per_url")
])
.filter(
(pl.col("request_count") > 100) &
(pl.col("requests_per_url") > 10) &
(pl.col("session_duration") < 3600) # Less than 1 hour
)
)PythonExample 3: Financial Data Processing
# Stock price analysis
stock_df = pl.read_csv("stock_prices.csv").with_columns([
pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
pl.col("open", "high", "low", "close", "volume").cast(pl.Float64)
])
# Technical indicators
stock_analysis = (
stock_df
.sort("date")
.with_columns([
# Simple moving averages
pl.col("close").rolling_mean(window_size=20).alias("sma_20"),
pl.col("close").rolling_mean(window_size=50).alias("sma_50"),
# Exponential moving average (approximation)
pl.col("close").ewm_mean(span=12).alias("ema_12"),
# Price changes
pl.col("close").pct_change().alias("daily_return"),
(pl.col("close") / pl.col("close").shift(252) - 1).alias("annual_return"),
# Volatility (20-day rolling standard deviation of returns)
pl.col("close").pct_change().rolling_std(window_size=20).alias("volatility_20d"),
# Support and resistance levels
pl.col("low").rolling_min(window_size=20).alias("support_20d"),
pl.col("high").rolling_max(window_size=20).alias("resistance_20d")
])
)
# Risk metrics
risk_metrics = (
stock_analysis
.select([
pl.col("daily_return").mean().alias("avg_daily_return"),
pl.col("daily_return").std().alias("daily_volatility"),
pl.col("annual_return").mean().alias("avg_annual_return"),
(pl.col("daily_return").mean() / pl.col("daily_return").std()).alias("sharpe_ratio_daily"),
pl.col("daily_return").filter(pl.col("daily_return") < 0).count().alias("down_days"),
pl.col("daily_return").count().alias("total_days")
])
.with_columns([
(pl.col("down_days") / pl.col("total_days")).alias("probability_of_loss"),
(pl.col("avg_daily_return") * 252).alias("annualized_return"),
(pl.col("daily_volatility") * (252 ** 0.5)).alias("annualized_volatility")
])
)
# Portfolio optimization (equal weight example)
portfolio_stocks = ["AAPL", "GOOGL", "MSFT", "AMZN"]
portfolio_analysis = (
stock_df
.filter(pl.col("symbol").is_in(portfolio_stocks))
.group_by("date")
.agg([
pl.col("close").mean().alias("portfolio_close"),
pl.col("volume").sum().alias("total_volume")
])
.sort("date")
.with_columns([
pl.col("portfolio_close").pct_change().alias("portfolio_return"),
pl.col("portfolio_close").rolling_std(window_size=252).alias("portfolio_volatility")
])
)PythonExample 4: IoT Sensor Data Processing
# Process IoT sensor data
sensor_df = (
pl.read_csv("sensor_data.csv")
.with_columns([
pl.col("timestamp").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"),
pl.col("temperature", "humidity", "pressure").cast(pl.Float64),
pl.col("sensor_id").cast(pl.Categorical)
])
)
# Anomaly detection using statistical methods
anomaly_detection = (
sensor_df
.sort(["sensor_id", "timestamp"])
.with_columns([
# Calculate rolling statistics
pl.col("temperature").rolling_mean(window_size=100).over("sensor_id").alias("temp_mean"),
pl.col("temperature").rolling_std(window_size=100).over("sensor_id").alias("temp_std"),
pl.col("humidity").rolling_mean(window_size=100).over("sensor_id").alias("humidity_mean"),
pl.col("humidity").rolling_std(window_size=100).over("sensor_id").alias("humidity_std")
])
.with_columns([
# Z-score for anomaly detection
((pl.col("temperature") - pl.col("temp_mean")) / pl.col("temp_std")).abs().alias("temp_zscore"),
((pl.col("humidity") - pl.col("humidity_mean")) / pl.col("humidity_std")).abs().alias("humidity_zscore"),
# Flag anomalies (beyond 3 standard deviations)
(pl.col("temp_zscore") > 3).alias("temp_anomaly"),
(pl.col("humidity_zscore") > 3).alias("humidity_anomaly")
])
.with_columns([
(pl.col("temp_anomaly") | pl.col("humidity_anomaly")).alias("any_anomaly")
])
)
# Sensor health monitoring
sensor_health = (
sensor_df
.group_by_dynamic("timestamp", every="1h", group_by="sensor_id")
.agg([
pl.col("temperature").count().alias("readings_count"),
pl.col("temperature").mean().alias("avg_temp"),
pl.col("temperature").std().alias("temp_variance"),
pl.col("humidity").mean().alias("avg_humidity"),
pl.col("pressure").mean().alias("avg_pressure")
])
.with_columns([
# Expected readings per hour (assuming 1 reading per minute)
(pl.col("readings_count") / 60).alias("data_completeness"),
# Flag sensors with issues
pl.when(pl.col("readings_count") < 50).then("Low Data")
.when(pl.col("temp_variance") > 10).then("High Variance")
.when(pl.col("data_completeness") < 0.8).then("Missing Data")
.otherwise("Normal").alias("health_status")
])
)
# Data quality report
data_quality_report = (
sensor_df
.group_by("sensor_id")
.agg([
pl.col("timestamp").count().alias("total_readings"),
pl.col("temperature").null_count().alias("temp_nulls"),
pl.col("humidity").null_count().alias("humidity_nulls"),
pl.col("pressure").null_count().alias("pressure_nulls"),
pl.col("timestamp").min().alias("first_reading"),
pl.col("timestamp").max().alias("last_reading")
])
.with_columns([
(pl.col("last_reading") - pl.col("first_reading")).dt.days().alias("operational_days"),
(pl.col("temp_nulls") / pl.col("total_readings") * 100).alias("temp_null_pct"),
(pl.col("humidity_nulls") / pl.col("total_readings") * 100).alias("humidity_null_pct"),
(pl.col("pressure_nulls") / pl.col("total_readings") * 100).alias("pressure_null_pct")
])
)Python14. Migration from Pandas
Syntax Comparison
# Pandas vs Polars equivalent operations
# Data Loading
# Pandas
df_pandas = pd.read_csv("data.csv")
# Polars
df_polars = pl.read_csv("data.csv")
# Column Selection
# Pandas
df_pandas[["col1", "col2"]]
df_pandas.loc[:, "col1":"col3"]
# Polars
df_polars.select(["col1", "col2"])
df_polars.select(pl.col("col1", "col2", "col3"))
# Filtering continued
# Pandas
df_pandas[df_pandas["age"] > 25]
df_pandas.query("age > 25 and city == 'NYC'")
# Polars
df_polars.filter(pl.col("age") > 25)
df_polars.filter((pl.col("age") > 25) & (pl.col("city") == "NYC"))
# GroupBy and Aggregation
# Pandas
df_pandas.groupby("category").agg({"value": ["sum", "mean", "count"]})
# Polars
df_polars.group_by("category").agg([
pl.col("value").sum(),
pl.col("value").mean(),
pl.col("value").count()
])
# Adding Columns
# Pandas
df_pandas["new_col"] = df_pandas["col1"] * df_pandas["col2"]
df_pandas.assign(new_col=lambda x: x["col1"] * x["col2"])
# Polars
df_polars.with_columns([
(pl.col("col1") * pl.col("col2")).alias("new_col")
])
# Sorting
# Pandas
df_pandas.sort_values(["col1", "col2"], ascending=[True, False])
# Polars
df_polars.sort(["col1", "col2"], descending=[False, True])
# Joins
# Pandas
df1.merge(df2, on="key", how="left")
# Polars
df1.join(df2, on="key", how="left")PythonMigration Strategy
flowchart TD
A[Pandas Code Analysis] --> B[Identify Operations]
B --> C[Map to Polars Equivalents]
C --> D[Test Performance]
D --> E[Optimize with Lazy Operations]
E --> F[Production Deployment]
subgraph "Common Challenges"
G[Index Operations]
H[Inplace Operations]
I[Method Chaining Differences]
J[API Differences]
end
B --> G
B --> H
B --> I
B --> JCommon Migration Patterns
# Pattern 1: Index-based operations
# Pandas (avoid in Polars)
df_pandas.reset_index()
df_pandas.set_index("column")
# Polars equivalent
df_polars.with_row_count("index") # Add row numbers
df_polars.sort("column") # Use sorting instead of indexing
# Pattern 2: Inplace operations
# Pandas
df_pandas.dropna(inplace=True)
df_pandas.fillna(0, inplace=True)
# Polars (immutable)
df_polars = df_polars.drop_nulls()
df_polars = df_polars.fill_null(0)
# Pattern 3: Apply functions
# Pandas
df_pandas["new_col"] = df_pandas["col"].apply(lambda x: x ** 2)
# Polars
df_polars = df_polars.with_columns([
pl.col("col").map_elements(lambda x: x ** 2, return_dtype=pl.Float64).alias("new_col")
])
# Better Polars (vectorized)
df_polars = df_polars.with_columns([
pl.col("col").pow(2).alias("new_col")
])PythonPerformance Comparison
import time
import pandas as pd
import polars as pl
import numpy as np
# Generate test data
n_rows = 1_000_000
data = {
"id": range(n_rows),
"value": np.random.randn(n_rows),
"category": np.random.choice(["A", "B", "C", "D"], n_rows),
"date": pd.date_range("2020-01-01", periods=n_rows, freq="1min")
}
df_pandas = pd.DataFrame(data)
df_polars = pl.DataFrame(data)
# Benchmark groupby operations
def benchmark_groupby():
# Pandas
start = time.time()
result_pandas = df_pandas.groupby("category").agg({
"value": ["sum", "mean", "std", "count"]
})
pandas_time = time.time() - start
# Polars
start = time.time()
result_polars = df_polars.group_by("category").agg([
pl.col("value").sum(),
pl.col("value").mean(),
pl.col("value").std(),
pl.col("value").count()
])
polars_time = time.time() - start
print(f"Pandas: {pandas_time:.4f}s")
print(f"Polars: {polars_time:.4f}s")
print(f"Speedup: {pandas_time / polars_time:.2f}x")
benchmark_groupby()Python15. Best Practices
Code Organization
# 1. Use lazy evaluation for complex pipelines
def process_sales_data(file_path: str) -> pl.DataFrame:
return (
pl.scan_csv(file_path)
.filter(pl.col("date").is_between("2023-01-01", "2023-12-31"))
.with_columns([
pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
pl.col("amount").cast(pl.Float64)
])
.group_by(["product", "region"])
.agg([
pl.col("amount").sum().alias("total_sales"),
pl.col("amount").count().alias("transactions")
])
.collect()
)
# 2. Create reusable expression functions
def create_age_groups(age_column: str) -> pl.Expr:
return (
pl.when(pl.col(age_column) < 18).then("Under 18")
.when(pl.col(age_column) < 65).then("Adult")
.otherwise("Senior")
.alias("age_group")
)
# 3. Use configuration for consistent formatting
def setup_polars_config():
pl.Config.set_tbl_rows(20)
pl.Config.set_tbl_cols(8)
pl.Config.set_fmt_str_lengths(50)
pl.Config.set_thread_pool_size(8)PythonError Handling and Validation
def safe_data_processing(df: pl.DataFrame) -> pl.DataFrame:
try:
# Validate input
required_columns = ["id", "value", "date"]
missing_cols = set(required_columns) - set(df.columns)
if missing_cols:
raise ValueError(f"Missing columns: {missing_cols}")
# Safe type casting
result = df.with_columns([
pl.col("value").cast(pl.Float64, strict=False),
pl.col("date").str.strptime(pl.Date, "%Y-%m-%d", strict=False)
])
# Check for processing errors
null_dates = result.filter(pl.col("date").is_null()).height
if null_dates > 0:
print(f"Warning: {null_dates} rows with invalid dates")
return result
except Exception as e:
print(f"Error processing data: {e}")
raise
# Schema validation
def validate_schema(df: pl.DataFrame, expected_schema: dict) -> bool:
"""Validate DataFrame schema against expected types"""
for col, expected_type in expected_schema.items():
if col not in df.columns:
raise ValueError(f"Missing column: {col}")
if df[col].dtype != expected_type:
raise TypeError(f"Column {col} has type {df[col].dtype}, expected {expected_type}")
return True
# Usage
expected_schema = {
"id": pl.Int64,
"name": pl.Utf8,
"value": pl.Float64
}
validate_schema(df, expected_schema)PythonMemory and Performance Best Practices
# 1. Use streaming for large datasets
def process_large_dataset(file_path: str):
return (
pl.scan_parquet(file_path)
.filter(pl.col("active") == True)
.select(["id", "value", "category"])
.group_by("category")
.agg(pl.col("value").sum())
.collect(streaming=True) # Process in chunks
)
# 2. Optimize data types
def optimize_dtypes(df: pl.DataFrame) -> pl.DataFrame:
"""Optimize data types for memory efficiency"""
return df.with_columns([
# Use smaller integer types when possible
pl.col("small_numbers").cast(pl.Int16),
# Use categorical for repeated strings
pl.col("categories").cast(pl.Categorical),
# Use appropriate float precision
pl.col("percentages").cast(pl.Float32)
])
# 3. Efficient filtering order
def efficient_filtering(df: pl.DataFrame):
"""Apply filters in order of selectivity"""
return (
df
.filter(pl.col("status") == "active") # Most selective first
.filter(pl.col("date") > "2023-01-01") # Then date range
.filter(pl.col("value") > 0) # Least selective last
)
# 4. Batch operations
def batch_process_files(file_paths: list, batch_size: int = 10):
"""Process files in batches to manage memory"""
results = []
for i in range(0, len(file_paths), batch_size):
batch = file_paths[i:i + batch_size]
batch_dfs = [pl.read_csv(f) for f in batch]
combined = pl.concat(batch_dfs)
processed = combined.group_by("category").agg(pl.col("value").sum())
results.append(processed)
return pl.concat(results)PythonTesting Strategies
import pytest
def test_data_transformation():
"""Test data transformation functions"""
# Sample data
test_df = pl.DataFrame({
"age": [15, 25, 70],
"name": ["Alice", "Bob", "Charlie"]
})
# Apply transformation
result = test_df.with_columns([create_age_groups("age")])
# Assertions
assert "age_group" in result.columns
assert result["age_group"].to_list() == ["Under 18", "Adult", "Senior"]
def test_aggregation_consistency():
"""Test that aggregations produce consistent results"""
df = pl.DataFrame({
"group": ["A", "A", "B", "B"],
"value": [1, 2, 3, 4]
})
result = df.group_by("group").agg(pl.col("value").sum())
expected = pl.DataFrame({
"group": ["A", "B"],
"value": [3, 7]
})
assert result.sort("group").equals(expected.sort("group"))
# Property-based testing example
from hypothesis import given, strategies as st
import hypothesis.extra.pandas as st_pd
@given(st_pd.data_frames([
st_pd.column("value", elements=st.floats(0, 1000)),
st_pd.column("category", elements=st.text(min_size=1, max_size=5))
]))
def test_groupby_sum_is_positive(df_pandas):
"""Property test: sum of positive values should be positive"""
df = pl.from_pandas(df_pandas)
if df.height > 0:
result = df.group_by("category").agg(pl.col("value").sum())
assert all(result["value"] >= 0)PythonDocumentation and Type Hints
from typing import List, Dict, Optional, Union
import polars as pl
def analyze_sales_trends(
df: pl.DataFrame,
date_column: str = "date",
value_column: str = "sales",
groupby_columns: Optional[List[str]] = None,
rolling_window: int = 30
) -> Dict[str, pl.DataFrame]:
"""
Analyze sales trends with rolling averages and growth rates.
Args:
df: Input DataFrame with sales data
date_column: Name of the date column
value_column: Name of the sales value column
groupby_columns: Optional columns to group by
rolling_window: Window size for rolling average (days)
Returns:
Dictionary containing trend analysis results:
- 'trends': DataFrame with rolling averages and growth rates
- 'summary': DataFrame with summary statistics
Example:
>>> df = pl.DataFrame({
... "date": ["2023-01-01", "2023-01-02"],
... "sales": [100, 150],
... "region": ["North", "South"]
... })
>>> results = analyze_sales_trends(df, groupby_columns=["region"])
"""
# Input validation
required_columns = [date_column, value_column]
missing_columns = set(required_columns) - set(df.columns)
if missing_columns:
raise ValueError(f"Missing required columns: {missing_columns}")
# Main analysis
if groupby_columns:
trends = (
df
.sort([*groupby_columns, date_column])
.with_columns([
pl.col(value_column)
.rolling_mean(rolling_window)
.over(groupby_columns)
.alias(f"rolling_avg_{rolling_window}d"),
pl.col(value_column)
.pct_change()
.over(groupby_columns)
.alias("growth_rate")
])
)
summary = (
df
.group_by(groupby_columns)
.agg([
pl.col(value_column).sum().alias("total_sales"),
pl.col(value_column).mean().alias("avg_sales"),
pl.col(value_column).std().alias("sales_volatility")
])
)
else:
trends = df.sort(date_column).with_columns([
pl.col(value_column).rolling_mean(rolling_window).alias(f"rolling_avg_{rolling_window}d"),
pl.col(value_column).pct_change().alias("growth_rate")
])
summary = df.select([
pl.col(value_column).sum().alias("total_sales"),
pl.col(value_column).mean().alias("avg_sales"),
pl.col(value_column).std().alias("sales_volatility")
])
return {
"trends": trends,
"summary": summary
}PythonProduction Deployment Best Practices
import logging
from pathlib import Path
import polars as pl
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class DataProcessor:
"""Production-ready data processor with error handling and logging"""
def __init__(self, config: dict):
self.config = config
self.setup_polars()
def setup_polars(self):
"""Configure Polars for production use"""
pl.Config.set_thread_pool_size(self.config.get("threads", 4))
pl.Config.set_streaming_chunk_size(self.config.get("chunk_size", 10000))
def process_file(self, file_path: Path) -> Optional[pl.DataFrame]:
"""Process a single file with error handling"""
try:
logger.info(f"Processing file: {file_path}")
# Validate file exists
if not file_path.exists():
logger.error(f"File not found: {file_path}")
return None
# Process with lazy evaluation
result = (
pl.scan_csv(file_path)
.filter(pl.col("date").is_not_null())
.filter(pl.col("value") > 0)
.with_columns([
pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
pl.col("value").cast(pl.Float64)
])
.collect(streaming=True)
)
logger.info(f"Successfully processed {result.height} rows")
return result
except Exception as e:
logger.error(f"Error processing {file_path}: {e}")
return None
def save_results(self, df: pl.DataFrame, output_path: Path):
"""Save results with appropriate format"""
try:
output_path.parent.mkdir(parents=True, exist_ok=True)
if output_path.suffix == ".parquet":
df.write_parquet(output_path, compression="snappy")
elif output_path.suffix == ".csv":
df.write_csv(output_path)
else:
raise ValueError(f"Unsupported output format: {output_path.suffix}")
logger.info(f"Results saved to: {output_path}")
except Exception as e:
logger.error(f"Error saving results: {e}")
raise
# Configuration management
def load_config() -> dict:
"""Load configuration from environment or config file"""
return {
"threads": int(os.getenv("POLARS_THREADS", 4)),
"chunk_size": int(os.getenv("POLARS_CHUNK_SIZE", 10000)),
"input_dir": Path(os.getenv("INPUT_DIR", "./data")),
"output_dir": Path(os.getenv("OUTPUT_DIR", "./output"))
}PythonBest Practices Summary
mindmap
root((Polars Best Practices))
Performance
Use lazy evaluation
Optimize data types
Stream large datasets
Filter early and efficiently
Code Quality
Type hints
Error handling
Comprehensive testing
Clear documentation
Production
Logging and monitoring
Configuration management
Resource management
Graceful error handling
Migration
Incremental approach
Performance benchmarking
Feature parity validation
Team trainingFinal Recommendations
- Start Simple: Begin with basic operations and gradually adopt advanced features
- Leverage Lazy Evaluation: Use
scan_*functions and.lazy()for better performance - Optimize Data Types: Choose appropriate types to reduce memory usage
- Test Thoroughly: Write comprehensive tests for data transformations
- Monitor Performance: Benchmark critical operations and optimize bottlenecks
- Handle Errors Gracefully: Implement proper error handling and validation
- Document Well: Provide clear documentation for complex operations
- Stay Updated: Keep up with Polars releases and new features
Appendix A: Quick Reference and Cheat Sheets
Essential Imports and Setup
import polars as pl
import numpy as np
from datetime import datetime, date, timedelta
# Configuration
pl.Config.set_tbl_rows(20)
pl.Config.set_tbl_cols(8)
pl.Config.set_fmt_str_lengths(50)PythonDataFrame Creation Cheat Sheet
# From dictionary (most common)
df = pl.DataFrame({
"name": ["Alice", "Bob"],
"age": [25, 30],
"salary": [50000, 60000]
})
# From lists with schema
df = pl.DataFrame([
["Alice", 25], ["Bob", 30]
], schema=["name", "age"])
# Empty with schema
df = pl.DataFrame(schema={
"id": pl.Int64,
"name": pl.Utf8,
"value": pl.Float64
})PythonFile I/O Quick Reference
# Reading files
df = pl.read_csv("file.csv")
df = pl.read_parquet("file.parquet")
df = pl.read_json("file.json")
# Lazy reading (for large files)
df = pl.scan_csv("large_file.csv").collect()
# Writing files
df.write_csv("output.csv")
df.write_parquet("output.parquet")
df.write_json("output.json")PythonSelection and Filtering Cheat Sheet
# Column selection
df.select("name") # Single column
df.select(["name", "age"]) # Multiple columns
df.select(pl.col("^.*_id$")) # Regex pattern
df.select(pl.exclude("id")) # All except id
# Row filtering
df.filter(pl.col("age") > 25) # Simple condition
df.filter((pl.col("age") > 20) & (pl.col("age") < 60)) # Multiple conditions
df.filter(pl.col("name").str.contains("Alice")) # String operations
df.filter(pl.col("value").is_null()) # Null valuesPythonTransformation Cheat Sheet
# Adding/modifying columns
df.with_columns([
pl.col("age").alias("years"), # Rename
(pl.col("salary") * 1.1).alias("new_salary"), # Calculate
pl.lit("constant").alias("category"), # Add constant
pl.col("name").str.len().alias("name_length") # String operation
])
# Conditional logic
df.with_columns([
pl.when(pl.col("age") > 30)
.then(pl.lit("Senior"))
.otherwise(pl.lit("Junior"))
.alias("category")
])PythonAggregation Quick Reference
# Basic aggregations
df.select([
pl.col("value").sum(),
pl.col("value").mean(),
pl.col("value").count(),
pl.col("value").min(),
pl.col("value").max(),
pl.col("value").std()
])
# Group by operations
df.group_by("category").agg([
pl.col("value").sum().alias("total"),
pl.col("value").mean().alias("average"),
pl.col("id").count().alias("count")
])PythonJoin Operations Quick Reference
# Join types
df1.join(df2, on="key", how="inner") # Inner join
df1.join(df2, on="key", how="left") # Left join
df1.join(df2, on="key", how="outer") # Outer join
df1.join(df2, on="key", how="cross") # Cross join
# Different column names
df1.join(df2, left_on="id", right_on="user_id", how="left")PythonData Types Quick Reference
# Numeric types
pl.Int8, pl.Int16, pl.Int32, pl.Int64 # Signed integers
pl.UInt8, pl.UInt16, pl.UInt32, pl.UInt64 # Unsigned integers
pl.Float32, pl.Float64 # Floating point
# Text and categorical
pl.Utf8 # String
pl.Categorical # Categories
pl.Binary # Binary data
# Temporal
pl.Date # Date only
pl.Datetime # Date and time
pl.Time # Time only
pl.Duration # Time intervals
# Complex types
pl.List(pl.Int32) # List of integers
pl.Struct([pl.Field("name", pl.Utf8)]) # Structured data
# Special
pl.Boolean # True/False
pl.Null # Null typePythonCommon Expressions
# Column operations
pl.col("name") # Column reference
pl.all() # All columns
pl.exclude("id") # All except specified
pl.lit(42) # Literal value
# String operations
pl.col("text").str.len() # String length
pl.col("text").str.contains("pattern") # Contains substring
pl.col("text").str.to_uppercase() # Uppercase
pl.col("text").str.replace("old", "new") # Replace text
# Date operations
pl.col("date").dt.year() # Extract year
pl.col("date").dt.month() # Extract month
pl.col("date").dt.day() # Extract day
pl.col("date").dt.weekday() # Day of week
# Mathematical operations
pl.col("value").abs() # Absolute value
pl.col("value").sqrt() # Square root
pl.col("value").round(2) # Round to 2 decimals
pl.col("value").rank() # Rank valuesPythonPerformance Tips Summary
# 1. Use lazy evaluation for large datasets
df = pl.scan_csv("large_file.csv").filter(...).collect()
# 2. Use appropriate data types
df = df.with_columns([
pl.col("small_numbers").cast(pl.Int16),
pl.col("categories").cast(pl.Categorical)
])
# 3. Filter early in lazy queries
df = (
pl.scan_csv("data.csv")
.filter(pl.col("date") > "2023-01-01") # Early filtering
.select(["id", "value"]) # Column pruning
.collect()
)
# 4. Use streaming for very large datasets
df = lazy_query.collect(streaming=True)PythonAppendix B: Troubleshooting Common Issues
Installation Issues
Problem: ImportError: No module named 'polars'
# Solution: Check Python environment
python -m pip install polars
# or
conda install -c conda-forge polarsBashProblem: Version conflicts
# Solution: Use virtual environment
python -m venv polars_env
source polars_env/bin/activate # Linux/Mac
# or
polars_env\Scripts\activate # Windows
pip install polars[all]BashData Type Issues
Problem: Unexpected data types after loading
# Check data types
print(df.dtypes)
# Fix with explicit casting
df = df.with_columns([
pl.col("number_col").cast(pl.Int64),
pl.col("date_col").str.strptime(pl.Date, "%Y-%m-%d")
])PythonProblem: Memory usage too high
# Use more efficient data types
df = df.with_columns([
pl.col("small_int").cast(pl.Int16), # Instead of Int64
pl.col("category").cast(pl.Categorical), # Instead of Utf8
pl.col("flag").cast(pl.Boolean) # Instead of string
])PythonPerformance Issues
Problem: Slow operations on large datasets
# Solution 1: Use lazy evaluation
result = (
pl.scan_csv("large_file.csv")
.filter(pl.col("important") == True)
.select(["id", "value"])
.collect()
)
# Solution 2: Use streaming
result = lazy_query.collect(streaming=True)
# Solution 3: Configure thread pool
pl.Config.set_thread_pool_size(8)PythonProblem: Out of memory errors
# Solution: Process in chunks or use streaming
def process_in_chunks(file_path, chunk_size=10000):
total_result = None
for chunk_df in pl.read_csv_batched(file_path, batch_size=chunk_size):
chunk_result = chunk_df.group_by("category").agg(pl.col("value").sum())
if total_result is None:
total_result = chunk_result
else:
total_result = pl.concat([total_result, chunk_result])
return total_result.group_by("category").agg(pl.col("value").sum())PythonCommon Error Messages
Error: ShapeError: Could not add column
# Problem: Column length mismatch
# Solution: Ensure all columns have the same length
df = pl.DataFrame({
"a": [1, 2, 3],
"b": [4, 5, 6], # Same length as 'a'
})PythonError: ComputeError: Could not parse
# Problem: Date/time parsing issues
# Solution: Specify format or use strict=False
df = df.with_columns([
pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d", strict=False)
])PythonError: ColumnNotFoundError
# Problem: Column name doesn't exist
# Solution: Check column names
print(df.columns) # List all columns
# or use regex for flexible matching
df.select(pl.col("^.*name.*$")) # Columns containing "name"PythonAppendix C: Performance Benchmarks
Polars vs Pandas Comparison
Based on common operations with 1M rows:
| Operation | Pandas Time | Polars Time | Speedup |
|---|---|---|---|
| GroupBy Sum | 145ms | 12ms | 12.1x |
| Filter + Select | 89ms | 8ms | 11.1x |
| Join Operations | 267ms | 23ms | 11.6x |
| String Operations | 156ms | 19ms | 8.2x |
| Date Parsing | 445ms | 67ms | 6.6x |
| Aggregations | 78ms | 9ms | 8.7x |
Memory Usage Comparison
| Data Type | Pandas Memory | Polars Memory | Reduction |
|---|---|---|---|
| 1M integers | 7.6 MB | 3.8 MB | 50% |
| 1M strings | 54.2 MB | 38.1 MB | 30% |
| 1M categories | 54.2 MB | 8.4 MB | 84% |
| 1M dates | 15.3 MB | 7.6 MB | 50% |
Appendix D: Resources and Further Learning
Official Resources
- Documentation: https://pola-rs.github.io/polars/
- GitHub Repository: https://github.com/pola-rs/polars
- User Guide: https://pola-rs.github.io/polars-book/
- API Reference: https://pola-rs.github.io/polars/py-polars/html/
Community and Support
- Discord Server: Join the Polars community for real-time help
- Stack Overflow: Tag questions with
python-polars - GitHub Issues: Report bugs and request features
- Reddit: r/PolarsDataFrame community discussions
Related Libraries and Tools
- PyArrow: Arrow format integration
- DuckDB: SQL interface for Polars
- Connectorx: Fast database connectors
- Great Expectations: Data quality validation
Learning Path Recommendations
Beginner Path:
- Master DataFrame creation and basic operations
- Learn filtering and selection patterns
- Practice with real datasets
- Understand lazy evaluation benefits
Intermediate Path:
- Complex aggregations and window functions
- Join operations and data combination
- Time series analysis
- Performance optimization techniques
Advanced Path:
- Custom functions and expressions
- Nested data manipulation
- Production deployment patterns
- Integration with other tools
Sample Datasets for Practice
# Built-in sample data generators
def create_sales_data(n_rows=10000):
import random
random.seed(42)
return pl.DataFrame({
"date": pl.date_range(date(2020, 1, 1), date(2023, 12, 31), "1d")[:n_rows],
"product": [f"Product_{i%100}" for i in range(n_rows)],
"sales": [random.randint(10, 1000) for _ in range(n_rows)],
"region": [random.choice(["North", "South", "East", "West"]) for _ in range(n_rows)]
})
def create_sensor_data(n_rows=50000):
import random
random.seed(42)
return pl.DataFrame({
"timestamp": pl.datetime_range(
datetime(2023, 1, 1),
datetime(2023, 12, 31),
"1h"
)[:n_rows],
"sensor_id": [f"SENSOR_{i%20:03d}" for i in range(n_rows)],
"temperature": [random.gauss(20, 5) for _ in range(n_rows)],
"humidity": [random.gauss(50, 15) for _ in range(n_rows)],
"pressure": [random.gauss(1013, 10) for _ in range(n_rows)]
})
# Usage
sales_df = create_sales_data(10000)
sensor_df = create_sensor_data(5000)PythonFinal Words
This comprehensive guide covers Polars from beginner to expert level, providing practical examples, performance optimization techniques, and production-ready patterns. The combination of detailed explanations, code examples, and visual diagrams using Mermaid makes it a complete resource for mastering Python Polars.
Key Takeaways
- Polars is Revolutionary: Built for modern data processing with speed and efficiency
- Lazy Evaluation: The key to performance optimization and memory efficiency
- Expression System: Powerful, composable operations that are both readable and fast
- Production Ready: Robust features for real-world data processing pipelines
- Growing Ecosystem: Active community and continuous development
Your Journey Forward
- Practice Regularly: Apply concepts with real datasets
- Join the Community: Engage with other Polars users
- Stay Updated: Follow Polars development for new features
- Share Knowledge: Contribute to the growing Polars ecosystem
Happy data processing with Polars! 🚀
Discover more from Altgr Blog
Subscribe to get the latest posts sent to your email.
