Python Pandas – Altgr Blog

Introduction to Pandas
Installation
Data Structures
Creating DataFrames
Reading and Writing Data
Data Inspection
Data Selection and Indexing
Data Filtering
Data Manipulation
Handling Missing Data
Data Aggregation and Grouping
Merging and Joining
Time Series Analysis
Data Visualization
Performance Tips

1. Introduction to Pandas

Pandas is a powerful, open-source data analysis and manipulation library for Python. It provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

Key Features:

Fast and efficient DataFrame object for data manipulation
Tools for reading and writing data between in-memory data structures and different formats
Intelligent data alignment and integrated handling of missing data
Flexible reshaping and pivoting of datasets
Powerful group by functionality
Time series functionality

2. Installation

# Using pip
pip install pandas

# Using conda
conda install pandas

# Install with additional dependencies
pip install pandas numpy matplotlib

# Using pip
pip install pandas

# Using conda
conda install pandas

# Install with additional dependencies
pip install pandas numpy matplotlib

Bash

Import pandas:

import pandas as pd
import numpy as np

import pandas as pd
import numpy as np

Python

3. Data Structures

Pandas has two primary data structures:

3.1 Series

A one-dimensional labeled array capable of holding any data type.

# Creating a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

# Series with custom index
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)

# From dictionary
data = {'a': 10, 'b': 20, 'c': 30}
s = pd.Series(data)

# Creating a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

# Series with custom index
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)

# From dictionary
data = {'a': 10, 'b': 20, 'c': 30}
s = pd.Series(data)

Python

3.2 DataFrame

A two-dimensional labeled data structure with columns of potentially different types.

# Creating a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': ['a', 'b', 'c', 'd'],
    'C': [1.5, 2.5, 3.5, 4.5]
})
print(df)

# Creating a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': ['a', 'b', 'c', 'd'],
    'C': [1.5, 2.5, 3.5, 4.5]
})
print(df)

Python

4. Creating DataFrames

4.1 From Dictionary

# Dictionary of lists
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)

# Dictionary of Series
data = {
    'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
    'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
}
df = pd.DataFrame(data)

# Dictionary of lists
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)

# Dictionary of Series
data = {
    'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
    'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
}
df = pd.DataFrame(data)

Python

4.2 From List of Dictionaries

data = [
    {'a': 1, 'b': 2, 'c': 3},
    {'a': 10, 'b': 20, 'c': 30},
    {'a': 100, 'b': 200, 'c': 300}
]
df = pd.DataFrame(data)

data = [
    {'a': 1, 'b': 2, 'c': 3},
    {'a': 10, 'b': 20, 'c': 30},
    {'a': 100, 'b': 200, 'c': 300}
]
df = pd.DataFrame(data)

Python

4.3 From NumPy Array

# Create from numpy array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])

# Create from numpy array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])

Python

5. Reading and Writing Data

5.1 Reading Data

# CSV files
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', sep=';', encoding='utf-8')

# Excel files
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# JSON files
df = pd.read_json('data.json')

# SQL databases
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)

# HTML tables
df = pd.read_html('https://example.com/table.html')[0]

# Clipboard
df = pd.read_clipboard()

# With specific parameters
df = pd.read_csv('data.csv', 
                 header=0,           # Row to use as header
                 index_col=0,        # Column to use as index
                 na_values=['NA'],   # Values to recognize as NA
                 parse_dates=['date'], # Parse date columns
                 nrows=1000)         # Number of rows to read

# CSV files
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', sep=';', encoding='utf-8')

# Excel files
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# JSON files
df = pd.read_json('data.json')

# SQL databases
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)

# HTML tables
df = pd.read_html('https://example.com/table.html')[0]

# Clipboard
df = pd.read_clipboard()

# With specific parameters
df = pd.read_csv('data.csv', 
                 header=0,           # Row to use as header
                 index_col=0,        # Column to use as index
                 na_values=['NA'],   # Values to recognize as NA
                 parse_dates=['date'], # Parse date columns
                 nrows=1000)         # Number of rows to read

Python

5.2 Writing Data

# CSV
df.to_csv('output.csv', index=False)

# Excel
df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)

# JSON
df.to_json('output.json', orient='records')

# SQL
df.to_sql('table_name', conn, if_exists='replace', index=False)

# HTML
df.to_html('output.html')

# Clipboard
df.to_clipboard()

# CSV
df.to_csv('output.csv', index=False)

# Excel
df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)

# JSON
df.to_json('output.json', orient='records')

# SQL
df.to_sql('table_name', conn, if_exists='replace', index=False)

# HTML
df.to_html('output.html')

# Clipboard
df.to_clipboard()

Python

6. Data Inspection

# View first/last rows
df.head()        # First 5 rows
df.head(10)      # First 10 rows
df.tail()        # Last 5 rows

# Shape and size
df.shape         # (rows, columns)
df.size          # Total number of elements
len(df)          # Number of rows

# Column information
df.columns       # Column names
df.dtypes        # Data types of columns
df.info()        # Detailed information

# Statistical summary
df.describe()    # For numerical columns
df.describe(include='all')  # For all columns

# Unique values
df['column'].unique()       # Unique values
df['column'].nunique()      # Count of unique values
df['column'].value_counts() # Frequency of values

# Check for missing values
df.isnull().sum()
df.isna().sum()

# Memory usage
df.memory_usage(deep=True)

# View first/last rows
df.head()        # First 5 rows
df.head(10)      # First 10 rows
df.tail()        # Last 5 rows

# Shape and size
df.shape         # (rows, columns)
df.size          # Total number of elements
len(df)          # Number of rows

# Column information
df.columns       # Column names
df.dtypes        # Data types of columns
df.info()        # Detailed information

# Statistical summary
df.describe()    # For numerical columns
df.describe(include='all')  # For all columns

# Unique values
df['column'].unique()       # Unique values
df['column'].nunique()      # Count of unique values
df['column'].value_counts() # Frequency of values

# Check for missing values
df.isnull().sum()
df.isna().sum()

# Memory usage
df.memory_usage(deep=True)

Python

7. Data Selection and Indexing

7.1 Selecting Columns

# Single column (returns Series)
df['Name']
df.Name  # Dot notation (only for valid Python identifiers)

# Multiple columns (returns DataFrame)
df[['Name', 'Age']]

# Select columns by data type
df.select_dtypes(include=['int64', 'float64'])
df.select_dtypes(exclude=['object'])

# Single column (returns Series)
df['Name']
df.Name  # Dot notation (only for valid Python identifiers)

# Multiple columns (returns DataFrame)
df[['Name', 'Age']]

# Select columns by data type
df.select_dtypes(include=['int64', 'float64'])
df.select_dtypes(exclude=['object'])

Python

7.2 Selecting Rows

# By position (iloc)
df.iloc[0]           # First row
df.iloc[0:5]         # First 5 rows
df.iloc[:, 0]        # First column
df.iloc[0:5, 0:3]    # First 5 rows, first 3 columns

# By label (loc)
df.loc[0]            # Row with index 0
df.loc[0:5]          # Rows 0 to 5 (inclusive)
df.loc[:, 'Name']    # All rows, 'Name' column
df.loc[0:5, ['Name', 'Age']]  # Specific rows and columns

# Selecting specific cells
df.at[0, 'Name']     # Fast access to scalar
df.iat[0, 1]         # Fast integer-based access

# By position (iloc)
df.iloc[0]           # First row
df.iloc[0:5]         # First 5 rows
df.iloc[:, 0]        # First column
df.iloc[0:5, 0:3]    # First 5 rows, first 3 columns

# By label (loc)
df.loc[0]            # Row with index 0
df.loc[0:5]          # Rows 0 to 5 (inclusive)
df.loc[:, 'Name']    # All rows, 'Name' column
df.loc[0:5, ['Name', 'Age']]  # Specific rows and columns

# Selecting specific cells
df.at[0, 'Name']     # Fast access to scalar
df.iat[0, 1]         # Fast integer-based access

Python

7.3 Boolean Indexing

# Single condition
df[df['Age'] > 30]

# Multiple conditions
df[(df['Age'] > 25) & (df['City'] == 'New York')]
df[(df['Age'] < 25) | (df['Age'] > 35)]

# Using isin()
df[df['City'].isin(['New York', 'Paris'])]

# String methods
df[df['Name'].str.contains('John')]
df[df['Name'].str.startswith('A')]

# Single condition
df[df['Age'] > 30]

# Multiple conditions
df[(df['Age'] > 25) & (df['City'] == 'New York')]
df[(df['Age'] < 25) | (df['Age'] > 35)]

# Using isin()
df[df['City'].isin(['New York', 'Paris'])]

# String methods
df[df['Name'].str.contains('John')]
df[df['Name'].str.startswith('A')]

Python

8. Data Filtering

# Filter by condition
filtered = df[df['Age'] > 30]

# Filter using query() method
filtered = df.query('Age > 30 and City == "New York"')

# Filter with multiple conditions
filtered = df[(df['Age'] > 25) & (df['Age'] < 35)]

# Filter rows with non-null values
filtered = df[df['column'].notna()]

# Filter based on string patterns
filtered = df[df['Name'].str.match(r'^[A-M]')]

# Remove duplicates
df_unique = df.drop_duplicates()
df_unique = df.drop_duplicates(subset=['Name'])
df_unique = df.drop_duplicates(keep='last')

# Sample data
sample = df.sample(n=100)          # 100 random rows
sample = df.sample(frac=0.1)       # 10% of rows

# Filter by condition
filtered = df[df['Age'] > 30]

# Filter using query() method
filtered = df.query('Age > 30 and City == "New York"')

# Filter with multiple conditions
filtered = df[(df['Age'] > 25) & (df['Age'] < 35)]

# Filter rows with non-null values
filtered = df[df['column'].notna()]

# Filter based on string patterns
filtered = df[df['Name'].str.match(r'^[A-M]')]

# Remove duplicates
df_unique = df.drop_duplicates()
df_unique = df.drop_duplicates(subset=['Name'])
df_unique = df.drop_duplicates(keep='last')

# Sample data
sample = df.sample(n=100)          # 100 random rows
sample = df.sample(frac=0.1)       # 10% of rows

Python

9. Data Manipulation

9.1 Adding/Removing Columns

# Add new column
df['NewColumn'] = df['Age'] * 2
df['Category'] = 'A'  # Constant value

# Add computed column
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Old')

# Insert column at specific position
df.insert(1, 'NewCol', [1, 2, 3, 4])

# Remove columns
df.drop('NewColumn', axis=1, inplace=True)
df.drop(['Col1', 'Col2'], axis=1, inplace=True)

# Remove columns using del
del df['NewColumn']

# Add new column
df['NewColumn'] = df['Age'] * 2
df['Category'] = 'A'  # Constant value

# Add computed column
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Old')

# Insert column at specific position
df.insert(1, 'NewCol', [1, 2, 3, 4])

# Remove columns
df.drop('NewColumn', axis=1, inplace=True)
df.drop(['Col1', 'Col2'], axis=1, inplace=True)

# Remove columns using del
del df['NewColumn']

Python

9.2 Adding/Removing Rows

# Add row using loc
df.loc[len(df)] = ['Tom', 45, 'Tokyo']

# Add row using append (deprecated in newer versions)
new_row = pd.DataFrame({'Name': ['Tom'], 'Age': [45], 'City': ['Tokyo']})
df = pd.concat([df, new_row], ignore_index=True)

# Remove rows by index
df.drop(0, inplace=True)
df.drop([0, 1, 2], inplace=True)

# Remove rows by condition
df = df[df['Age'] >= 25]

# Add row using loc
df.loc[len(df)] = ['Tom', 45, 'Tokyo']

# Add row using append (deprecated in newer versions)
new_row = pd.DataFrame({'Name': ['Tom'], 'Age': [45], 'City': ['Tokyo']})
df = pd.concat([df, new_row], ignore_index=True)

# Remove rows by index
df.drop(0, inplace=True)
df.drop([0, 1, 2], inplace=True)

# Remove rows by condition
df = df[df['Age'] >= 25]

Python

9.3 Renaming

# Rename columns
df.rename(columns={'Name': 'FullName', 'Age': 'Years'}, inplace=True)

# Rename all columns
df.columns = ['col1', 'col2', 'col3']

# Rename index
df.rename(index={0: 'first', 1: 'second'}, inplace=True)

# Reset index
df.reset_index(drop=True, inplace=True)

# Set column as index
df.set_index('Name', inplace=True)

# Rename columns
df.rename(columns={'Name': 'FullName', 'Age': 'Years'}, inplace=True)

# Rename all columns
df.columns = ['col1', 'col2', 'col3']

# Rename index
df.rename(index={0: 'first', 1: 'second'}, inplace=True)

# Reset index
df.reset_index(drop=True, inplace=True)

# Set column as index
df.set_index('Name', inplace=True)

Python

9.4 Sorting

# Sort by column
df.sort_values('Age')
df.sort_values('Age', ascending=False)

# Sort by multiple columns
df.sort_values(['City', 'Age'], ascending=[True, False])

# Sort by index
df.sort_index()
df.sort_index(ascending=False)

# Sort by column
df.sort_values('Age')
df.sort_values('Age', ascending=False)

# Sort by multiple columns
df.sort_values(['City', 'Age'], ascending=[True, False])

# Sort by index
df.sort_index()
df.sort_index(ascending=False)

Python

9.5 Apply Functions

# Apply function to column
df['Age_Squared'] = df['Age'].apply(lambda x: x**2)

# Apply function to DataFrame
df['Age_Category'] = df.apply(
    lambda row: 'Young' if row['Age'] < 30 else 'Old', 
    axis=1
)

# Map values
mapping = {'New York': 'NY', 'Paris': 'PR', 'Berlin': 'BE'}
df['City_Code'] = df['City'].map(mapping)

# Replace values
df['City'].replace('New York', 'NYC', inplace=True)
df.replace({'New York': 'NYC', 'Paris': 'PRS'}, inplace=True)

# Apply function to column
df['Age_Squared'] = df['Age'].apply(lambda x: x**2)

# Apply function to DataFrame
df['Age_Category'] = df.apply(
    lambda row: 'Young' if row['Age'] < 30 else 'Old', 
    axis=1
)

# Map values
mapping = {'New York': 'NY', 'Paris': 'PR', 'Berlin': 'BE'}
df['City_Code'] = df['City'].map(mapping)

# Replace values
df['City'].replace('New York', 'NYC', inplace=True)
df.replace({'New York': 'NYC', 'Paris': 'PRS'}, inplace=True)

Python

9.6 String Operations

# Convert to lowercase/uppercase
df['Name_Lower'] = df['Name'].str.lower()
df['Name_Upper'] = df['Name'].str.upper()

# Strip whitespace
df['Name'] = df['Name'].str.strip()

# Split strings
df['First_Name'] = df['Name'].str.split(' ').str[0]

# Contains pattern
mask = df['Name'].str.contains('John', case=False)

# Extract pattern
df['Numbers'] = df['Text'].str.extract(r'(\d+)')

# Replace in strings
df['Name'] = df['Name'].str.replace('John', 'Jonathan')

# Convert to lowercase/uppercase
df['Name_Lower'] = df['Name'].str.lower()
df['Name_Upper'] = df['Name'].str.upper()

# Strip whitespace
df['Name'] = df['Name'].str.strip()

# Split strings
df['First_Name'] = df['Name'].str.split(' ').str[0]

# Contains pattern
mask = df['Name'].str.contains('John', case=False)

# Extract pattern
df['Numbers'] = df['Text'].str.extract(r'(\d+)')

# Replace in strings
df['Name'] = df['Name'].str.replace('John', 'Jonathan')

Python

10. Handling Missing Data

10.1 Detecting Missing Data

# Check for missing values
df.isnull()
df.isna()
df.isnull().sum()  # Count per column
df.isnull().any()  # Any missing values per column

# Check for non-missing values
df.notna()
df.notnull()

# Check for missing values
df.isnull()
df.isna()
df.isnull().sum()  # Count per column
df.isnull().any()  # Any missing values per column

# Check for non-missing values
df.notna()
df.notnull()

Python

10.2 Handling Missing Data

# Drop rows with any missing values
df.dropna()

# Drop rows where all values are missing
df.dropna(how='all')

# Drop rows with missing values in specific columns
df.dropna(subset=['Age', 'City'])

# Drop columns with missing values
df.dropna(axis=1)

# Fill missing values with a constant
df.fillna(0)
df['Age'].fillna(df['Age'].mean())

# Forward fill (propagate last valid observation)
df.fillna(method='ffill')
df.fillna(method='pad')

# Backward fill
df.fillna(method='bfill')
df.fillna(method='backfill')

# Fill with different values per column
df.fillna({'Age': 0, 'City': 'Unknown'})

# Interpolate missing values
df['Age'].interpolate()
df.interpolate(method='linear')

# Drop rows with any missing values
df.dropna()

# Drop rows where all values are missing
df.dropna(how='all')

# Drop rows with missing values in specific columns
df.dropna(subset=['Age', 'City'])

# Drop columns with missing values
df.dropna(axis=1)

# Fill missing values with a constant
df.fillna(0)
df['Age'].fillna(df['Age'].mean())

# Forward fill (propagate last valid observation)
df.fillna(method='ffill')
df.fillna(method='pad')

# Backward fill
df.fillna(method='bfill')
df.fillna(method='backfill')

# Fill with different values per column
df.fillna({'Age': 0, 'City': 'Unknown'})

# Interpolate missing values
df['Age'].interpolate()
df.interpolate(method='linear')

Python

11. Data Aggregation and Grouping

11.1 Basic Aggregations

# Sum, mean, median, etc.
df['Age'].sum()
df['Age'].mean()
df['Age'].median()
df['Age'].std()
df['Age'].min()
df['Age'].max()
df['Age'].count()

# Multiple aggregations at once
df['Age'].agg(['sum', 'mean', 'std'])

# Aggregation on multiple columns
df[['Age', 'Salary']].agg(['mean', 'min', 'max'])

# Different aggregations for different columns
df.agg({
    'Age': ['mean', 'std'],
    'Salary': ['sum', 'mean']
})

# Sum, mean, median, etc.
df['Age'].sum()
df['Age'].mean()
df['Age'].median()
df['Age'].std()
df['Age'].min()
df['Age'].max()
df['Age'].count()

# Multiple aggregations at once
df['Age'].agg(['sum', 'mean', 'std'])

# Aggregation on multiple columns
df[['Age', 'Salary']].agg(['mean', 'min', 'max'])

# Different aggregations for different columns
df.agg({
    'Age': ['mean', 'std'],
    'Salary': ['sum', 'mean']
})

Python

11.2 GroupBy Operations

# Group by single column
grouped = df.groupby('City')
grouped['Age'].mean()

# Group by multiple columns
df.groupby(['City', 'Gender'])['Age'].mean()

# Multiple aggregations on groups
df.groupby('City').agg({
    'Age': ['mean', 'min', 'max'],
    'Salary': 'sum'
})

# Custom aggregation functions
df.groupby('City')['Age'].agg(
    Average='mean',
    Total='sum',
    Range=lambda x: x.max() - x.min()
)

# Iterate through groups
for name, group in df.groupby('City'):
    print(f"City: {name}")
    print(group)

# Transform (return same size as input)
df['Age_Normalized'] = df.groupby('City')['Age'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Filter groups
df.groupby('City').filter(lambda x: len(x) > 2)

# Get specific group
df.groupby('City').get_group('New York')

# Group by single column
grouped = df.groupby('City')
grouped['Age'].mean()

# Group by multiple columns
df.groupby(['City', 'Gender'])['Age'].mean()

# Multiple aggregations on groups
df.groupby('City').agg({
    'Age': ['mean', 'min', 'max'],
    'Salary': 'sum'
})

# Custom aggregation functions
df.groupby('City')['Age'].agg(
    Average='mean',
    Total='sum',
    Range=lambda x: x.max() - x.min()
)

# Iterate through groups
for name, group in df.groupby('City'):
    print(f"City: {name}")
    print(group)

# Transform (return same size as input)
df['Age_Normalized'] = df.groupby('City')['Age'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Filter groups
df.groupby('City').filter(lambda x: len(x) > 2)

# Get specific group
df.groupby('City').get_group('New York')

Python

11.3 Pivot Tables

# Create pivot table
pivot = df.pivot_table(
    values='Salary',
    index='City',
    columns='Gender',
    aggfunc='mean'
)

# Multiple aggregation functions
pivot = df.pivot_table(
    values='Salary',
    index='City',
    columns='Gender',
    aggfunc=['mean', 'sum', 'count']
)

# Multiple values
pivot = df.pivot_table(
    values=['Salary', 'Age'],
    index='City',
    columns='Gender',
    aggfunc='mean'
)

# With margins (totals)
pivot = df.pivot_table(
    values='Salary',
    index='City',
    columns='Gender',
    aggfunc='mean',
    margins=True
)

# Create pivot table
pivot = df.pivot_table(
    values='Salary',
    index='City',
    columns='Gender',
    aggfunc='mean'
)

# Multiple aggregation functions
pivot = df.pivot_table(
    values='Salary',
    index='City',
    columns='Gender',
    aggfunc=['mean', 'sum', 'count']
)

# Multiple values
pivot = df.pivot_table(
    values=['Salary', 'Age'],
    index='City',
    columns='Gender',
    aggfunc='mean'
)

# With margins (totals)
pivot = df.pivot_table(
    values='Salary',
    index='City',
    columns='Gender',
    aggfunc='mean',
    margins=True
)

Python

11.4 Cross-tabulation

# Create cross-tabulation
crosstab = pd.crosstab(df['City'], df['Gender'])

# With values
crosstab = pd.crosstab(
    df['City'], 
    df['Gender'],
    values=df['Salary'],
    aggfunc='mean'
)

# With margins
crosstab = pd.crosstab(df['City'], df['Gender'], margins=True)

# Normalize
crosstab = pd.crosstab(df['City'], df['Gender'], normalize='columns')

# Create cross-tabulation
crosstab = pd.crosstab(df['City'], df['Gender'])

# With values
crosstab = pd.crosstab(
    df['City'], 
    df['Gender'],
    values=df['Salary'],
    aggfunc='mean'
)

# With margins
crosstab = pd.crosstab(df['City'], df['Gender'], margins=True)

# Normalize
crosstab = pd.crosstab(df['City'], df['Gender'], normalize='columns')

Python

12. Merging and Joining

12.1 Concatenation

# Vertical concatenation (stack rows)
df_combined = pd.concat([df1, df2])
df_combined = pd.concat([df1, df2], ignore_index=True)

# Horizontal concatenation (stack columns)
df_combined = pd.concat([df1, df2], axis=1)

# With keys for multi-index
df_combined = pd.concat([df1, df2], keys=['first', 'second'])

# Vertical concatenation (stack rows)
df_combined = pd.concat([df1, df2])
df_combined = pd.concat([df1, df2], ignore_index=True)

# Horizontal concatenation (stack columns)
df_combined = pd.concat([df1, df2], axis=1)

# With keys for multi-index
df_combined = pd.concat([df1, df2], keys=['first', 'second'])

Python

12.2 Merging/Joining

# Inner join (default)
merged = pd.merge(df1, df2, on='key')

# Left join
merged = pd.merge(df1, df2, on='key', how='left')

# Right join
merged = pd.merge(df1, df2, on='key', how='right')

# Outer join
merged = pd.merge(df1, df2, on='key', how='outer')

# Merge on multiple columns
merged = pd.merge(df1, df2, on=['key1', 'key2'])

# Merge with different column names
merged = pd.merge(df1, df2, left_on='key1', right_on='key2')

# Merge on index
merged = pd.merge(df1, df2, left_index=True, right_index=True)

# Using join method
merged = df1.join(df2, how='left')
merged = df1.join(df2, lsuffix='_left', rsuffix='_right')

# Inner join (default)
merged = pd.merge(df1, df2, on='key')

# Left join
merged = pd.merge(df1, df2, on='key', how='left')

# Right join
merged = pd.merge(df1, df2, on='key', how='right')

# Outer join
merged = pd.merge(df1, df2, on='key', how='outer')

# Merge on multiple columns
merged = pd.merge(df1, df2, on=['key1', 'key2'])

# Merge with different column names
merged = pd.merge(df1, df2, left_on='key1', right_on='key2')

# Merge on index
merged = pd.merge(df1, df2, left_index=True, right_index=True)

# Using join method
merged = df1.join(df2, how='left')
merged = df1.join(df2, lsuffix='_left', rsuffix='_right')

Python

12.3 Append (Deprecated)

# Use concat instead of append
df_combined = pd.concat([df1, df2], ignore_index=True)

# Use concat instead of append
df_combined = pd.concat([df1, df2], ignore_index=True)

Python

13. Time Series Analysis

13.1 Creating DateTime Objects

# Convert string to datetime
df['date'] = pd.to_datetime(df['date_string'])
df['date'] = pd.to_datetime(df['date_string'], format='%Y-%m-%d')

# Create date range
dates = pd.date_range('2023-01-01', periods=100, freq='D')
dates = pd.date_range('2023-01-01', '2023-12-31', freq='W')

# Common frequencies:
# 'D' - calendar day
# 'B' - business day
# 'W' - weekly
# 'M' - month end
# 'MS' - month start
# 'Q' - quarter end
# 'Y' - year end
# 'H' - hourly
# 'T' or 'min' - minutely
# 'S' - secondly

# Convert string to datetime
df['date'] = pd.to_datetime(df['date_string'])
df['date'] = pd.to_datetime(df['date_string'], format='%Y-%m-%d')

# Create date range
dates = pd.date_range('2023-01-01', periods=100, freq='D')
dates = pd.date_range('2023-01-01', '2023-12-31', freq='W')

# Common frequencies:
# 'D' - calendar day
# 'B' - business day
# 'W' - weekly
# 'M' - month end
# 'MS' - month start
# 'Q' - quarter end
# 'Y' - year end
# 'H' - hourly
# 'T' or 'min' - minutely
# 'S' - secondly

Python

13.2 DateTime Components

# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek
df['dayofyear'] = df['date'].dt.dayofyear
df['week'] = df['date'].dt.isocalendar().week
df['quarter'] = df['date'].dt.quarter
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute

# Day name
df['day_name'] = df['date'].dt.day_name()
df['month_name'] = df['date'].dt.month_name()

# Boolean checks
df['is_month_start'] = df['date'].dt.is_month_start
df['is_month_end'] = df['date'].dt.is_month_end
df['is_quarter_start'] = df['date'].dt.is_quarter_start

# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek
df['dayofyear'] = df['date'].dt.dayofyear
df['week'] = df['date'].dt.isocalendar().week
df['quarter'] = df['date'].dt.quarter
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute

# Day name
df['day_name'] = df['date'].dt.day_name()
df['month_name'] = df['date'].dt.month_name()

# Boolean checks
df['is_month_start'] = df['date'].dt.is_month_start
df['is_month_end'] = df['date'].dt.is_month_end
df['is_quarter_start'] = df['date'].dt.is_quarter_start

Python

13.3 Time Series Operations

# Set datetime as index
df.set_index('date', inplace=True)

# Select by date
df['2023']                          # All data from 2023
df['2023-01']                       # All data from January 2023
df['2023-01-01':'2023-01-31']       # Date range

# Resampling (changing frequency)
df.resample('M').mean()             # Monthly mean
df.resample('W').sum()              # Weekly sum
df.resample('Q').agg(['mean', 'sum'])  # Quarterly aggregation

# Rolling window
df['rolling_mean'] = df['value'].rolling(window=7).mean()
df['rolling_std'] = df['value'].rolling(window=7).std()
df['rolling_sum'] = df['value'].rolling(window=30).sum()

# Expanding window
df['expanding_mean'] = df['value'].expanding().mean()

# Shifting
df['previous_day'] = df['value'].shift(1)   # Shift forward
df['next_day'] = df['value'].shift(-1)      # Shift backward
df['pct_change'] = df['value'].pct_change()  # Percentage change

# Time differences
df['time_diff'] = df['date'].diff()

# Set datetime as index
df.set_index('date', inplace=True)

# Select by date
df['2023']                          # All data from 2023
df['2023-01']                       # All data from January 2023
df['2023-01-01':'2023-01-31']       # Date range

# Resampling (changing frequency)
df.resample('M').mean()             # Monthly mean
df.resample('W').sum()              # Weekly sum
df.resample('Q').agg(['mean', 'sum'])  # Quarterly aggregation

# Rolling window
df['rolling_mean'] = df['value'].rolling(window=7).mean()
df['rolling_std'] = df['value'].rolling(window=7).std()
df['rolling_sum'] = df['value'].rolling(window=30).sum()

# Expanding window
df['expanding_mean'] = df['value'].expanding().mean()

# Shifting
df['previous_day'] = df['value'].shift(1)   # Shift forward
df['next_day'] = df['value'].shift(-1)      # Shift backward
df['pct_change'] = df['value'].pct_change()  # Percentage change

# Time differences
df['time_diff'] = df['date'].diff()

Python

13.4 Time Zones

# Localize timezone-naive to timezone-aware
df['date'] = df['date'].dt.tz_localize('UTC')

# Convert to different timezone
df['date'] = df['date'].dt.tz_convert('US/Eastern')

# Remove timezone
df['date'] = df['date'].dt.tz_localize(None)

# Localize timezone-naive to timezone-aware
df['date'] = df['date'].dt.tz_localize('UTC')

# Convert to different timezone
df['date'] = df['date'].dt.tz_convert('US/Eastern')

# Remove timezone
df['date'] = df['date'].dt.tz_localize(None)

Python

14. Data Visualization

Pandas integrates with matplotlib for quick visualizations.

import matplotlib.pyplot as plt

# Line plot
df['Age'].plot()
df.plot(x='Name', y='Age')
plt.show()

# Multiple lines
df[['Age', 'Salary']].plot()

# Bar plot
df['City'].value_counts().plot(kind='bar')
df['City'].value_counts().plot.bar()

# Horizontal bar plot
df['City'].value_counts().plot(kind='barh')

# Histogram
df['Age'].plot(kind='hist', bins=20)
df['Age'].hist(bins=20)

# Box plot
df['Age'].plot(kind='box')
df.boxplot(column='Age', by='City')

# Scatter plot
df.plot(kind='scatter', x='Age', y='Salary')
df.plot.scatter(x='Age', y='Salary', c='City')

# Pie chart
df['City'].value_counts().plot(kind='pie', autopct='%1.1f%%')

# Area plot
df.plot(kind='area')

# Hexbin plot
df.plot(kind='hexbin', x='Age', y='Salary', gridsize=20)

# Density plot
df['Age'].plot(kind='density')

# Subplots
df.plot(subplots=True, layout=(2, 2), figsize=(10, 8))

# Customization
df['Age'].plot(
    kind='hist',
    title='Age Distribution',
    xlabel='Age',
    ylabel='Frequency',
    color='blue',
    alpha=0.7,
    figsize=(10, 6)
)
plt.show()

import matplotlib.pyplot as plt

# Line plot
df['Age'].plot()
df.plot(x='Name', y='Age')
plt.show()

# Multiple lines
df[['Age', 'Salary']].plot()

# Bar plot
df['City'].value_counts().plot(kind='bar')
df['City'].value_counts().plot.bar()

# Horizontal bar plot
df['City'].value_counts().plot(kind='barh')

# Histogram
df['Age'].plot(kind='hist', bins=20)
df['Age'].hist(bins=20)

# Box plot
df['Age'].plot(kind='box')
df.boxplot(column='Age', by='City')

# Scatter plot
df.plot(kind='scatter', x='Age', y='Salary')
df.plot.scatter(x='Age', y='Salary', c='City')

# Pie chart
df['City'].value_counts().plot(kind='pie', autopct='%1.1f%%')

# Area plot
df.plot(kind='area')

# Hexbin plot
df.plot(kind='hexbin', x='Age', y='Salary', gridsize=20)

# Density plot
df['Age'].plot(kind='density')

# Subplots
df.plot(subplots=True, layout=(2, 2), figsize=(10, 8))

# Customization
df['Age'].plot(
    kind='hist',
    title='Age Distribution',
    xlabel='Age',
    ylabel='Frequency',
    color='blue',
    alpha=0.7,
    figsize=(10, 6)
)
plt.show()

Python

15. Performance Tips

15.1 Memory Optimization

# Check memory usage
df.memory_usage(deep=True)

# Optimize data types
df['int_column'] = df['int_column'].astype('int32')  # Instead of int64
df['cat_column'] = df['cat_column'].astype('category')  # For categorical data

# Convert object to category for columns with limited unique values
for col in df.select_dtypes(include=['object']):
    if df[col].nunique() / len(df) < 0.5:  # Less than 50% unique
        df[col] = df[col].astype('category')

# Check memory usage
df.memory_usage(deep=True)

# Optimize data types
df['int_column'] = df['int_column'].astype('int32')  # Instead of int64
df['cat_column'] = df['cat_column'].astype('category')  # For categorical data

# Convert object to category for columns with limited unique values
for col in df.select_dtypes(include=['object']):
    if df[col].nunique() / len(df) < 0.5:  # Less than 50% unique
        df[col] = df[col].astype('category')

Python

15.2 Vectorization

# SLOW: Using iterrows
for index, row in df.iterrows():
    df.at[index, 'result'] = row['A'] + row['B']

# FAST: Vectorized operation
df['result'] = df['A'] + df['B']

# SLOW: Using apply with lambda
df['result'] = df.apply(lambda row: row['A'] + row['B'], axis=1)

# FAST: Direct column operations
df['result'] = df['A'] + df['B']

# SLOW: Using iterrows
for index, row in df.iterrows():
    df.at[index, 'result'] = row['A'] + row['B']

# FAST: Vectorized operation
df['result'] = df['A'] + df['B']

# SLOW: Using apply with lambda
df['result'] = df.apply(lambda row: row['A'] + row['B'], axis=1)

# FAST: Direct column operations
df['result'] = df['A'] + df['B']

Python

15.3 Efficient Reading

# Read only necessary columns
df = pd.read_csv('large_file.csv', usecols=['col1', 'col2', 'col3'])

# Specify data types
dtypes = {'col1': 'int32', 'col2': 'float32', 'col3': 'category'}
df = pd.read_csv('large_file.csv', dtype=dtypes)

# Read in chunks
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    process(chunk)

# Use nrows for testing
df = pd.read_csv('large_file.csv', nrows=1000)

# Read only necessary columns
df = pd.read_csv('large_file.csv', usecols=['col1', 'col2', 'col3'])

# Specify data types
dtypes = {'col1': 'int32', 'col2': 'float32', 'col3': 'category'}
df = pd.read_csv('large_file.csv', dtype=dtypes)

# Read in chunks
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    process(chunk)

# Use nrows for testing
df = pd.read_csv('large_file.csv', nrows=1000)

Python

15.4 Query vs Boolean Indexing

# Boolean indexing
result = df[(df['A'] > 5) & (df['B'] < 10)]

# Query (often faster for complex conditions)
result = df.query('A > 5 and B < 10')

# Boolean indexing
result = df[(df['A'] > 5) & (df['B'] < 10)]

# Query (often faster for complex conditions)
result = df.query('A > 5 and B < 10')

Python

15.5 Use Built-in Functions

# SLOW: Custom function
df['is_adult'] = df['Age'].apply(lambda x: 1 if x >= 18 else 0)

# FAST: NumPy where
df['is_adult'] = np.where(df['Age'] >= 18, 1, 0)

# FAST: Built-in methods
df['sum'] = df[['A', 'B', 'C']].sum(axis=1)  # Instead of apply

# SLOW: Custom function
df['is_adult'] = df['Age'].apply(lambda x: 1 if x >= 18 else 0)

# FAST: NumPy where
df['is_adult'] = np.where(df['Age'] >= 18, 1, 0)

# FAST: Built-in methods
df['sum'] = df[['A', 'B', 'C']].sum(axis=1)  # Instead of apply

Python

15.6 Avoid Chained Assignment

# BAD: Chained assignment (can cause SettingWithCopyWarning)
df[df['Age'] > 30]['City'] = 'Updated'

# GOOD: Use loc
df.loc[df['Age'] > 30, 'City'] = 'Updated'

# BAD: Chained assignment (can cause SettingWithCopyWarning)
df[df['Age'] > 30]['City'] = 'Updated'

# GOOD: Use loc
df.loc[df['Age'] > 30, 'City'] = 'Updated'

Python

Additional Resources

Official Documentation

Useful Libraries to Use with Pandas

NumPy: Numerical computing
Matplotlib/Seaborn: Visualization
Scikit-learn: Machine learning
SQLAlchemy: Database connections
Openpyxl/XlsxWriter: Excel file handling

Best Practices

Always make copies when modifying DataFrames: df_copy = df.copy()
Use inplace=True cautiously; it modifies the original DataFrame
Chain operations for readability: df.groupby('A').filter(...).sort_values(...)
Use method chaining with backslashes for multi-line operations
Set pd.set_option('display.max_columns', None) to see all columns
Use df.pipe() for custom function chains
Profile your code to identify bottlenecks
Consider using Dask or Vaex for very large datasets

Common Pitfalls

SettingWithCopyWarning: Use .loc[] or .copy() explicitly
Mixed data types in columns can slow operations
Not resetting index after filtering can cause issues
Forgetting inplace=True or not assigning the result back
Not handling missing data appropriately

Comprehensive Pandas Cheatsheet

1. Importing and Setup

import pandas as pd
import numpy as np

# Display options
pd.set_option('display.max_columns', None)      # Show all columns
pd.set_option('display.max_rows', 100)          # Show max 100 rows
pd.set_option('display.max_colwidth', None)     # Show full column width
pd.set_option('display.precision', 2)           # Decimal precision
pd.set_option('display.float_format', '{:.2f}'.format)  # Float format
pd.reset_option('all')                          # Reset all options

import pandas as pd
import numpy as np

# Display options
pd.set_option('display.max_columns', None)      # Show all columns
pd.set_option('display.max_rows', 100)          # Show max 100 rows
pd.set_option('display.max_colwidth', None)     # Show full column width
pd.set_option('display.precision', 2)           # Decimal precision
pd.set_option('display.float_format', '{:.2f}'.format)  # Float format
pd.reset_option('all')                          # Reset all options

Python

2. Creating Data Structures

# Series
s = pd.Series([1, 2, 3, 4, 5])
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s = pd.Series({'a': 1, 'b': 2, 'c': 3})

# DataFrame from dict
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# DataFrame from list of dicts
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])

# DataFrame from numpy array
df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])

# Empty DataFrame
df = pd.DataFrame()

# DataFrame with index and columns
df = pd.DataFrame(data, index=idx, columns=cols)

# Series
s = pd.Series([1, 2, 3, 4, 5])
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s = pd.Series({'a': 1, 'b': 2, 'c': 3})

# DataFrame from dict
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# DataFrame from list of dicts
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])

# DataFrame from numpy array
df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])

# Empty DataFrame
df = pd.DataFrame()

# DataFrame with index and columns
df = pd.DataFrame(data, index=idx, columns=cols)

Python

3. Reading and Writing Files

# CSV
df = pd.read_csv('file.csv')
df = pd.read_csv('file.csv', sep=';', encoding='utf-8', header=0, 
                 index_col=0, usecols=['col1', 'col2'], nrows=1000,
                 na_values=['NA', '?'], parse_dates=['date_col'])
df.to_csv('output.csv', index=False, encoding='utf-8', sep=',')

# Excel
df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
df = pd.read_excel('file.xlsx', sheet_name=0)  # First sheet
df = pd.read_excel('file.xlsx', sheet_name=None)  # All sheets
df.to_excel('output.xlsx', sheet_name='Data', index=False)

# JSON
df = pd.read_json('file.json')
df = pd.read_json('file.json', orient='records')
df.to_json('output.json', orient='records', indent=2)

# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table', conn)
df = pd.read_sql_query('SELECT * FROM table WHERE col > 5', conn)
df = pd.read_sql_table('table_name', conn)
df.to_sql('table_name', conn, if_exists='replace', index=False)

# HTML
df = pd.read_html('url_or_file.html')[0]
df.to_html('output.html', index=False)

# Pickle (fast, pandas-specific format)
df.to_pickle('file.pkl')
df = pd.read_pickle('file.pkl')

# Parquet (efficient columnar format)
df.to_parquet('file.parquet')
df = pd.read_parquet('file.parquet')

# Clipboard
df = pd.read_clipboard()
df.to_clipboard(index=False)

# CSV
df = pd.read_csv('file.csv')
df = pd.read_csv('file.csv', sep=';', encoding='utf-8', header=0, 
                 index_col=0, usecols=['col1', 'col2'], nrows=1000,
                 na_values=['NA', '?'], parse_dates=['date_col'])
df.to_csv('output.csv', index=False, encoding='utf-8', sep=',')

# Excel
df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
df = pd.read_excel('file.xlsx', sheet_name=0)  # First sheet
df = pd.read_excel('file.xlsx', sheet_name=None)  # All sheets
df.to_excel('output.xlsx', sheet_name='Data', index=False)

# JSON
df = pd.read_json('file.json')
df = pd.read_json('file.json', orient='records')
df.to_json('output.json', orient='records', indent=2)

# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table', conn)
df = pd.read_sql_query('SELECT * FROM table WHERE col > 5', conn)
df = pd.read_sql_table('table_name', conn)
df.to_sql('table_name', conn, if_exists='replace', index=False)

# HTML
df = pd.read_html('url_or_file.html')[0]
df.to_html('output.html', index=False)

# Pickle (fast, pandas-specific format)
df.to_pickle('file.pkl')
df = pd.read_pickle('file.pkl')

# Parquet (efficient columnar format)
df.to_parquet('file.parquet')
df = pd.read_parquet('file.parquet')

# Clipboard
df = pd.read_clipboard()
df.to_clipboard(index=False)

Python

4. Viewing and Inspecting Data

# View data
df.head(n)              # First n rows (default 5)
df.tail(n)              # Last n rows
df.sample(n)            # Random n rows
df.sample(frac=0.1)     # Random 10% of rows

# Shape and size
df.shape                # (rows, columns)
df.size                 # Total elements
len(df)                 # Number of rows
df.ndim                 # Number of dimensions

# Index and columns
df.index                # Row index
df.columns              # Column names
df.axes                 # List of [index, columns]

# Data types
df.dtypes               # Data type of each column
df.dtypes.value_counts()  # Count of each data type
df.info()               # Detailed info
df.info(memory_usage='deep')  # Deep memory usage

# Statistics
df.describe()           # Summary statistics for numeric columns
df.describe(include='all')  # All columns
df.describe(include=['object'])  # Only object columns
df.describe(exclude=['float64'])  # Exclude certain types

# Counts
df.count()              # Non-null values per column
df['col'].value_counts()  # Frequency counts
df['col'].value_counts(normalize=True)  # Proportions
df['col'].value_counts(dropna=False)  # Include NaN
df['col'].nunique()     # Number of unique values
df['col'].unique()      # Array of unique values

# Min/Max
df.min()                # Min of each column
df.max()                # Max of each column
df.idxmin()             # Index of min value
df.idxmax()             # Index of max value

# Sum and Mean
df.sum()                # Sum of each column
df.sum(axis=1)          # Sum of each row
df.mean()               # Mean of each column
df.median()             # Median
df.mode()               # Mode
df.std()                # Standard deviation
df.var()                # Variance

# Correlations
df.corr()               # Correlation matrix
df.corr(method='pearson')  # Pearson correlation
df.corr(method='spearman')  # Spearman correlation
df['A'].corr(df['B'])   # Correlation between two columns
df.corrwith(df['A'])    # Correlation with one column

# Memory
df.memory_usage()       # Memory usage per column
df.memory_usage(deep=True)  # Deep memory usage

# View data
df.head(n)              # First n rows (default 5)
df.tail(n)              # Last n rows
df.sample(n)            # Random n rows
df.sample(frac=0.1)     # Random 10% of rows

# Shape and size
df.shape                # (rows, columns)
df.size                 # Total elements
len(df)                 # Number of rows
df.ndim                 # Number of dimensions

# Index and columns
df.index                # Row index
df.columns              # Column names
df.axes                 # List of [index, columns]

# Data types
df.dtypes               # Data type of each column
df.dtypes.value_counts()  # Count of each data type
df.info()               # Detailed info
df.info(memory_usage='deep')  # Deep memory usage

# Statistics
df.describe()           # Summary statistics for numeric columns
df.describe(include='all')  # All columns
df.describe(include=['object'])  # Only object columns
df.describe(exclude=['float64'])  # Exclude certain types

# Counts
df.count()              # Non-null values per column
df['col'].value_counts()  # Frequency counts
df['col'].value_counts(normalize=True)  # Proportions
df['col'].value_counts(dropna=False)  # Include NaN
df['col'].nunique()     # Number of unique values
df['col'].unique()      # Array of unique values

# Min/Max
df.min()                # Min of each column
df.max()                # Max of each column
df.idxmin()             # Index of min value
df.idxmax()             # Index of max value

# Sum and Mean
df.sum()                # Sum of each column
df.sum(axis=1)          # Sum of each row
df.mean()               # Mean of each column
df.median()             # Median
df.mode()               # Mode
df.std()                # Standard deviation
df.var()                # Variance

# Correlations
df.corr()               # Correlation matrix
df.corr(method='pearson')  # Pearson correlation
df.corr(method='spearman')  # Spearman correlation
df['A'].corr(df['B'])   # Correlation between two columns
df.corrwith(df['A'])    # Correlation with one column

# Memory
df.memory_usage()       # Memory usage per column
df.memory_usage(deep=True)  # Deep memory usage

Python

5. Selection and Indexing

# Column selection
df['col']               # Single column (Series)
df[['col']]             # Single column (DataFrame)
df[['col1', 'col2']]    # Multiple columns
df.col                  # Column by attribute (if valid name)

# Row selection by position (iloc)
df.iloc[0]              # First row
df.iloc[-1]             # Last row
df.iloc[0:5]            # First 5 rows
df.iloc[[0, 2, 4]]      # Specific rows
df.iloc[:, 0]           # First column
df.iloc[:, 0:3]         # First 3 columns
df.iloc[0:5, 0:3]       # Rows 0-4, columns 0-2

# Row selection by label (loc)
df.loc[0]               # Row with index label 0
df.loc[0:5]             # Rows 0 to 5 (inclusive!)
df.loc[:, 'A']          # Column 'A'
df.loc[:, 'A':'C']      # Columns A to C (inclusive)
df.loc[0:5, ['A', 'C']]  # Specific rows and columns
df.loc[df['A'] > 5, 'B']  # Conditional selection

# Scalar access
df.at[0, 'A']           # Fast scalar access by label
df.iat[0, 1]            # Fast scalar access by position

# Boolean indexing
df[df['A'] > 5]         # Rows where A > 5
df[df['A'].isin([1, 2, 3])]  # Rows where A is in list
df[df['A'].between(1, 5)]    # Rows where A is between 1 and 5
df[~df['A'].isin([1, 2])]    # NOT in list
df[(df['A'] > 5) & (df['B'] < 10)]  # Multiple conditions (AND)
df[(df['A'] > 5) | (df['B'] < 10)]  # Multiple conditions (OR)

# Query method
df.query('A > 5')
df.query('A > 5 and B < 10')
df.query('A > @threshold')  # Use variables with @

# Select by data type
df.select_dtypes(include=['float64', 'int64'])
df.select_dtypes(include='number')
df.select_dtypes(exclude=['object'])

# Filter columns by name
df.filter(items=['A', 'B'])         # Exact names
df.filter(like='_total')            # Contains substring
df.filter(regex='^col')             # Regex pattern

# Where
df.where(df > 0)                    # Keep values where condition is True
df.where(df > 0, -df)               # Replace False values
df.mask(df > 0)                     # Opposite of where

# Column selection
df['col']               # Single column (Series)
df[['col']]             # Single column (DataFrame)
df[['col1', 'col2']]    # Multiple columns
df.col                  # Column by attribute (if valid name)

# Row selection by position (iloc)
df.iloc[0]              # First row
df.iloc[-1]             # Last row
df.iloc[0:5]            # First 5 rows
df.iloc[[0, 2, 4]]      # Specific rows
df.iloc[:, 0]           # First column
df.iloc[:, 0:3]         # First 3 columns
df.iloc[0:5, 0:3]       # Rows 0-4, columns 0-2

# Row selection by label (loc)
df.loc[0]               # Row with index label 0
df.loc[0:5]             # Rows 0 to 5 (inclusive!)
df.loc[:, 'A']          # Column 'A'
df.loc[:, 'A':'C']      # Columns A to C (inclusive)
df.loc[0:5, ['A', 'C']]  # Specific rows and columns
df.loc[df['A'] > 5, 'B']  # Conditional selection

# Scalar access
df.at[0, 'A']           # Fast scalar access by label
df.iat[0, 1]            # Fast scalar access by position

# Boolean indexing
df[df['A'] > 5]         # Rows where A > 5
df[df['A'].isin([1, 2, 3])]  # Rows where A is in list
df[df['A'].between(1, 5)]    # Rows where A is between 1 and 5
df[~df['A'].isin([1, 2])]    # NOT in list
df[(df['A'] > 5) & (df['B'] < 10)]  # Multiple conditions (AND)
df[(df['A'] > 5) | (df['B'] < 10)]  # Multiple conditions (OR)

# Query method
df.query('A > 5')
df.query('A > 5 and B < 10')
df.query('A > @threshold')  # Use variables with @

# Select by data type
df.select_dtypes(include=['float64', 'int64'])
df.select_dtypes(include='number')
df.select_dtypes(exclude=['object'])

# Filter columns by name
df.filter(items=['A', 'B'])         # Exact names
df.filter(like='_total')            # Contains substring
df.filter(regex='^col')             # Regex pattern

# Where
df.where(df > 0)                    # Keep values where condition is True
df.where(df > 0, -df)               # Replace False values
df.mask(df > 0)                     # Opposite of where

Python

6. Filtering and Querying

# Basic filtering
df[df['A'] > 5]
df[df['A'] == 'value']
df[df['A'] != 'value']
df[df['A'].isna()]
df[df['A'].notna()]

# Multiple conditions
df[(df['A'] > 5) & (df['B'] < 10)]
df[(df['A'] > 5) | (df['B'] < 10)]
df[~(df['A'] > 5)]  # NOT

# isin() method
df[df['A'].isin([1, 2, 3])]
df[df['A'].isin(other_df['B'])]

# String methods
df[df['col'].str.contains('pattern')]
df[df['col'].str.startswith('A')]
df[df['col'].str.endswith('z')]
df[df['col'].str.match(r'^[A-Z]')]
df[df['col'].str.len() > 5]
df[df['col'].str.isdigit()]
df[df['col'].str.isalpha()]

# between() method
df[df['A'].between(1, 10)]
df[df['A'].between(1, 10, inclusive='neither')]

# query() method
df.query('A > 5')
df.query('A > 5 and B < 10')
df.query('A in [1, 2, 3]')
df.query('col.str.contains("pattern")', engine='python')

# nlargest/nsmallest
df.nlargest(10, 'A')    # 10 largest values in column A
df.nsmallest(10, 'A')   # 10 smallest values
df.nlargest(5, ['A', 'B'])  # By multiple columns

# Basic filtering
df[df['A'] > 5]
df[df['A'] == 'value']
df[df['A'] != 'value']
df[df['A'].isna()]
df[df['A'].notna()]

# Multiple conditions
df[(df['A'] > 5) & (df['B'] < 10)]
df[(df['A'] > 5) | (df['B'] < 10)]
df[~(df['A'] > 5)]  # NOT

# isin() method
df[df['A'].isin([1, 2, 3])]
df[df['A'].isin(other_df['B'])]

# String methods
df[df['col'].str.contains('pattern')]
df[df['col'].str.startswith('A')]
df[df['col'].str.endswith('z')]
df[df['col'].str.match(r'^[A-Z]')]
df[df['col'].str.len() > 5]
df[df['col'].str.isdigit()]
df[df['col'].str.isalpha()]

# between() method
df[df['A'].between(1, 10)]
df[df['A'].between(1, 10, inclusive='neither')]

# query() method
df.query('A > 5')
df.query('A > 5 and B < 10')
df.query('A in [1, 2, 3]')
df.query('col.str.contains("pattern")', engine='python')

# nlargest/nsmallest
df.nlargest(10, 'A')    # 10 largest values in column A
df.nsmallest(10, 'A')   # 10 smallest values
df.nlargest(5, ['A', 'B'])  # By multiple columns

Python

7. Adding, Modifying, and Removing Data

# Add columns
df['new_col'] = value           # Constant value
df['new_col'] = df['A'] + df['B']  # Computed column
df['new_col'] = df['A'].apply(lambda x: x*2)
df.insert(1, 'new_col', values)  # Insert at position

# Assign (method chaining friendly)
df = df.assign(new_col=df['A'] * 2)
df = df.assign(col1=lambda x: x['A'] * 2,
               col2=lambda x: x['col1'] + 1)

# Modify columns
df['A'] = df['A'] * 2
df.loc[:, 'A'] = df['A'] * 2
df.loc[df['A'] > 5, 'B'] = 0

# Remove columns
df.drop('col', axis=1)
df.drop(['col1', 'col2'], axis=1)
df.drop(columns=['col1', 'col2'])
del df['col']
df.pop('col')  # Returns the column

# Add rows
df.loc[len(df)] = [value1, value2, value3]
df = pd.concat([df, new_row_df], ignore_index=True)

# Remove rows
df.drop(0)              # Drop row with index 0
df.drop([0, 1, 2])      # Drop multiple rows
df.drop(index=[0, 1])   # Explicit index parameter

# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['A', 'B'])
df.drop_duplicates(keep='first')   # Keep first occurrence
df.drop_duplicates(keep='last')    # Keep last occurrence
df.drop_duplicates(keep=False)     # Remove all duplicates

# Add columns
df['new_col'] = value           # Constant value
df['new_col'] = df['A'] + df['B']  # Computed column
df['new_col'] = df['A'].apply(lambda x: x*2)
df.insert(1, 'new_col', values)  # Insert at position

# Assign (method chaining friendly)
df = df.assign(new_col=df['A'] * 2)
df = df.assign(col1=lambda x: x['A'] * 2,
               col2=lambda x: x['col1'] + 1)

# Modify columns
df['A'] = df['A'] * 2
df.loc[:, 'A'] = df['A'] * 2
df.loc[df['A'] > 5, 'B'] = 0

# Remove columns
df.drop('col', axis=1)
df.drop(['col1', 'col2'], axis=1)
df.drop(columns=['col1', 'col2'])
del df['col']
df.pop('col')  # Returns the column

# Add rows
df.loc[len(df)] = [value1, value2, value3]
df = pd.concat([df, new_row_df], ignore_index=True)

# Remove rows
df.drop(0)              # Drop row with index 0
df.drop([0, 1, 2])      # Drop multiple rows
df.drop(index=[0, 1])   # Explicit index parameter

# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['A', 'B'])
df.drop_duplicates(keep='first')   # Keep first occurrence
df.drop_duplicates(keep='last')    # Keep last occurrence
df.drop_duplicates(keep=False)     # Remove all duplicates

Python

8. Renaming and Reindexing

# Rename columns
df.rename(columns={'old': 'new'})
df.rename(columns={'A': 'col_A', 'B': 'col_B'})
df.rename(columns=str.upper)  # Apply function to all
df.columns = ['A', 'B', 'C']  # Replace all column names

# Rename index
df.rename(index={0: 'first', 1: 'second'})
df.rename(index=str)

# Add prefix/suffix
df.add_prefix('col_')
df.add_suffix('_total')

# Reset index
df.reset_index()            # Move index to column
df.reset_index(drop=True)   # Drop the index

# Set index
df.set_index('col')
df.set_index(['col1', 'col2'])  # MultiIndex

# Reindex
df.reindex([0, 2, 4])       # New index
df.reindex(columns=['A', 'C'])  # Reorder columns
df.reindex_like(other_df)   # Match another DataFrame

# Rename columns
df.rename(columns={'old': 'new'})
df.rename(columns={'A': 'col_A', 'B': 'col_B'})
df.rename(columns=str.upper)  # Apply function to all
df.columns = ['A', 'B', 'C']  # Replace all column names

# Rename index
df.rename(index={0: 'first', 1: 'second'})
df.rename(index=str)

# Add prefix/suffix
df.add_prefix('col_')
df.add_suffix('_total')

# Reset index
df.reset_index()            # Move index to column
df.reset_index(drop=True)   # Drop the index

# Set index
df.set_index('col')
df.set_index(['col1', 'col2'])  # MultiIndex

# Reindex
df.reindex([0, 2, 4])       # New index
df.reindex(columns=['A', 'C'])  # Reorder columns
df.reindex_like(other_df)   # Match another DataFrame

Python

9. Sorting

# Sort by values
df.sort_values('A')
df.sort_values('A', ascending=False)
df.sort_values(['A', 'B'])
df.sort_values(['A', 'B'], ascending=[True, False])
df.sort_values('A', na_position='first')  # NaN first

# Sort by index
df.sort_index()
df.sort_index(ascending=False)
df.sort_index(axis=1)  # Sort columns

# Sort by multiple columns with different orders
df.sort_values(by=['col1', 'col2'], 
               ascending=[True, False],
               na_position='last')

# Sort by values
df.sort_values('A')
df.sort_values('A', ascending=False)
df.sort_values(['A', 'B'])
df.sort_values(['A', 'B'], ascending=[True, False])
df.sort_values('A', na_position='first')  # NaN first

# Sort by index
df.sort_index()
df.sort_index(ascending=False)
df.sort_index(axis=1)  # Sort columns

# Sort by multiple columns with different orders
df.sort_values(by=['col1', 'col2'], 
               ascending=[True, False],
               na_position='last')

Python

10. Apply, Map, and Transform

# apply() - Apply function to each column or row
df.apply(np.sum)            # Sum of each column
df.apply(np.sum, axis=1)    # Sum of each row
df['A'].apply(lambda x: x * 2)
df.apply(lambda x: x.max() - x.min())

# applymap() - Apply function to each element (element-wise)
df.applymap(lambda x: x * 2)
df.applymap(str)

# map() - Map values (Series only)
df['A'].map({1: 'one', 2: 'two', 3: 'three'})
df['A'].map(lambda x: x * 2)

# replace() - Replace values
df.replace(0, np.nan)
df.replace([0, 1], [100, 200])
df.replace({0: 100, 1: 200})
df.replace({'A': {0: 100}, 'B': {1: 200}})
df['A'].replace(0, 100)

# transform() - Transform values
df.transform(lambda x: x * 2)
df.transform([np.sqrt, np.square])
df.groupby('A').transform('mean')

# pipe() - Chain custom functions
df.pipe(func1).pipe(func2).pipe(func3)

# apply() - Apply function to each column or row
df.apply(np.sum)            # Sum of each column
df.apply(np.sum, axis=1)    # Sum of each row
df['A'].apply(lambda x: x * 2)
df.apply(lambda x: x.max() - x.min())

# applymap() - Apply function to each element (element-wise)
df.applymap(lambda x: x * 2)
df.applymap(str)

# map() - Map values (Series only)
df['A'].map({1: 'one', 2: 'two', 3: 'three'})
df['A'].map(lambda x: x * 2)

# replace() - Replace values
df.replace(0, np.nan)
df.replace([0, 1], [100, 200])
df.replace({0: 100, 1: 200})
df.replace({'A': {0: 100}, 'B': {1: 200}})
df['A'].replace(0, 100)

# transform() - Transform values
df.transform(lambda x: x * 2)
df.transform([np.sqrt, np.square])
df.groupby('A').transform('mean')

# pipe() - Chain custom functions
df.pipe(func1).pipe(func2).pipe(func3)

Python

11. String Operations

# Case conversion
df['col'].str.lower()
df['col'].str.upper()
df['col'].str.title()
df['col'].str.capitalize()
df['col'].str.swapcase()

# Strip whitespace
df['col'].str.strip()
df['col'].str.lstrip()
df['col'].str.rstrip()
df['col'].str.strip('chars')

# Split and join
df['col'].str.split()
df['col'].str.split(',')
df['col'].str.split(',', expand=True)  # Create new columns
df['col'].str.rsplit(',', n=1)  # Split from right
df['col'].str.join('-')

# Substring operations
df['col'].str[:5]  # First 5 characters
df['col'].str[5:]  # From 5th character
df['col'].str[2:5]  # Characters 2-4
df['col'].str.slice(0, 5)
df['col'].str.slice(start=2, stop=5, step=1)

# Replace and remove
df['col'].str.replace('old', 'new')
df['col'].str.replace(r'\d+', 'NUM', regex=True)
df['col'].str.removeprefix('prefix')
df['col'].str.removesuffix('suffix')

# Find and count
df['col'].str.contains('pattern')
df['col'].str.contains('pattern', case=False)
df['col'].str.contains(r'\d+', regex=True)
df['col'].str.startswith('A')
df['col'].str.endswith('z')
df['col'].str.find('pattern')
df['col'].str.count('pattern')

# Extract
df['col'].str.extract(r'(\d+)')  # First group
df['col'].str.extract(r'(\d+)-(\w+)')  # Multiple groups
df['col'].str.extractall(r'(\d+)')  # All matches
df['col'].str.findall(r'\d+')

# Length and checking
df['col'].str.len()
df['col'].str.isdigit()
df['col'].str.isalpha()
df['col'].str.isalnum()
df['col'].str.isspace()
df['col'].str.islower()
df['col'].str.isupper()
df['col'].str.istitle()

# Padding
df['col'].str.pad(10, side='left', fillchar='0')
df['col'].str.zfill(10)  # Zero-pad
df['col'].str.center(10, fillchar='*')
df['col'].str.ljust(10)
df['col'].str.rjust(10)

# Get specific element from split
df['col'].str.split('-').str[0]
df['col'].str.split('-').str.get(0)

# Case conversion
df['col'].str.lower()
df['col'].str.upper()
df['col'].str.title()
df['col'].str.capitalize()
df['col'].str.swapcase()

# Strip whitespace
df['col'].str.strip()
df['col'].str.lstrip()
df['col'].str.rstrip()
df['col'].str.strip('chars')

# Split and join
df['col'].str.split()
df['col'].str.split(',')
df['col'].str.split(',', expand=True)  # Create new columns
df['col'].str.rsplit(',', n=1)  # Split from right
df['col'].str.join('-')

# Substring operations
df['col'].str[:5]  # First 5 characters
df['col'].str[5:]  # From 5th character
df['col'].str[2:5]  # Characters 2-4
df['col'].str.slice(0, 5)
df['col'].str.slice(start=2, stop=5, step=1)

# Replace and remove
df['col'].str.replace('old', 'new')
df['col'].str.replace(r'\d+', 'NUM', regex=True)
df['col'].str.removeprefix('prefix')
df['col'].str.removesuffix('suffix')

# Find and count
df['col'].str.contains('pattern')
df['col'].str.contains('pattern', case=False)
df['col'].str.contains(r'\d+', regex=True)
df['col'].str.startswith('A')
df['col'].str.endswith('z')
df['col'].str.find('pattern')
df['col'].str.count('pattern')

# Extract
df['col'].str.extract(r'(\d+)')  # First group
df['col'].str.extract(r'(\d+)-(\w+)')  # Multiple groups
df['col'].str.extractall(r'(\d+)')  # All matches
df['col'].str.findall(r'\d+')

# Length and checking
df['col'].str.len()
df['col'].str.isdigit()
df['col'].str.isalpha()
df['col'].str.isalnum()
df['col'].str.isspace()
df['col'].str.islower()
df['col'].str.isupper()
df['col'].str.istitle()

# Padding
df['col'].str.pad(10, side='left', fillchar='0')
df['col'].str.zfill(10)  # Zero-pad
df['col'].str.center(10, fillchar='*')
df['col'].str.ljust(10)
df['col'].str.rjust(10)

# Get specific element from split
df['col'].str.split('-').str[0]
df['col'].str.split('-').str.get(0)

Python

12. Missing Data

# Detect missing data
df.isnull()             # Boolean DataFrame
df.isna()               # Same as isnull()
df.notna()              # Opposite
df.isnull().sum()       # Count per column
df.isnull().sum().sum()  # Total count
df.isnull().any()       # Any missing per column
df.isnull().all()       # All missing per column

# Drop missing data
df.dropna()             # Drop rows with any NaN
df.dropna(axis=1)       # Drop columns with any NaN
df.dropna(how='all')    # Drop if all values are NaN
df.dropna(thresh=2)     # Drop if less than 2 non-NaN
df.dropna(subset=['A', 'B'])  # Drop based on specific columns

# Fill missing data
df.fillna(0)            # Fill with constant
df.fillna({'A': 0, 'B': 1})  # Different values per column
df.fillna(method='ffill')  # Forward fill
df.fillna(method='bfill')  # Backward fill
df.fillna(df.mean())    # Fill with mean
df['A'].fillna(df['A'].mean())  # Fill column with its mean
df.fillna(method='ffill', limit=2)  # Limit number of fills

# Interpolate
df.interpolate()        # Linear interpolation
df.interpolate(method='polynomial', order=2)
df.interpolate(method='time')  # For time series
df.interpolate(limit=2)  # Limit consecutive fills
df.interpolate(limit_direction='forward')

# Replace NaN
df.replace(np.nan, 0)
df.where(pd.notna(df), 0)

# Detect missing data
df.isnull()             # Boolean DataFrame
df.isna()               # Same as isnull()
df.notna()              # Opposite
df.isnull().sum()       # Count per column
df.isnull().sum().sum()  # Total count
df.isnull().any()       # Any missing per column
df.isnull().all()       # All missing per column

# Drop missing data
df.dropna()             # Drop rows with any NaN
df.dropna(axis=1)       # Drop columns with any NaN
df.dropna(how='all')    # Drop if all values are NaN
df.dropna(thresh=2)     # Drop if less than 2 non-NaN
df.dropna(subset=['A', 'B'])  # Drop based on specific columns

# Fill missing data
df.fillna(0)            # Fill with constant
df.fillna({'A': 0, 'B': 1})  # Different values per column
df.fillna(method='ffill')  # Forward fill
df.fillna(method='bfill')  # Backward fill
df.fillna(df.mean())    # Fill with mean
df['A'].fillna(df['A'].mean())  # Fill column with its mean
df.fillna(method='ffill', limit=2)  # Limit number of fills

# Interpolate
df.interpolate()        # Linear interpolation
df.interpolate(method='polynomial', order=2)
df.interpolate(method='time')  # For time series
df.interpolate(limit=2)  # Limit consecutive fills
df.interpolate(limit_direction='forward')

# Replace NaN
df.replace(np.nan, 0)
df.where(pd.notna(df), 0)

Python

13. Aggregation and Statistics

# Basic statistics
df.sum()                # Sum
df.sum(axis=1)          # Row-wise sum
df.mean()               # Mean
df.median()             # Median
df.mode()               # Mode
df.std()                # Standard deviation
df.var()                # Variance
df.min()                # Minimum
df.max()                # Maximum
df.abs()                # Absolute value
df.prod()               # Product
df.cumsum()             # Cumulative sum
df.cumprod()            # Cumulative product
df.cummin()             # Cumulative minimum
df.cummax()             # Cumulative maximum

# Quantiles
df.quantile(0.25)       # 25th percentile
df.quantile([0.25, 0.5, 0.75])  # Multiple quantiles

# Ranking
df.rank()               # Rank values
df.rank(ascending=False)
df.rank(method='first')  # first, average, min, max, dense

# Aggregation with agg()
df.agg(['sum', 'mean', 'std'])
df.agg({'A': ['sum', 'mean'], 'B': ['min', 'max']})
df.agg({'A': 'sum', 'B': 'mean'})
df['A'].agg(['sum', 'mean', lambda x: x.max() - x.min()])

# Custom aggregation
df.agg(custom_func)
df.agg({'A': custom_func, 'B': 'mean'})

# Basic statistics
df.sum()                # Sum
df.sum(axis=1)          # Row-wise sum
df.mean()               # Mean
df.median()             # Median
df.mode()               # Mode
df.std()                # Standard deviation
df.var()                # Variance
df.min()                # Minimum
df.max()                # Maximum
df.abs()                # Absolute value
df.prod()               # Product
df.cumsum()             # Cumulative sum
df.cumprod()            # Cumulative product
df.cummin()             # Cumulative minimum
df.cummax()             # Cumulative maximum

# Quantiles
df.quantile(0.25)       # 25th percentile
df.quantile([0.25, 0.5, 0.75])  # Multiple quantiles

# Ranking
df.rank()               # Rank values
df.rank(ascending=False)
df.rank(method='first')  # first, average, min, max, dense

# Aggregation with agg()
df.agg(['sum', 'mean', 'std'])
df.agg({'A': ['sum', 'mean'], 'B': ['min', 'max']})
df.agg({'A': 'sum', 'B': 'mean'})
df['A'].agg(['sum', 'mean', lambda x: x.max() - x.min()])

# Custom aggregation
df.agg(custom_func)
df.agg({'A': custom_func, 'B': 'mean'})

Python

14. GroupBy Operations

# Basic groupby
grouped = df.groupby('A')
grouped = df.groupby(['A', 'B'])  # Multiple columns
grouped = df.groupby(df['A'] > 5)  # By condition

# Aggregation after groupby
df.groupby('A').sum()
df.groupby('A').mean()
df.groupby('A')['B'].sum()  # Specific column
df.groupby('A')[['B', 'C']].sum()  # Multiple columns

# Multiple aggregations
df.groupby('A').agg(['sum', 'mean', 'std'])
df.groupby('A').agg({'B': 'sum', 'C': 'mean'})
df.groupby('A').agg({'B': ['sum', 'mean'], 'C': 'max'})

# Named aggregations
df.groupby('A').agg(
    total=('B', 'sum'),
    average=('B', 'mean'),
    max_value=('C', 'max')
)

# Transform
df.groupby('A').transform('mean')  # Broadcast to original shape
df['B_normalized'] = df.groupby('A')['B'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Filter groups
df.groupby('A').filter(lambda x: len(x) > 2)
df.groupby('A').filter(lambda x: x['B'].sum() > 100)

# Apply custom function
df.groupby('A').apply(lambda x: x.max() - x.min())

# Iterate through groups
for name, group in df.groupby('A'):
    print(f"Group: {name}")
    print(group)

# Get specific group
df.groupby('A').get_group('value')

# Group by multiple with different operations
df.groupby(['A', 'B']).agg({
    'C': ['sum', 'mean'],
    'D': 'max'
}).reset_index()

# Size and count
df.groupby('A').size()   # Size of each group
df.groupby('A').count()  # Count non-null values
df.groupby('A')['B'].count()

# First and last
df.groupby('A').first()
df.groupby('A').last()
df.groupby('A').head(2)  # First 2 of each group
df.groupby('A').tail(2)  # Last 2 of each group

# Basic groupby
grouped = df.groupby('A')
grouped = df.groupby(['A', 'B'])  # Multiple columns
grouped = df.groupby(df['A'] > 5)  # By condition

# Aggregation after groupby
df.groupby('A').sum()
df.groupby('A').mean()
df.groupby('A')['B'].sum()  # Specific column
df.groupby('A')[['B', 'C']].sum()  # Multiple columns

# Multiple aggregations
df.groupby('A').agg(['sum', 'mean', 'std'])
df.groupby('A').agg({'B': 'sum', 'C': 'mean'})
df.groupby('A').agg({'B': ['sum', 'mean'], 'C': 'max'})

# Named aggregations
df.groupby('A').agg(
    total=('B', 'sum'),
    average=('B', 'mean'),
    max_value=('C', 'max')
)

# Transform
df.groupby('A').transform('mean')  # Broadcast to original shape
df['B_normalized'] = df.groupby('A')['B'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Filter groups
df.groupby('A').filter(lambda x: len(x) > 2)
df.groupby('A').filter(lambda x: x['B'].sum() > 100)

# Apply custom function
df.groupby('A').apply(lambda x: x.max() - x.min())

# Iterate through groups
for name, group in df.groupby('A'):
    print(f"Group: {name}")
    print(group)

# Get specific group
df.groupby('A').get_group('value')

# Group by multiple with different operations
df.groupby(['A', 'B']).agg({
    'C': ['sum', 'mean'],
    'D': 'max'
}).reset_index()

# Size and count
df.groupby('A').size()   # Size of each group
df.groupby('A').count()  # Count non-null values
df.groupby('A')['B'].count()

# First and last
df.groupby('A').first()
df.groupby('A').last()
df.groupby('A').head(2)  # First 2 of each group
df.groupby('A').tail(2)  # Last 2 of each group

Python

15. Pivot Tables and Reshaping

# Pivot table
pd.pivot_table(df, values='D', index='A', columns='B')
pd.pivot_table(df, values='D', index='A', columns='B', aggfunc='sum')
pd.pivot_table(df, values='D', index=['A', 'B'], columns='C')
pd.pivot_table(df, values=['D', 'E'], index='A', columns='B')
pd.pivot_table(df, index='A', columns='B', aggfunc={'D': 'sum', 'E': 'mean'})
pd.pivot_table(df, index='A', columns='B', margins=True)  # Add totals

# Pivot (no aggregation)
df.pivot(index='A', columns='B', values='C')

# Melt (unpivot)
pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
pd.melt(df, id_vars=['A'], var_name='variable', value_name='value')

# Stack and unstack
df.stack()              # Pivot columns to rows
df.unstack()            # Pivot rows to columns
df.unstack(level=0)     # Unstack specific level
df.stack().unstack()    # Round trip

# Wide to long
pd.wide_to_long(df, stubnames='col', i='id', j='time')

# Transpose
df.T                    # Transpose (swap rows and columns)
df.transpose()

# Crosstab
pd.crosstab(df['A'], df['B'])
pd.crosstab(df['A'], df['B'], normalize='columns')
pd.crosstab(df['A'], df['B'], values=df['C'], aggfunc='mean')
pd.crosstab(df['A'], df['B'], margins=True)

# Pivot table
pd.pivot_table(df, values='D', index='A', columns='B')
pd.pivot_table(df, values='D', index='A', columns='B', aggfunc='sum')
pd.pivot_table(df, values='D', index=['A', 'B'], columns='C')
pd.pivot_table(df, values=['D', 'E'], index='A', columns='B')
pd.pivot_table(df, index='A', columns='B', aggfunc={'D': 'sum', 'E': 'mean'})
pd.pivot_table(df, index='A', columns='B', margins=True)  # Add totals

# Pivot (no aggregation)
df.pivot(index='A', columns='B', values='C')

# Melt (unpivot)
pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
pd.melt(df, id_vars=['A'], var_name='variable', value_name='value')

# Stack and unstack
df.stack()              # Pivot columns to rows
df.unstack()            # Pivot rows to columns
df.unstack(level=0)     # Unstack specific level
df.stack().unstack()    # Round trip

# Wide to long
pd.wide_to_long(df, stubnames='col', i='id', j='time')

# Transpose
df.T                    # Transpose (swap rows and columns)
df.transpose()

# Crosstab
pd.crosstab(df['A'], df['B'])
pd.crosstab(df['A'], df['B'], normalize='columns')
pd.crosstab(df['A'], df['B'], values=df['C'], aggfunc='mean')
pd.crosstab(df['A'], df['B'], margins=True)

Python

16. Merging, Joining, and Concatenating

# Concatenate
pd.concat([df1, df2])   # Vertical (rows)
pd.concat([df1, df2], axis=1)  # Horizontal (columns)
pd.concat([df1, df2], ignore_index=True)  # Reset index
pd.concat([df1, df2], keys=['first', 'second'])  # Add keys

# Merge (like SQL join)
pd.merge(df1, df2, on='key')  # Inner join
pd.merge(df1, df2, on='key', how='left')  # Left join
pd.merge(df1, df2, on='key', how='right')  # Right join
pd.merge(df1, df2, on='key', how='outer')  # Outer join
pd.merge(df1, df2, left_on='key1', right_on='key2')  # Different names
pd.merge(df1, df2, on=['key1', 'key2'])  # Multiple keys
pd.merge(df1, df2, left_index=True, right_index=True)  # On index
pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))  # Suffix
pd.merge(df1, df2, on='key', indicator=True)  # Add merge indicator

# Join (merge on index)
df1.join(df2)           # Left join on index
df1.join(df2, how='inner')
df1.join(df2, on='key')
df1.join(df2, lsuffix='_l', rsuffix='_r')

# Combine first (update missing values)
df1.combine_first(df2)

# Update
df1.update(df2)

# Concatenate
pd.concat([df1, df2])   # Vertical (rows)
pd.concat([df1, df2], axis=1)  # Horizontal (columns)
pd.concat([df1, df2], ignore_index=True)  # Reset index
pd.concat([df1, df2], keys=['first', 'second'])  # Add keys

# Merge (like SQL join)
pd.merge(df1, df2, on='key')  # Inner join
pd.merge(df1, df2, on='key', how='left')  # Left join
pd.merge(df1, df2, on='key', how='right')  # Right join
pd.merge(df1, df2, on='key', how='outer')  # Outer join
pd.merge(df1, df2, left_on='key1', right_on='key2')  # Different names
pd.merge(df1, df2, on=['key1', 'key2'])  # Multiple keys
pd.merge(df1, df2, left_index=True, right_index=True)  # On index
pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))  # Suffix
pd.merge(df1, df2, on='key', indicator=True)  # Add merge indicator

# Join (merge on index)
df1.join(df2)           # Left join on index
df1.join(df2, how='inner')
df1.join(df2, on='key')
df1.join(df2, lsuffix='_l', rsuffix='_r')

# Combine first (update missing values)
df1.combine_first(df2)

# Update
df1.update(df2)

Python

17. Time Series

# Create datetime
pd.to_datetime('2023-01-01')
pd.to_datetime(['2023-01-01', '2023-01-02'])
pd.to_datetime(df['date_col'], format='%Y-%m-%d')
pd.to_datetime(df['date_col'], errors='coerce')  # Invalid to NaT

# Date range
pd.date_range('2023-01-01', periods=10)
pd.date_range('2023-01-01', '2023-12-31')
pd.date_range('2023-01-01', periods=10, freq='D')  # Daily
pd.date_range('2023-01-01', periods=10, freq='B')  # Business days
pd.date_range('2023-01-01', periods=10, freq='W')  # Weekly
pd.date_range('2023-01-01', periods=10, freq='M')  # Month end
pd.date_range('2023-01-01', periods=10, freq='MS')  # Month start
pd.date_range('2023-01-01', periods=10, freq='H')  # Hourly

# Time delta
pd.Timedelta('1 day')
pd.Timedelta(days=1, hours=2, minutes=30)
pd.timedelta_range('1 day', periods=10)

# Extract datetime components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute
df['second'] = df['date'].dt.second
df['dayofweek'] = df['date'].dt.dayofweek  # Monday=0
df['dayofyear'] = df['date'].dt.dayofyear
df['quarter'] = df['date'].dt.quarter
df['week'] = df['date'].dt.isocalendar().week
df['day_name'] = df['date'].dt.day_name()
df['month_name'] = df['date'].dt.month_name()

# Boolean datetime checks
df['is_month_start'] = df['date'].dt.is_month_start
df['is_month_end'] = df['date'].dt.is_month_end
df['is_quarter_start'] = df['date'].dt.is_quarter_start
df['is_quarter_end'] = df['date'].dt.is_quarter_end
df['is_year_start'] = df['date'].dt.is_year_start
df['is_year_end'] = df['date'].dt.is_year_end
df['is_leap_year'] = df['date'].dt.is_leap_year

# Date offsets
df['date'] + pd.DateOffset(days=1)
df['date'] + pd.DateOffset(months=1)
df['date'] + pd.offsets.BusinessDay(5)

# Set datetime index
df.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index)

# Select by date
df['2023']              # All of 2023
df['2023-01']           # January 2023
df['2023-01-15']        # Specific day
df['2023-01':'2023-03']  # Date range

# Resample (change frequency)
df.resample('D').mean()  # Daily mean
df.resample('W').sum()   # Weekly sum
df.resample('M').last()  # Monthly last value
df.resample('Q').mean()  # Quarterly mean
df.resample('Y').sum()   # Yearly sum
df.resample('2D').mean()  # Every 2 days
df.resample('H').ffill()  # Hourly with forward fill

# Rolling window
df.rolling(window=7).mean()  # 7-period moving average
df.rolling(window=7).sum()
df.rolling(window=7).std()
df.rolling(window='7D').mean()  # 7-day window (time-based)
df.rolling(window=7, min_periods=1).mean()  # Allow fewer periods

# Expanding window
df.expanding().mean()    # Cumulative mean
df.expanding(min_periods=3).sum()

# Shifting
df.shift(1)              # Shift forward by 1
df.shift(-1)             # Shift backward by 1
df.shift(freq='D')       # Shift by time frequency
df.tshift(periods=1)     # Shift index

# Lagging
df['lag1'] = df['value'].shift(1)
df['lag7'] = df['value'].shift(7)

# Percentage change
df.pct_change()          # Period-over-period change
df.pct_change(periods=7)  # 7-period change

# Timezone
df['date'].dt.tz_localize('UTC')  # Make timezone-aware
df['date'].dt.tz_convert('US/Eastern')  # Convert timezone
df['date'].dt.tz_localize(None)  # Remove timezone

# Create datetime
pd.to_datetime('2023-01-01')
pd.to_datetime(['2023-01-01', '2023-01-02'])
pd.to_datetime(df['date_col'], format='%Y-%m-%d')
pd.to_datetime(df['date_col'], errors='coerce')  # Invalid to NaT

# Date range
pd.date_range('2023-01-01', periods=10)
pd.date_range('2023-01-01', '2023-12-31')
pd.date_range('2023-01-01', periods=10, freq='D')  # Daily
pd.date_range('2023-01-01', periods=10, freq='B')  # Business days
pd.date_range('2023-01-01', periods=10, freq='W')  # Weekly
pd.date_range('2023-01-01', periods=10, freq='M')  # Month end
pd.date_range('2023-01-01', periods=10, freq='MS')  # Month start
pd.date_range('2023-01-01', periods=10, freq='H')  # Hourly

# Time delta
pd.Timedelta('1 day')
pd.Timedelta(days=1, hours=2, minutes=30)
pd.timedelta_range('1 day', periods=10)

# Extract datetime components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute
df['second'] = df['date'].dt.second
df['dayofweek'] = df['date'].dt.dayofweek  # Monday=0
df['dayofyear'] = df['date'].dt.dayofyear
df['quarter'] = df['date'].dt.quarter
df['week'] = df['date'].dt.isocalendar().week
df['day_name'] = df['date'].dt.day_name()
df['month_name'] = df['date'].dt.month_name()

# Boolean datetime checks
df['is_month_start'] = df['date'].dt.is_month_start
df['is_month_end'] = df['date'].dt.is_month_end
df['is_quarter_start'] = df['date'].dt.is_quarter_start
df['is_quarter_end'] = df['date'].dt.is_quarter_end
df['is_year_start'] = df['date'].dt.is_year_start
df['is_year_end'] = df['date'].dt.is_year_end
df['is_leap_year'] = df['date'].dt.is_leap_year

# Date offsets
df['date'] + pd.DateOffset(days=1)
df['date'] + pd.DateOffset(months=1)
df['date'] + pd.offsets.BusinessDay(5)

# Set datetime index
df.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index)

# Select by date
df['2023']              # All of 2023
df['2023-01']           # January 2023
df['2023-01-15']        # Specific day
df['2023-01':'2023-03']  # Date range

# Resample (change frequency)
df.resample('D').mean()  # Daily mean
df.resample('W').sum()   # Weekly sum
df.resample('M').last()  # Monthly last value
df.resample('Q').mean()  # Quarterly mean
df.resample('Y').sum()   # Yearly sum
df.resample('2D').mean()  # Every 2 days
df.resample('H').ffill()  # Hourly with forward fill

# Rolling window
df.rolling(window=7).mean()  # 7-period moving average
df.rolling(window=7).sum()
df.rolling(window=7).std()
df.rolling(window='7D').mean()  # 7-day window (time-based)
df.rolling(window=7, min_periods=1).mean()  # Allow fewer periods

# Expanding window
df.expanding().mean()    # Cumulative mean
df.expanding(min_periods=3).sum()

# Shifting
df.shift(1)              # Shift forward by 1
df.shift(-1)             # Shift backward by 1
df.shift(freq='D')       # Shift by time frequency
df.tshift(periods=1)     # Shift index

# Lagging
df['lag1'] = df['value'].shift(1)
df['lag7'] = df['value'].shift(7)

# Percentage change
df.pct_change()          # Period-over-period change
df.pct_change(periods=7)  # 7-period change

# Timezone
df['date'].dt.tz_localize('UTC')  # Make timezone-aware
df['date'].dt.tz_convert('US/Eastern')  # Convert timezone
df['date'].dt.tz_localize(None)  # Remove timezone

Python

18. Categorical Data

# Create categorical
df['category'] = pd.Categorical(df['category'])
df['category'] = df['category'].astype('category')

# Create with specific categories
cat = pd.Categorical(['a', 'b', 'c'], categories=['a', 'b', 'c', 'd'])

# Ordered categorical
cat = pd.Categorical(['low', 'medium', 'high'], 
                     categories=['low', 'medium', 'high'],
                     ordered=True)

# Get categories
df['category'].cat.categories

# Add categories
df['category'].cat.add_categories(['new_cat'])

# Remove categories
df['category'].cat.remove_categories(['old_cat'])

# Rename categories
df['category'].cat.rename_categories({'old': 'new'})

# Reorder categories
df['category'].cat.reorder_categories(['c', 'b', 'a'])

# Set categories
df['category'].cat.set_categories(['a', 'b', 'c'])

# Get codes
df['category'].cat.codes

# Create categorical
df['category'] = pd.Categorical(df['category'])
df['category'] = df['category'].astype('category')

# Create with specific categories
cat = pd.Categorical(['a', 'b', 'c'], categories=['a', 'b', 'c', 'd'])

# Ordered categorical
cat = pd.Categorical(['low', 'medium', 'high'], 
                     categories=['low', 'medium', 'high'],
                     ordered=True)

# Get categories
df['category'].cat.categories

# Add categories
df['category'].cat.add_categories(['new_cat'])

# Remove categories
df['category'].cat.remove_categories(['old_cat'])

# Rename categories
df['category'].cat.rename_categories({'old': 'new'})

# Reorder categories
df['category'].cat.reorder_categories(['c', 'b', 'a'])

# Set categories
df['category'].cat.set_categories(['a', 'b', 'c'])

# Get codes
df['category'].cat.codes

Python

19. Multi-Index (Hierarchical Index)

# Create MultiIndex
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1)])
idx = pd.MultiIndex.from_product([['a', 'b'], [1, 2, 3]])
idx = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
df = pd.DataFrame(data, index=idx)

# Set MultiIndex from columns
df.set_index(['col1', 'col2'])

# Select from MultiIndex
df.loc['a']              # First level
df.loc[('a', 1)]         # Both levels
df.loc[('a', slice(None)), :]  # All of first level

# Cross-section
df.xs('a', level=0)      # All rows where level 0 is 'a'
df.xs(1, level=1)        # All rows where level 1 is 1

# Stack and unstack with MultiIndex
df.stack()
df.unstack(level=0)
df.unstack(level='level_name')

# Swap levels
df.swaplevel(0, 1)

# Sort MultiIndex
df.sort_index(level=0)
df.sort_index(level=[0, 1])

# Reset MultiIndex
df.reset_index()
df.reset_index(level=0)

# Create MultiIndex
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1)])
idx = pd.MultiIndex.from_product([['a', 'b'], [1, 2, 3]])
idx = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
df = pd.DataFrame(data, index=idx)

# Set MultiIndex from columns
df.set_index(['col1', 'col2'])

# Select from MultiIndex
df.loc['a']              # First level
df.loc[('a', 1)]         # Both levels
df.loc[('a', slice(None)), :]  # All of first level

# Cross-section
df.xs('a', level=0)      # All rows where level 0 is 'a'
df.xs(1, level=1)        # All rows where level 1 is 1

# Stack and unstack with MultiIndex
df.stack()
df.unstack(level=0)
df.unstack(level='level_name')

# Swap levels
df.swaplevel(0, 1)

# Sort MultiIndex
df.sort_index(level=0)
df.sort_index(level=[0, 1])

# Reset MultiIndex
df.reset_index()
df.reset_index(level=0)

Python

20. Window Functions

# Rolling window
df.rolling(window=3).mean()
df.rolling(window=3).sum()
df.rolling(window=3).std()
df.rolling(window=3).min()
df.rolling(window=3).max()
df.rolling(window=3).median()
df.rolling(window=3).count()
df.rolling(window=3).apply(custom_func)

# Expanding window
df.expanding().mean()
df.expanding(min_periods=3).sum()

# Exponentially weighted
df.ewm(span=20).mean()   # EMA
df.ewm(alpha=0.5).mean()
df.ewm(halflife=10).mean()

# Rolling window
df.rolling(window=3).mean()
df.rolling(window=3).sum()
df.rolling(window=3).std()
df.rolling(window=3).min()
df.rolling(window=3).max()
df.rolling(window=3).median()
df.rolling(window=3).count()
df.rolling(window=3).apply(custom_func)

# Expanding window
df.expanding().mean()
df.expanding(min_periods=3).sum()

# Exponentially weighted
df.ewm(span=20).mean()   # EMA
df.ewm(alpha=0.5).mean()
df.ewm(halflife=10).mean()

Python

21. Performance and Optimization

# Data types optimization
df['int_col'] = df['int_col'].astype('int32')
df['cat_col'] = df['cat_col'].astype('category')
df['sparse_col'] = df['sparse_col'].astype(pd.SparseDtype('int'))

# Chunking for large files
for chunk in pd.read_csv('large.csv', chunksize=10000):
    process(chunk)

# Query instead of boolean indexing
df.query('A > 5 and B < 10')  # Often faster

# Use vectorized operations
df['result'] = df['A'] + df['B']  # Not apply()

# Use numexpr for complex calculations
pd.eval('df_A + df_B * df_C')

# Copy explicitly
df_copy = df.copy()
df_deep_copy = df.copy(deep=True)

# Memory usage
df.info(memory_usage='deep')
df.memory_usage(deep=True)

# Convert to sparse
df_sparse = df.astype(pd.SparseDtype('float', fill_value=0))

# Data types optimization
df['int_col'] = df['int_col'].astype('int32')
df['cat_col'] = df['cat_col'].astype('category')
df['sparse_col'] = df['sparse_col'].astype(pd.SparseDtype('int'))

# Chunking for large files
for chunk in pd.read_csv('large.csv', chunksize=10000):
    process(chunk)

# Query instead of boolean indexing
df.query('A > 5 and B < 10')  # Often faster

# Use vectorized operations
df['result'] = df['A'] + df['B']  # Not apply()

# Use numexpr for complex calculations
pd.eval('df_A + df_B * df_C')

# Copy explicitly
df_copy = df.copy()
df_deep_copy = df.copy(deep=True)

# Memory usage
df.info(memory_usage='deep')
df.memory_usage(deep=True)

# Convert to sparse
df_sparse = df.astype(pd.SparseDtype('float', fill_value=0))

Python

22. Advanced Operations

# explode() - Expand lists in cells
df.explode('list_column')

# eval() - Fast evaluation of expressions
df.eval('C = A + B')
df.eval('C = A + B', inplace=True)

# Binning
pd.cut(df['A'], bins=5)  # Equal-width bins
pd.cut(df['A'], bins=[0, 10, 20, 30])  # Custom bins
pd.qcut(df['A'], q=4)    # Equal-frequency bins (quartiles)
pd.qcut(df['A'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

# Dummy variables
pd.get_dummies(df['category'])
pd.get_dummies(df['category'], prefix='cat')
pd.get_dummies(df, columns=['cat1', 'cat2'])
pd.get_dummies(df, drop_first=True)  # Avoid multicollinearity

# Ranking
df.rank()
df.rank(method='min')
df.rank(pct=True)  # Percentile ranks

# Sampling
df.sample(n=100)
df.sample(frac=0.1)
df.sample(n=100, replace=True)  # With replacement
df.sample(n=100, weights='weight_col')  # Weighted sampling

# Clip values
df.clip(lower=0, upper=100)
df['A'].clip(lower=0)

# Factorize (convert to integers)
codes, unique = pd.factorize(df['category'])

# explode() - Expand lists in cells
df.explode('list_column')

# eval() - Fast evaluation of expressions
df.eval('C = A + B')
df.eval('C = A + B', inplace=True)

# Binning
pd.cut(df['A'], bins=5)  # Equal-width bins
pd.cut(df['A'], bins=[0, 10, 20, 30])  # Custom bins
pd.qcut(df['A'], q=4)    # Equal-frequency bins (quartiles)
pd.qcut(df['A'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

# Dummy variables
pd.get_dummies(df['category'])
pd.get_dummies(df['category'], prefix='cat')
pd.get_dummies(df, columns=['cat1', 'cat2'])
pd.get_dummies(df, drop_first=True)  # Avoid multicollinearity

# Ranking
df.rank()
df.rank(method='min')
df.rank(pct=True)  # Percentile ranks

# Sampling
df.sample(n=100)
df.sample(frac=0.1)
df.sample(n=100, replace=True)  # With replacement
df.sample(n=100, weights='weight_col')  # Weighted sampling

# Clip values
df.clip(lower=0, upper=100)
df['A'].clip(lower=0)

# Factorize (convert to integers)
codes, unique = pd.factorize(df['category'])

Python

23. Validation and Testing

# Check data types
assert df['col'].dtype == 'int64'

# Check for nulls
assert df['col'].notna().all()

# Check uniqueness
assert df['id'].is_unique

# Check value range
assert (df['age'] >= 0).all()
assert (df['age'] <= 120).all()

# Check membership
assert df['category'].isin(['A', 'B', 'C']).all()

# equals() - Check if DataFrames are equal
df1.equals(df2)

# compare() - Show differences
df1.compare(df2)

# Check data types
assert df['col'].dtype == 'int64'

# Check for nulls
assert df['col'].notna().all()

# Check uniqueness
assert df['id'].is_unique

# Check value range
assert (df['age'] >= 0).all()
assert (df['age'] <= 120).all()

# Check membership
assert df['category'].isin(['A', 'B', 'C']).all()

# equals() - Check if DataFrames are equal
df1.equals(df2)

# compare() - Show differences
df1.compare(df2)

Python

24. Working with Duplicates

# Find duplicates
df.duplicated()          # Boolean Series
df.duplicated(subset=['A', 'B'])  # Based on specific columns
df.duplicated(keep='first')  # Mark duplicates except first
df.duplicated(keep='last')   # Mark duplicates except last
df.duplicated(keep=False)    # Mark all duplicates

# View duplicates
df[df.duplicated()]
df[df.duplicated(subset=['A'], keep=False)]

# Count duplicates
df.duplicated().sum()

# Drop duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['A'])
df.drop_duplicates(keep='last')

# Find duplicates
df.duplicated()          # Boolean Series
df.duplicated(subset=['A', 'B'])  # Based on specific columns
df.duplicated(keep='first')  # Mark duplicates except first
df.duplicated(keep='last')   # Mark duplicates except last
df.duplicated(keep=False)    # Mark all duplicates

# View duplicates
df[df.duplicated()]
df[df.duplicated(subset=['A'], keep=False)]

# Count duplicates
df.duplicated().sum()

# Drop duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['A'])
df.drop_duplicates(keep='last')

Python

25. Useful Utilities

# Info about DataFrame
df.info()
df.describe()
df.dtypes
df.shape
df.columns
df.index

# Convert types
df.astype('float')
df.astype({'A': 'int32', 'B': 'float64'})
df['A'].astype('category')

# Copy
df.copy()
df.copy(deep=True)

# Iteration (avoid when possible)
for index, row in df.iterrows():  # Row by row
    print(row['A'])
for row in df.itertuples():  # Faster than iterrows
    print(row.A)

# Apply function to each element
df.applymap(func)  # Entire DataFrame
df['col'].apply(func)  # Series

# Pipe for method chaining
df.pipe(func1).pipe(func2).pipe(func3)

# Style (for display)
df.style.highlight_max()
df.style.background_gradient()
df.style.format('{:.2f}')

# Options
pd.options.display.max_rows = 100
pd.options.display.max_columns = None
pd.set_option('display.max_rows', 100)
pd.get_option('display.max_rows')
pd.reset_option('display.max_rows')
pd.describe_option()  # Show all options

# Info about DataFrame
df.info()
df.describe()
df.dtypes
df.shape
df.columns
df.index

# Convert types
df.astype('float')
df.astype({'A': 'int32', 'B': 'float64'})
df['A'].astype('category')

# Copy
df.copy()
df.copy(deep=True)

# Iteration (avoid when possible)
for index, row in df.iterrows():  # Row by row
    print(row['A'])
for row in df.itertuples():  # Faster than iterrows
    print(row.A)

# Apply function to each element
df.applymap(func)  # Entire DataFrame
df['col'].apply(func)  # Series

# Pipe for method chaining
df.pipe(func1).pipe(func2).pipe(func3)

# Style (for display)
df.style.highlight_max()
df.style.background_gradient()
df.style.format('{:.2f}')

# Options
pd.options.display.max_rows = 100
pd.options.display.max_columns = None
pd.set_option('display.max_rows', 100)
pd.get_option('display.max_rows')
pd.reset_option('display.max_rows')
pd.describe_option()  # Show all options

Python

Conclusion

This tutorial covers the essential and advanced features of pandas for data manipulation and analysis. Practice these concepts with real datasets to become proficient. Remember that pandas is constantly evolving, so check the official documentation for the latest features and best practices.

Happy data wrangling! 🐼

Discover more from Altgr Blog

Subscribe to get the latest posts sent to your email.

Table of Contents

1. Introduction to Pandas

2. Installation

3. Data Structures

3.1 Series

3.2 DataFrame

4. Creating DataFrames

4.1 From Dictionary

4.2 From List of Dictionaries

4.3 From NumPy Array

5. Reading and Writing Data

5.1 Reading Data

5.2 Writing Data

6. Data Inspection

7. Data Selection and Indexing

7.1 Selecting Columns

7.2 Selecting Rows

7.3 Boolean Indexing

8. Data Filtering

9. Data Manipulation

9.1 Adding/Removing Columns

9.2 Adding/Removing Rows

9.3 Renaming

9.4 Sorting

9.5 Apply Functions

9.6 String Operations

10. Handling Missing Data

10.1 Detecting Missing Data

10.2 Handling Missing Data

11. Data Aggregation and Grouping

11.1 Basic Aggregations

11.2 GroupBy Operations

11.3 Pivot Tables

11.4 Cross-tabulation

12. Merging and Joining

12.1 Concatenation

12.2 Merging/Joining

12.3 Append (Deprecated)

13. Time Series Analysis

13.1 Creating DateTime Objects

13.2 DateTime Components

13.3 Time Series Operations

13.4 Time Zones

14. Data Visualization

15. Performance Tips

15.1 Memory Optimization

15.2 Vectorization

15.3 Efficient Reading

15.4 Query vs Boolean Indexing

15.5 Use Built-in Functions

15.6 Avoid Chained Assignment

Additional Resources

Official Documentation

Useful Libraries to Use with Pandas

Best Practices

Common Pitfalls

Comprehensive Pandas Cheatsheet

1. Importing and Setup

2. Creating Data Structures

3. Reading and Writing Files

4. Viewing and Inspecting Data

5. Selection and Indexing

6. Filtering and Querying

7. Adding, Modifying, and Removing Data

8. Renaming and Reindexing

9. Sorting

10. Apply, Map, and Transform

11. String Operations

12. Missing Data

13. Aggregation and Statistics

14. GroupBy Operations

15. Pivot Tables and Reshaping

16. Merging, Joining, and Concatenating

17. Time Series

18. Categorical Data

19. Multi-Index (Hierarchical Index)

20. Window Functions

21. Performance and Optimization

22. Advanced Operations

23. Validation and Testing