Table of Contents
- Introduction to Pandas
- Installation
- Data Structures
- Creating DataFrames
- Reading and Writing Data
- Data Inspection
- Data Selection and Indexing
- Data Filtering
- Data Manipulation
- Handling Missing Data
- Data Aggregation and Grouping
- Merging and Joining
- Time Series Analysis
- Data Visualization
- Performance Tips
1. Introduction to Pandas
Pandas is a powerful, open-source data analysis and manipulation library for Python. It provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
Key Features:
- Fast and efficient DataFrame object for data manipulation
- Tools for reading and writing data between in-memory data structures and different formats
- Intelligent data alignment and integrated handling of missing data
- Flexible reshaping and pivoting of datasets
- Powerful group by functionality
- Time series functionality
2. Installation
# Using pip
pip install pandas
# Using conda
conda install pandas
# Install with additional dependencies
pip install pandas numpy matplotlibBashImport pandas:
import pandas as pd
import numpy as npPython3. Data Structures
Pandas has two primary data structures:
3.1 Series
A one-dimensional labeled array capable of holding any data type.
# Creating a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
# Series with custom index
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)
# From dictionary
data = {'a': 10, 'b': 20, 'c': 30}
s = pd.Series(data)Python3.2 DataFrame
A two-dimensional labeled data structure with columns of potentially different types.
# Creating a DataFrame
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': ['a', 'b', 'c', 'd'],
'C': [1.5, 2.5, 3.5, 4.5]
})
print(df)Python4. Creating DataFrames
4.1 From Dictionary
# Dictionary of lists
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
# Dictionary of Series
data = {
'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
}
df = pd.DataFrame(data)Python4.2 From List of Dictionaries
data = [
{'a': 1, 'b': 2, 'c': 3},
{'a': 10, 'b': 20, 'c': 30},
{'a': 100, 'b': 200, 'c': 300}
]
df = pd.DataFrame(data)Python4.3 From NumPy Array
# Create from numpy array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])Python5. Reading and Writing Data
5.1 Reading Data
# CSV files
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', sep=';', encoding='utf-8')
# Excel files
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# JSON files
df = pd.read_json('data.json')
# SQL databases
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)
# HTML tables
df = pd.read_html('https://example.com/table.html')[0]
# Clipboard
df = pd.read_clipboard()
# With specific parameters
df = pd.read_csv('data.csv',
header=0, # Row to use as header
index_col=0, # Column to use as index
na_values=['NA'], # Values to recognize as NA
parse_dates=['date'], # Parse date columns
nrows=1000) # Number of rows to readPython5.2 Writing Data
# CSV
df.to_csv('output.csv', index=False)
# Excel
df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)
# JSON
df.to_json('output.json', orient='records')
# SQL
df.to_sql('table_name', conn, if_exists='replace', index=False)
# HTML
df.to_html('output.html')
# Clipboard
df.to_clipboard()Python6. Data Inspection
# View first/last rows
df.head() # First 5 rows
df.head(10) # First 10 rows
df.tail() # Last 5 rows
# Shape and size
df.shape # (rows, columns)
df.size # Total number of elements
len(df) # Number of rows
# Column information
df.columns # Column names
df.dtypes # Data types of columns
df.info() # Detailed information
# Statistical summary
df.describe() # For numerical columns
df.describe(include='all') # For all columns
# Unique values
df['column'].unique() # Unique values
df['column'].nunique() # Count of unique values
df['column'].value_counts() # Frequency of values
# Check for missing values
df.isnull().sum()
df.isna().sum()
# Memory usage
df.memory_usage(deep=True)Python7. Data Selection and Indexing
7.1 Selecting Columns
# Single column (returns Series)
df['Name']
df.Name # Dot notation (only for valid Python identifiers)
# Multiple columns (returns DataFrame)
df[['Name', 'Age']]
# Select columns by data type
df.select_dtypes(include=['int64', 'float64'])
df.select_dtypes(exclude=['object'])Python7.2 Selecting Rows
# By position (iloc)
df.iloc[0] # First row
df.iloc[0:5] # First 5 rows
df.iloc[:, 0] # First column
df.iloc[0:5, 0:3] # First 5 rows, first 3 columns
# By label (loc)
df.loc[0] # Row with index 0
df.loc[0:5] # Rows 0 to 5 (inclusive)
df.loc[:, 'Name'] # All rows, 'Name' column
df.loc[0:5, ['Name', 'Age']] # Specific rows and columns
# Selecting specific cells
df.at[0, 'Name'] # Fast access to scalar
df.iat[0, 1] # Fast integer-based accessPython7.3 Boolean Indexing
# Single condition
df[df['Age'] > 30]
# Multiple conditions
df[(df['Age'] > 25) & (df['City'] == 'New York')]
df[(df['Age'] < 25) | (df['Age'] > 35)]
# Using isin()
df[df['City'].isin(['New York', 'Paris'])]
# String methods
df[df['Name'].str.contains('John')]
df[df['Name'].str.startswith('A')]Python8. Data Filtering
# Filter by condition
filtered = df[df['Age'] > 30]
# Filter using query() method
filtered = df.query('Age > 30 and City == "New York"')
# Filter with multiple conditions
filtered = df[(df['Age'] > 25) & (df['Age'] < 35)]
# Filter rows with non-null values
filtered = df[df['column'].notna()]
# Filter based on string patterns
filtered = df[df['Name'].str.match(r'^[A-M]')]
# Remove duplicates
df_unique = df.drop_duplicates()
df_unique = df.drop_duplicates(subset=['Name'])
df_unique = df.drop_duplicates(keep='last')
# Sample data
sample = df.sample(n=100) # 100 random rows
sample = df.sample(frac=0.1) # 10% of rowsPython9. Data Manipulation
9.1 Adding/Removing Columns
# Add new column
df['NewColumn'] = df['Age'] * 2
df['Category'] = 'A' # Constant value
# Add computed column
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Old')
# Insert column at specific position
df.insert(1, 'NewCol', [1, 2, 3, 4])
# Remove columns
df.drop('NewColumn', axis=1, inplace=True)
df.drop(['Col1', 'Col2'], axis=1, inplace=True)
# Remove columns using del
del df['NewColumn']Python9.2 Adding/Removing Rows
# Add row using loc
df.loc[len(df)] = ['Tom', 45, 'Tokyo']
# Add row using append (deprecated in newer versions)
new_row = pd.DataFrame({'Name': ['Tom'], 'Age': [45], 'City': ['Tokyo']})
df = pd.concat([df, new_row], ignore_index=True)
# Remove rows by index
df.drop(0, inplace=True)
df.drop([0, 1, 2], inplace=True)
# Remove rows by condition
df = df[df['Age'] >= 25]Python9.3 Renaming
# Rename columns
df.rename(columns={'Name': 'FullName', 'Age': 'Years'}, inplace=True)
# Rename all columns
df.columns = ['col1', 'col2', 'col3']
# Rename index
df.rename(index={0: 'first', 1: 'second'}, inplace=True)
# Reset index
df.reset_index(drop=True, inplace=True)
# Set column as index
df.set_index('Name', inplace=True)Python9.4 Sorting
# Sort by column
df.sort_values('Age')
df.sort_values('Age', ascending=False)
# Sort by multiple columns
df.sort_values(['City', 'Age'], ascending=[True, False])
# Sort by index
df.sort_index()
df.sort_index(ascending=False)Python9.5 Apply Functions
# Apply function to column
df['Age_Squared'] = df['Age'].apply(lambda x: x**2)
# Apply function to DataFrame
df['Age_Category'] = df.apply(
lambda row: 'Young' if row['Age'] < 30 else 'Old',
axis=1
)
# Map values
mapping = {'New York': 'NY', 'Paris': 'PR', 'Berlin': 'BE'}
df['City_Code'] = df['City'].map(mapping)
# Replace values
df['City'].replace('New York', 'NYC', inplace=True)
df.replace({'New York': 'NYC', 'Paris': 'PRS'}, inplace=True)Python9.6 String Operations
# Convert to lowercase/uppercase
df['Name_Lower'] = df['Name'].str.lower()
df['Name_Upper'] = df['Name'].str.upper()
# Strip whitespace
df['Name'] = df['Name'].str.strip()
# Split strings
df['First_Name'] = df['Name'].str.split(' ').str[0]
# Contains pattern
mask = df['Name'].str.contains('John', case=False)
# Extract pattern
df['Numbers'] = df['Text'].str.extract(r'(\d+)')
# Replace in strings
df['Name'] = df['Name'].str.replace('John', 'Jonathan')Python10. Handling Missing Data
10.1 Detecting Missing Data
# Check for missing values
df.isnull()
df.isna()
df.isnull().sum() # Count per column
df.isnull().any() # Any missing values per column
# Check for non-missing values
df.notna()
df.notnull()Python10.2 Handling Missing Data
# Drop rows with any missing values
df.dropna()
# Drop rows where all values are missing
df.dropna(how='all')
# Drop rows with missing values in specific columns
df.dropna(subset=['Age', 'City'])
# Drop columns with missing values
df.dropna(axis=1)
# Fill missing values with a constant
df.fillna(0)
df['Age'].fillna(df['Age'].mean())
# Forward fill (propagate last valid observation)
df.fillna(method='ffill')
df.fillna(method='pad')
# Backward fill
df.fillna(method='bfill')
df.fillna(method='backfill')
# Fill with different values per column
df.fillna({'Age': 0, 'City': 'Unknown'})
# Interpolate missing values
df['Age'].interpolate()
df.interpolate(method='linear')Python11. Data Aggregation and Grouping
11.1 Basic Aggregations
# Sum, mean, median, etc.
df['Age'].sum()
df['Age'].mean()
df['Age'].median()
df['Age'].std()
df['Age'].min()
df['Age'].max()
df['Age'].count()
# Multiple aggregations at once
df['Age'].agg(['sum', 'mean', 'std'])
# Aggregation on multiple columns
df[['Age', 'Salary']].agg(['mean', 'min', 'max'])
# Different aggregations for different columns
df.agg({
'Age': ['mean', 'std'],
'Salary': ['sum', 'mean']
})Python11.2 GroupBy Operations
# Group by single column
grouped = df.groupby('City')
grouped['Age'].mean()
# Group by multiple columns
df.groupby(['City', 'Gender'])['Age'].mean()
# Multiple aggregations on groups
df.groupby('City').agg({
'Age': ['mean', 'min', 'max'],
'Salary': 'sum'
})
# Custom aggregation functions
df.groupby('City')['Age'].agg(
Average='mean',
Total='sum',
Range=lambda x: x.max() - x.min()
)
# Iterate through groups
for name, group in df.groupby('City'):
print(f"City: {name}")
print(group)
# Transform (return same size as input)
df['Age_Normalized'] = df.groupby('City')['Age'].transform(
lambda x: (x - x.mean()) / x.std()
)
# Filter groups
df.groupby('City').filter(lambda x: len(x) > 2)
# Get specific group
df.groupby('City').get_group('New York')Python11.3 Pivot Tables
# Create pivot table
pivot = df.pivot_table(
values='Salary',
index='City',
columns='Gender',
aggfunc='mean'
)
# Multiple aggregation functions
pivot = df.pivot_table(
values='Salary',
index='City',
columns='Gender',
aggfunc=['mean', 'sum', 'count']
)
# Multiple values
pivot = df.pivot_table(
values=['Salary', 'Age'],
index='City',
columns='Gender',
aggfunc='mean'
)
# With margins (totals)
pivot = df.pivot_table(
values='Salary',
index='City',
columns='Gender',
aggfunc='mean',
margins=True
)Python11.4 Cross-tabulation
# Create cross-tabulation
crosstab = pd.crosstab(df['City'], df['Gender'])
# With values
crosstab = pd.crosstab(
df['City'],
df['Gender'],
values=df['Salary'],
aggfunc='mean'
)
# With margins
crosstab = pd.crosstab(df['City'], df['Gender'], margins=True)
# Normalize
crosstab = pd.crosstab(df['City'], df['Gender'], normalize='columns')Python12. Merging and Joining
12.1 Concatenation
# Vertical concatenation (stack rows)
df_combined = pd.concat([df1, df2])
df_combined = pd.concat([df1, df2], ignore_index=True)
# Horizontal concatenation (stack columns)
df_combined = pd.concat([df1, df2], axis=1)
# With keys for multi-index
df_combined = pd.concat([df1, df2], keys=['first', 'second'])Python12.2 Merging/Joining
# Inner join (default)
merged = pd.merge(df1, df2, on='key')
# Left join
merged = pd.merge(df1, df2, on='key', how='left')
# Right join
merged = pd.merge(df1, df2, on='key', how='right')
# Outer join
merged = pd.merge(df1, df2, on='key', how='outer')
# Merge on multiple columns
merged = pd.merge(df1, df2, on=['key1', 'key2'])
# Merge with different column names
merged = pd.merge(df1, df2, left_on='key1', right_on='key2')
# Merge on index
merged = pd.merge(df1, df2, left_index=True, right_index=True)
# Using join method
merged = df1.join(df2, how='left')
merged = df1.join(df2, lsuffix='_left', rsuffix='_right')Python12.3 Append (Deprecated)
# Use concat instead of append
df_combined = pd.concat([df1, df2], ignore_index=True)Python13. Time Series Analysis
13.1 Creating DateTime Objects
# Convert string to datetime
df['date'] = pd.to_datetime(df['date_string'])
df['date'] = pd.to_datetime(df['date_string'], format='%Y-%m-%d')
# Create date range
dates = pd.date_range('2023-01-01', periods=100, freq='D')
dates = pd.date_range('2023-01-01', '2023-12-31', freq='W')
# Common frequencies:
# 'D' - calendar day
# 'B' - business day
# 'W' - weekly
# 'M' - month end
# 'MS' - month start
# 'Q' - quarter end
# 'Y' - year end
# 'H' - hourly
# 'T' or 'min' - minutely
# 'S' - secondlyPython13.2 DateTime Components
# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek
df['dayofyear'] = df['date'].dt.dayofyear
df['week'] = df['date'].dt.isocalendar().week
df['quarter'] = df['date'].dt.quarter
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute
# Day name
df['day_name'] = df['date'].dt.day_name()
df['month_name'] = df['date'].dt.month_name()
# Boolean checks
df['is_month_start'] = df['date'].dt.is_month_start
df['is_month_end'] = df['date'].dt.is_month_end
df['is_quarter_start'] = df['date'].dt.is_quarter_startPython13.3 Time Series Operations
# Set datetime as index
df.set_index('date', inplace=True)
# Select by date
df['2023'] # All data from 2023
df['2023-01'] # All data from January 2023
df['2023-01-01':'2023-01-31'] # Date range
# Resampling (changing frequency)
df.resample('M').mean() # Monthly mean
df.resample('W').sum() # Weekly sum
df.resample('Q').agg(['mean', 'sum']) # Quarterly aggregation
# Rolling window
df['rolling_mean'] = df['value'].rolling(window=7).mean()
df['rolling_std'] = df['value'].rolling(window=7).std()
df['rolling_sum'] = df['value'].rolling(window=30).sum()
# Expanding window
df['expanding_mean'] = df['value'].expanding().mean()
# Shifting
df['previous_day'] = df['value'].shift(1) # Shift forward
df['next_day'] = df['value'].shift(-1) # Shift backward
df['pct_change'] = df['value'].pct_change() # Percentage change
# Time differences
df['time_diff'] = df['date'].diff()Python13.4 Time Zones
# Localize timezone-naive to timezone-aware
df['date'] = df['date'].dt.tz_localize('UTC')
# Convert to different timezone
df['date'] = df['date'].dt.tz_convert('US/Eastern')
# Remove timezone
df['date'] = df['date'].dt.tz_localize(None)Python14. Data Visualization
Pandas integrates with matplotlib for quick visualizations.
import matplotlib.pyplot as plt
# Line plot
df['Age'].plot()
df.plot(x='Name', y='Age')
plt.show()
# Multiple lines
df[['Age', 'Salary']].plot()
# Bar plot
df['City'].value_counts().plot(kind='bar')
df['City'].value_counts().plot.bar()
# Horizontal bar plot
df['City'].value_counts().plot(kind='barh')
# Histogram
df['Age'].plot(kind='hist', bins=20)
df['Age'].hist(bins=20)
# Box plot
df['Age'].plot(kind='box')
df.boxplot(column='Age', by='City')
# Scatter plot
df.plot(kind='scatter', x='Age', y='Salary')
df.plot.scatter(x='Age', y='Salary', c='City')
# Pie chart
df['City'].value_counts().plot(kind='pie', autopct='%1.1f%%')
# Area plot
df.plot(kind='area')
# Hexbin plot
df.plot(kind='hexbin', x='Age', y='Salary', gridsize=20)
# Density plot
df['Age'].plot(kind='density')
# Subplots
df.plot(subplots=True, layout=(2, 2), figsize=(10, 8))
# Customization
df['Age'].plot(
kind='hist',
title='Age Distribution',
xlabel='Age',
ylabel='Frequency',
color='blue',
alpha=0.7,
figsize=(10, 6)
)
plt.show()Python15. Performance Tips
15.1 Memory Optimization
# Check memory usage
df.memory_usage(deep=True)
# Optimize data types
df['int_column'] = df['int_column'].astype('int32') # Instead of int64
df['cat_column'] = df['cat_column'].astype('category') # For categorical data
# Convert object to category for columns with limited unique values
for col in df.select_dtypes(include=['object']):
if df[col].nunique() / len(df) < 0.5: # Less than 50% unique
df[col] = df[col].astype('category')Python15.2 Vectorization
# SLOW: Using iterrows
for index, row in df.iterrows():
df.at[index, 'result'] = row['A'] + row['B']
# FAST: Vectorized operation
df['result'] = df['A'] + df['B']
# SLOW: Using apply with lambda
df['result'] = df.apply(lambda row: row['A'] + row['B'], axis=1)
# FAST: Direct column operations
df['result'] = df['A'] + df['B']Python15.3 Efficient Reading
# Read only necessary columns
df = pd.read_csv('large_file.csv', usecols=['col1', 'col2', 'col3'])
# Specify data types
dtypes = {'col1': 'int32', 'col2': 'float32', 'col3': 'category'}
df = pd.read_csv('large_file.csv', dtype=dtypes)
# Read in chunks
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
process(chunk)
# Use nrows for testing
df = pd.read_csv('large_file.csv', nrows=1000)Python15.4 Query vs Boolean Indexing
# Boolean indexing
result = df[(df['A'] > 5) & (df['B'] < 10)]
# Query (often faster for complex conditions)
result = df.query('A > 5 and B < 10')Python15.5 Use Built-in Functions
# SLOW: Custom function
df['is_adult'] = df['Age'].apply(lambda x: 1 if x >= 18 else 0)
# FAST: NumPy where
df['is_adult'] = np.where(df['Age'] >= 18, 1, 0)
# FAST: Built-in methods
df['sum'] = df[['A', 'B', 'C']].sum(axis=1) # Instead of applyPython15.6 Avoid Chained Assignment
# BAD: Chained assignment (can cause SettingWithCopyWarning)
df[df['Age'] > 30]['City'] = 'Updated'
# GOOD: Use loc
df.loc[df['Age'] > 30, 'City'] = 'Updated'PythonAdditional Resources
Official Documentation
Useful Libraries to Use with Pandas
- NumPy: Numerical computing
- Matplotlib/Seaborn: Visualization
- Scikit-learn: Machine learning
- SQLAlchemy: Database connections
- Openpyxl/XlsxWriter: Excel file handling
Best Practices
- Always make copies when modifying DataFrames:
df_copy = df.copy() - Use
inplace=Truecautiously; it modifies the original DataFrame - Chain operations for readability:
df.groupby('A').filter(...).sort_values(...) - Use method chaining with backslashes for multi-line operations
- Set
pd.set_option('display.max_columns', None)to see all columns - Use
df.pipe()for custom function chains - Profile your code to identify bottlenecks
- Consider using Dask or Vaex for very large datasets
Common Pitfalls
- SettingWithCopyWarning: Use
.loc[]or.copy()explicitly - Mixed data types in columns can slow operations
- Not resetting index after filtering can cause issues
- Forgetting
inplace=Trueor not assigning the result back - Not handling missing data appropriately
Comprehensive Pandas Cheatsheet
1. Importing and Setup
import pandas as pd
import numpy as np
# Display options
pd.set_option('display.max_columns', None) # Show all columns
pd.set_option('display.max_rows', 100) # Show max 100 rows
pd.set_option('display.max_colwidth', None) # Show full column width
pd.set_option('display.precision', 2) # Decimal precision
pd.set_option('display.float_format', '{:.2f}'.format) # Float format
pd.reset_option('all') # Reset all optionsPython2. Creating Data Structures
# Series
s = pd.Series([1, 2, 3, 4, 5])
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s = pd.Series({'a': 1, 'b': 2, 'c': 3})
# DataFrame from dict
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# DataFrame from list of dicts
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
# DataFrame from numpy array
df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
# Empty DataFrame
df = pd.DataFrame()
# DataFrame with index and columns
df = pd.DataFrame(data, index=idx, columns=cols)Python3. Reading and Writing Files
# CSV
df = pd.read_csv('file.csv')
df = pd.read_csv('file.csv', sep=';', encoding='utf-8', header=0,
index_col=0, usecols=['col1', 'col2'], nrows=1000,
na_values=['NA', '?'], parse_dates=['date_col'])
df.to_csv('output.csv', index=False, encoding='utf-8', sep=',')
# Excel
df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
df = pd.read_excel('file.xlsx', sheet_name=0) # First sheet
df = pd.read_excel('file.xlsx', sheet_name=None) # All sheets
df.to_excel('output.xlsx', sheet_name='Data', index=False)
# JSON
df = pd.read_json('file.json')
df = pd.read_json('file.json', orient='records')
df.to_json('output.json', orient='records', indent=2)
# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table', conn)
df = pd.read_sql_query('SELECT * FROM table WHERE col > 5', conn)
df = pd.read_sql_table('table_name', conn)
df.to_sql('table_name', conn, if_exists='replace', index=False)
# HTML
df = pd.read_html('url_or_file.html')[0]
df.to_html('output.html', index=False)
# Pickle (fast, pandas-specific format)
df.to_pickle('file.pkl')
df = pd.read_pickle('file.pkl')
# Parquet (efficient columnar format)
df.to_parquet('file.parquet')
df = pd.read_parquet('file.parquet')
# Clipboard
df = pd.read_clipboard()
df.to_clipboard(index=False)Python4. Viewing and Inspecting Data
# View data
df.head(n) # First n rows (default 5)
df.tail(n) # Last n rows
df.sample(n) # Random n rows
df.sample(frac=0.1) # Random 10% of rows
# Shape and size
df.shape # (rows, columns)
df.size # Total elements
len(df) # Number of rows
df.ndim # Number of dimensions
# Index and columns
df.index # Row index
df.columns # Column names
df.axes # List of [index, columns]
# Data types
df.dtypes # Data type of each column
df.dtypes.value_counts() # Count of each data type
df.info() # Detailed info
df.info(memory_usage='deep') # Deep memory usage
# Statistics
df.describe() # Summary statistics for numeric columns
df.describe(include='all') # All columns
df.describe(include=['object']) # Only object columns
df.describe(exclude=['float64']) # Exclude certain types
# Counts
df.count() # Non-null values per column
df['col'].value_counts() # Frequency counts
df['col'].value_counts(normalize=True) # Proportions
df['col'].value_counts(dropna=False) # Include NaN
df['col'].nunique() # Number of unique values
df['col'].unique() # Array of unique values
# Min/Max
df.min() # Min of each column
df.max() # Max of each column
df.idxmin() # Index of min value
df.idxmax() # Index of max value
# Sum and Mean
df.sum() # Sum of each column
df.sum(axis=1) # Sum of each row
df.mean() # Mean of each column
df.median() # Median
df.mode() # Mode
df.std() # Standard deviation
df.var() # Variance
# Correlations
df.corr() # Correlation matrix
df.corr(method='pearson') # Pearson correlation
df.corr(method='spearman') # Spearman correlation
df['A'].corr(df['B']) # Correlation between two columns
df.corrwith(df['A']) # Correlation with one column
# Memory
df.memory_usage() # Memory usage per column
df.memory_usage(deep=True) # Deep memory usagePython5. Selection and Indexing
# Column selection
df['col'] # Single column (Series)
df[['col']] # Single column (DataFrame)
df[['col1', 'col2']] # Multiple columns
df.col # Column by attribute (if valid name)
# Row selection by position (iloc)
df.iloc[0] # First row
df.iloc[-1] # Last row
df.iloc[0:5] # First 5 rows
df.iloc[[0, 2, 4]] # Specific rows
df.iloc[:, 0] # First column
df.iloc[:, 0:3] # First 3 columns
df.iloc[0:5, 0:3] # Rows 0-4, columns 0-2
# Row selection by label (loc)
df.loc[0] # Row with index label 0
df.loc[0:5] # Rows 0 to 5 (inclusive!)
df.loc[:, 'A'] # Column 'A'
df.loc[:, 'A':'C'] # Columns A to C (inclusive)
df.loc[0:5, ['A', 'C']] # Specific rows and columns
df.loc[df['A'] > 5, 'B'] # Conditional selection
# Scalar access
df.at[0, 'A'] # Fast scalar access by label
df.iat[0, 1] # Fast scalar access by position
# Boolean indexing
df[df['A'] > 5] # Rows where A > 5
df[df['A'].isin([1, 2, 3])] # Rows where A is in list
df[df['A'].between(1, 5)] # Rows where A is between 1 and 5
df[~df['A'].isin([1, 2])] # NOT in list
df[(df['A'] > 5) & (df['B'] < 10)] # Multiple conditions (AND)
df[(df['A'] > 5) | (df['B'] < 10)] # Multiple conditions (OR)
# Query method
df.query('A > 5')
df.query('A > 5 and B < 10')
df.query('A > @threshold') # Use variables with @
# Select by data type
df.select_dtypes(include=['float64', 'int64'])
df.select_dtypes(include='number')
df.select_dtypes(exclude=['object'])
# Filter columns by name
df.filter(items=['A', 'B']) # Exact names
df.filter(like='_total') # Contains substring
df.filter(regex='^col') # Regex pattern
# Where
df.where(df > 0) # Keep values where condition is True
df.where(df > 0, -df) # Replace False values
df.mask(df > 0) # Opposite of wherePython6. Filtering and Querying
# Basic filtering
df[df['A'] > 5]
df[df['A'] == 'value']
df[df['A'] != 'value']
df[df['A'].isna()]
df[df['A'].notna()]
# Multiple conditions
df[(df['A'] > 5) & (df['B'] < 10)]
df[(df['A'] > 5) | (df['B'] < 10)]
df[~(df['A'] > 5)] # NOT
# isin() method
df[df['A'].isin([1, 2, 3])]
df[df['A'].isin(other_df['B'])]
# String methods
df[df['col'].str.contains('pattern')]
df[df['col'].str.startswith('A')]
df[df['col'].str.endswith('z')]
df[df['col'].str.match(r'^[A-Z]')]
df[df['col'].str.len() > 5]
df[df['col'].str.isdigit()]
df[df['col'].str.isalpha()]
# between() method
df[df['A'].between(1, 10)]
df[df['A'].between(1, 10, inclusive='neither')]
# query() method
df.query('A > 5')
df.query('A > 5 and B < 10')
df.query('A in [1, 2, 3]')
df.query('col.str.contains("pattern")', engine='python')
# nlargest/nsmallest
df.nlargest(10, 'A') # 10 largest values in column A
df.nsmallest(10, 'A') # 10 smallest values
df.nlargest(5, ['A', 'B']) # By multiple columnsPython7. Adding, Modifying, and Removing Data
# Add columns
df['new_col'] = value # Constant value
df['new_col'] = df['A'] + df['B'] # Computed column
df['new_col'] = df['A'].apply(lambda x: x*2)
df.insert(1, 'new_col', values) # Insert at position
# Assign (method chaining friendly)
df = df.assign(new_col=df['A'] * 2)
df = df.assign(col1=lambda x: x['A'] * 2,
col2=lambda x: x['col1'] + 1)
# Modify columns
df['A'] = df['A'] * 2
df.loc[:, 'A'] = df['A'] * 2
df.loc[df['A'] > 5, 'B'] = 0
# Remove columns
df.drop('col', axis=1)
df.drop(['col1', 'col2'], axis=1)
df.drop(columns=['col1', 'col2'])
del df['col']
df.pop('col') # Returns the column
# Add rows
df.loc[len(df)] = [value1, value2, value3]
df = pd.concat([df, new_row_df], ignore_index=True)
# Remove rows
df.drop(0) # Drop row with index 0
df.drop([0, 1, 2]) # Drop multiple rows
df.drop(index=[0, 1]) # Explicit index parameter
# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['A', 'B'])
df.drop_duplicates(keep='first') # Keep first occurrence
df.drop_duplicates(keep='last') # Keep last occurrence
df.drop_duplicates(keep=False) # Remove all duplicatesPython8. Renaming and Reindexing
# Rename columns
df.rename(columns={'old': 'new'})
df.rename(columns={'A': 'col_A', 'B': 'col_B'})
df.rename(columns=str.upper) # Apply function to all
df.columns = ['A', 'B', 'C'] # Replace all column names
# Rename index
df.rename(index={0: 'first', 1: 'second'})
df.rename(index=str)
# Add prefix/suffix
df.add_prefix('col_')
df.add_suffix('_total')
# Reset index
df.reset_index() # Move index to column
df.reset_index(drop=True) # Drop the index
# Set index
df.set_index('col')
df.set_index(['col1', 'col2']) # MultiIndex
# Reindex
df.reindex([0, 2, 4]) # New index
df.reindex(columns=['A', 'C']) # Reorder columns
df.reindex_like(other_df) # Match another DataFramePython9. Sorting
# Sort by values
df.sort_values('A')
df.sort_values('A', ascending=False)
df.sort_values(['A', 'B'])
df.sort_values(['A', 'B'], ascending=[True, False])
df.sort_values('A', na_position='first') # NaN first
# Sort by index
df.sort_index()
df.sort_index(ascending=False)
df.sort_index(axis=1) # Sort columns
# Sort by multiple columns with different orders
df.sort_values(by=['col1', 'col2'],
ascending=[True, False],
na_position='last')Python10. Apply, Map, and Transform
# apply() - Apply function to each column or row
df.apply(np.sum) # Sum of each column
df.apply(np.sum, axis=1) # Sum of each row
df['A'].apply(lambda x: x * 2)
df.apply(lambda x: x.max() - x.min())
# applymap() - Apply function to each element (element-wise)
df.applymap(lambda x: x * 2)
df.applymap(str)
# map() - Map values (Series only)
df['A'].map({1: 'one', 2: 'two', 3: 'three'})
df['A'].map(lambda x: x * 2)
# replace() - Replace values
df.replace(0, np.nan)
df.replace([0, 1], [100, 200])
df.replace({0: 100, 1: 200})
df.replace({'A': {0: 100}, 'B': {1: 200}})
df['A'].replace(0, 100)
# transform() - Transform values
df.transform(lambda x: x * 2)
df.transform([np.sqrt, np.square])
df.groupby('A').transform('mean')
# pipe() - Chain custom functions
df.pipe(func1).pipe(func2).pipe(func3)Python11. String Operations
# Case conversion
df['col'].str.lower()
df['col'].str.upper()
df['col'].str.title()
df['col'].str.capitalize()
df['col'].str.swapcase()
# Strip whitespace
df['col'].str.strip()
df['col'].str.lstrip()
df['col'].str.rstrip()
df['col'].str.strip('chars')
# Split and join
df['col'].str.split()
df['col'].str.split(',')
df['col'].str.split(',', expand=True) # Create new columns
df['col'].str.rsplit(',', n=1) # Split from right
df['col'].str.join('-')
# Substring operations
df['col'].str[:5] # First 5 characters
df['col'].str[5:] # From 5th character
df['col'].str[2:5] # Characters 2-4
df['col'].str.slice(0, 5)
df['col'].str.slice(start=2, stop=5, step=1)
# Replace and remove
df['col'].str.replace('old', 'new')
df['col'].str.replace(r'\d+', 'NUM', regex=True)
df['col'].str.removeprefix('prefix')
df['col'].str.removesuffix('suffix')
# Find and count
df['col'].str.contains('pattern')
df['col'].str.contains('pattern', case=False)
df['col'].str.contains(r'\d+', regex=True)
df['col'].str.startswith('A')
df['col'].str.endswith('z')
df['col'].str.find('pattern')
df['col'].str.count('pattern')
# Extract
df['col'].str.extract(r'(\d+)') # First group
df['col'].str.extract(r'(\d+)-(\w+)') # Multiple groups
df['col'].str.extractall(r'(\d+)') # All matches
df['col'].str.findall(r'\d+')
# Length and checking
df['col'].str.len()
df['col'].str.isdigit()
df['col'].str.isalpha()
df['col'].str.isalnum()
df['col'].str.isspace()
df['col'].str.islower()
df['col'].str.isupper()
df['col'].str.istitle()
# Padding
df['col'].str.pad(10, side='left', fillchar='0')
df['col'].str.zfill(10) # Zero-pad
df['col'].str.center(10, fillchar='*')
df['col'].str.ljust(10)
df['col'].str.rjust(10)
# Get specific element from split
df['col'].str.split('-').str[0]
df['col'].str.split('-').str.get(0)Python12. Missing Data
# Detect missing data
df.isnull() # Boolean DataFrame
df.isna() # Same as isnull()
df.notna() # Opposite
df.isnull().sum() # Count per column
df.isnull().sum().sum() # Total count
df.isnull().any() # Any missing per column
df.isnull().all() # All missing per column
# Drop missing data
df.dropna() # Drop rows with any NaN
df.dropna(axis=1) # Drop columns with any NaN
df.dropna(how='all') # Drop if all values are NaN
df.dropna(thresh=2) # Drop if less than 2 non-NaN
df.dropna(subset=['A', 'B']) # Drop based on specific columns
# Fill missing data
df.fillna(0) # Fill with constant
df.fillna({'A': 0, 'B': 1}) # Different values per column
df.fillna(method='ffill') # Forward fill
df.fillna(method='bfill') # Backward fill
df.fillna(df.mean()) # Fill with mean
df['A'].fillna(df['A'].mean()) # Fill column with its mean
df.fillna(method='ffill', limit=2) # Limit number of fills
# Interpolate
df.interpolate() # Linear interpolation
df.interpolate(method='polynomial', order=2)
df.interpolate(method='time') # For time series
df.interpolate(limit=2) # Limit consecutive fills
df.interpolate(limit_direction='forward')
# Replace NaN
df.replace(np.nan, 0)
df.where(pd.notna(df), 0)Python13. Aggregation and Statistics
# Basic statistics
df.sum() # Sum
df.sum(axis=1) # Row-wise sum
df.mean() # Mean
df.median() # Median
df.mode() # Mode
df.std() # Standard deviation
df.var() # Variance
df.min() # Minimum
df.max() # Maximum
df.abs() # Absolute value
df.prod() # Product
df.cumsum() # Cumulative sum
df.cumprod() # Cumulative product
df.cummin() # Cumulative minimum
df.cummax() # Cumulative maximum
# Quantiles
df.quantile(0.25) # 25th percentile
df.quantile([0.25, 0.5, 0.75]) # Multiple quantiles
# Ranking
df.rank() # Rank values
df.rank(ascending=False)
df.rank(method='first') # first, average, min, max, dense
# Aggregation with agg()
df.agg(['sum', 'mean', 'std'])
df.agg({'A': ['sum', 'mean'], 'B': ['min', 'max']})
df.agg({'A': 'sum', 'B': 'mean'})
df['A'].agg(['sum', 'mean', lambda x: x.max() - x.min()])
# Custom aggregation
df.agg(custom_func)
df.agg({'A': custom_func, 'B': 'mean'})Python14. GroupBy Operations
# Basic groupby
grouped = df.groupby('A')
grouped = df.groupby(['A', 'B']) # Multiple columns
grouped = df.groupby(df['A'] > 5) # By condition
# Aggregation after groupby
df.groupby('A').sum()
df.groupby('A').mean()
df.groupby('A')['B'].sum() # Specific column
df.groupby('A')[['B', 'C']].sum() # Multiple columns
# Multiple aggregations
df.groupby('A').agg(['sum', 'mean', 'std'])
df.groupby('A').agg({'B': 'sum', 'C': 'mean'})
df.groupby('A').agg({'B': ['sum', 'mean'], 'C': 'max'})
# Named aggregations
df.groupby('A').agg(
total=('B', 'sum'),
average=('B', 'mean'),
max_value=('C', 'max')
)
# Transform
df.groupby('A').transform('mean') # Broadcast to original shape
df['B_normalized'] = df.groupby('A')['B'].transform(
lambda x: (x - x.mean()) / x.std()
)
# Filter groups
df.groupby('A').filter(lambda x: len(x) > 2)
df.groupby('A').filter(lambda x: x['B'].sum() > 100)
# Apply custom function
df.groupby('A').apply(lambda x: x.max() - x.min())
# Iterate through groups
for name, group in df.groupby('A'):
print(f"Group: {name}")
print(group)
# Get specific group
df.groupby('A').get_group('value')
# Group by multiple with different operations
df.groupby(['A', 'B']).agg({
'C': ['sum', 'mean'],
'D': 'max'
}).reset_index()
# Size and count
df.groupby('A').size() # Size of each group
df.groupby('A').count() # Count non-null values
df.groupby('A')['B'].count()
# First and last
df.groupby('A').first()
df.groupby('A').last()
df.groupby('A').head(2) # First 2 of each group
df.groupby('A').tail(2) # Last 2 of each groupPython15. Pivot Tables and Reshaping
# Pivot table
pd.pivot_table(df, values='D', index='A', columns='B')
pd.pivot_table(df, values='D', index='A', columns='B', aggfunc='sum')
pd.pivot_table(df, values='D', index=['A', 'B'], columns='C')
pd.pivot_table(df, values=['D', 'E'], index='A', columns='B')
pd.pivot_table(df, index='A', columns='B', aggfunc={'D': 'sum', 'E': 'mean'})
pd.pivot_table(df, index='A', columns='B', margins=True) # Add totals
# Pivot (no aggregation)
df.pivot(index='A', columns='B', values='C')
# Melt (unpivot)
pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
pd.melt(df, id_vars=['A'], var_name='variable', value_name='value')
# Stack and unstack
df.stack() # Pivot columns to rows
df.unstack() # Pivot rows to columns
df.unstack(level=0) # Unstack specific level
df.stack().unstack() # Round trip
# Wide to long
pd.wide_to_long(df, stubnames='col', i='id', j='time')
# Transpose
df.T # Transpose (swap rows and columns)
df.transpose()
# Crosstab
pd.crosstab(df['A'], df['B'])
pd.crosstab(df['A'], df['B'], normalize='columns')
pd.crosstab(df['A'], df['B'], values=df['C'], aggfunc='mean')
pd.crosstab(df['A'], df['B'], margins=True)Python16. Merging, Joining, and Concatenating
# Concatenate
pd.concat([df1, df2]) # Vertical (rows)
pd.concat([df1, df2], axis=1) # Horizontal (columns)
pd.concat([df1, df2], ignore_index=True) # Reset index
pd.concat([df1, df2], keys=['first', 'second']) # Add keys
# Merge (like SQL join)
pd.merge(df1, df2, on='key') # Inner join
pd.merge(df1, df2, on='key', how='left') # Left join
pd.merge(df1, df2, on='key', how='right') # Right join
pd.merge(df1, df2, on='key', how='outer') # Outer join
pd.merge(df1, df2, left_on='key1', right_on='key2') # Different names
pd.merge(df1, df2, on=['key1', 'key2']) # Multiple keys
pd.merge(df1, df2, left_index=True, right_index=True) # On index
pd.merge(df1, df2, on='key', suffixes=('_left', '_right')) # Suffix
pd.merge(df1, df2, on='key', indicator=True) # Add merge indicator
# Join (merge on index)
df1.join(df2) # Left join on index
df1.join(df2, how='inner')
df1.join(df2, on='key')
df1.join(df2, lsuffix='_l', rsuffix='_r')
# Combine first (update missing values)
df1.combine_first(df2)
# Update
df1.update(df2)Python17. Time Series
# Create datetime
pd.to_datetime('2023-01-01')
pd.to_datetime(['2023-01-01', '2023-01-02'])
pd.to_datetime(df['date_col'], format='%Y-%m-%d')
pd.to_datetime(df['date_col'], errors='coerce') # Invalid to NaT
# Date range
pd.date_range('2023-01-01', periods=10)
pd.date_range('2023-01-01', '2023-12-31')
pd.date_range('2023-01-01', periods=10, freq='D') # Daily
pd.date_range('2023-01-01', periods=10, freq='B') # Business days
pd.date_range('2023-01-01', periods=10, freq='W') # Weekly
pd.date_range('2023-01-01', periods=10, freq='M') # Month end
pd.date_range('2023-01-01', periods=10, freq='MS') # Month start
pd.date_range('2023-01-01', periods=10, freq='H') # Hourly
# Time delta
pd.Timedelta('1 day')
pd.Timedelta(days=1, hours=2, minutes=30)
pd.timedelta_range('1 day', periods=10)
# Extract datetime components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute
df['second'] = df['date'].dt.second
df['dayofweek'] = df['date'].dt.dayofweek # Monday=0
df['dayofyear'] = df['date'].dt.dayofyear
df['quarter'] = df['date'].dt.quarter
df['week'] = df['date'].dt.isocalendar().week
df['day_name'] = df['date'].dt.day_name()
df['month_name'] = df['date'].dt.month_name()
# Boolean datetime checks
df['is_month_start'] = df['date'].dt.is_month_start
df['is_month_end'] = df['date'].dt.is_month_end
df['is_quarter_start'] = df['date'].dt.is_quarter_start
df['is_quarter_end'] = df['date'].dt.is_quarter_end
df['is_year_start'] = df['date'].dt.is_year_start
df['is_year_end'] = df['date'].dt.is_year_end
df['is_leap_year'] = df['date'].dt.is_leap_year
# Date offsets
df['date'] + pd.DateOffset(days=1)
df['date'] + pd.DateOffset(months=1)
df['date'] + pd.offsets.BusinessDay(5)
# Set datetime index
df.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index)
# Select by date
df['2023'] # All of 2023
df['2023-01'] # January 2023
df['2023-01-15'] # Specific day
df['2023-01':'2023-03'] # Date range
# Resample (change frequency)
df.resample('D').mean() # Daily mean
df.resample('W').sum() # Weekly sum
df.resample('M').last() # Monthly last value
df.resample('Q').mean() # Quarterly mean
df.resample('Y').sum() # Yearly sum
df.resample('2D').mean() # Every 2 days
df.resample('H').ffill() # Hourly with forward fill
# Rolling window
df.rolling(window=7).mean() # 7-period moving average
df.rolling(window=7).sum()
df.rolling(window=7).std()
df.rolling(window='7D').mean() # 7-day window (time-based)
df.rolling(window=7, min_periods=1).mean() # Allow fewer periods
# Expanding window
df.expanding().mean() # Cumulative mean
df.expanding(min_periods=3).sum()
# Shifting
df.shift(1) # Shift forward by 1
df.shift(-1) # Shift backward by 1
df.shift(freq='D') # Shift by time frequency
df.tshift(periods=1) # Shift index
# Lagging
df['lag1'] = df['value'].shift(1)
df['lag7'] = df['value'].shift(7)
# Percentage change
df.pct_change() # Period-over-period change
df.pct_change(periods=7) # 7-period change
# Timezone
df['date'].dt.tz_localize('UTC') # Make timezone-aware
df['date'].dt.tz_convert('US/Eastern') # Convert timezone
df['date'].dt.tz_localize(None) # Remove timezonePython18. Categorical Data
# Create categorical
df['category'] = pd.Categorical(df['category'])
df['category'] = df['category'].astype('category')
# Create with specific categories
cat = pd.Categorical(['a', 'b', 'c'], categories=['a', 'b', 'c', 'd'])
# Ordered categorical
cat = pd.Categorical(['low', 'medium', 'high'],
categories=['low', 'medium', 'high'],
ordered=True)
# Get categories
df['category'].cat.categories
# Add categories
df['category'].cat.add_categories(['new_cat'])
# Remove categories
df['category'].cat.remove_categories(['old_cat'])
# Rename categories
df['category'].cat.rename_categories({'old': 'new'})
# Reorder categories
df['category'].cat.reorder_categories(['c', 'b', 'a'])
# Set categories
df['category'].cat.set_categories(['a', 'b', 'c'])
# Get codes
df['category'].cat.codesPython19. Multi-Index (Hierarchical Index)
# Create MultiIndex
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1)])
idx = pd.MultiIndex.from_product([['a', 'b'], [1, 2, 3]])
idx = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
df = pd.DataFrame(data, index=idx)
# Set MultiIndex from columns
df.set_index(['col1', 'col2'])
# Select from MultiIndex
df.loc['a'] # First level
df.loc[('a', 1)] # Both levels
df.loc[('a', slice(None)), :] # All of first level
# Cross-section
df.xs('a', level=0) # All rows where level 0 is 'a'
df.xs(1, level=1) # All rows where level 1 is 1
# Stack and unstack with MultiIndex
df.stack()
df.unstack(level=0)
df.unstack(level='level_name')
# Swap levels
df.swaplevel(0, 1)
# Sort MultiIndex
df.sort_index(level=0)
df.sort_index(level=[0, 1])
# Reset MultiIndex
df.reset_index()
df.reset_index(level=0)Python20. Window Functions
# Rolling window
df.rolling(window=3).mean()
df.rolling(window=3).sum()
df.rolling(window=3).std()
df.rolling(window=3).min()
df.rolling(window=3).max()
df.rolling(window=3).median()
df.rolling(window=3).count()
df.rolling(window=3).apply(custom_func)
# Expanding window
df.expanding().mean()
df.expanding(min_periods=3).sum()
# Exponentially weighted
df.ewm(span=20).mean() # EMA
df.ewm(alpha=0.5).mean()
df.ewm(halflife=10).mean()Python21. Performance and Optimization
# Data types optimization
df['int_col'] = df['int_col'].astype('int32')
df['cat_col'] = df['cat_col'].astype('category')
df['sparse_col'] = df['sparse_col'].astype(pd.SparseDtype('int'))
# Chunking for large files
for chunk in pd.read_csv('large.csv', chunksize=10000):
process(chunk)
# Query instead of boolean indexing
df.query('A > 5 and B < 10') # Often faster
# Use vectorized operations
df['result'] = df['A'] + df['B'] # Not apply()
# Use numexpr for complex calculations
pd.eval('df_A + df_B * df_C')
# Copy explicitly
df_copy = df.copy()
df_deep_copy = df.copy(deep=True)
# Memory usage
df.info(memory_usage='deep')
df.memory_usage(deep=True)
# Convert to sparse
df_sparse = df.astype(pd.SparseDtype('float', fill_value=0))Python22. Advanced Operations
# explode() - Expand lists in cells
df.explode('list_column')
# eval() - Fast evaluation of expressions
df.eval('C = A + B')
df.eval('C = A + B', inplace=True)
# Binning
pd.cut(df['A'], bins=5) # Equal-width bins
pd.cut(df['A'], bins=[0, 10, 20, 30]) # Custom bins
pd.qcut(df['A'], q=4) # Equal-frequency bins (quartiles)
pd.qcut(df['A'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
# Dummy variables
pd.get_dummies(df['category'])
pd.get_dummies(df['category'], prefix='cat')
pd.get_dummies(df, columns=['cat1', 'cat2'])
pd.get_dummies(df, drop_first=True) # Avoid multicollinearity
# Ranking
df.rank()
df.rank(method='min')
df.rank(pct=True) # Percentile ranks
# Sampling
df.sample(n=100)
df.sample(frac=0.1)
df.sample(n=100, replace=True) # With replacement
df.sample(n=100, weights='weight_col') # Weighted sampling
# Clip values
df.clip(lower=0, upper=100)
df['A'].clip(lower=0)
# Factorize (convert to integers)
codes, unique = pd.factorize(df['category'])Python23. Validation and Testing
# Check data types
assert df['col'].dtype == 'int64'
# Check for nulls
assert df['col'].notna().all()
# Check uniqueness
assert df['id'].is_unique
# Check value range
assert (df['age'] >= 0).all()
assert (df['age'] <= 120).all()
# Check membership
assert df['category'].isin(['A', 'B', 'C']).all()
# equals() - Check if DataFrames are equal
df1.equals(df2)
# compare() - Show differences
df1.compare(df2)Python24. Working with Duplicates
# Find duplicates
df.duplicated() # Boolean Series
df.duplicated(subset=['A', 'B']) # Based on specific columns
df.duplicated(keep='first') # Mark duplicates except first
df.duplicated(keep='last') # Mark duplicates except last
df.duplicated(keep=False) # Mark all duplicates
# View duplicates
df[df.duplicated()]
df[df.duplicated(subset=['A'], keep=False)]
# Count duplicates
df.duplicated().sum()
# Drop duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['A'])
df.drop_duplicates(keep='last')Python25. Useful Utilities
# Info about DataFrame
df.info()
df.describe()
df.dtypes
df.shape
df.columns
df.index
# Convert types
df.astype('float')
df.astype({'A': 'int32', 'B': 'float64'})
df['A'].astype('category')
# Copy
df.copy()
df.copy(deep=True)
# Iteration (avoid when possible)
for index, row in df.iterrows(): # Row by row
print(row['A'])
for row in df.itertuples(): # Faster than iterrows
print(row.A)
# Apply function to each element
df.applymap(func) # Entire DataFrame
df['col'].apply(func) # Series
# Pipe for method chaining
df.pipe(func1).pipe(func2).pipe(func3)
# Style (for display)
df.style.highlight_max()
df.style.background_gradient()
df.style.format('{:.2f}')
# Options
pd.options.display.max_rows = 100
pd.options.display.max_columns = None
pd.set_option('display.max_rows', 100)
pd.get_option('display.max_rows')
pd.reset_option('display.max_rows')
pd.describe_option() # Show all optionsPythonConclusion
This tutorial covers the essential and advanced features of pandas for data manipulation and analysis. Practice these concepts with real datasets to become proficient. Remember that pandas is constantly evolving, so check the official documentation for the latest features and best practices.
Happy data wrangling! 🐼
Discover more from Altgr Blog
Subscribe to get the latest posts sent to your email.
