Table of Contents

    1. Introduction to Pandas
    2. Installation
    3. Data Structures
    4. Creating DataFrames
    5. Reading and Writing Data
    6. Data Inspection
    7. Data Selection and Indexing
    8. Data Filtering
    9. Data Manipulation
    10. Handling Missing Data
    11. Data Aggregation and Grouping
    12. Merging and Joining
    13. Time Series Analysis
    14. Data Visualization
    15. Performance Tips

    1. Introduction to Pandas

    Pandas is a powerful, open-source data analysis and manipulation library for Python. It provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

    Key Features:

    • Fast and efficient DataFrame object for data manipulation
    • Tools for reading and writing data between in-memory data structures and different formats
    • Intelligent data alignment and integrated handling of missing data
    • Flexible reshaping and pivoting of datasets
    • Powerful group by functionality
    • Time series functionality

    2. Installation

    # Using pip
    pip install pandas
    
    # Using conda
    conda install pandas
    
    # Install with additional dependencies
    pip install pandas numpy matplotlib
    Bash

    Import pandas:

    import pandas as pd
    import numpy as np
    Python

    3. Data Structures

    Pandas has two primary data structures:

    3.1 Series

    A one-dimensional labeled array capable of holding any data type.

    # Creating a Series
    s = pd.Series([1, 3, 5, np.nan, 6, 8])
    print(s)
    
    # Series with custom index
    s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
    print(s)
    
    # From dictionary
    data = {'a': 10, 'b': 20, 'c': 30}
    s = pd.Series(data)
    Python

    3.2 DataFrame

    A two-dimensional labeled data structure with columns of potentially different types.

    # Creating a DataFrame
    df = pd.DataFrame({
        'A': [1, 2, 3, 4],
        'B': ['a', 'b', 'c', 'd'],
        'C': [1.5, 2.5, 3.5, 4.5]
    })
    print(df)
    Python

    4. Creating DataFrames

    4.1 From Dictionary

    # Dictionary of lists
    data = {
        'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 24, 35, 32],
        'City': ['New York', 'Paris', 'Berlin', 'London']
    }
    df = pd.DataFrame(data)
    
    # Dictionary of Series
    data = {
        'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
        'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
    }
    df = pd.DataFrame(data)
    Python

    4.2 From List of Dictionaries

    data = [
        {'a': 1, 'b': 2, 'c': 3},
        {'a': 10, 'b': 20, 'c': 30},
        {'a': 100, 'b': 200, 'c': 300}
    ]
    df = pd.DataFrame(data)
    Python

    4.3 From NumPy Array

    # Create from numpy array
    data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    df = pd.DataFrame(data, columns=['A', 'B', 'C'])
    Python

    5. Reading and Writing Data

    5.1 Reading Data

    # CSV files
    df = pd.read_csv('data.csv')
    df = pd.read_csv('data.csv', sep=';', encoding='utf-8')
    
    # Excel files
    df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
    
    # JSON files
    df = pd.read_json('data.json')
    
    # SQL databases
    import sqlite3
    conn = sqlite3.connect('database.db')
    df = pd.read_sql('SELECT * FROM table_name', conn)
    
    # HTML tables
    df = pd.read_html('https://example.com/table.html')[0]
    
    # Clipboard
    df = pd.read_clipboard()
    
    # With specific parameters
    df = pd.read_csv('data.csv', 
                     header=0,           # Row to use as header
                     index_col=0,        # Column to use as index
                     na_values=['NA'],   # Values to recognize as NA
                     parse_dates=['date'], # Parse date columns
                     nrows=1000)         # Number of rows to read
    Python

    5.2 Writing Data

    # CSV
    df.to_csv('output.csv', index=False)
    
    # Excel
    df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)
    
    # JSON
    df.to_json('output.json', orient='records')
    
    # SQL
    df.to_sql('table_name', conn, if_exists='replace', index=False)
    
    # HTML
    df.to_html('output.html')
    
    # Clipboard
    df.to_clipboard()
    Python

    6. Data Inspection

    # View first/last rows
    df.head()        # First 5 rows
    df.head(10)      # First 10 rows
    df.tail()        # Last 5 rows
    
    # Shape and size
    df.shape         # (rows, columns)
    df.size          # Total number of elements
    len(df)          # Number of rows
    
    # Column information
    df.columns       # Column names
    df.dtypes        # Data types of columns
    df.info()        # Detailed information
    
    # Statistical summary
    df.describe()    # For numerical columns
    df.describe(include='all')  # For all columns
    
    # Unique values
    df['column'].unique()       # Unique values
    df['column'].nunique()      # Count of unique values
    df['column'].value_counts() # Frequency of values
    
    # Check for missing values
    df.isnull().sum()
    df.isna().sum()
    
    # Memory usage
    df.memory_usage(deep=True)
    Python

    7. Data Selection and Indexing

    7.1 Selecting Columns

    # Single column (returns Series)
    df['Name']
    df.Name  # Dot notation (only for valid Python identifiers)
    
    # Multiple columns (returns DataFrame)
    df[['Name', 'Age']]
    
    # Select columns by data type
    df.select_dtypes(include=['int64', 'float64'])
    df.select_dtypes(exclude=['object'])
    Python

    7.2 Selecting Rows

    # By position (iloc)
    df.iloc[0]           # First row
    df.iloc[0:5]         # First 5 rows
    df.iloc[:, 0]        # First column
    df.iloc[0:5, 0:3]    # First 5 rows, first 3 columns
    
    # By label (loc)
    df.loc[0]            # Row with index 0
    df.loc[0:5]          # Rows 0 to 5 (inclusive)
    df.loc[:, 'Name']    # All rows, 'Name' column
    df.loc[0:5, ['Name', 'Age']]  # Specific rows and columns
    
    # Selecting specific cells
    df.at[0, 'Name']     # Fast access to scalar
    df.iat[0, 1]         # Fast integer-based access
    Python

    7.3 Boolean Indexing

    # Single condition
    df[df['Age'] > 30]
    
    # Multiple conditions
    df[(df['Age'] > 25) & (df['City'] == 'New York')]
    df[(df['Age'] < 25) | (df['Age'] > 35)]
    
    # Using isin()
    df[df['City'].isin(['New York', 'Paris'])]
    
    # String methods
    df[df['Name'].str.contains('John')]
    df[df['Name'].str.startswith('A')]
    Python

    8. Data Filtering

    # Filter by condition
    filtered = df[df['Age'] > 30]
    
    # Filter using query() method
    filtered = df.query('Age > 30 and City == "New York"')
    
    # Filter with multiple conditions
    filtered = df[(df['Age'] > 25) & (df['Age'] < 35)]
    
    # Filter rows with non-null values
    filtered = df[df['column'].notna()]
    
    # Filter based on string patterns
    filtered = df[df['Name'].str.match(r'^[A-M]')]
    
    # Remove duplicates
    df_unique = df.drop_duplicates()
    df_unique = df.drop_duplicates(subset=['Name'])
    df_unique = df.drop_duplicates(keep='last')
    
    # Sample data
    sample = df.sample(n=100)          # 100 random rows
    sample = df.sample(frac=0.1)       # 10% of rows
    Python

    9. Data Manipulation

    9.1 Adding/Removing Columns

    # Add new column
    df['NewColumn'] = df['Age'] * 2
    df['Category'] = 'A'  # Constant value
    
    # Add computed column
    df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Old')
    
    # Insert column at specific position
    df.insert(1, 'NewCol', [1, 2, 3, 4])
    
    # Remove columns
    df.drop('NewColumn', axis=1, inplace=True)
    df.drop(['Col1', 'Col2'], axis=1, inplace=True)
    
    # Remove columns using del
    del df['NewColumn']
    Python

    9.2 Adding/Removing Rows

    # Add row using loc
    df.loc[len(df)] = ['Tom', 45, 'Tokyo']
    
    # Add row using append (deprecated in newer versions)
    new_row = pd.DataFrame({'Name': ['Tom'], 'Age': [45], 'City': ['Tokyo']})
    df = pd.concat([df, new_row], ignore_index=True)
    
    # Remove rows by index
    df.drop(0, inplace=True)
    df.drop([0, 1, 2], inplace=True)
    
    # Remove rows by condition
    df = df[df['Age'] >= 25]
    Python

    9.3 Renaming

    # Rename columns
    df.rename(columns={'Name': 'FullName', 'Age': 'Years'}, inplace=True)
    
    # Rename all columns
    df.columns = ['col1', 'col2', 'col3']
    
    # Rename index
    df.rename(index={0: 'first', 1: 'second'}, inplace=True)
    
    # Reset index
    df.reset_index(drop=True, inplace=True)
    
    # Set column as index
    df.set_index('Name', inplace=True)
    Python

    9.4 Sorting

    # Sort by column
    df.sort_values('Age')
    df.sort_values('Age', ascending=False)
    
    # Sort by multiple columns
    df.sort_values(['City', 'Age'], ascending=[True, False])
    
    # Sort by index
    df.sort_index()
    df.sort_index(ascending=False)
    Python

    9.5 Apply Functions

    # Apply function to column
    df['Age_Squared'] = df['Age'].apply(lambda x: x**2)
    
    # Apply function to DataFrame
    df['Age_Category'] = df.apply(
        lambda row: 'Young' if row['Age'] < 30 else 'Old', 
        axis=1
    )
    
    # Map values
    mapping = {'New York': 'NY', 'Paris': 'PR', 'Berlin': 'BE'}
    df['City_Code'] = df['City'].map(mapping)
    
    # Replace values
    df['City'].replace('New York', 'NYC', inplace=True)
    df.replace({'New York': 'NYC', 'Paris': 'PRS'}, inplace=True)
    Python

    9.6 String Operations

    # Convert to lowercase/uppercase
    df['Name_Lower'] = df['Name'].str.lower()
    df['Name_Upper'] = df['Name'].str.upper()
    
    # Strip whitespace
    df['Name'] = df['Name'].str.strip()
    
    # Split strings
    df['First_Name'] = df['Name'].str.split(' ').str[0]
    
    # Contains pattern
    mask = df['Name'].str.contains('John', case=False)
    
    # Extract pattern
    df['Numbers'] = df['Text'].str.extract(r'(\d+)')
    
    # Replace in strings
    df['Name'] = df['Name'].str.replace('John', 'Jonathan')
    Python

    10. Handling Missing Data

    10.1 Detecting Missing Data

    # Check for missing values
    df.isnull()
    df.isna()
    df.isnull().sum()  # Count per column
    df.isnull().any()  # Any missing values per column
    
    # Check for non-missing values
    df.notna()
    df.notnull()
    Python

    10.2 Handling Missing Data

    # Drop rows with any missing values
    df.dropna()
    
    # Drop rows where all values are missing
    df.dropna(how='all')
    
    # Drop rows with missing values in specific columns
    df.dropna(subset=['Age', 'City'])
    
    # Drop columns with missing values
    df.dropna(axis=1)
    
    # Fill missing values with a constant
    df.fillna(0)
    df['Age'].fillna(df['Age'].mean())
    
    # Forward fill (propagate last valid observation)
    df.fillna(method='ffill')
    df.fillna(method='pad')
    
    # Backward fill
    df.fillna(method='bfill')
    df.fillna(method='backfill')
    
    # Fill with different values per column
    df.fillna({'Age': 0, 'City': 'Unknown'})
    
    # Interpolate missing values
    df['Age'].interpolate()
    df.interpolate(method='linear')
    Python

    11. Data Aggregation and Grouping

    11.1 Basic Aggregations

    # Sum, mean, median, etc.
    df['Age'].sum()
    df['Age'].mean()
    df['Age'].median()
    df['Age'].std()
    df['Age'].min()
    df['Age'].max()
    df['Age'].count()
    
    # Multiple aggregations at once
    df['Age'].agg(['sum', 'mean', 'std'])
    
    # Aggregation on multiple columns
    df[['Age', 'Salary']].agg(['mean', 'min', 'max'])
    
    # Different aggregations for different columns
    df.agg({
        'Age': ['mean', 'std'],
        'Salary': ['sum', 'mean']
    })
    Python

    11.2 GroupBy Operations

    # Group by single column
    grouped = df.groupby('City')
    grouped['Age'].mean()
    
    # Group by multiple columns
    df.groupby(['City', 'Gender'])['Age'].mean()
    
    # Multiple aggregations on groups
    df.groupby('City').agg({
        'Age': ['mean', 'min', 'max'],
        'Salary': 'sum'
    })
    
    # Custom aggregation functions
    df.groupby('City')['Age'].agg(
        Average='mean',
        Total='sum',
        Range=lambda x: x.max() - x.min()
    )
    
    # Iterate through groups
    for name, group in df.groupby('City'):
        print(f"City: {name}")
        print(group)
    
    # Transform (return same size as input)
    df['Age_Normalized'] = df.groupby('City')['Age'].transform(
        lambda x: (x - x.mean()) / x.std()
    )
    
    # Filter groups
    df.groupby('City').filter(lambda x: len(x) > 2)
    
    # Get specific group
    df.groupby('City').get_group('New York')
    Python

    11.3 Pivot Tables

    # Create pivot table
    pivot = df.pivot_table(
        values='Salary',
        index='City',
        columns='Gender',
        aggfunc='mean'
    )
    
    # Multiple aggregation functions
    pivot = df.pivot_table(
        values='Salary',
        index='City',
        columns='Gender',
        aggfunc=['mean', 'sum', 'count']
    )
    
    # Multiple values
    pivot = df.pivot_table(
        values=['Salary', 'Age'],
        index='City',
        columns='Gender',
        aggfunc='mean'
    )
    
    # With margins (totals)
    pivot = df.pivot_table(
        values='Salary',
        index='City',
        columns='Gender',
        aggfunc='mean',
        margins=True
    )
    Python

    11.4 Cross-tabulation

    # Create cross-tabulation
    crosstab = pd.crosstab(df['City'], df['Gender'])
    
    # With values
    crosstab = pd.crosstab(
        df['City'], 
        df['Gender'],
        values=df['Salary'],
        aggfunc='mean'
    )
    
    # With margins
    crosstab = pd.crosstab(df['City'], df['Gender'], margins=True)
    
    # Normalize
    crosstab = pd.crosstab(df['City'], df['Gender'], normalize='columns')
    Python

    12. Merging and Joining

    12.1 Concatenation

    # Vertical concatenation (stack rows)
    df_combined = pd.concat([df1, df2])
    df_combined = pd.concat([df1, df2], ignore_index=True)
    
    # Horizontal concatenation (stack columns)
    df_combined = pd.concat([df1, df2], axis=1)
    
    # With keys for multi-index
    df_combined = pd.concat([df1, df2], keys=['first', 'second'])
    Python

    12.2 Merging/Joining

    # Inner join (default)
    merged = pd.merge(df1, df2, on='key')
    
    # Left join
    merged = pd.merge(df1, df2, on='key', how='left')
    
    # Right join
    merged = pd.merge(df1, df2, on='key', how='right')
    
    # Outer join
    merged = pd.merge(df1, df2, on='key', how='outer')
    
    # Merge on multiple columns
    merged = pd.merge(df1, df2, on=['key1', 'key2'])
    
    # Merge with different column names
    merged = pd.merge(df1, df2, left_on='key1', right_on='key2')
    
    # Merge on index
    merged = pd.merge(df1, df2, left_index=True, right_index=True)
    
    # Using join method
    merged = df1.join(df2, how='left')
    merged = df1.join(df2, lsuffix='_left', rsuffix='_right')
    Python

    12.3 Append (Deprecated)

    # Use concat instead of append
    df_combined = pd.concat([df1, df2], ignore_index=True)
    Python

    13. Time Series Analysis

    13.1 Creating DateTime Objects

    # Convert string to datetime
    df['date'] = pd.to_datetime(df['date_string'])
    df['date'] = pd.to_datetime(df['date_string'], format='%Y-%m-%d')
    
    # Create date range
    dates = pd.date_range('2023-01-01', periods=100, freq='D')
    dates = pd.date_range('2023-01-01', '2023-12-31', freq='W')
    
    # Common frequencies:
    # 'D' - calendar day
    # 'B' - business day
    # 'W' - weekly
    # 'M' - month end
    # 'MS' - month start
    # 'Q' - quarter end
    # 'Y' - year end
    # 'H' - hourly
    # 'T' or 'min' - minutely
    # 'S' - secondly
    Python

    13.2 DateTime Components

    # Extract components
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['dayofweek'] = df['date'].dt.dayofweek
    df['dayofyear'] = df['date'].dt.dayofyear
    df['week'] = df['date'].dt.isocalendar().week
    df['quarter'] = df['date'].dt.quarter
    df['hour'] = df['date'].dt.hour
    df['minute'] = df['date'].dt.minute
    
    # Day name
    df['day_name'] = df['date'].dt.day_name()
    df['month_name'] = df['date'].dt.month_name()
    
    # Boolean checks
    df['is_month_start'] = df['date'].dt.is_month_start
    df['is_month_end'] = df['date'].dt.is_month_end
    df['is_quarter_start'] = df['date'].dt.is_quarter_start
    Python

    13.3 Time Series Operations

    # Set datetime as index
    df.set_index('date', inplace=True)
    
    # Select by date
    df['2023']                          # All data from 2023
    df['2023-01']                       # All data from January 2023
    df['2023-01-01':'2023-01-31']       # Date range
    
    # Resampling (changing frequency)
    df.resample('M').mean()             # Monthly mean
    df.resample('W').sum()              # Weekly sum
    df.resample('Q').agg(['mean', 'sum'])  # Quarterly aggregation
    
    # Rolling window
    df['rolling_mean'] = df['value'].rolling(window=7).mean()
    df['rolling_std'] = df['value'].rolling(window=7).std()
    df['rolling_sum'] = df['value'].rolling(window=30).sum()
    
    # Expanding window
    df['expanding_mean'] = df['value'].expanding().mean()
    
    # Shifting
    df['previous_day'] = df['value'].shift(1)   # Shift forward
    df['next_day'] = df['value'].shift(-1)      # Shift backward
    df['pct_change'] = df['value'].pct_change()  # Percentage change
    
    # Time differences
    df['time_diff'] = df['date'].diff()
    Python

    13.4 Time Zones

    # Localize timezone-naive to timezone-aware
    df['date'] = df['date'].dt.tz_localize('UTC')
    
    # Convert to different timezone
    df['date'] = df['date'].dt.tz_convert('US/Eastern')
    
    # Remove timezone
    df['date'] = df['date'].dt.tz_localize(None)
    Python

    14. Data Visualization

    Pandas integrates with matplotlib for quick visualizations.

    import matplotlib.pyplot as plt
    
    # Line plot
    df['Age'].plot()
    df.plot(x='Name', y='Age')
    plt.show()
    
    # Multiple lines
    df[['Age', 'Salary']].plot()
    
    # Bar plot
    df['City'].value_counts().plot(kind='bar')
    df['City'].value_counts().plot.bar()
    
    # Horizontal bar plot
    df['City'].value_counts().plot(kind='barh')
    
    # Histogram
    df['Age'].plot(kind='hist', bins=20)
    df['Age'].hist(bins=20)
    
    # Box plot
    df['Age'].plot(kind='box')
    df.boxplot(column='Age', by='City')
    
    # Scatter plot
    df.plot(kind='scatter', x='Age', y='Salary')
    df.plot.scatter(x='Age', y='Salary', c='City')
    
    # Pie chart
    df['City'].value_counts().plot(kind='pie', autopct='%1.1f%%')
    
    # Area plot
    df.plot(kind='area')
    
    # Hexbin plot
    df.plot(kind='hexbin', x='Age', y='Salary', gridsize=20)
    
    # Density plot
    df['Age'].plot(kind='density')
    
    # Subplots
    df.plot(subplots=True, layout=(2, 2), figsize=(10, 8))
    
    # Customization
    df['Age'].plot(
        kind='hist',
        title='Age Distribution',
        xlabel='Age',
        ylabel='Frequency',
        color='blue',
        alpha=0.7,
        figsize=(10, 6)
    )
    plt.show()
    Python

    15. Performance Tips

    15.1 Memory Optimization

    # Check memory usage
    df.memory_usage(deep=True)
    
    # Optimize data types
    df['int_column'] = df['int_column'].astype('int32')  # Instead of int64
    df['cat_column'] = df['cat_column'].astype('category')  # For categorical data
    
    # Convert object to category for columns with limited unique values
    for col in df.select_dtypes(include=['object']):
        if df[col].nunique() / len(df) < 0.5:  # Less than 50% unique
            df[col] = df[col].astype('category')
    Python

    15.2 Vectorization

    # SLOW: Using iterrows
    for index, row in df.iterrows():
        df.at[index, 'result'] = row['A'] + row['B']
    
    # FAST: Vectorized operation
    df['result'] = df['A'] + df['B']
    
    # SLOW: Using apply with lambda
    df['result'] = df.apply(lambda row: row['A'] + row['B'], axis=1)
    
    # FAST: Direct column operations
    df['result'] = df['A'] + df['B']
    Python

    15.3 Efficient Reading

    # Read only necessary columns
    df = pd.read_csv('large_file.csv', usecols=['col1', 'col2', 'col3'])
    
    # Specify data types
    dtypes = {'col1': 'int32', 'col2': 'float32', 'col3': 'category'}
    df = pd.read_csv('large_file.csv', dtype=dtypes)
    
    # Read in chunks
    chunk_size = 10000
    for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
        process(chunk)
    
    # Use nrows for testing
    df = pd.read_csv('large_file.csv', nrows=1000)
    Python

    15.4 Query vs Boolean Indexing

    # Boolean indexing
    result = df[(df['A'] > 5) & (df['B'] < 10)]
    
    # Query (often faster for complex conditions)
    result = df.query('A > 5 and B < 10')
    Python

    15.5 Use Built-in Functions

    # SLOW: Custom function
    df['is_adult'] = df['Age'].apply(lambda x: 1 if x >= 18 else 0)
    
    # FAST: NumPy where
    df['is_adult'] = np.where(df['Age'] >= 18, 1, 0)
    
    # FAST: Built-in methods
    df['sum'] = df[['A', 'B', 'C']].sum(axis=1)  # Instead of apply
    Python

    15.6 Avoid Chained Assignment

    # BAD: Chained assignment (can cause SettingWithCopyWarning)
    df[df['Age'] > 30]['City'] = 'Updated'
    
    # GOOD: Use loc
    df.loc[df['Age'] > 30, 'City'] = 'Updated'
    Python

    Additional Resources

    Official Documentation

    Useful Libraries to Use with Pandas

    • NumPy: Numerical computing
    • Matplotlib/Seaborn: Visualization
    • Scikit-learn: Machine learning
    • SQLAlchemy: Database connections
    • Openpyxl/XlsxWriter: Excel file handling

    Best Practices

    1. Always make copies when modifying DataFrames: df_copy = df.copy()
    2. Use inplace=True cautiously; it modifies the original DataFrame
    3. Chain operations for readability: df.groupby('A').filter(...).sort_values(...)
    4. Use method chaining with backslashes for multi-line operations
    5. Set pd.set_option('display.max_columns', None) to see all columns
    6. Use df.pipe() for custom function chains
    7. Profile your code to identify bottlenecks
    8. Consider using Dask or Vaex for very large datasets

    Common Pitfalls

    • SettingWithCopyWarning: Use .loc[] or .copy() explicitly
    • Mixed data types in columns can slow operations
    • Not resetting index after filtering can cause issues
    • Forgetting inplace=True or not assigning the result back
    • Not handling missing data appropriately

    Comprehensive Pandas Cheatsheet

    1. Importing and Setup

    import pandas as pd
    import numpy as np
    
    # Display options
    pd.set_option('display.max_columns', None)      # Show all columns
    pd.set_option('display.max_rows', 100)          # Show max 100 rows
    pd.set_option('display.max_colwidth', None)     # Show full column width
    pd.set_option('display.precision', 2)           # Decimal precision
    pd.set_option('display.float_format', '{:.2f}'.format)  # Float format
    pd.reset_option('all')                          # Reset all options
    Python

    2. Creating Data Structures

    # Series
    s = pd.Series([1, 2, 3, 4, 5])
    s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
    s = pd.Series({'a': 1, 'b': 2, 'c': 3})
    
    # DataFrame from dict
    df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
    
    # DataFrame from list of dicts
    df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
    
    # DataFrame from numpy array
    df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
    
    # Empty DataFrame
    df = pd.DataFrame()
    
    # DataFrame with index and columns
    df = pd.DataFrame(data, index=idx, columns=cols)
    Python

    3. Reading and Writing Files

    # CSV
    df = pd.read_csv('file.csv')
    df = pd.read_csv('file.csv', sep=';', encoding='utf-8', header=0, 
                     index_col=0, usecols=['col1', 'col2'], nrows=1000,
                     na_values=['NA', '?'], parse_dates=['date_col'])
    df.to_csv('output.csv', index=False, encoding='utf-8', sep=',')
    
    # Excel
    df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
    df = pd.read_excel('file.xlsx', sheet_name=0)  # First sheet
    df = pd.read_excel('file.xlsx', sheet_name=None)  # All sheets
    df.to_excel('output.xlsx', sheet_name='Data', index=False)
    
    # JSON
    df = pd.read_json('file.json')
    df = pd.read_json('file.json', orient='records')
    df.to_json('output.json', orient='records', indent=2)
    
    # SQL
    import sqlite3
    conn = sqlite3.connect('database.db')
    df = pd.read_sql('SELECT * FROM table', conn)
    df = pd.read_sql_query('SELECT * FROM table WHERE col > 5', conn)
    df = pd.read_sql_table('table_name', conn)
    df.to_sql('table_name', conn, if_exists='replace', index=False)
    
    # HTML
    df = pd.read_html('url_or_file.html')[0]
    df.to_html('output.html', index=False)
    
    # Pickle (fast, pandas-specific format)
    df.to_pickle('file.pkl')
    df = pd.read_pickle('file.pkl')
    
    # Parquet (efficient columnar format)
    df.to_parquet('file.parquet')
    df = pd.read_parquet('file.parquet')
    
    # Clipboard
    df = pd.read_clipboard()
    df.to_clipboard(index=False)
    Python

    4. Viewing and Inspecting Data

    # View data
    df.head(n)              # First n rows (default 5)
    df.tail(n)              # Last n rows
    df.sample(n)            # Random n rows
    df.sample(frac=0.1)     # Random 10% of rows
    
    # Shape and size
    df.shape                # (rows, columns)
    df.size                 # Total elements
    len(df)                 # Number of rows
    df.ndim                 # Number of dimensions
    
    # Index and columns
    df.index                # Row index
    df.columns              # Column names
    df.axes                 # List of [index, columns]
    
    # Data types
    df.dtypes               # Data type of each column
    df.dtypes.value_counts()  # Count of each data type
    df.info()               # Detailed info
    df.info(memory_usage='deep')  # Deep memory usage
    
    # Statistics
    df.describe()           # Summary statistics for numeric columns
    df.describe(include='all')  # All columns
    df.describe(include=['object'])  # Only object columns
    df.describe(exclude=['float64'])  # Exclude certain types
    
    # Counts
    df.count()              # Non-null values per column
    df['col'].value_counts()  # Frequency counts
    df['col'].value_counts(normalize=True)  # Proportions
    df['col'].value_counts(dropna=False)  # Include NaN
    df['col'].nunique()     # Number of unique values
    df['col'].unique()      # Array of unique values
    
    # Min/Max
    df.min()                # Min of each column
    df.max()                # Max of each column
    df.idxmin()             # Index of min value
    df.idxmax()             # Index of max value
    
    # Sum and Mean
    df.sum()                # Sum of each column
    df.sum(axis=1)          # Sum of each row
    df.mean()               # Mean of each column
    df.median()             # Median
    df.mode()               # Mode
    df.std()                # Standard deviation
    df.var()                # Variance
    
    # Correlations
    df.corr()               # Correlation matrix
    df.corr(method='pearson')  # Pearson correlation
    df.corr(method='spearman')  # Spearman correlation
    df['A'].corr(df['B'])   # Correlation between two columns
    df.corrwith(df['A'])    # Correlation with one column
    
    # Memory
    df.memory_usage()       # Memory usage per column
    df.memory_usage(deep=True)  # Deep memory usage
    Python

    5. Selection and Indexing

    # Column selection
    df['col']               # Single column (Series)
    df[['col']]             # Single column (DataFrame)
    df[['col1', 'col2']]    # Multiple columns
    df.col                  # Column by attribute (if valid name)
    
    # Row selection by position (iloc)
    df.iloc[0]              # First row
    df.iloc[-1]             # Last row
    df.iloc[0:5]            # First 5 rows
    df.iloc[[0, 2, 4]]      # Specific rows
    df.iloc[:, 0]           # First column
    df.iloc[:, 0:3]         # First 3 columns
    df.iloc[0:5, 0:3]       # Rows 0-4, columns 0-2
    
    # Row selection by label (loc)
    df.loc[0]               # Row with index label 0
    df.loc[0:5]             # Rows 0 to 5 (inclusive!)
    df.loc[:, 'A']          # Column 'A'
    df.loc[:, 'A':'C']      # Columns A to C (inclusive)
    df.loc[0:5, ['A', 'C']]  # Specific rows and columns
    df.loc[df['A'] > 5, 'B']  # Conditional selection
    
    # Scalar access
    df.at[0, 'A']           # Fast scalar access by label
    df.iat[0, 1]            # Fast scalar access by position
    
    # Boolean indexing
    df[df['A'] > 5]         # Rows where A > 5
    df[df['A'].isin([1, 2, 3])]  # Rows where A is in list
    df[df['A'].between(1, 5)]    # Rows where A is between 1 and 5
    df[~df['A'].isin([1, 2])]    # NOT in list
    df[(df['A'] > 5) & (df['B'] < 10)]  # Multiple conditions (AND)
    df[(df['A'] > 5) | (df['B'] < 10)]  # Multiple conditions (OR)
    
    # Query method
    df.query('A > 5')
    df.query('A > 5 and B < 10')
    df.query('A > @threshold')  # Use variables with @
    
    # Select by data type
    df.select_dtypes(include=['float64', 'int64'])
    df.select_dtypes(include='number')
    df.select_dtypes(exclude=['object'])
    
    # Filter columns by name
    df.filter(items=['A', 'B'])         # Exact names
    df.filter(like='_total')            # Contains substring
    df.filter(regex='^col')             # Regex pattern
    
    # Where
    df.where(df > 0)                    # Keep values where condition is True
    df.where(df > 0, -df)               # Replace False values
    df.mask(df > 0)                     # Opposite of where
    Python

    6. Filtering and Querying

    # Basic filtering
    df[df['A'] > 5]
    df[df['A'] == 'value']
    df[df['A'] != 'value']
    df[df['A'].isna()]
    df[df['A'].notna()]
    
    # Multiple conditions
    df[(df['A'] > 5) & (df['B'] < 10)]
    df[(df['A'] > 5) | (df['B'] < 10)]
    df[~(df['A'] > 5)]  # NOT
    
    # isin() method
    df[df['A'].isin([1, 2, 3])]
    df[df['A'].isin(other_df['B'])]
    
    # String methods
    df[df['col'].str.contains('pattern')]
    df[df['col'].str.startswith('A')]
    df[df['col'].str.endswith('z')]
    df[df['col'].str.match(r'^[A-Z]')]
    df[df['col'].str.len() > 5]
    df[df['col'].str.isdigit()]
    df[df['col'].str.isalpha()]
    
    # between() method
    df[df['A'].between(1, 10)]
    df[df['A'].between(1, 10, inclusive='neither')]
    
    # query() method
    df.query('A > 5')
    df.query('A > 5 and B < 10')
    df.query('A in [1, 2, 3]')
    df.query('col.str.contains("pattern")', engine='python')
    
    # nlargest/nsmallest
    df.nlargest(10, 'A')    # 10 largest values in column A
    df.nsmallest(10, 'A')   # 10 smallest values
    df.nlargest(5, ['A', 'B'])  # By multiple columns
    Python

    7. Adding, Modifying, and Removing Data

    # Add columns
    df['new_col'] = value           # Constant value
    df['new_col'] = df['A'] + df['B']  # Computed column
    df['new_col'] = df['A'].apply(lambda x: x*2)
    df.insert(1, 'new_col', values)  # Insert at position
    
    # Assign (method chaining friendly)
    df = df.assign(new_col=df['A'] * 2)
    df = df.assign(col1=lambda x: x['A'] * 2,
                   col2=lambda x: x['col1'] + 1)
    
    # Modify columns
    df['A'] = df['A'] * 2
    df.loc[:, 'A'] = df['A'] * 2
    df.loc[df['A'] > 5, 'B'] = 0
    
    # Remove columns
    df.drop('col', axis=1)
    df.drop(['col1', 'col2'], axis=1)
    df.drop(columns=['col1', 'col2'])
    del df['col']
    df.pop('col')  # Returns the column
    
    # Add rows
    df.loc[len(df)] = [value1, value2, value3]
    df = pd.concat([df, new_row_df], ignore_index=True)
    
    # Remove rows
    df.drop(0)              # Drop row with index 0
    df.drop([0, 1, 2])      # Drop multiple rows
    df.drop(index=[0, 1])   # Explicit index parameter
    
    # Remove duplicates
    df.drop_duplicates()
    df.drop_duplicates(subset=['A', 'B'])
    df.drop_duplicates(keep='first')   # Keep first occurrence
    df.drop_duplicates(keep='last')    # Keep last occurrence
    df.drop_duplicates(keep=False)     # Remove all duplicates
    Python

    8. Renaming and Reindexing

    # Rename columns
    df.rename(columns={'old': 'new'})
    df.rename(columns={'A': 'col_A', 'B': 'col_B'})
    df.rename(columns=str.upper)  # Apply function to all
    df.columns = ['A', 'B', 'C']  # Replace all column names
    
    # Rename index
    df.rename(index={0: 'first', 1: 'second'})
    df.rename(index=str)
    
    # Add prefix/suffix
    df.add_prefix('col_')
    df.add_suffix('_total')
    
    # Reset index
    df.reset_index()            # Move index to column
    df.reset_index(drop=True)   # Drop the index
    
    # Set index
    df.set_index('col')
    df.set_index(['col1', 'col2'])  # MultiIndex
    
    # Reindex
    df.reindex([0, 2, 4])       # New index
    df.reindex(columns=['A', 'C'])  # Reorder columns
    df.reindex_like(other_df)   # Match another DataFrame
    Python

    9. Sorting

    # Sort by values
    df.sort_values('A')
    df.sort_values('A', ascending=False)
    df.sort_values(['A', 'B'])
    df.sort_values(['A', 'B'], ascending=[True, False])
    df.sort_values('A', na_position='first')  # NaN first
    
    # Sort by index
    df.sort_index()
    df.sort_index(ascending=False)
    df.sort_index(axis=1)  # Sort columns
    
    # Sort by multiple columns with different orders
    df.sort_values(by=['col1', 'col2'], 
                   ascending=[True, False],
                   na_position='last')
    Python

    10. Apply, Map, and Transform

    # apply() - Apply function to each column or row
    df.apply(np.sum)            # Sum of each column
    df.apply(np.sum, axis=1)    # Sum of each row
    df['A'].apply(lambda x: x * 2)
    df.apply(lambda x: x.max() - x.min())
    
    # applymap() - Apply function to each element (element-wise)
    df.applymap(lambda x: x * 2)
    df.applymap(str)
    
    # map() - Map values (Series only)
    df['A'].map({1: 'one', 2: 'two', 3: 'three'})
    df['A'].map(lambda x: x * 2)
    
    # replace() - Replace values
    df.replace(0, np.nan)
    df.replace([0, 1], [100, 200])
    df.replace({0: 100, 1: 200})
    df.replace({'A': {0: 100}, 'B': {1: 200}})
    df['A'].replace(0, 100)
    
    # transform() - Transform values
    df.transform(lambda x: x * 2)
    df.transform([np.sqrt, np.square])
    df.groupby('A').transform('mean')
    
    # pipe() - Chain custom functions
    df.pipe(func1).pipe(func2).pipe(func3)
    Python

    11. String Operations

    # Case conversion
    df['col'].str.lower()
    df['col'].str.upper()
    df['col'].str.title()
    df['col'].str.capitalize()
    df['col'].str.swapcase()
    
    # Strip whitespace
    df['col'].str.strip()
    df['col'].str.lstrip()
    df['col'].str.rstrip()
    df['col'].str.strip('chars')
    
    # Split and join
    df['col'].str.split()
    df['col'].str.split(',')
    df['col'].str.split(',', expand=True)  # Create new columns
    df['col'].str.rsplit(',', n=1)  # Split from right
    df['col'].str.join('-')
    
    # Substring operations
    df['col'].str[:5]  # First 5 characters
    df['col'].str[5:]  # From 5th character
    df['col'].str[2:5]  # Characters 2-4
    df['col'].str.slice(0, 5)
    df['col'].str.slice(start=2, stop=5, step=1)
    
    # Replace and remove
    df['col'].str.replace('old', 'new')
    df['col'].str.replace(r'\d+', 'NUM', regex=True)
    df['col'].str.removeprefix('prefix')
    df['col'].str.removesuffix('suffix')
    
    # Find and count
    df['col'].str.contains('pattern')
    df['col'].str.contains('pattern', case=False)
    df['col'].str.contains(r'\d+', regex=True)
    df['col'].str.startswith('A')
    df['col'].str.endswith('z')
    df['col'].str.find('pattern')
    df['col'].str.count('pattern')
    
    # Extract
    df['col'].str.extract(r'(\d+)')  # First group
    df['col'].str.extract(r'(\d+)-(\w+)')  # Multiple groups
    df['col'].str.extractall(r'(\d+)')  # All matches
    df['col'].str.findall(r'\d+')
    
    # Length and checking
    df['col'].str.len()
    df['col'].str.isdigit()
    df['col'].str.isalpha()
    df['col'].str.isalnum()
    df['col'].str.isspace()
    df['col'].str.islower()
    df['col'].str.isupper()
    df['col'].str.istitle()
    
    # Padding
    df['col'].str.pad(10, side='left', fillchar='0')
    df['col'].str.zfill(10)  # Zero-pad
    df['col'].str.center(10, fillchar='*')
    df['col'].str.ljust(10)
    df['col'].str.rjust(10)
    
    # Get specific element from split
    df['col'].str.split('-').str[0]
    df['col'].str.split('-').str.get(0)
    Python

    12. Missing Data

    # Detect missing data
    df.isnull()             # Boolean DataFrame
    df.isna()               # Same as isnull()
    df.notna()              # Opposite
    df.isnull().sum()       # Count per column
    df.isnull().sum().sum()  # Total count
    df.isnull().any()       # Any missing per column
    df.isnull().all()       # All missing per column
    
    # Drop missing data
    df.dropna()             # Drop rows with any NaN
    df.dropna(axis=1)       # Drop columns with any NaN
    df.dropna(how='all')    # Drop if all values are NaN
    df.dropna(thresh=2)     # Drop if less than 2 non-NaN
    df.dropna(subset=['A', 'B'])  # Drop based on specific columns
    
    # Fill missing data
    df.fillna(0)            # Fill with constant
    df.fillna({'A': 0, 'B': 1})  # Different values per column
    df.fillna(method='ffill')  # Forward fill
    df.fillna(method='bfill')  # Backward fill
    df.fillna(df.mean())    # Fill with mean
    df['A'].fillna(df['A'].mean())  # Fill column with its mean
    df.fillna(method='ffill', limit=2)  # Limit number of fills
    
    # Interpolate
    df.interpolate()        # Linear interpolation
    df.interpolate(method='polynomial', order=2)
    df.interpolate(method='time')  # For time series
    df.interpolate(limit=2)  # Limit consecutive fills
    df.interpolate(limit_direction='forward')
    
    # Replace NaN
    df.replace(np.nan, 0)
    df.where(pd.notna(df), 0)
    Python

    13. Aggregation and Statistics

    # Basic statistics
    df.sum()                # Sum
    df.sum(axis=1)          # Row-wise sum
    df.mean()               # Mean
    df.median()             # Median
    df.mode()               # Mode
    df.std()                # Standard deviation
    df.var()                # Variance
    df.min()                # Minimum
    df.max()                # Maximum
    df.abs()                # Absolute value
    df.prod()               # Product
    df.cumsum()             # Cumulative sum
    df.cumprod()            # Cumulative product
    df.cummin()             # Cumulative minimum
    df.cummax()             # Cumulative maximum
    
    # Quantiles
    df.quantile(0.25)       # 25th percentile
    df.quantile([0.25, 0.5, 0.75])  # Multiple quantiles
    
    # Ranking
    df.rank()               # Rank values
    df.rank(ascending=False)
    df.rank(method='first')  # first, average, min, max, dense
    
    # Aggregation with agg()
    df.agg(['sum', 'mean', 'std'])
    df.agg({'A': ['sum', 'mean'], 'B': ['min', 'max']})
    df.agg({'A': 'sum', 'B': 'mean'})
    df['A'].agg(['sum', 'mean', lambda x: x.max() - x.min()])
    
    # Custom aggregation
    df.agg(custom_func)
    df.agg({'A': custom_func, 'B': 'mean'})
    Python

    14. GroupBy Operations

    # Basic groupby
    grouped = df.groupby('A')
    grouped = df.groupby(['A', 'B'])  # Multiple columns
    grouped = df.groupby(df['A'] > 5)  # By condition
    
    # Aggregation after groupby
    df.groupby('A').sum()
    df.groupby('A').mean()
    df.groupby('A')['B'].sum()  # Specific column
    df.groupby('A')[['B', 'C']].sum()  # Multiple columns
    
    # Multiple aggregations
    df.groupby('A').agg(['sum', 'mean', 'std'])
    df.groupby('A').agg({'B': 'sum', 'C': 'mean'})
    df.groupby('A').agg({'B': ['sum', 'mean'], 'C': 'max'})
    
    # Named aggregations
    df.groupby('A').agg(
        total=('B', 'sum'),
        average=('B', 'mean'),
        max_value=('C', 'max')
    )
    
    # Transform
    df.groupby('A').transform('mean')  # Broadcast to original shape
    df['B_normalized'] = df.groupby('A')['B'].transform(
        lambda x: (x - x.mean()) / x.std()
    )
    
    # Filter groups
    df.groupby('A').filter(lambda x: len(x) > 2)
    df.groupby('A').filter(lambda x: x['B'].sum() > 100)
    
    # Apply custom function
    df.groupby('A').apply(lambda x: x.max() - x.min())
    
    # Iterate through groups
    for name, group in df.groupby('A'):
        print(f"Group: {name}")
        print(group)
    
    # Get specific group
    df.groupby('A').get_group('value')
    
    # Group by multiple with different operations
    df.groupby(['A', 'B']).agg({
        'C': ['sum', 'mean'],
        'D': 'max'
    }).reset_index()
    
    # Size and count
    df.groupby('A').size()   # Size of each group
    df.groupby('A').count()  # Count non-null values
    df.groupby('A')['B'].count()
    
    # First and last
    df.groupby('A').first()
    df.groupby('A').last()
    df.groupby('A').head(2)  # First 2 of each group
    df.groupby('A').tail(2)  # Last 2 of each group
    Python

    15. Pivot Tables and Reshaping

    # Pivot table
    pd.pivot_table(df, values='D', index='A', columns='B')
    pd.pivot_table(df, values='D', index='A', columns='B', aggfunc='sum')
    pd.pivot_table(df, values='D', index=['A', 'B'], columns='C')
    pd.pivot_table(df, values=['D', 'E'], index='A', columns='B')
    pd.pivot_table(df, index='A', columns='B', aggfunc={'D': 'sum', 'E': 'mean'})
    pd.pivot_table(df, index='A', columns='B', margins=True)  # Add totals
    
    # Pivot (no aggregation)
    df.pivot(index='A', columns='B', values='C')
    
    # Melt (unpivot)
    pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
    pd.melt(df, id_vars=['A'], var_name='variable', value_name='value')
    
    # Stack and unstack
    df.stack()              # Pivot columns to rows
    df.unstack()            # Pivot rows to columns
    df.unstack(level=0)     # Unstack specific level
    df.stack().unstack()    # Round trip
    
    # Wide to long
    pd.wide_to_long(df, stubnames='col', i='id', j='time')
    
    # Transpose
    df.T                    # Transpose (swap rows and columns)
    df.transpose()
    
    # Crosstab
    pd.crosstab(df['A'], df['B'])
    pd.crosstab(df['A'], df['B'], normalize='columns')
    pd.crosstab(df['A'], df['B'], values=df['C'], aggfunc='mean')
    pd.crosstab(df['A'], df['B'], margins=True)
    Python

    16. Merging, Joining, and Concatenating

    # Concatenate
    pd.concat([df1, df2])   # Vertical (rows)
    pd.concat([df1, df2], axis=1)  # Horizontal (columns)
    pd.concat([df1, df2], ignore_index=True)  # Reset index
    pd.concat([df1, df2], keys=['first', 'second'])  # Add keys
    
    # Merge (like SQL join)
    pd.merge(df1, df2, on='key')  # Inner join
    pd.merge(df1, df2, on='key', how='left')  # Left join
    pd.merge(df1, df2, on='key', how='right')  # Right join
    pd.merge(df1, df2, on='key', how='outer')  # Outer join
    pd.merge(df1, df2, left_on='key1', right_on='key2')  # Different names
    pd.merge(df1, df2, on=['key1', 'key2'])  # Multiple keys
    pd.merge(df1, df2, left_index=True, right_index=True)  # On index
    pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))  # Suffix
    pd.merge(df1, df2, on='key', indicator=True)  # Add merge indicator
    
    # Join (merge on index)
    df1.join(df2)           # Left join on index
    df1.join(df2, how='inner')
    df1.join(df2, on='key')
    df1.join(df2, lsuffix='_l', rsuffix='_r')
    
    # Combine first (update missing values)
    df1.combine_first(df2)
    
    # Update
    df1.update(df2)
    Python

    17. Time Series

    # Create datetime
    pd.to_datetime('2023-01-01')
    pd.to_datetime(['2023-01-01', '2023-01-02'])
    pd.to_datetime(df['date_col'], format='%Y-%m-%d')
    pd.to_datetime(df['date_col'], errors='coerce')  # Invalid to NaT
    
    # Date range
    pd.date_range('2023-01-01', periods=10)
    pd.date_range('2023-01-01', '2023-12-31')
    pd.date_range('2023-01-01', periods=10, freq='D')  # Daily
    pd.date_range('2023-01-01', periods=10, freq='B')  # Business days
    pd.date_range('2023-01-01', periods=10, freq='W')  # Weekly
    pd.date_range('2023-01-01', periods=10, freq='M')  # Month end
    pd.date_range('2023-01-01', periods=10, freq='MS')  # Month start
    pd.date_range('2023-01-01', periods=10, freq='H')  # Hourly
    
    # Time delta
    pd.Timedelta('1 day')
    pd.Timedelta(days=1, hours=2, minutes=30)
    pd.timedelta_range('1 day', periods=10)
    
    # Extract datetime components
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['hour'] = df['date'].dt.hour
    df['minute'] = df['date'].dt.minute
    df['second'] = df['date'].dt.second
    df['dayofweek'] = df['date'].dt.dayofweek  # Monday=0
    df['dayofyear'] = df['date'].dt.dayofyear
    df['quarter'] = df['date'].dt.quarter
    df['week'] = df['date'].dt.isocalendar().week
    df['day_name'] = df['date'].dt.day_name()
    df['month_name'] = df['date'].dt.month_name()
    
    # Boolean datetime checks
    df['is_month_start'] = df['date'].dt.is_month_start
    df['is_month_end'] = df['date'].dt.is_month_end
    df['is_quarter_start'] = df['date'].dt.is_quarter_start
    df['is_quarter_end'] = df['date'].dt.is_quarter_end
    df['is_year_start'] = df['date'].dt.is_year_start
    df['is_year_end'] = df['date'].dt.is_year_end
    df['is_leap_year'] = df['date'].dt.is_leap_year
    
    # Date offsets
    df['date'] + pd.DateOffset(days=1)
    df['date'] + pd.DateOffset(months=1)
    df['date'] + pd.offsets.BusinessDay(5)
    
    # Set datetime index
    df.set_index('date', inplace=True)
    df.index = pd.to_datetime(df.index)
    
    # Select by date
    df['2023']              # All of 2023
    df['2023-01']           # January 2023
    df['2023-01-15']        # Specific day
    df['2023-01':'2023-03']  # Date range
    
    # Resample (change frequency)
    df.resample('D').mean()  # Daily mean
    df.resample('W').sum()   # Weekly sum
    df.resample('M').last()  # Monthly last value
    df.resample('Q').mean()  # Quarterly mean
    df.resample('Y').sum()   # Yearly sum
    df.resample('2D').mean()  # Every 2 days
    df.resample('H').ffill()  # Hourly with forward fill
    
    # Rolling window
    df.rolling(window=7).mean()  # 7-period moving average
    df.rolling(window=7).sum()
    df.rolling(window=7).std()
    df.rolling(window='7D').mean()  # 7-day window (time-based)
    df.rolling(window=7, min_periods=1).mean()  # Allow fewer periods
    
    # Expanding window
    df.expanding().mean()    # Cumulative mean
    df.expanding(min_periods=3).sum()
    
    # Shifting
    df.shift(1)              # Shift forward by 1
    df.shift(-1)             # Shift backward by 1
    df.shift(freq='D')       # Shift by time frequency
    df.tshift(periods=1)     # Shift index
    
    # Lagging
    df['lag1'] = df['value'].shift(1)
    df['lag7'] = df['value'].shift(7)
    
    # Percentage change
    df.pct_change()          # Period-over-period change
    df.pct_change(periods=7)  # 7-period change
    
    # Timezone
    df['date'].dt.tz_localize('UTC')  # Make timezone-aware
    df['date'].dt.tz_convert('US/Eastern')  # Convert timezone
    df['date'].dt.tz_localize(None)  # Remove timezone
    Python

    18. Categorical Data

    # Create categorical
    df['category'] = pd.Categorical(df['category'])
    df['category'] = df['category'].astype('category')
    
    # Create with specific categories
    cat = pd.Categorical(['a', 'b', 'c'], categories=['a', 'b', 'c', 'd'])
    
    # Ordered categorical
    cat = pd.Categorical(['low', 'medium', 'high'], 
                         categories=['low', 'medium', 'high'],
                         ordered=True)
    
    # Get categories
    df['category'].cat.categories
    
    # Add categories
    df['category'].cat.add_categories(['new_cat'])
    
    # Remove categories
    df['category'].cat.remove_categories(['old_cat'])
    
    # Rename categories
    df['category'].cat.rename_categories({'old': 'new'})
    
    # Reorder categories
    df['category'].cat.reorder_categories(['c', 'b', 'a'])
    
    # Set categories
    df['category'].cat.set_categories(['a', 'b', 'c'])
    
    # Get codes
    df['category'].cat.codes
    Python

    19. Multi-Index (Hierarchical Index)

    # Create MultiIndex
    idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1)])
    idx = pd.MultiIndex.from_product([['a', 'b'], [1, 2, 3]])
    idx = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
    df = pd.DataFrame(data, index=idx)
    
    # Set MultiIndex from columns
    df.set_index(['col1', 'col2'])
    
    # Select from MultiIndex
    df.loc['a']              # First level
    df.loc[('a', 1)]         # Both levels
    df.loc[('a', slice(None)), :]  # All of first level
    
    # Cross-section
    df.xs('a', level=0)      # All rows where level 0 is 'a'
    df.xs(1, level=1)        # All rows where level 1 is 1
    
    # Stack and unstack with MultiIndex
    df.stack()
    df.unstack(level=0)
    df.unstack(level='level_name')
    
    # Swap levels
    df.swaplevel(0, 1)
    
    # Sort MultiIndex
    df.sort_index(level=0)
    df.sort_index(level=[0, 1])
    
    # Reset MultiIndex
    df.reset_index()
    df.reset_index(level=0)
    Python

    20. Window Functions

    # Rolling window
    df.rolling(window=3).mean()
    df.rolling(window=3).sum()
    df.rolling(window=3).std()
    df.rolling(window=3).min()
    df.rolling(window=3).max()
    df.rolling(window=3).median()
    df.rolling(window=3).count()
    df.rolling(window=3).apply(custom_func)
    
    # Expanding window
    df.expanding().mean()
    df.expanding(min_periods=3).sum()
    
    # Exponentially weighted
    df.ewm(span=20).mean()   # EMA
    df.ewm(alpha=0.5).mean()
    df.ewm(halflife=10).mean()
    Python

    21. Performance and Optimization

    # Data types optimization
    df['int_col'] = df['int_col'].astype('int32')
    df['cat_col'] = df['cat_col'].astype('category')
    df['sparse_col'] = df['sparse_col'].astype(pd.SparseDtype('int'))
    
    # Chunking for large files
    for chunk in pd.read_csv('large.csv', chunksize=10000):
        process(chunk)
    
    # Query instead of boolean indexing
    df.query('A > 5 and B < 10')  # Often faster
    
    # Use vectorized operations
    df['result'] = df['A'] + df['B']  # Not apply()
    
    # Use numexpr for complex calculations
    pd.eval('df_A + df_B * df_C')
    
    # Copy explicitly
    df_copy = df.copy()
    df_deep_copy = df.copy(deep=True)
    
    # Memory usage
    df.info(memory_usage='deep')
    df.memory_usage(deep=True)
    
    # Convert to sparse
    df_sparse = df.astype(pd.SparseDtype('float', fill_value=0))
    Python

    22. Advanced Operations

    # explode() - Expand lists in cells
    df.explode('list_column')
    
    # eval() - Fast evaluation of expressions
    df.eval('C = A + B')
    df.eval('C = A + B', inplace=True)
    
    # Binning
    pd.cut(df['A'], bins=5)  # Equal-width bins
    pd.cut(df['A'], bins=[0, 10, 20, 30])  # Custom bins
    pd.qcut(df['A'], q=4)    # Equal-frequency bins (quartiles)
    pd.qcut(df['A'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
    
    # Dummy variables
    pd.get_dummies(df['category'])
    pd.get_dummies(df['category'], prefix='cat')
    pd.get_dummies(df, columns=['cat1', 'cat2'])
    pd.get_dummies(df, drop_first=True)  # Avoid multicollinearity
    
    # Ranking
    df.rank()
    df.rank(method='min')
    df.rank(pct=True)  # Percentile ranks
    
    # Sampling
    df.sample(n=100)
    df.sample(frac=0.1)
    df.sample(n=100, replace=True)  # With replacement
    df.sample(n=100, weights='weight_col')  # Weighted sampling
    
    # Clip values
    df.clip(lower=0, upper=100)
    df['A'].clip(lower=0)
    
    # Factorize (convert to integers)
    codes, unique = pd.factorize(df['category'])
    Python

    23. Validation and Testing

    # Check data types
    assert df['col'].dtype == 'int64'
    
    # Check for nulls
    assert df['col'].notna().all()
    
    # Check uniqueness
    assert df['id'].is_unique
    
    # Check value range
    assert (df['age'] >= 0).all()
    assert (df['age'] <= 120).all()
    
    # Check membership
    assert df['category'].isin(['A', 'B', 'C']).all()
    
    # equals() - Check if DataFrames are equal
    df1.equals(df2)
    
    # compare() - Show differences
    df1.compare(df2)
    Python

    24. Working with Duplicates

    # Find duplicates
    df.duplicated()          # Boolean Series
    df.duplicated(subset=['A', 'B'])  # Based on specific columns
    df.duplicated(keep='first')  # Mark duplicates except first
    df.duplicated(keep='last')   # Mark duplicates except last
    df.duplicated(keep=False)    # Mark all duplicates
    
    # View duplicates
    df[df.duplicated()]
    df[df.duplicated(subset=['A'], keep=False)]
    
    # Count duplicates
    df.duplicated().sum()
    
    # Drop duplicates
    df.drop_duplicates()
    df.drop_duplicates(subset=['A'])
    df.drop_duplicates(keep='last')
    Python

    25. Useful Utilities

    # Info about DataFrame
    df.info()
    df.describe()
    df.dtypes
    df.shape
    df.columns
    df.index
    
    # Convert types
    df.astype('float')
    df.astype({'A': 'int32', 'B': 'float64'})
    df['A'].astype('category')
    
    # Copy
    df.copy()
    df.copy(deep=True)
    
    # Iteration (avoid when possible)
    for index, row in df.iterrows():  # Row by row
        print(row['A'])
    for row in df.itertuples():  # Faster than iterrows
        print(row.A)
    
    # Apply function to each element
    df.applymap(func)  # Entire DataFrame
    df['col'].apply(func)  # Series
    
    # Pipe for method chaining
    df.pipe(func1).pipe(func2).pipe(func3)
    
    # Style (for display)
    df.style.highlight_max()
    df.style.background_gradient()
    df.style.format('{:.2f}')
    
    # Options
    pd.options.display.max_rows = 100
    pd.options.display.max_columns = None
    pd.set_option('display.max_rows', 100)
    pd.get_option('display.max_rows')
    pd.reset_option('display.max_rows')
    pd.describe_option()  # Show all options
    Python

    Conclusion

    This tutorial covers the essential and advanced features of pandas for data manipulation and analysis. Practice these concepts with real datasets to become proficient. Remember that pandas is constantly evolving, so check the official documentation for the latest features and best practices.

    Happy data wrangling! 🐼


    Discover more from Altgr Blog

    Subscribe to get the latest posts sent to your email.

    Leave a Reply

    Your email address will not be published. Required fields are marked *