PostgreSQL Replication

    From Beginner to Expert

    Table of Contents

    1. Introduction to PostgreSQL Replication
    2. Fundamentals of Database Replication
    3. PostgreSQL Architecture and WAL
    4. Setting Up Basic Streaming Replication
    5. Advanced Replication Configurations
    6. Logical Replication
    7. High Availability and Failover
    8. Monitoring and Troubleshooting
    9. Performance Optimization
    10. Security in Replication
    11. Advanced Topics and Best Practices

    1. Introduction to PostgreSQL Replication

    What is Database Replication?

    Database replication is the process of copying and maintaining database objects (tables, schemas, etc.) in multiple database instances. PostgreSQL replication ensures data availability, improves performance, and provides disaster recovery capabilities.

    Types of PostgreSQL Replication

    graph TD
        A[PostgreSQL Replication] --> B[Physical Replication]
        A --> C[Logical Replication]
    
        B --> D[Streaming Replication]
        B --> E[File-based Replication]
    
        C --> F[Publication/Subscription]
        C --> G[Logical Decoding]
    
        D --> H[Synchronous]
        D --> I[Asynchronous]

    Benefits of Replication

    • High Availability: Automatic failover capabilities
    • Load Distribution: Read queries can be distributed across replicas
    • Disaster Recovery: Geographic distribution of data
    • Backup Operations: Backups can run on replicas without affecting primary
    • Reporting: Dedicated reporting databases

    2. Fundamentals of Database Replication

    Key Concepts

    Primary and Standby Servers

    graph LR
        A[Primary Server] -->|WAL Stream| B[Standby Server 1]
        A -->|WAL Stream| C[Standby Server 2]
        A -->|WAL Stream| D[Standby Server 3]
    
        B --> E[Read-Only Queries]
        C --> F[Read-Only Queries]
        D --> G[Read-Only Queries]

    Write-Ahead Logging (WAL)

    WAL is the foundation of PostgreSQL replication. Every change to the database is first written to the WAL before being applied to the data files.

    sequenceDiagram
        participant Client
        participant Primary
        participant WAL
        participant Standby
    
        Client->>Primary: INSERT/UPDATE/DELETE
        Primary->>WAL: Write change to WAL
        WAL->>Primary: Acknowledge write
        Primary->>Client: Confirm transaction
        WAL->>Standby: Stream WAL records
        Standby->>Standby: Apply WAL records

    Replication Modes

    Asynchronous Replication

    • Primary doesn’t wait for standby confirmation
    • Higher performance, potential data loss risk
    • Default mode

    Synchronous Replication

    • Primary waits for standby confirmation
    • Lower performance, zero data loss
    • Configurable levels of synchronization
    graph TD
        A[Transaction Committed] --> B{Replication Mode}
        B -->|Asynchronous| C[Immediate Response]
        B -->|Synchronous| D[Wait for Standby ACK]
        D --> E[Response after Confirmation]

    3. PostgreSQL Architecture and WAL

    PostgreSQL Process Architecture

    graph TB
        subgraph "PostgreSQL Instance"
            A[Postmaster] --> B[Backend Processes]
            A --> C[Background Processes]
    
            C --> D[WAL Writer]
            C --> E[Checkpointer]
            C --> F[Background Writer]
            C --> G[WAL Sender]
            C --> H[WAL Receiver]
    
            B --> I[Shared Buffers]
            D --> J[WAL Files]
            E --> K[Data Files]
        end

    WAL Architecture Deep Dive

    WAL File Structure

    -- Check current WAL file and location
    SELECT pg_current_wal_lsn();
    SELECT pg_current_wal_insert_lsn();
    
    -- View WAL file information
    SELECT name, setting FROM pg_settings WHERE name LIKE 'wal%';
    SQL

    Key WAL Parameters

    -- postgresql.conf settings
    wal_level = replica                    -- or 'logical' for logical replication
    max_wal_senders = 10                  -- Maximum number of WAL sender processes
    wal_keep_size = '1GB'                 -- Amount of WAL files to keep
    archive_mode = on                     -- Enable WAL archiving
    archive_command = 'cp %p /archive/%f' -- Archive command
    SQL

    Understanding LSN (Log Sequence Number)

    graph LR
        A[LSN: 0/1A2B3C4D] --> B[Timeline ID: 0]
        A --> C[File Offset: 1A2B3C4D]
    
        D[WAL Segment] --> E[16MB File]
        E --> F[Multiple LSN Records]

    4. Setting Up Basic Streaming Replication

    Prerequisites

    System Requirements

    • PostgreSQL 12+ (recommended)
    • Network connectivity between servers
    • Sufficient disk space
    • Synchronized system clocks

    Architecture Overview

    graph TB
        subgraph "Primary Server (Master)"
            A[PostgreSQL Primary]
            B[WAL Sender Process]
            C[WAL Files]
        end
    
        subgraph "Standby Server (Replica)"
            D[PostgreSQL Standby]
            E[WAL Receiver Process]
            F[Recovery Process]
        end
    
        B -->|TCP Connection| E
        A --> B
        B --> C
        E --> F
        F --> D

    Step-by-Step Setup

    Step 1: Configure Primary Server

    -- postgresql.conf on primary
    listen_addresses = '*'
    wal_level = replica
    max_wal_senders = 10
    max_replication_slots = 10
    wal_keep_size = '1GB'
    hot_standby = on
    SQL
    -- pg_hba.conf on primary
    # TYPE  DATABASE        USER            ADDRESS                 METHOD
    host    replication     replicator      standby_ip/32          md5
    SQL

    Step 2: Create Replication User

    -- On primary server
    CREATE USER replicator REPLICATION LOGIN PASSWORD 'secure_password';
    SQL

    Step 3: Create Base Backup

    # On standby server
    pg_basebackup -h primary_ip -D /var/lib/postgresql/data -U replicator -v -P -W
    Bash

    Step 4: Configure Standby Server

    -- postgresql.conf on standby
    hot_standby = on
    primary_conninfo = 'host=primary_ip port=5432 user=replicator password=secure_password'
    SQL
    # Create standby.signal file
    touch /var/lib/postgresql/data/standby.signal
    Bash

    Step 5: Start Services

    # Start primary
    sudo systemctl start postgresql
    
    # Start standby
    sudo systemctl start postgresql
    Bash

    Verification

    -- On primary: Check replication status
    SELECT client_addr, state, sync_state FROM pg_stat_replication;
    
    -- On standby: Check recovery status
    SELECT pg_is_in_recovery();
    SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();
    SQL

    5. Advanced Replication Configurations

    Replication Slots

    Replication slots ensure that WAL files are retained until all subscribers have received them.

    graph TD
        A[WAL Generation] --> B[Replication Slot]
        B --> C{Slot Active?}
        C -->|Yes| D[Keep WAL Files]
        C -->|No| E[WAL Files May Be Removed]
        D --> F[Standby Can Catch Up]
        E --> G[Potential Data Loss]

    Creating and Managing Replication Slots

    -- Create a replication slot
    SELECT pg_create_physical_replication_slot('standby_slot');
    
    -- View replication slots
    SELECT slot_name, slot_type, active, restart_lsn FROM pg_replication_slots;
    
    -- Configure standby to use slot
    -- In postgresql.conf on standby:
    primary_slot_name = 'standby_slot'
    SQL

    Synchronous Replication Configuration

    -- postgresql.conf on primary
    synchronous_standby_names = 'standby1,standby2'  -- ANY 1 (standby1, standby2)
    synchronous_standby_names = 'FIRST 1 (standby1, standby2)'  -- Specific priority
    synchronous_standby_names = 'ANY 2 (standby1, standby2, standby3)'  -- Any 2 of 3
    SQL

    Synchronous Replication Flow

    sequenceDiagram
        participant Client
        participant Primary
        participant Standby1
        participant Standby2
    
        Client->>Primary: COMMIT
        Primary->>Standby1: WAL Record
        Primary->>Standby2: WAL Record
        Standby1->>Primary: ACK
        Standby2->>Primary: ACK
        Primary->>Client: COMMIT Success

    Cascading Replication

    graph TD
        A[Primary] --> B[Standby 1]
        A --> C[Standby 2]
        B --> D[Standby 3]
        B --> E[Standby 4]
        C --> F[Standby 5]

    Configuration for Cascading

    -- On intermediate standby (Standby 1)
    -- postgresql.conf
    hot_standby = on
    max_wal_senders = 5
    
    -- On downstream standby (Standby 3)
    -- postgresql.conf
    primary_conninfo = 'host=standby1_ip port=5432 user=replicator'
    SQL

    6. Logical Replication

    Understanding Logical Replication

    Logical replication uses a publish/subscribe model and replicates data changes based on the actual data modifications rather than physical file changes.

    graph LR
        subgraph "Publisher Database"
            A[Table A] --> D[Publication]
            B[Table B] --> D
            C[Table C] --> D
        end
    
        subgraph "Subscriber Database"
            D --> E[Subscription]
            E --> F[Table A]
            E --> G[Table B]
            E --> H[Table C]
        end

    Setting Up Logical Replication

    Step 1: Configure Publisher

    -- postgresql.conf
    wal_level = logical
    max_replication_slots = 10
    max_wal_senders = 10
    
    -- Create publication
    CREATE PUBLICATION my_publication FOR ALL TABLES;
    -- OR for specific tables
    CREATE PUBLICATION specific_pub FOR TABLE users, orders;
    SQL

    Step 2: Configure Subscriber

    -- Create subscription
    CREATE SUBSCRIPTION my_subscription
    CONNECTION 'host=publisher_ip dbname=mydb user=replicator password=pass'
    PUBLICATION my_publication;
    SQL

    Logical Replication Features

    Selective Replication

    -- Replicate only specific columns
    CREATE PUBLICATION filtered_pub FOR TABLE users (id, email, created_at);
    
    -- Replicate with row filter (PostgreSQL 15+)
    CREATE PUBLICATION active_users FOR TABLE users WHERE (active = true);
    SQL

    Cross-Version Replication

    graph LR
        A[PostgreSQL 13] -->|Logical Replication| B[PostgreSQL 15]
        C[PostgreSQL 14] -->|Logical Replication| D[PostgreSQL 16]

    Monitoring Logical Replication

    -- Monitor publications
    SELECT * FROM pg_publication;
    
    -- Monitor subscriptions
    SELECT * FROM pg_subscription;
    
    -- Check replication lag
    SELECT 
        subname,
        pid,
        received_lsn,
        latest_end_lsn,
        latest_end_time
    FROM pg_stat_subscription;
    SQL

    7. High Availability and Failover

    Automatic Failover Architecture

    graph TB
        subgraph "HA Cluster"
            A[Primary] --> B[Standby 1]
            A --> C[Standby 2]
    
            D[Load Balancer] --> A
            E[Monitoring Tool] --> A
            E --> B
            E --> C
    
            F[Virtual IP] --> D
        end
    
        subgraph "Failover Process"
            G[Primary Failure Detected] --> H[Promote Standby]
            H --> I[Update Load Balancer]
            I --> J[Update DNS/VIP]
        end

    Tools for High Availability

    pg_auto_failover

    # Install pg_auto_failover
    sudo apt-get install postgresql-13-auto-failover
    
    # Initialize monitor
    pg_autoctl create monitor --hostname monitor-host --auth trust
    
    # Create primary
    pg_autoctl create postgres --hostname primary-host --auth trust \
      --monitor postgres://autoctl_node@monitor-host/pg_auto_failover
    
    # Create standby
    pg_autoctl create postgres --hostname standby-host --auth trust \
      --monitor postgres://autoctl_node@monitor-host/pg_auto_failover
    Bash

    Patroni Configuration

    # patroni.yml
    scope: postgres-cluster
    namespace: /service/
    name: postgresql0
    
    restapi:
      listen: 0.0.0.0:8008
      connect_address: node1:8008
    
    etcd:
      host: etcd-cluster:2379
    
    bootstrap:
      dcs:
        ttl: 30
        loop_wait: 10
        retry_timeout: 30
        maximum_lag_on_failover: 1048576
        postgresql:
          use_pg_rewind: true
          parameters:
            wal_level: replica
            hot_standby: "on"
            max_wal_senders: 10
            max_replication_slots: 10
    
    postgresql:
      listen: 0.0.0.0:5432
      connect_address: node1:5432
      data_dir: /var/lib/postgresql/data
      authentication:
        replication:
          username: replicator
          password: secret
    YAML

    Manual Failover Procedures

    Promoting a Standby

    # Method 1: Using pg_ctl
    pg_ctl promote -D /var/lib/postgresql/data
    
    # Method 2: Using SQL
    SELECT pg_promote();
    
    # Method 3: Remove standby.signal file
    rm /var/lib/postgresql/data/standby.signal
    Bash

    Failover Verification

    -- Check if server is no longer in recovery
    SELECT pg_is_in_recovery();
    
    -- Verify write capability
    CREATE TABLE failover_test (id serial, created_at timestamp default now());
    INSERT INTO failover_test DEFAULT VALUES;
    SQL

    Split-Brain Prevention

    graph TD
        A[Fencing Mechanism] --> B[Consensus Algorithm]
        A --> C[Quorum-based Decisions]
        A --> D[Network Partitioning Detection]
    
        B --> E[Prevent Multiple Primaries]
        C --> E
        D --> E

    8. Monitoring and Troubleshooting

    Key Metrics to Monitor

    Replication Lag

    -- On primary: Check lag for all standbys
    SELECT 
        client_addr,
        application_name,
        state,
        sync_state,
        pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) as send_lag,
        pg_wal_lsn_diff(sent_lsn, flush_lsn) as flush_lag,
        pg_wal_lsn_diff(flush_lsn, replay_lsn) as replay_lag
    FROM pg_stat_replication;
    
    -- On standby: Check receive and replay lag
    SELECT 
        pg_last_wal_receive_lsn(),
        pg_last_wal_replay_lsn(),
        pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) as replay_lag;
    SQL

    WAL Generation Rate

    -- Monitor WAL generation
    SELECT 
        name,
        pg_current_wal_lsn() as current_lsn,
        pg_walfile_name(pg_current_wal_lsn()) as current_wal_file
    FROM pg_stat_bgwriter;
    SQL

    Monitoring Dashboard Queries

    -- Comprehensive replication status
    WITH replication_stats AS (
        SELECT 
            client_addr,
            application_name,
            state,
            sync_state,
            pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)/1024/1024 as send_lag_mb,
            pg_wal_lsn_diff(sent_lsn, flush_lsn)/1024/1024 as flush_lag_mb,
            pg_wal_lsn_diff(flush_lsn, replay_lsn)/1024/1024 as replay_lag_mb,
            extract(epoch from (now() - backend_start)) as connection_duration_seconds
        FROM pg_stat_replication
    )
    SELECT 
        *,
        CASE 
            WHEN send_lag_mb > 100 THEN 'CRITICAL'
            WHEN send_lag_mb > 50 THEN 'WARNING'
            ELSE 'OK'
        END as lag_status
    FROM replication_stats;
    SQL

    Common Issues and Solutions

    Issue 1: Replication Lag

    graph TD
        A[High Replication Lag] --> B[Check Network]
        A --> C[Check Disk I/O]
        A --> D[Check WAL Settings]
    
        B --> E[Network Bandwidth]
        B --> F[Network Latency]
    
        C --> G[Disk Speed on Standby]
        C --> H[I/O Contention]
    
        D --> I[wal_keep_size]
        D --> J[max_wal_senders]

    Issue 2: Broken Replication

    -- Check for replication breaks
    SELECT slot_name, active, restart_lsn, confirmed_flush_lsn 
    FROM pg_replication_slots;
    
    -- Fix broken replication
    -- 1. If standby is too far behind, recreate from base backup
    pg_basebackup -h primary -D /var/lib/postgresql/data -U replicator -v -P
    
    -- 2. If replication slot is inactive
    SELECT pg_drop_replication_slot('slot_name');
    SELECT pg_create_physical_replication_slot('slot_name');
    SQL

    Alerting Thresholds

    -- Create monitoring function
    CREATE OR REPLACE FUNCTION check_replication_health()
    RETURNS TABLE (
        metric text,
        value numeric,
        status text,
        message text
    ) AS $$
    BEGIN
        -- Check replication lag
        RETURN QUERY
        SELECT 
            'replication_lag_mb' as metric,
            COALESCE(MAX(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)/1024/1024), 0) as value,
            CASE 
                WHEN MAX(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)/1024/1024) > 100 THEN 'CRITICAL'
                WHEN MAX(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)/1024/1024) > 50 THEN 'WARNING'
                ELSE 'OK'
            END as status,
            'Replication lag in MB' as message
        FROM pg_stat_replication;
    
        -- Check number of connected standbys
        RETURN QUERY
        SELECT 
            'connected_standbys' as metric,
            COUNT(*)::numeric as value,
            CASE 
                WHEN COUNT(*) < 1 THEN 'CRITICAL'
                WHEN COUNT(*) < 2 THEN 'WARNING'
                ELSE 'OK'
            END as status,
            'Number of connected standby servers' as message
        FROM pg_stat_replication;
    END;
    $$ LANGUAGE plpgsql;
    SQL

    9. Performance Optimization

    Network Optimization

    TCP Configuration

    # /etc/sysctl.conf optimizations for replication
    net.core.rmem_max = 134217728
    net.core.wmem_max = 134217728
    net.ipv4.tcp_rmem = 4096 87380 134217728
    net.ipv4.tcp_wmem = 4096 65536 134217728
    net.ipv4.tcp_window_scaling = 1
    INI

    Compression

    -- Enable WAL compression (PostgreSQL 14+)
    wal_compression = on
    
    -- Connection compression
    primary_conninfo = 'host=primary compression=on'
    SQL

    WAL Optimization

    -- postgresql.conf optimizations
    wal_buffers = 16MB                    -- WAL buffer size
    wal_writer_delay = 200ms              -- WAL writer sleep time
    wal_writer_flush_after = 1MB          -- Force flush threshold
    checkpoint_timeout = 15min            -- Checkpoint frequency
    checkpoint_completion_target = 0.9    -- Checkpoint completion target
    max_wal_size = 4GB                    -- Maximum WAL size
    min_wal_size = 1GB                    -- Minimum WAL size
    SQL

    Standby Performance Tuning

    -- Standby-specific settings
    hot_standby_feedback = on             -- Send feedback to primary
    max_standby_streaming_delay = 30s     -- Cancel delay for streaming
    max_standby_archive_delay = 30s       -- Cancel delay for archive recovery
    wal_receiver_status_interval = 10s    -- Status update interval
    wal_retrieve_retry_interval = 5s      -- Retry interval for WAL retrieval
    SQL

    Parallel WAL Replay

    -- Enable parallel WAL replay (PostgreSQL 14+)
    recovery_prefetch = try               -- Prefetch blocks during recovery
    wal_decode_buffer_size = 512kB        -- Buffer size for WAL decoding
    SQL

    Performance Monitoring Queries

    -- WAL generation rate
    SELECT 
        name,
        pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) as total_wal,
        pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), pg_current_wal_lsn() - pg_wal_lsn_diff(pg_current_wal_lsn(), pg_stat_file('pg_wal/00000001000000000000000' || (pg_walfile_name_offset(pg_current_wal_lsn())).file_offset::text, true).modification))) as wal_rate
    FROM pg_settings WHERE name = 'cluster_name';
    
    -- Replication throughput
    SELECT 
        client_addr,
        pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as lag_size,
        extract(epoch from (now() - pg_last_wal_replay_timestamp())) as lag_seconds
    FROM pg_stat_replication;
    SQL

    10. Security in Replication

    Authentication and Authorization

    SSL/TLS Configuration

    -- postgresql.conf
    ssl = on
    ssl_cert_file = 'server.crt'
    ssl_key_file = 'server.key'
    ssl_ca_file = 'ca.crt'
    ssl_crl_file = 'ca.crl'
    SQL
    -- pg_hba.conf with SSL enforcement
    hostssl replication replicator standby_ip/32 cert
    hostssl replication replicator 0.0.0.0/0 cert clientcert=verify-full
    SQL

    Certificate-based Authentication

    # Generate certificates
    openssl genrsa -out ca.key 2048
    openssl req -new -x509 -key ca.key -out ca.crt -days 365
    openssl genrsa -out server.key 2048
    openssl req -new -key server.key -out server.csr
    openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -out server.crt -days 365
    Bash

    Network Security

    Firewall Configuration

    # Allow PostgreSQL replication traffic
    sudo ufw allow from standby_ip to any port 5432
    sudo ufw allow from primary_ip to any port 5432
    
    # Deny all other traffic to PostgreSQL port
    sudo ufw deny 5432
    Bash

    VPN and Private Networks

    graph TD
        subgraph "Primary Site"
            A[Primary DB] --> B[VPN Gateway]
        end
    
        subgraph "Secondary Site"
            C[VPN Gateway] --> D[Standby DB]
        end
    
        B -.->|Encrypted Tunnel| C
    
        subgraph "Security Features"
            E[IPSec Encryption]
            F[Certificate Auth]
            G[Network Isolation]
        end

    Data Encryption

    At-Rest Encryption

    -- Enable TDE (Transparent Data Encryption) if available
    -- Or use filesystem-level encryption
    # LUKS encryption setup
    cryptsetup luksFormat /dev/sdb
    cryptsetup luksOpen /dev/sdb encrypted_disk
    mkfs.ext4 /dev/mapper/encrypted_disk
    Bash

    In-Transit Encryption

    -- Connection string with SSL
    primary_conninfo = 'host=primary sslmode=require sslcert=client.crt sslkey=client.key sslrootcert=ca.crt'
    SQL

    Access Control

    Role-based Security

    -- Create replication roles with minimal privileges
    CREATE ROLE replication_admin;
    GRANT CONNECT ON DATABASE postgres TO replication_admin;
    GRANT USAGE ON SCHEMA pg_catalog TO replication_admin;
    
    -- Create specific replication user
    CREATE USER standby_replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';
    GRANT replication_admin TO standby_replicator;
    SQL

    Audit Logging

    -- postgresql.conf
    log_connections = on
    log_disconnections = on
    log_replication_commands = on
    log_statement = 'all'
    log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
    SQL

    11. Advanced Topics and Best Practices

    Multi-Master Replication

    PostgreSQL doesn’t natively support multi-master replication, but there are several solutions:

    BDR (Bi-Directional Replication)

    graph TD
        A[Node 1] <--> B[Node 2]
        B <--> C[Node 3]
        C <--> A
    
        D[Conflict Resolution] --> A
        D --> B
        D --> C

    Postgres-XL/XC Configuration

    -- Coordinator configuration
    CREATE NODE coord1 WITH (TYPE = 'coordinator', HOST = 'coord1', PORT = 5432);
    CREATE NODE coord2 WITH (TYPE = 'coordinator', HOST = 'coord2', PORT = 5432);
    
    -- Data node configuration
    CREATE NODE datanode1 WITH (TYPE = 'datanode', HOST = 'dn1', PORT = 5432);
    CREATE NODE datanode2 WITH (TYPE = 'datanode', HOST = 'dn2', PORT = 5432);
    SQL

    Cross-Platform Replication

    Heterogeneous Replication

    graph LR
        A[PostgreSQL] -->|Logical Replication| B[PostgreSQL]
        A -->|ETL Tools| C[MySQL]
        A -->|CDC Tools| D[MongoDB]
        A -->|Streaming| E[Apache Kafka]

    Backup Integration

    Point-in-Time Recovery with Replication

    # Continuous archiving setup
    archive_mode = on
    archive_command = 'test ! -f /backup/archive/%f && cp %p /backup/archive/%f'
    
    # Base backup with WAL
    pg_basebackup -D /backup/base -Ft -z -P -U backup_user
    Bash

    Backup from Standby

    -- Configure standby for backups
    hot_standby = on
    archive_mode = always  -- Archive on standby too
    
    -- Backup command on standby
    pg_basebackup -D /backup/standby_backup -X stream -v -P
    SQL

    Performance Testing and Benchmarking

    pgbench with Replication

    # Initialize pgbench
    pgbench -i -s 1000 -h primary_host testdb
    
    # Run read-write test on primary
    pgbench -c 10 -j 2 -T 300 -h primary_host testdb
    
    # Run read-only test on standby
    pgbench -c 20 -j 4 -T 300 -S -h standby_host testdb
    Bash

    Custom Replication Benchmarks

    -- Replication lag benchmark
    CREATE OR REPLACE FUNCTION test_replication_lag()
    RETURNS TABLE (
        test_time timestamp,
        write_lsn pg_lsn,
        lag_ms bigint
    ) AS $$
    DECLARE
        start_time timestamp;
        start_lsn pg_lsn;
        current_lsn pg_lsn;
    BEGIN
        start_time := clock_timestamp();
        start_lsn := pg_current_wal_insert_lsn();
    
        -- Generate some WAL activity
        INSERT INTO lag_test (data) VALUES (random()::text);
    
        -- Wait for replication
        PERFORM pg_sleep(0.001);
    
        current_lsn := pg_current_wal_insert_lsn();
    
        RETURN QUERY SELECT 
            start_time,
            start_lsn,
            extract(milliseconds from (clock_timestamp() - start_time))::bigint;
    END;
    $$ LANGUAGE plpgsql;
    SQL

    Best Practices Summary

    Configuration Best Practices

    1. Always use replication slots for important standbys
    2. Monitor replication lag continuously
    3. Set appropriate WAL retention based on network reliability
    4. Use synchronous replication for critical data
    5. Implement proper authentication and encryption

    Operational Best Practices

    1. Regular failover testing in staging environments
    2. Automate monitoring and alerting
    3. Document runbooks for common scenarios
    4. Keep standbys at same version or compatible versions
    5. Plan for disaster recovery scenarios

    Troubleshooting Checklist

    -- Quick replication health check
    SELECT 
        'Primary Status' as check_type,
        CASE WHEN pg_is_in_recovery() THEN 'STANDBY' ELSE 'PRIMARY' END as status
    UNION ALL
    SELECT 
        'Replication Connections',
        COUNT(*)::text
    FROM pg_stat_replication
    UNION ALL
    SELECT 
        'Max Replication Lag (MB)',
        COALESCE(pg_size_pretty(MAX(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn))), 'N/A')
    FROM pg_stat_replication;
    SQL

    Conclusion

    PostgreSQL replication is a powerful feature that enables high availability, disaster recovery, and performance scaling. This guide has covered:

    • Physical vs Logical Replication: Understanding when to use each approach
    • Setup and Configuration: Step-by-step implementation guides
    • Monitoring and Troubleshooting: Essential skills for production environments
    • Performance Optimization: Tuning for optimal replication performance
    • Security: Protecting replication streams and data
    • Advanced Topics: Multi-master setups and complex architectures

    Next Steps

    1. Practice in Lab Environment: Set up test clusters to gain hands-on experience
    2. Implement Monitoring: Deploy comprehensive monitoring solutions
    3. Plan for Production: Design replication architecture for your specific needs
    4. Stay Updated: Follow PostgreSQL releases for new replication features
    5. Community Engagement: Participate in PostgreSQL community forums and events

    Additional Resources


    This guide represents current best practices as of PostgreSQL 15+. Always consult the official documentation for your specific PostgreSQL version.


    Discover more from Altgr Blog

    Subscribe to get the latest posts sent to your email.

    Leave a Reply

    Your email address will not be published. Required fields are marked *