Python Web Scraping – Altgr Blog

Introduction to Web Scraping
Legal and Ethical Considerations
Understanding HTML Structure
Setting Up Your Environment
Basic Web Scraping with Requests and BeautifulSoup
Advanced Techniques
Handling Dynamic Content with Selenium
Best Practices
Common Challenges and Solutions
Real-World Projects

1. Introduction to Web Scraping

Web scraping is the automated process of extracting data from websites. It’s a powerful technique used for:

Data Analysis: Collecting data for research and analytics
Price Monitoring: Tracking competitor prices
Content Aggregation: Building news or product aggregators
Lead Generation: Gathering business contact information
Market Research: Analyzing trends and patterns

When to Use Web Scraping

✅ Good Use Cases:

Public data that’s not available via API
Monitoring website changes
Research and academic purposes
Personal projects

❌ When NOT to Use:

If an official API exists, use it instead
When explicitly prohibited by terms of service
For scraping personal or private data

2. Legal and Ethical Considerations

Legal Framework

Before scraping any website, consider:

Terms of Service (ToS): Always review the website’s ToS
robots.txt: Check the robots.txt file (e.g., https://example.com/robots.txt)
Copyright: Respect intellectual property rights
Rate Limiting: Don’t overload servers with requests

Ethical Guidelines

# Example: Checking robots.txt
from urllib.robotparser import RobotFileParser

def can_scrape(url, user_agent='*'):
    """Check if scraping is allowed"""
    rp = RobotFileParser()
    rp.set_url(url + '/robots.txt')
    rp.read()
    return rp.can_fetch(user_agent, url)

# Usage
website = "https://example.com"
if can_scrape(website):
    print("Scraping is allowed!")
else:
    print("Scraping is not allowed.")

# Example: Checking robots.txt
from urllib.robotparser import RobotFileParser

def can_scrape(url, user_agent='*'):
    """Check if scraping is allowed"""
    rp = RobotFileParser()
    rp.set_url(url + '/robots.txt')
    rp.read()
    return rp.can_fetch(user_agent, url)

# Usage
website = "https://example.com"
if can_scrape(website):
    print("Scraping is allowed!")
else:
    print("Scraping is not allowed.")

Python

Best Practices:

Add delays between requests
Identify your bot with a proper User-Agent
Respect rate limits
Use APIs when available

3. Understanding HTML Structure

HTML Basics

HTML (HyperText Markup Language) structures web pages using tags:

<!DOCTYPE html>
<html>
<head>
    <title>Page Title</title>
</head>
<body>
    <h1>Main Heading</h1>
    <div class="content">
        <p>This is a paragraph.</p>
        <ul id="items">
            <li class="item">Item 1</li>
            <li class="item">Item 2</li>
        </ul>
    </div>
    <a href="https://example.com">Link</a>
</body>
</html>

<!DOCTYPE html>
<html>
<head>
    <title>Page Title</title>
</head>
<body>
    <h1>Main Heading</h1>
    <div class="content">
        <p>This is a paragraph.</p>
        <ul id="items">
            <li class="item">Item 1</li>
            <li class="item">Item 2</li>
        </ul>
    </div>
    <a href="https://example.com">Link</a>
</body>
</html>

HTML

Key HTML Elements for Scraping

Tags: <div>, <span>, <p>, <h1>, etc.
Attributes: class, id, href, src
Text: Content between opening and closing tags

CSS Selectors

CSS selectors help locate elements:

Selector	Example	Description
Tag	`p`	Selects all `<p>` elements
Class	`.content`	Selects elements with class=”content”
ID	`#items`	Selects element with id=”items”
Descendant	`div p`	Selects `<p>` inside `<div>`
Child	`div > p`	Direct child only
Attribute	`[href]`	Elements with href attribute

XPath (Alternative to CSS)

# XPath examples
/html/body/div[1]           # First div in body
//div[@class='content']     # Any div with class='content'
//a[@href]                  # All links with href
//li[text()='Item 1']       # List item with specific text

# XPath examples
/html/body/div[1]           # First div in body
//div[@class='content']     # Any div with class='content'
//a[@href]                  # All links with href
//li[text()='Item 1']       # List item with specific text

Python

4. Setting Up Your Environment

Installing Required Libraries

# Core libraries
pip install requests beautifulsoup4 lxml

# For dynamic content
pip install selenium

# Additional useful libraries
pip install pandas  # Data manipulation
pip install scrapy  # Advanced scraping framework

# Core libraries
pip install requests beautifulsoup4 lxml

# For dynamic content
pip install selenium

# Additional useful libraries
pip install pandas  # Data manipulation
pip install scrapy  # Advanced scraping framework

Bash

Creating a Virtual Environment

# Windows
python -m venv scraping_env
scraping_env\Scripts\activate

# Linux/Mac
python3 -m venv scraping_env
source scraping_env/bin/activate

# Windows
python -m venv scraping_env
scraping_env\Scripts\activate

# Linux/Mac
python3 -m venv scraping_env
source scraping_env/bin/activate

Bash

Verify Installation

import requests
from bs4 import BeautifulSoup
import pandas as pd

print("All libraries installed successfully!")

import requests
from bs4 import BeautifulSoup
import pandas as pd

print("All libraries installed successfully!")

Python

5. Basic Web Scraping with Requests and BeautifulSoup

5.1 Making HTTP Requests

import requests

# Basic GET request
url = "https://example.com"
response = requests.get(url)

# Check status code
if response.status_code == 200:
    print("Success!")
    html_content = response.text
else:
    print(f"Error: {response.status_code}")

# Adding headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)

# Handling timeouts
try:
    response = requests.get(url, timeout=5)
except requests.Timeout:
    print("Request timed out")
except requests.RequestException as e:
    print(f"Error: {e}")

import requests

# Basic GET request
url = "https://example.com"
response = requests.get(url)

# Check status code
if response.status_code == 200:
    print("Success!")
    html_content = response.text
else:
    print(f"Error: {response.status_code}")

# Adding headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)

# Handling timeouts
try:
    response = requests.get(url, timeout=5)
except requests.Timeout:
    print("Request timed out")
except requests.RequestException as e:
    print(f"Error: {e}")

Python

5.2 Parsing HTML with BeautifulSoup

from bs4 import BeautifulSoup
import requests

# Fetch and parse
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')

# Finding elements
title = soup.find('h1')                    # First h1
all_paragraphs = soup.find_all('p')        # All paragraphs
by_class = soup.find('div', class_='content')
by_id = soup.find(id='main-content')

# CSS Selectors
items = soup.select('.item')               # All elements with class 'item'
first_item = soup.select_one('#first')     # Element with id 'first'

# Extracting text
print(title.text)                          # Get text content
print(title.get_text(strip=True))          # Remove whitespace

# Extracting attributes
link = soup.find('a')
href = link.get('href')                    # Get href attribute
href = link['href']                        # Alternative syntax

from bs4 import BeautifulSoup
import requests

# Fetch and parse
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')

# Finding elements
title = soup.find('h1')                    # First h1
all_paragraphs = soup.find_all('p')        # All paragraphs
by_class = soup.find('div', class_='content')
by_id = soup.find(id='main-content')

# CSS Selectors
items = soup.select('.item')               # All elements with class 'item'
first_item = soup.select_one('#first')     # Element with id 'first'

# Extracting text
print(title.text)                          # Get text content
print(title.get_text(strip=True))          # Remove whitespace

# Extracting attributes
link = soup.find('a')
href = link.get('href')                    # Get href attribute
href = link['href']                        # Alternative syntax

Python

5.3 Complete Example: Scraping Quotes

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def scrape_quotes():
    """Scrape quotes from quotes.toscrape.com"""
    base_url = "http://quotes.toscrape.com/page/"
    all_quotes = []

    for page in range(1, 6):  # Scrape first 5 pages
        print(f"Scraping page {page}...")
        url = f"{base_url}{page}/"

        # Add headers
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
        }

        response = requests.get(url, headers=headers)

        if response.status_code != 200:
            print(f"Failed to fetch page {page}")
            continue

        soup = BeautifulSoup(response.content, 'lxml')

        # Find all quote containers
        quotes = soup.find_all('div', class_='quote')

        for quote in quotes:
            # Extract quote text
            text = quote.find('span', class_='text').text

            # Extract author
            author = quote.find('small', class_='author').text

            # Extract tags
            tags = [tag.text for tag in quote.find_all('a', class_='tag')]

            all_quotes.append({
                'quote': text,
                'author': author,
                'tags': ', '.join(tags)
            })

        # Be polite - add delay between requests
        time.sleep(1)

    # Save to CSV
    df = pd.DataFrame(all_quotes)
    df.to_csv('quotes.csv', index=False)
    print(f"Scraped {len(all_quotes)} quotes!")

    return df

# Run the scraper
if __name__ == "__main__":
    quotes_df = scrape_quotes()
    print(quotes_df.head())

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def scrape_quotes():
    """Scrape quotes from quotes.toscrape.com"""
    base_url = "http://quotes.toscrape.com/page/"
    all_quotes = []

    for page in range(1, 6):  # Scrape first 5 pages
        print(f"Scraping page {page}...")
        url = f"{base_url}{page}/"

        # Add headers
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
        }

        response = requests.get(url, headers=headers)

        if response.status_code != 200:
            print(f"Failed to fetch page {page}")
            continue

        soup = BeautifulSoup(response.content, 'lxml')

        # Find all quote containers
        quotes = soup.find_all('div', class_='quote')

        for quote in quotes:
            # Extract quote text
            text = quote.find('span', class_='text').text

            # Extract author
            author = quote.find('small', class_='author').text

            # Extract tags
            tags = [tag.text for tag in quote.find_all('a', class_='tag')]

            all_quotes.append({
                'quote': text,
                'author': author,
                'tags': ', '.join(tags)
            })

        # Be polite - add delay between requests
        time.sleep(1)

    # Save to CSV
    df = pd.DataFrame(all_quotes)
    df.to_csv('quotes.csv', index=False)
    print(f"Scraped {len(all_quotes)} quotes!")

    return df

# Run the scraper
if __name__ == "__main__":
    quotes_df = scrape_quotes()
    print(quotes_df.head())

Python

5.4 Navigating the DOM Tree

from bs4 import BeautifulSoup

html = """
<div class="container">
    <h1>Title</h1>
    <div class="content">
        <p>First paragraph</p>
        <p>Second paragraph</p>
    </div>
</div>
"""

soup = BeautifulSoup(html, 'lxml')

# Parent navigation
p = soup.find('p')
parent = p.parent                    # Get parent element
parents = p.parents                  # All parents (generator)

# Sibling navigation
first_p = soup.find('p')
next_sibling = first_p.find_next_sibling()
prev_sibling = first_p.find_previous_sibling()

# Child navigation
container = soup.find('div', class_='container')
children = container.children        # Direct children (generator)
descendants = container.descendants  # All descendants (generator)

# Finding next/previous elements
h1 = soup.find('h1')
next_elem = h1.find_next('p')       # Next <p> tag

from bs4 import BeautifulSoup

html = """
<div class="container">
    <h1>Title</h1>
    <div class="content">
        <p>First paragraph</p>
        <p>Second paragraph</p>
    </div>
</div>
"""

soup = BeautifulSoup(html, 'lxml')

# Parent navigation
p = soup.find('p')
parent = p.parent                    # Get parent element
parents = p.parents                  # All parents (generator)

# Sibling navigation
first_p = soup.find('p')
next_sibling = first_p.find_next_sibling()
prev_sibling = first_p.find_previous_sibling()

# Child navigation
container = soup.find('div', class_='container')
children = container.children        # Direct children (generator)
descendants = container.descendants  # All descendants (generator)

# Finding next/previous elements
h1 = soup.find('h1')
next_elem = h1.find_next('p')       # Next <p> tag

Python

6. Advanced Techniques

6.1 Handling Pagination

def scrape_paginated_site(base_url, max_pages=10):
    """Scrape multiple pages"""
    all_data = []

    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        response = requests.get(url)

        if response.status_code != 200:
            break

        soup = BeautifulSoup(response.content, 'lxml')

        # Check if "Next" button exists
        next_button = soup.find('a', class_='next')
        if not next_button:
            print(f"Reached last page at page {page}")
            break

        # Extract data from current page
        items = soup.find_all('div', class_='item')
        for item in items:
            data = {
                'title': item.find('h2').text,
                'price': item.find('span', class_='price').text
            }
            all_data.append(data)

        time.sleep(1)  # Rate limiting

    return all_data

def scrape_paginated_site(base_url, max_pages=10):
    """Scrape multiple pages"""
    all_data = []

    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        response = requests.get(url)

        if response.status_code != 200:
            break

        soup = BeautifulSoup(response.content, 'lxml')

        # Check if "Next" button exists
        next_button = soup.find('a', class_='next')
        if not next_button:
            print(f"Reached last page at page {page}")
            break

        # Extract data from current page
        items = soup.find_all('div', class_='item')
        for item in items:
            data = {
                'title': item.find('h2').text,
                'price': item.find('span', class_='price').text
            }
            all_data.append(data)

        time.sleep(1)  # Rate limiting

    return all_data

Python

6.2 Handling Forms and POST Requests

import requests

# POST request with form data
login_url = "https://example.com/login"
login_data = {
    'username': 'your_username',
    'password': 'your_password'
}

# Create a session to persist cookies
session = requests.Session()

# Send POST request
response = session.post(login_url, data=login_data)

# Access protected pages using the same session
protected_url = "https://example.com/dashboard"
response = session.get(protected_url)

soup = BeautifulSoup(response.content, 'lxml')

import requests

# POST request with form data
login_url = "https://example.com/login"
login_data = {
    'username': 'your_username',
    'password': 'your_password'
}

# Create a session to persist cookies
session = requests.Session()

# Send POST request
response = session.post(login_url, data=login_data)

# Access protected pages using the same session
protected_url = "https://example.com/dashboard"
response = session.get(protected_url)

soup = BeautifulSoup(response.content, 'lxml')

Python

6.3 Working with JSON APIs

import requests
import json

# Fetch JSON data
api_url = "https://api.example.com/data"
response = requests.get(api_url)

# Parse JSON
data = response.json()

# Alternative
data = json.loads(response.text)

# Extract specific fields
for item in data['results']:
    print(item['name'], item['price'])

# Save to file
with open('data.json', 'w') as f:
    json.dump(data, f, indent=4)

import requests
import json

# Fetch JSON data
api_url = "https://api.example.com/data"
response = requests.get(api_url)

# Parse JSON
data = response.json()

# Alternative
data = json.loads(response.text)

# Extract specific fields
for item in data['results']:
    print(item['name'], item['price'])

# Save to file
with open('data.json', 'w') as f:
    json.dump(data, f, indent=4)

Python

6.4 Handling Different Encodings

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)

# Check encoding
print(response.encoding)

# Set encoding manually if needed
response.encoding = 'utf-8'

# Or let BeautifulSoup handle it
soup = BeautifulSoup(response.content, 'lxml', from_encoding='utf-8')

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)

# Check encoding
print(response.encoding)

# Set encoding manually if needed
response.encoding = 'utf-8'

# Or let BeautifulSoup handle it
soup = BeautifulSoup(response.content, 'lxml', from_encoding='utf-8')

Python

6.5 Working with Tables

import pandas as pd
from bs4 import BeautifulSoup
import requests

def scrape_table(url):
    """Scrape HTML tables"""
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')

    # Method 1: Using pandas (easiest for simple tables)
    tables = pd.read_html(response.text)
    df = tables[0]  # First table

    # Method 2: Manual parsing (more control)
    table = soup.find('table')

    # Extract headers
    headers = []
    for th in table.find_all('th'):
        headers.append(th.text.strip())

    # Extract rows
    rows = []
    for tr in table.find_all('tr')[1:]:  # Skip header row
        row = []
        for td in tr.find_all('td'):
            row.append(td.text.strip())
        rows.append(row)

    # Create DataFrame
    df = pd.DataFrame(rows, columns=headers)

    return df

# Usage
df = scrape_table("https://example.com/table")
print(df)

import pandas as pd
from bs4 import BeautifulSoup
import requests

def scrape_table(url):
    """Scrape HTML tables"""
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')

    # Method 1: Using pandas (easiest for simple tables)
    tables = pd.read_html(response.text)
    df = tables[0]  # First table

    # Method 2: Manual parsing (more control)
    table = soup.find('table')

    # Extract headers
    headers = []
    for th in table.find_all('th'):
        headers.append(th.text.strip())

    # Extract rows
    rows = []
    for tr in table.find_all('tr')[1:]:  # Skip header row
        row = []
        for td in tr.find_all('td'):
            row.append(td.text.strip())
        rows.append(row)

    # Create DataFrame
    df = pd.DataFrame(rows, columns=headers)

    return df

# Usage
df = scrape_table("https://example.com/table")
print(df)

Python

6.6 Downloading Files

import requests
import os

def download_file(url, filename=None):
    """Download file from URL"""
    response = requests.get(url, stream=True)

    # Get filename from URL if not provided
    if filename is None:
        filename = url.split('/')[-1]

    # Download in chunks for large files
    with open(filename, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)

    print(f"Downloaded: {filename}")

# Download multiple files
def download_images(soup, folder='images'):
    """Download all images from page"""
    os.makedirs(folder, exist_ok=True)

    images = soup.find_all('img')

    for idx, img in enumerate(images):
        img_url = img.get('src')

        # Handle relative URLs
        if not img_url.startswith('http'):
            img_url = urljoin(base_url, img_url)

        filename = f"{folder}/image_{idx}.jpg"
        download_file(img_url, filename)
        time.sleep(0.5)

# Usage
from urllib.parse import urljoin

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
download_images(soup)

import requests
import os

def download_file(url, filename=None):
    """Download file from URL"""
    response = requests.get(url, stream=True)

    # Get filename from URL if not provided
    if filename is None:
        filename = url.split('/')[-1]

    # Download in chunks for large files
    with open(filename, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)

    print(f"Downloaded: {filename}")

# Download multiple files
def download_images(soup, folder='images'):
    """Download all images from page"""
    os.makedirs(folder, exist_ok=True)

    images = soup.find_all('img')

    for idx, img in enumerate(images):
        img_url = img.get('src')

        # Handle relative URLs
        if not img_url.startswith('http'):
            img_url = urljoin(base_url, img_url)

        filename = f"{folder}/image_{idx}.jpg"
        download_file(img_url, filename)
        time.sleep(0.5)

# Usage
from urllib.parse import urljoin

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
download_images(soup)

Python

7. Handling Dynamic Content with Selenium

Many modern websites use JavaScript to load content dynamically. For these sites, Selenium is essential.

7.1 Setting Up Selenium

# Install Selenium
pip install selenium

# Install WebDriver Manager (automatic driver management)
pip install webdriver-manager

# Install Selenium
pip install selenium

# Install WebDriver Manager (automatic driver management)
pip install webdriver-manager

Bash

7.2 Basic Selenium Usage

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

# Setup Chrome driver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Navigate to page
driver.get("https://example.com")

# Wait for page to load
wait = WebDriverWait(driver, 10)

# Find elements
title = driver.find_element(By.TAG_NAME, "h1")
print(title.text)

# Find multiple elements
items = driver.find_elements(By.CLASS_NAME, "item")

# CSS Selectors
element = driver.find_element(By.CSS_SELECTOR, ".content > p")

# XPath
element = driver.find_element(By.XPATH, "//div[@class='content']")

# Close browser
driver.quit()

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

# Setup Chrome driver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Navigate to page
driver.get("https://example.com")

# Wait for page to load
wait = WebDriverWait(driver, 10)

# Find elements
title = driver.find_element(By.TAG_NAME, "h1")
print(title.text)

# Find multiple elements
items = driver.find_elements(By.CLASS_NAME, "item")

# CSS Selectors
element = driver.find_element(By.CSS_SELECTOR, ".content > p")

# XPath
element = driver.find_element(By.XPATH, "//div[@class='content']")

# Close browser
driver.quit()

Python

7.3 Interacting with Elements

from selenium.webdriver.common.keys import Keys

# Click button
button = driver.find_element(By.ID, "submit-btn")
button.click()

# Fill form
search_box = driver.find_element(By.NAME, "search")
search_box.send_keys("web scraping")
search_box.send_keys(Keys.RETURN)  # Press Enter

# Scroll to element
element = driver.find_element(By.ID, "footer")
driver.execute_script("arguments[0].scrollIntoView();", element)

# Execute JavaScript
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Select dropdown
from selenium.webdriver.support.select import Select

dropdown = Select(driver.find_element(By.ID, "dropdown"))
dropdown.select_by_visible_text("Option 1")
dropdown.select_by_value("option1")
dropdown.select_by_index(0)

from selenium.webdriver.common.keys import Keys

# Click button
button = driver.find_element(By.ID, "submit-btn")
button.click()

# Fill form
search_box = driver.find_element(By.NAME, "search")
search_box.send_keys("web scraping")
search_box.send_keys(Keys.RETURN)  # Press Enter

# Scroll to element
element = driver.find_element(By.ID, "footer")
driver.execute_script("arguments[0].scrollIntoView();", element)

# Execute JavaScript
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Select dropdown
from selenium.webdriver.support.select import Select

dropdown = Select(driver.find_element(By.ID, "dropdown"))
dropdown.select_by_visible_text("Option 1")
dropdown.select_by_value("option1")
dropdown.select_by_index(0)

Python

7.4 Waiting for Elements

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

# Explicit wait
wait = WebDriverWait(driver, 10)

# Wait for element to be present
element = wait.until(
    EC.presence_of_element_located((By.ID, "dynamic-content"))
)

# Wait for element to be clickable
button = wait.until(
    EC.element_to_be_clickable((By.ID, "submit-btn"))
)

# Wait for element to be visible
element = wait.until(
    EC.visibility_of_element_located((By.CLASS_NAME, "modal"))
)

# Wait for title to contain text
wait.until(EC.title_contains("Expected Title"))

# Implicit wait (applies to all find operations)
driver.implicitly_wait(10)  # seconds

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

# Explicit wait
wait = WebDriverWait(driver, 10)

# Wait for element to be present
element = wait.until(
    EC.presence_of_element_located((By.ID, "dynamic-content"))
)

# Wait for element to be clickable
button = wait.until(
    EC.element_to_be_clickable((By.ID, "submit-btn"))
)

# Wait for element to be visible
element = wait.until(
    EC.visibility_of_element_located((By.CLASS_NAME, "modal"))
)

# Wait for title to contain text
wait.until(EC.title_contains("Expected Title"))

# Implicit wait (applies to all find operations)
driver.implicitly_wait(10)  # seconds

Python

7.5 Handling Multiple Windows/Tabs

# Get current window handle
main_window = driver.current_window_handle

# Click link that opens new tab
link = driver.find_element(By.LINK_TEXT, "Open in new tab")
link.click()

# Switch to new tab
all_windows = driver.window_handles
for window in all_windows:
    if window != main_window:
        driver.switch_to.window(window)
        break

# Do something in new tab
print(driver.title)

# Switch back to main window
driver.switch_to.window(main_window)

# Close current tab
driver.close()

# Get current window handle
main_window = driver.current_window_handle

# Click link that opens new tab
link = driver.find_element(By.LINK_TEXT, "Open in new tab")
link.click()

# Switch to new tab
all_windows = driver.window_handles
for window in all_windows:
    if window != main_window:
        driver.switch_to.window(window)
        break

# Do something in new tab
print(driver.title)

# Switch back to main window
driver.switch_to.window(main_window)

# Close current tab
driver.close()

Python

7.6 Headless Browser

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# Configure headless mode
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--window-size=1920,1080")

# Create driver with options
driver = webdriver.Chrome(options=chrome_options)

# Use normally
driver.get("https://example.com")
print(driver.title)
driver.quit()

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# Configure headless mode
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--window-size=1920,1080")

# Create driver with options
driver = webdriver.Chrome(options=chrome_options)

# Use normally
driver.get("https://example.com")
print(driver.title)
driver.quit()

Python

7.7 Complete Selenium Example

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
import pandas as pd
import time

def scrape_dynamic_site():
    """Scrape a site with dynamic content"""

    # Setup
    chrome_options = Options()
    chrome_options.add_argument("--headless")

    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)

    try:
        # Navigate to site
        driver.get("https://example.com")

        # Wait for dynamic content to load
        wait = WebDriverWait(driver, 10)
        wait.until(
            EC.presence_of_element_located((By.CLASS_NAME, "item"))
        )

        # Scroll to load more content
        for _ in range(3):
            driver.execute_script(
                "window.scrollTo(0, document.body.scrollHeight);"
            )
            time.sleep(2)

        # Extract data
        items = driver.find_elements(By.CLASS_NAME, "item")

        data = []
        for item in items:
            title = item.find_element(By.CLASS_NAME, "title").text
            price = item.find_element(By.CLASS_NAME, "price").text

            data.append({
                'title': title,
                'price': price
            })

        # Save to DataFrame
        df = pd.DataFrame(data)
        df.to_csv('scraped_data.csv', index=False)

        print(f"Scraped {len(data)} items")
        return df

    finally:
        driver.quit()

# Run scraper
if __name__ == "__main__":
    df = scrape_dynamic_site()
    print(df.head())

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
import pandas as pd
import time

def scrape_dynamic_site():
    """Scrape a site with dynamic content"""

    # Setup
    chrome_options = Options()
    chrome_options.add_argument("--headless")

    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)

    try:
        # Navigate to site
        driver.get("https://example.com")

        # Wait for dynamic content to load
        wait = WebDriverWait(driver, 10)
        wait.until(
            EC.presence_of_element_located((By.CLASS_NAME, "item"))
        )

        # Scroll to load more content
        for _ in range(3):
            driver.execute_script(
                "window.scrollTo(0, document.body.scrollHeight);"
            )
            time.sleep(2)

        # Extract data
        items = driver.find_elements(By.CLASS_NAME, "item")

        data = []
        for item in items:
            title = item.find_element(By.CLASS_NAME, "title").text
            price = item.find_element(By.CLASS_NAME, "price").text

            data.append({
                'title': title,
                'price': price
            })

        # Save to DataFrame
        df = pd.DataFrame(data)
        df.to_csv('scraped_data.csv', index=False)

        print(f"Scraped {len(data)} items")
        return df

    finally:
        driver.quit()

# Run scraper
if __name__ == "__main__":
    df = scrape_dynamic_site()
    print(df.head())

Python

8. Best Practices

8.1 Error Handling

import requests
from bs4 import BeautifulSoup
import logging

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

def safe_scrape(url):
    """Scrape with comprehensive error handling"""
    try:
        # Make request
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # Raise exception for bad status codes

        # Parse HTML
        soup = BeautifulSoup(response.content, 'lxml')

        # Extract data with safety checks
        title = soup.find('h1')
        if title:
            logging.info(f"Found title: {title.text}")
        else:
            logging.warning("No title found")

        return soup

    except requests.Timeout:
        logging.error(f"Timeout error for {url}")
    except requests.HTTPError as e:
        logging.error(f"HTTP error: {e}")
    except requests.RequestException as e:
        logging.error(f"Request failed: {e}")
    except Exception as e:
        logging.error(f"Unexpected error: {e}")

    return None

# Usage
soup = safe_scrape("https://example.com")
if soup:
    # Process data
    pass

import requests
from bs4 import BeautifulSoup
import logging

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

def safe_scrape(url):
    """Scrape with comprehensive error handling"""
    try:
        # Make request
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # Raise exception for bad status codes

        # Parse HTML
        soup = BeautifulSoup(response.content, 'lxml')

        # Extract data with safety checks
        title = soup.find('h1')
        if title:
            logging.info(f"Found title: {title.text}")
        else:
            logging.warning("No title found")

        return soup

    except requests.Timeout:
        logging.error(f"Timeout error for {url}")
    except requests.HTTPError as e:
        logging.error(f"HTTP error: {e}")
    except requests.RequestException as e:
        logging.error(f"Request failed: {e}")
    except Exception as e:
        logging.error(f"Unexpected error: {e}")

    return None

# Usage
soup = safe_scrape("https://example.com")
if soup:
    # Process data
    pass

Python

8.2 Rate Limiting and Delays

import time
from datetime import datetime
import random

class RateLimiter:
    """Simple rate limiter"""
    def __init__(self, requests_per_second=1):
        self.delay = 1.0 / requests_per_second
        self.last_request = 0

    def wait(self):
        """Wait if necessary to maintain rate limit"""
        elapsed = time.time() - self.last_request
        if elapsed < self.delay:
            time.sleep(self.delay - elapsed)
        self.last_request = time.time()

# Usage
limiter = RateLimiter(requests_per_second=2)

for url in urls:
    limiter.wait()
    response = requests.get(url)
    # Process response

# Alternative: Random delay
def random_delay(min_seconds=1, max_seconds=3):
    """Add random delay between requests"""
    time.sleep(random.uniform(min_seconds, max_seconds))

# Usage
for url in urls:
    response = requests.get(url)
    random_delay(1, 3)

import time
from datetime import datetime
import random

class RateLimiter:
    """Simple rate limiter"""
    def __init__(self, requests_per_second=1):
        self.delay = 1.0 / requests_per_second
        self.last_request = 0

    def wait(self):
        """Wait if necessary to maintain rate limit"""
        elapsed = time.time() - self.last_request
        if elapsed < self.delay:
            time.sleep(self.delay - elapsed)
        self.last_request = time.time()

# Usage
limiter = RateLimiter(requests_per_second=2)

for url in urls:
    limiter.wait()
    response = requests.get(url)
    # Process response

# Alternative: Random delay
def random_delay(min_seconds=1, max_seconds=3):
    """Add random delay between requests"""
    time.sleep(random.uniform(min_seconds, max_seconds))

# Usage
for url in urls:
    response = requests.get(url)
    random_delay(1, 3)

Python

8.3 Using Proxies

import requests

# Single proxy
proxies = {
    'http': 'http://proxy.example.com:8080',
    'https': 'https://proxy.example.com:8080',
}

response = requests.get(url, proxies=proxies)

# Rotating proxies
class ProxyRotator:
    """Rotate through multiple proxies"""
    def __init__(self, proxy_list):
        self.proxies = proxy_list
        self.current = 0

    def get_proxy(self):
        """Get next proxy in rotation"""
        proxy = self.proxies[self.current]
        self.current = (self.current + 1) % len(self.proxies)
        return {'http': proxy, 'https': proxy}

# Usage
proxy_list = [
    'http://proxy1.com:8080',
    'http://proxy2.com:8080',
    'http://proxy3.com:8080',
]

rotator = ProxyRotator(proxy_list)

for url in urls:
    proxies = rotator.get_proxy()
    try:
        response = requests.get(url, proxies=proxies, timeout=5)
    except:
        continue

import requests

# Single proxy
proxies = {
    'http': 'http://proxy.example.com:8080',
    'https': 'https://proxy.example.com:8080',
}

response = requests.get(url, proxies=proxies)

# Rotating proxies
class ProxyRotator:
    """Rotate through multiple proxies"""
    def __init__(self, proxy_list):
        self.proxies = proxy_list
        self.current = 0

    def get_proxy(self):
        """Get next proxy in rotation"""
        proxy = self.proxies[self.current]
        self.current = (self.current + 1) % len(self.proxies)
        return {'http': proxy, 'https': proxy}

# Usage
proxy_list = [
    'http://proxy1.com:8080',
    'http://proxy2.com:8080',
    'http://proxy3.com:8080',
]

rotator = ProxyRotator(proxy_list)

for url in urls:
    proxies = rotator.get_proxy()
    try:
        response = requests.get(url, proxies=proxies, timeout=5)
    except:
        continue

Python

8.4 User Agents

import random

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101',
]

def get_random_user_agent():
    """Return random user agent"""
    return random.choice(USER_AGENTS)

# Usage
headers = {
    'User-Agent': get_random_user_agent()
}

response = requests.get(url, headers=headers)

import random

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101',
]

def get_random_user_agent():
    """Return random user agent"""
    return random.choice(USER_AGENTS)

# Usage
headers = {
    'User-Agent': get_random_user_agent()
}

response = requests.get(url, headers=headers)

Python

8.5 Caching

import requests
from functools import lru_cache
import pickle
import os
from datetime import datetime, timedelta

# Simple in-memory caching with lru_cache
@lru_cache(maxsize=128)
def fetch_page(url):
    """Cached page fetching"""
    response = requests.get(url)
    return response.text

# File-based caching
class FileCache:
    """Simple file-based cache"""
    def __init__(self, cache_dir='cache', expiry_hours=24):
        self.cache_dir = cache_dir
        self.expiry = timedelta(hours=expiry_hours)
        os.makedirs(cache_dir, exist_ok=True)

    def get_cache_path(self, url):
        """Generate cache file path"""
        filename = url.replace('/', '_').replace(':', '_')
        return os.path.join(self.cache_dir, f"{filename}.pkl")

    def get(self, url):
        """Get cached content"""
        cache_path = self.get_cache_path(url)

        if not os.path.exists(cache_path):
            return None

        # Check if cache is expired
        file_time = datetime.fromtimestamp(os.path.getmtime(cache_path))
        if datetime.now() - file_time > self.expiry:
            return None

        with open(cache_path, 'rb') as f:
            return pickle.load(f)

    def set(self, url, content):
        """Save content to cache"""
        cache_path = self.get_cache_path(url)
        with open(cache_path, 'wb') as f:
            pickle.dump(content, f)

# Usage
cache = FileCache()

def fetch_with_cache(url):
    # Try cache first
    content = cache.get(url)
    if content:
        print("Using cached version")
        return content

    # Fetch if not cached
    response = requests.get(url)
    content = response.text
    cache.set(url, content)
    return content

import requests
from functools import lru_cache
import pickle
import os
from datetime import datetime, timedelta

# Simple in-memory caching with lru_cache
@lru_cache(maxsize=128)
def fetch_page(url):
    """Cached page fetching"""
    response = requests.get(url)
    return response.text

# File-based caching
class FileCache:
    """Simple file-based cache"""
    def __init__(self, cache_dir='cache', expiry_hours=24):
        self.cache_dir = cache_dir
        self.expiry = timedelta(hours=expiry_hours)
        os.makedirs(cache_dir, exist_ok=True)

    def get_cache_path(self, url):
        """Generate cache file path"""
        filename = url.replace('/', '_').replace(':', '_')
        return os.path.join(self.cache_dir, f"{filename}.pkl")

    def get(self, url):
        """Get cached content"""
        cache_path = self.get_cache_path(url)

        if not os.path.exists(cache_path):
            return None

        # Check if cache is expired
        file_time = datetime.fromtimestamp(os.path.getmtime(cache_path))
        if datetime.now() - file_time > self.expiry:
            return None

        with open(cache_path, 'rb') as f:
            return pickle.load(f)

    def set(self, url, content):
        """Save content to cache"""
        cache_path = self.get_cache_path(url)
        with open(cache_path, 'wb') as f:
            pickle.dump(content, f)

# Usage
cache = FileCache()

def fetch_with_cache(url):
    # Try cache first
    content = cache.get(url)
    if content:
        print("Using cached version")
        return content

    # Fetch if not cached
    response = requests.get(url)
    content = response.text
    cache.set(url, content)
    return content

Python

8.6 Data Validation and Cleaning

import re
import pandas as pd

def clean_text(text):
    """Clean extracted text"""
    if not text:
        return ""

    # Remove extra whitespace
    text = ' '.join(text.split())

    # Remove special characters if needed
    text = re.sub(r'[^\w\s.,!?-]', '', text)

    return text.strip()

def clean_price(price_string):
    """Extract numeric price from string"""
    # Remove currency symbols and extract number
    match = re.search(r'[\d,]+\.?\d*', price_string)
    if match:
        number = match.group().replace(',', '')
        return float(number)
    return None

def validate_data(df):
    """Validate scraped data"""
    # Remove duplicates
    df = df.drop_duplicates()

    # Remove rows with missing values
    df = df.dropna(subset=['title', 'price'])

    # Validate price range
    df = df[df['price'] > 0]
    df = df[df['price'] < 10000]

    # Clean text fields
    df['title'] = df['title'].apply(clean_text)

    return df

# Usage
data = []
for item in items:
    title = clean_text(item.find('h2').text)
    price = clean_price(item.find('span', class_='price').text)

    if title and price:
        data.append({'title': title, 'price': price})

df = pd.DataFrame(data)
df = validate_data(df)

import re
import pandas as pd

def clean_text(text):
    """Clean extracted text"""
    if not text:
        return ""

    # Remove extra whitespace
    text = ' '.join(text.split())

    # Remove special characters if needed
    text = re.sub(r'[^\w\s.,!?-]', '', text)

    return text.strip()

def clean_price(price_string):
    """Extract numeric price from string"""
    # Remove currency symbols and extract number
    match = re.search(r'[\d,]+\.?\d*', price_string)
    if match:
        number = match.group().replace(',', '')
        return float(number)
    return None

def validate_data(df):
    """Validate scraped data"""
    # Remove duplicates
    df = df.drop_duplicates()

    # Remove rows with missing values
    df = df.dropna(subset=['title', 'price'])

    # Validate price range
    df = df[df['price'] > 0]
    df = df[df['price'] < 10000]

    # Clean text fields
    df['title'] = df['title'].apply(clean_text)

    return df

# Usage
data = []
for item in items:
    title = clean_text(item.find('h2').text)
    price = clean_price(item.find('span', class_='price').text)

    if title and price:
        data.append({'title': title, 'price': price})

df = pd.DataFrame(data)
df = validate_data(df)

Python

9. Common Challenges and Solutions

9.1 Handling AJAX/Dynamic Content

Problem: Content loads after page load via JavaScript.

Solution:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get(url)

# Wait for AJAX content
wait = WebDriverWait(driver, 10)
element = wait.until(
    EC.presence_of_element_located((By.CLASS_NAME, "ajax-content"))
)

# Or inspect network requests and call API directly
import requests
api_url = "https://example.com/api/data"
response = requests.get(api_url)
data = response.json()

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get(url)

# Wait for AJAX content
wait = WebDriverWait(driver, 10)
element = wait.until(
    EC.presence_of_element_located((By.CLASS_NAME, "ajax-content"))
)

# Or inspect network requests and call API directly
import requests
api_url = "https://example.com/api/data"
response = requests.get(api_url)
data = response.json()

Python

9.2 Bypassing CAPTCHAs

Problem: Sites use CAPTCHAs to prevent bots.

Solutions:

Use official APIs instead
Contact site owner for permission
Use CAPTCHA solving services (2captcha, Anti-Captcha)
Implement human-like behavior patterns

# Example: More human-like scraping
import time
import random
from selenium.webdriver.common.action_chains import ActionChains

def human_like_delay():
    time.sleep(random.uniform(1, 3))

def human_like_scroll(driver):
    # Scroll gradually like a human
    for i in range(5):
        driver.execute_script(f"window.scrollTo(0, {i * 200});")
        time.sleep(random.uniform(0.1, 0.3))

def human_like_mouse_movement(driver, element):
    actions = ActionChains(driver)
    actions.move_to_element(element)
    actions.pause(random.uniform(0.1, 0.5))
    actions.click()
    actions.perform()

# Example: More human-like scraping
import time
import random
from selenium.webdriver.common.action_chains import ActionChains

def human_like_delay():
    time.sleep(random.uniform(1, 3))

def human_like_scroll(driver):
    # Scroll gradually like a human
    for i in range(5):
        driver.execute_script(f"window.scrollTo(0, {i * 200});")
        time.sleep(random.uniform(0.1, 0.3))

def human_like_mouse_movement(driver, element):
    actions = ActionChains(driver)
    actions.move_to_element(element)
    actions.pause(random.uniform(0.1, 0.5))
    actions.click()
    actions.perform()

Python

9.3 Handling Session and Cookies

Problem: Need to maintain session across requests.

Solution:

import requests

# Use session object
session = requests.Session()

# Login
login_data = {'username': 'user', 'password': 'pass'}
session.post('https://example.com/login', data=login_data)

# Session maintains cookies automatically
response = session.get('https://example.com/protected-page')

# Manually set cookies
cookies = {'session_id': 'abc123'}
response = requests.get(url, cookies=cookies)

# Save and load cookies with Selenium
from selenium import webdriver
import pickle

driver = webdriver.Chrome()
driver.get('https://example.com')

# Save cookies
cookies = driver.get_cookies()
pickle.dump(cookies, open('cookies.pkl', 'wb'))

# Load cookies
driver.get('https://example.com')
cookies = pickle.load(open('cookies.pkl', 'rb'))
for cookie in cookies:
    driver.add_cookie(cookie)
driver.refresh()

import requests

# Use session object
session = requests.Session()

# Login
login_data = {'username': 'user', 'password': 'pass'}
session.post('https://example.com/login', data=login_data)

# Session maintains cookies automatically
response = session.get('https://example.com/protected-page')

# Manually set cookies
cookies = {'session_id': 'abc123'}
response = requests.get(url, cookies=cookies)

# Save and load cookies with Selenium
from selenium import webdriver
import pickle

driver = webdriver.Chrome()
driver.get('https://example.com')

# Save cookies
cookies = driver.get_cookies()
pickle.dump(cookies, open('cookies.pkl', 'wb'))

# Load cookies
driver.get('https://example.com')
cookies = pickle.load(open('cookies.pkl', 'rb'))
for cookie in cookies:
    driver.add_cookie(cookie)
driver.refresh()

Python

9.4 Handling Infinite Scroll

Problem: Content loads continuously as you scroll.

Solution:

from selenium import webdriver
import time

def scrape_infinite_scroll(url, scroll_pause=2, max_scrolls=10):
    driver = webdriver.Chrome()
    driver.get(url)

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    scrolls = 0
    while scrolls < max_scrolls:
        # Scroll down
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait for content to load
        time.sleep(scroll_pause)

        # Calculate new scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")

        # Break if reached bottom
        if new_height == last_height:
            break

        last_height = new_height
        scrolls += 1

    # Extract data
    html = driver.page_source
    driver.quit()

    return html

from selenium import webdriver
import time

def scrape_infinite_scroll(url, scroll_pause=2, max_scrolls=10):
    driver = webdriver.Chrome()
    driver.get(url)

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    scrolls = 0
    while scrolls < max_scrolls:
        # Scroll down
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait for content to load
        time.sleep(scroll_pause)

        # Calculate new scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")

        # Break if reached bottom
        if new_height == last_height:
            break

        last_height = new_height
        scrolls += 1

    # Extract data
    html = driver.page_source
    driver.quit()

    return html

Python

9.5 Handling Rate Limiting (429 Errors)

Problem: Too many requests cause blocks.

Solution:

import time
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

def requests_with_retry():
    """Create session with automatic retries"""
    session = requests.Session()

    retry = Retry(
        total=5,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
        method_whitelist=["HEAD", "GET", "OPTIONS"]
    )

    adapter = HTTPAdapter(max_retries=retry)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    return session

# Usage
session = requests_with_retry()

def scrape_with_backoff(url, max_retries=3):
    """Exponential backoff on rate limiting"""
    for attempt in range(max_retries):
        response = session.get(url)

        if response.status_code == 429:
            wait_time = 2 ** attempt  # Exponential backoff
            print(f"Rate limited. Waiting {wait_time} seconds...")
            time.sleep(wait_time)
            continue

        if response.status_code == 200:
            return response

    return None

import time
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

def requests_with_retry():
    """Create session with automatic retries"""
    session = requests.Session()

    retry = Retry(
        total=5,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
        method_whitelist=["HEAD", "GET", "OPTIONS"]
    )

    adapter = HTTPAdapter(max_retries=retry)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    return session

# Usage
session = requests_with_retry()

def scrape_with_backoff(url, max_retries=3):
    """Exponential backoff on rate limiting"""
    for attempt in range(max_retries):
        response = session.get(url)

        if response.status_code == 429:
            wait_time = 2 ** attempt  # Exponential backoff
            print(f"Rate limited. Waiting {wait_time} seconds...")
            time.sleep(wait_time)
            continue

        if response.status_code == 200:
            return response

    return None

Python

10. Real-World Projects

Project 1: E-commerce Price Tracker

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import smtplib
from email.mime.text import MIMEText

class PriceTracker:
    """Track product prices on e-commerce sites"""

    def __init__(self, products_csv='products.csv'):
        self.products_csv = products_csv
        self.products = self.load_products()

    def load_products(self):
        """Load products to track"""
        try:
            return pd.read_csv(self.products_csv)
        except FileNotFoundError:
            return pd.DataFrame(columns=['name', 'url', 'target_price'])

    def get_price(self, url):
        """Extract price from product page"""
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
        }

        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'lxml')

        # This selector varies by site - adjust accordingly
        price_elem = soup.find('span', class_='price')
        if price_elem:
            price_text = price_elem.text
            # Extract numeric value
            import re
            match = re.search(r'[\d,]+\.?\d*', price_text)
            if match:
                return float(match.group().replace(',', ''))

        return None

    def check_prices(self):
        """Check all product prices"""
        results = []

        for _, product in self.products.iterrows():
            current_price = self.get_price(product['url'])

            if current_price:
                result = {
                    'name': product['name'],
                    'current_price': current_price,
                    'target_price': product['target_price'],
                    'timestamp': datetime.now(),
                    'alert': current_price <= product['target_price']
                }
                results.append(result)

                print(f"{product['name']}: ${current_price}")

        return pd.DataFrame(results)

    def send_alert(self, product_name, current_price, target_price):
        """Send email alert for price drop"""
        msg = MIMEText(
            f"Price alert for {product_name}!\n"
            f"Current price: ${current_price}\n"
            f"Target price: ${target_price}"
        )

        msg['Subject'] = f'Price Alert: {product_name}'
        msg['From'] = 'your_email@gmail.com'
        msg['To'] = 'recipient@gmail.com'

        # Send email (configure SMTP settings)
        # smtp = smtplib.SMTP('smtp.gmail.com', 587)
        # smtp.starttls()
        # smtp.login('your_email@gmail.com', 'password')
        # smtp.send_message(msg)
        # smtp.quit()

    def run(self):
        """Run price tracking"""
        results = self.check_prices()

        # Check for alerts
        alerts = results[results['alert']]
        for _, alert in alerts.iterrows():
            print(f"ALERT: {alert['name']} is now ${alert['current_price']}!")
            # self.send_alert(...)

        # Save history
        history_file = 'price_history.csv'
        try:
            history = pd.read_csv(history_file)
            history = pd.concat([history, results], ignore_index=True)
        except FileNotFoundError:
            history = results

        history.to_csv(history_file, index=False)

        return results

# Usage
if __name__ == "__main__":
    tracker = PriceTracker()
    results = tracker.run()
    print(results)

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import smtplib
from email.mime.text import MIMEText

class PriceTracker:
    """Track product prices on e-commerce sites"""

    def __init__(self, products_csv='products.csv'):
        self.products_csv = products_csv
        self.products = self.load_products()

    def load_products(self):
        """Load products to track"""
        try:
            return pd.read_csv(self.products_csv)
        except FileNotFoundError:
            return pd.DataFrame(columns=['name', 'url', 'target_price'])

    def get_price(self, url):
        """Extract price from product page"""
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
        }

        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'lxml')

        # This selector varies by site - adjust accordingly
        price_elem = soup.find('span', class_='price')
        if price_elem:
            price_text = price_elem.text
            # Extract numeric value
            import re
            match = re.search(r'[\d,]+\.?\d*', price_text)
            if match:
                return float(match.group().replace(',', ''))

        return None

    def check_prices(self):
        """Check all product prices"""
        results = []

        for _, product in self.products.iterrows():
            current_price = self.get_price(product['url'])

            if current_price:
                result = {
                    'name': product['name'],
                    'current_price': current_price,
                    'target_price': product['target_price'],
                    'timestamp': datetime.now(),
                    'alert': current_price <= product['target_price']
                }
                results.append(result)

                print(f"{product['name']}: ${current_price}")

        return pd.DataFrame(results)

    def send_alert(self, product_name, current_price, target_price):
        """Send email alert for price drop"""
        msg = MIMEText(
            f"Price alert for {product_name}!\n"
            f"Current price: ${current_price}\n"
            f"Target price: ${target_price}"
        )

        msg['Subject'] = f'Price Alert: {product_name}'
        msg['From'] = 'your_email@gmail.com'
        msg['To'] = 'recipient@gmail.com'

        # Send email (configure SMTP settings)
        # smtp = smtplib.SMTP('smtp.gmail.com', 587)
        # smtp.starttls()
        # smtp.login('your_email@gmail.com', 'password')
        # smtp.send_message(msg)
        # smtp.quit()

    def run(self):
        """Run price tracking"""
        results = self.check_prices()

        # Check for alerts
        alerts = results[results['alert']]
        for _, alert in alerts.iterrows():
            print(f"ALERT: {alert['name']} is now ${alert['current_price']}!")
            # self.send_alert(...)

        # Save history
        history_file = 'price_history.csv'
        try:
            history = pd.read_csv(history_file)
            history = pd.concat([history, results], ignore_index=True)
        except FileNotFoundError:
            history = results

        history.to_csv(history_file, index=False)

        return results

# Usage
if __name__ == "__main__":
    tracker = PriceTracker()
    results = tracker.run()
    print(results)

Python

Project 2: News Aggregator

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import feedparser

class NewsAggregator:
    """Aggregate news from multiple sources"""

    def __init__(self):
        self.sources = {
            'TechCrunch': 'https://techcrunch.com',
            'The Verge': 'https://www.theverge.com',
        }
        self.articles = []

    def scrape_rss(self, rss_url):
        """Scrape articles from RSS feed"""
        feed = feedparser.parse(rss_url)

        articles = []
        for entry in feed.entries[:10]:  # Top 10 articles
            article = {
                'title': entry.title,
                'link': entry.link,
                'published': entry.published,
                'summary': entry.summary if hasattr(entry, 'summary') else ''
            }
            articles.append(article)

        return articles

    def scrape_website(self, url, source_name):
        """Scrape articles from website"""
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
        }

        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'lxml')

        articles = []

        # This varies by site - customize selectors
        article_elements = soup.find_all('article', limit=10)

        for article_elem in article_elements:
            title_elem = article_elem.find('h2')
            link_elem = article_elem.find('a')

            if title_elem and link_elem:
                article = {
                    'source': source_name,
                    'title': title_elem.text.strip(),
                    'link': link_elem.get('href'),
                    'timestamp': datetime.now()
                }
                articles.append(article)

        return articles

    def aggregate(self):
        """Aggregate news from all sources"""
        all_articles = []

        for source_name, url in self.sources.items():
            print(f"Scraping {source_name}...")
            articles = self.scrape_website(url, source_name)
            all_articles.extend(articles)

        df = pd.DataFrame(all_articles)
        df = df.drop_duplicates(subset=['title'])
        df.to_csv('news_aggregated.csv', index=False)

        return df

    def filter_by_keywords(self, df, keywords):
        """Filter articles by keywords"""
        mask = df['title'].str.contains('|'.join(keywords), case=False)
        return df[mask]

# Usage
if __name__ == "__main__":
    aggregator = NewsAggregator()
    articles = aggregator.aggregate()

    # Filter by topics
    ai_articles = aggregator.filter_by_keywords(
        articles, 
        ['AI', 'artificial intelligence', 'machine learning']
    )

    print(f"Found {len(ai_articles)} AI-related articles")
    print(ai_articles[['source', 'title']])

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import feedparser

class NewsAggregator:
    """Aggregate news from multiple sources"""

    def __init__(self):
        self.sources = {
            'TechCrunch': 'https://techcrunch.com',
            'The Verge': 'https://www.theverge.com',
        }
        self.articles = []

    def scrape_rss(self, rss_url):
        """Scrape articles from RSS feed"""
        feed = feedparser.parse(rss_url)

        articles = []
        for entry in feed.entries[:10]:  # Top 10 articles
            article = {
                'title': entry.title,
                'link': entry.link,
                'published': entry.published,
                'summary': entry.summary if hasattr(entry, 'summary') else ''
            }
            articles.append(article)

        return articles

    def scrape_website(self, url, source_name):
        """Scrape articles from website"""
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
        }

        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'lxml')

        articles = []

        # This varies by site - customize selectors
        article_elements = soup.find_all('article', limit=10)

        for article_elem in article_elements:
            title_elem = article_elem.find('h2')
            link_elem = article_elem.find('a')

            if title_elem and link_elem:
                article = {
                    'source': source_name,
                    'title': title_elem.text.strip(),
                    'link': link_elem.get('href'),
                    'timestamp': datetime.now()
                }
                articles.append(article)

        return articles

    def aggregate(self):
        """Aggregate news from all sources"""
        all_articles = []

        for source_name, url in self.sources.items():
            print(f"Scraping {source_name}...")
            articles = self.scrape_website(url, source_name)
            all_articles.extend(articles)

        df = pd.DataFrame(all_articles)
        df = df.drop_duplicates(subset=['title'])
        df.to_csv('news_aggregated.csv', index=False)

        return df

    def filter_by_keywords(self, df, keywords):
        """Filter articles by keywords"""
        mask = df['title'].str.contains('|'.join(keywords), case=False)
        return df[mask]

# Usage
if __name__ == "__main__":
    aggregator = NewsAggregator()
    articles = aggregator.aggregate()

    # Filter by topics
    ai_articles = aggregator.filter_by_keywords(
        articles, 
        ['AI', 'artificial intelligence', 'machine learning']
    )

    print(f"Found {len(ai_articles)} AI-related articles")
    print(ai_articles[['source', 'title']])

Python

Project 3: Job Listing Scraper

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import re

class JobScraper:
    """Scrape job listings from multiple sites"""

    def __init__(self, keywords, location=''):
        self.keywords = keywords
        self.location = location
        self.jobs = []

    def scrape_indeed(self):
        """Scrape jobs from Indeed"""
        base_url = "https://www.indeed.com/jobs"
        params = {
            'q': ' '.join(self.keywords),
            'l': self.location
        }

        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
        }

        response = requests.get(base_url, params=params, headers=headers)
        soup = BeautifulSoup(response.content, 'lxml')

        # Find job cards (selector may need updating)
        job_cards = soup.find_all('div', class_='job_seen_beacon')

        jobs = []
        for card in job_cards:
            try:
                title_elem = card.find('h2', class_='jobTitle')
                company_elem = card.find('span', class_='companyName')
                location_elem = card.find('div', class_='companyLocation')

                job = {
                    'title': title_elem.text.strip() if title_elem else 'N/A',
                    'company': company_elem.text.strip() if company_elem else 'N/A',
                    'location': location_elem.text.strip() if location_elem else 'N/A',
                    'source': 'Indeed',
                    'scraped_date': datetime.now()
                }
                jobs.append(job)
            except Exception as e:
                continue

        return jobs

    def scrape_linkedin(self):
        """Scrape jobs from LinkedIn (requires authentication)"""
        # LinkedIn requires login - use Selenium or API
        pass

    def clean_salary(self, salary_text):
        """Extract salary range from text"""
        if not salary_text:
            return None, None

        # Extract numbers
        numbers = re.findall(r'\d+,?\d*', salary_text)
        if len(numbers) >= 2:
            min_salary = int(numbers[0].replace(',', ''))
            max_salary = int(numbers[1].replace(',', ''))
            return min_salary, max_salary

        return None, None

    def aggregate_jobs(self):
        """Aggregate jobs from all sources"""
        all_jobs = []

        print("Scraping Indeed...")
        indeed_jobs = self.scrape_indeed()
        all_jobs.extend(indeed_jobs)

        # Add more sources
        # all_jobs.extend(self.scrape_linkedin())

        df = pd.DataFrame(all_jobs)
        df = df.drop_duplicates(subset=['title', 'company'])

        return df

    def save_results(self, df, filename='jobs.csv'):
        """Save results to CSV"""
        df.to_csv(filename, index=False)
        print(f"Saved {len(df)} jobs to {filename}")

    def filter_by_salary(self, df, min_salary):
        """Filter jobs by minimum salary"""
        # Implement salary filtering logic
        pass

# Usage
if __name__ == "__main__":
    scraper = JobScraper(
        keywords=['Python', 'Developer'],
        location='New York'
    )

    jobs_df = scraper.aggregate_jobs()
    scraper.save_results(jobs_df)

    print(f"Found {len(jobs_df)} jobs")
    print(jobs_df.head())

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import re

class JobScraper:
    """Scrape job listings from multiple sites"""

    def __init__(self, keywords, location=''):
        self.keywords = keywords
        self.location = location
        self.jobs = []

    def scrape_indeed(self):
        """Scrape jobs from Indeed"""
        base_url = "https://www.indeed.com/jobs"
        params = {
            'q': ' '.join(self.keywords),
            'l': self.location
        }

        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
        }

        response = requests.get(base_url, params=params, headers=headers)
        soup = BeautifulSoup(response.content, 'lxml')

        # Find job cards (selector may need updating)
        job_cards = soup.find_all('div', class_='job_seen_beacon')

        jobs = []
        for card in job_cards:
            try:
                title_elem = card.find('h2', class_='jobTitle')
                company_elem = card.find('span', class_='companyName')
                location_elem = card.find('div', class_='companyLocation')

                job = {
                    'title': title_elem.text.strip() if title_elem else 'N/A',
                    'company': company_elem.text.strip() if company_elem else 'N/A',
                    'location': location_elem.text.strip() if location_elem else 'N/A',
                    'source': 'Indeed',
                    'scraped_date': datetime.now()
                }
                jobs.append(job)
            except Exception as e:
                continue

        return jobs

    def scrape_linkedin(self):
        """Scrape jobs from LinkedIn (requires authentication)"""
        # LinkedIn requires login - use Selenium or API
        pass

    def clean_salary(self, salary_text):
        """Extract salary range from text"""
        if not salary_text:
            return None, None

        # Extract numbers
        numbers = re.findall(r'\d+,?\d*', salary_text)
        if len(numbers) >= 2:
            min_salary = int(numbers[0].replace(',', ''))
            max_salary = int(numbers[1].replace(',', ''))
            return min_salary, max_salary

        return None, None

    def aggregate_jobs(self):
        """Aggregate jobs from all sources"""
        all_jobs = []

        print("Scraping Indeed...")
        indeed_jobs = self.scrape_indeed()
        all_jobs.extend(indeed_jobs)

        # Add more sources
        # all_jobs.extend(self.scrape_linkedin())

        df = pd.DataFrame(all_jobs)
        df = df.drop_duplicates(subset=['title', 'company'])

        return df

    def save_results(self, df, filename='jobs.csv'):
        """Save results to CSV"""
        df.to_csv(filename, index=False)
        print(f"Saved {len(df)} jobs to {filename}")

    def filter_by_salary(self, df, min_salary):
        """Filter jobs by minimum salary"""
        # Implement salary filtering logic
        pass

# Usage
if __name__ == "__main__":
    scraper = JobScraper(
        keywords=['Python', 'Developer'],
        location='New York'
    )

    jobs_df = scraper.aggregate_jobs()
    scraper.save_results(jobs_df)

    print(f"Found {len(jobs_df)} jobs")
    print(jobs_df.head())

Python

Conclusion

Web scraping is a powerful skill that opens up endless possibilities for data collection and analysis. Remember to:

Be Ethical: Respect robots.txt, terms of service, and rate limits
Start Simple: Begin with basic scraping before moving to complex scenarios
Handle Errors: Always implement proper error handling
Stay Updated: Websites change frequently – maintain your scrapers
Use APIs: When available, APIs are always better than scraping

Next Steps

Practice on scraping-friendly sites like:
Explore frameworks:
- Scrapy: Full-featured scraping framework
- Playwright: Modern browser automation
- Puppeteer: Node.js browser automation

Additional Resources

Documentation:
Books:
- “Web Scraping with Python” by Ryan Mitchell
- “Automate the Boring Stuff with Python” by Al Sweigart
Practice Sites:
- http://toscrape.com
- https://webscraper.io/test-sites

Python Web Scraping Cheatsheet

Quick Reference Guide

1. Installation & Setup

# Essential libraries
pip install requests beautifulsoup4 lxml selenium pandas

# Optional but useful
pip install scrapy playwright webdriver-manager fake-useragent

# Essential libraries
pip install requests beautifulsoup4 lxml selenium pandas

# Optional but useful
pip install scrapy playwright webdriver-manager fake-useragent

Python

2. Requests Library

Basic GET Request

import requests

response = requests.get('https://example.com')
html = response.text
content = response.content  # bytes
status = response.status_code

import requests

response = requests.get('https://example.com')
html = response.text
content = response.content  # bytes
status = response.status_code

Python

Headers & User-Agent

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
    'Accept': 'text/html,application/xhtml+xml',
    'Accept-Language': 'en-US,en;q=0.9'
}
response = requests.get(url, headers=headers)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
    'Accept': 'text/html,application/xhtml+xml',
    'Accept-Language': 'en-US,en;q=0.9'
}
response = requests.get(url, headers=headers)

Python

Parameters & POST Requests

# GET with parameters
params = {'search': 'python', 'page': 1}
response = requests.get(url, params=params)

# POST with data
data = {'username': 'user', 'password': 'pass'}
response = requests.post(url, data=data)

# POST with JSON
json_data = {'key': 'value'}
response = requests.post(url, json=json_data)

# GET with parameters
params = {'search': 'python', 'page': 1}
response = requests.get(url, params=params)

# POST with data
data = {'username': 'user', 'password': 'pass'}
response = requests.post(url, data=data)

# POST with JSON
json_data = {'key': 'value'}
response = requests.post(url, json=json_data)

Python

Sessions & Cookies

session = requests.Session()
session.get('https://example.com/login')
session.post('https://example.com/auth', data=login_data)
response = session.get('https://example.com/profile')

session = requests.Session()
session.get('https://example.com/login')
session.post('https://example.com/auth', data=login_data)
response = session.get('https://example.com/profile')

Python

Timeouts & Retries

# Timeout
response = requests.get(url, timeout=5)

# Proxies
proxies = {'http': 'http://proxy:8080', 'https': 'https://proxy:8080'}
response = requests.get(url, proxies=proxies)

# Verify SSL (disable for testing only)
response = requests.get(url, verify=False)

# Timeout
response = requests.get(url, timeout=5)

# Proxies
proxies = {'http': 'http://proxy:8080', 'https': 'https://proxy:8080'}
response = requests.get(url, proxies=proxies)

# Verify SSL (disable for testing only)
response = requests.get(url, verify=False)

Python

3. BeautifulSoup Parsing

Initialize Soup

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')           # Fast parser
soup = BeautifulSoup(html, 'html.parser')    # Built-in parser

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')           # Fast parser
soup = BeautifulSoup(html, 'html.parser')    # Built-in parser

Python

Finding Elements

# Find first matching element
soup.find('div')
soup.find('div', class_='content')
soup.find('div', id='main')
soup.find('a', href=True)
soup.find('div', attrs={'data-id': '123'})

# Find all matching elements
soup.find_all('p')
soup.find_all('div', class_='item')
soup.find_all(['h1', 'h2', 'h3'])
soup.find_all('a', limit=5)

# Find first matching element
soup.find('div')
soup.find('div', class_='content')
soup.find('div', id='main')
soup.find('a', href=True)
soup.find('div', attrs={'data-id': '123'})

# Find all matching elements
soup.find_all('p')
soup.find_all('div', class_='item')
soup.find_all(['h1', 'h2', 'h3'])
soup.find_all('a', limit=5)

Python

CSS Selectors

soup.select('div.content')           # Class selector
soup.select('#main')                 # ID selector
soup.select('div > p')               # Direct child
soup.select('div p')                 # Descendant
soup.select('a[href]')               # Has attribute
soup.select('a[href^="http"]')       # Starts with
soup.select('a[href$=".pdf"]')       # Ends with
soup.select('div:nth-of-type(2)')    # Nth element
soup.select_one('div.content')       # First match only

soup.select('div.content')           # Class selector
soup.select('#main')                 # ID selector
soup.select('div > p')               # Direct child
soup.select('div p')                 # Descendant
soup.select('a[href]')               # Has attribute
soup.select('a[href^="http"]')       # Starts with
soup.select('a[href$=".pdf"]')       # Ends with
soup.select('div:nth-of-type(2)')    # Nth element
soup.select_one('div.content')       # First match only

Python

Extracting Data

# Get text
element.text
element.get_text()
element.get_text(strip=True)
element.string                       # Direct text only

# Get attributes
element.get('href')
element['href']
element.attrs                        # All attributes dict

# Check existence
if element:
    print("Element exists")

# Get text
element.text
element.get_text()
element.get_text(strip=True)
element.string                       # Direct text only

# Get attributes
element.get('href')
element['href']
element.attrs                        # All attributes dict

# Check existence
if element:
    print("Element exists")

Python

Navigation

# Parents
element.parent
element.parents

# Siblings
element.next_sibling
element.previous_sibling
element.next_siblings
element.previous_siblings

# Children
element.children                     # Direct children
element.descendants                  # All descendants

# Finding relatives
element.find_next('p')
element.find_previous('div')
element.find_next_sibling('span')

# Parents
element.parent
element.parents

# Siblings
element.next_sibling
element.previous_sibling
element.next_siblings
element.previous_siblings

# Children
element.children                     # Direct children
element.descendants                  # All descendants

# Finding relatives
element.find_next('p')
element.find_previous('div')
element.find_next_sibling('span')

Python

4. Selenium WebDriver

Setup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

# Basic setup
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Headless mode
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(service=service, options=options)

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

# Basic setup
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Headless mode
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(service=service, options=options)

Python

Finding Elements

# Single element
driver.find_element(By.ID, 'element-id')
driver.find_element(By.NAME, 'element-name')
driver.find_element(By.CLASS_NAME, 'class-name')
driver.find_element(By.TAG_NAME, 'div')
driver.find_element(By.CSS_SELECTOR, 'div.content')
driver.find_element(By.XPATH, '//div[@class="content"]')
driver.find_element(By.LINK_TEXT, 'Click Here')
driver.find_element(By.PARTIAL_LINK_TEXT, 'Click')

# Multiple elements
driver.find_elements(By.CLASS_NAME, 'item')

# Single element
driver.find_element(By.ID, 'element-id')
driver.find_element(By.NAME, 'element-name')
driver.find_element(By.CLASS_NAME, 'class-name')
driver.find_element(By.TAG_NAME, 'div')
driver.find_element(By.CSS_SELECTOR, 'div.content')
driver.find_element(By.XPATH, '//div[@class="content"]')
driver.find_element(By.LINK_TEXT, 'Click Here')
driver.find_element(By.PARTIAL_LINK_TEXT, 'Click')

# Multiple elements
driver.find_elements(By.CLASS_NAME, 'item')

Python

Interactions

# Click
element.click()

# Type text
element.send_keys('text to type')
element.send_keys(Keys.RETURN)
element.send_keys(Keys.CONTROL, 'a')

# Clear input
element.clear()

# Get values
element.text
element.get_attribute('href')
element.get_attribute('value')

# Check states
element.is_displayed()
element.is_enabled()
element.is_selected()

# Click
element.click()

# Type text
element.send_keys('text to type')
element.send_keys(Keys.RETURN)
element.send_keys(Keys.CONTROL, 'a')

# Clear input
element.clear()

# Get values
element.text
element.get_attribute('href')
element.get_attribute('value')

# Check states
element.is_displayed()
element.is_enabled()
element.is_selected()

Python

Waits

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Implicit wait (global)
driver.implicitly_wait(10)

# Explicit wait
wait = WebDriverWait(driver, 10)
element = wait.until(EC.presence_of_element_located((By.ID, 'myElement')))
element = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'content')))
element = wait.until(EC.element_to_be_clickable((By.ID, 'button')))

# Wait conditions
EC.title_contains('Expected')
EC.title_is('Exact Title')
EC.presence_of_element_located()
EC.visibility_of_element_located()
EC.element_to_be_clickable()
EC.staleness_of()
EC.frame_to_be_available_and_switch_to_it()

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Implicit wait (global)
driver.implicitly_wait(10)

# Explicit wait
wait = WebDriverWait(driver, 10)
element = wait.until(EC.presence_of_element_located((By.ID, 'myElement')))
element = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'content')))
element = wait.until(EC.element_to_be_clickable((By.ID, 'button')))

# Wait conditions
EC.title_contains('Expected')
EC.title_is('Exact Title')
EC.presence_of_element_located()
EC.visibility_of_element_located()
EC.element_to_be_clickable()
EC.staleness_of()
EC.frame_to_be_available_and_switch_to_it()

Python

Navigation & Actions

# Navigation
driver.get('https://example.com')
driver.back()
driver.forward()
driver.refresh()

# JavaScript execution
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
driver.execute_script('arguments[0].click();', element)

# Screenshots
driver.save_screenshot('screenshot.png')

# Window management
driver.maximize_window()
driver.set_window_size(1920, 1080)

# Cookies
driver.get_cookies()
driver.add_cookie({'name': 'key', 'value': 'value'})
driver.delete_all_cookies()

# Close
driver.close()    # Current window
driver.quit()     # All windows

# Navigation
driver.get('https://example.com')
driver.back()
driver.forward()
driver.refresh()

# JavaScript execution
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
driver.execute_script('arguments[0].click();', element)

# Screenshots
driver.save_screenshot('screenshot.png')

# Window management
driver.maximize_window()
driver.set_window_size(1920, 1080)

# Cookies
driver.get_cookies()
driver.add_cookie({'name': 'key', 'value': 'value'})
driver.delete_all_cookies()

# Close
driver.close()    # Current window
driver.quit()     # All windows

Python

Switching Contexts

# Windows/Tabs
main_window = driver.current_window_handle
all_windows = driver.window_handles
driver.switch_to.window(all_windows[1])

# Frames
driver.switch_to.frame('frame-name')
driver.switch_to.frame(0)  # By index
driver.switch_to.default_content()  # Back to main

# Alerts
alert = driver.switch_to.alert
alert.accept()
alert.dismiss()
alert.send_keys('text')
alert_text = alert.text

# Windows/Tabs
main_window = driver.current_window_handle
all_windows = driver.window_handles
driver.switch_to.window(all_windows[1])

# Frames
driver.switch_to.frame('frame-name')
driver.switch_to.frame(0)  # By index
driver.switch_to.default_content()  # Back to main

# Alerts
alert = driver.switch_to.alert
alert.accept()
alert.dismiss()
alert.send_keys('text')
alert_text = alert.text

Python

5. Common Patterns

Safe Element Finding

# BeautifulSoup
element = soup.find('h1')
text = element.text if element else 'N/A'

# Alternative
text = soup.find('h1').text if soup.find('h1') else 'N/A'

# With get_text
text = element.get_text(strip=True) if element else ''

# BeautifulSoup
element = soup.find('h1')
text = element.text if element else 'N/A'

# Alternative
text = soup.find('h1').text if soup.find('h1') else 'N/A'

# With get_text
text = element.get_text(strip=True) if element else ''

Python

Pagination Loop

page = 1
while True:
    url = f'{base_url}?page={page}'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')

    items = soup.find_all('div', class_='item')
    if not items:
        break

    # Process items
    for item in items:
        # Extract data
        pass

    page += 1
    time.sleep(1)

page = 1
while True:
    url = f'{base_url}?page={page}'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')

    items = soup.find_all('div', class_='item')
    if not items:
        break

    # Process items
    for item in items:
        # Extract data
        pass

    page += 1
    time.sleep(1)

Python

Infinite Scroll

from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get(url)

last_height = driver.execute_script('return document.body.scrollHeight')

while True:
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    time.sleep(2)

    new_height = driver.execute_script('return document.body.scrollHeight')
    if new_height == last_height:
        break
    last_height = new_height

from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get(url)

last_height = driver.execute_script('return document.body.scrollHeight')

while True:
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    time.sleep(2)

    new_height = driver.execute_script('return document.body.scrollHeight')
    if new_height == last_height:
        break
    last_height = new_height

Python

Rate Limiting

import time
from datetime import datetime

def rate_limit(calls_per_second=1):
    min_interval = 1.0 / calls_per_second
    last_called = [0.0]

    def decorator(func):
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            if elapsed < min_interval:
                time.sleep(min_interval - elapsed)
            result = func(*args, **kwargs)
            last_called[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limit(calls_per_second=2)
def scrape_page(url):
    return requests.get(url)

import time
from datetime import datetime

def rate_limit(calls_per_second=1):
    min_interval = 1.0 / calls_per_second
    last_called = [0.0]

    def decorator(func):
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            if elapsed < min_interval:
                time.sleep(min_interval - elapsed)
            result = func(*args, **kwargs)
            last_called[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limit(calls_per_second=2)
def scrape_page(url):
    return requests.get(url)

Python

6. Data Extraction & Cleaning

Text Cleaning

import re

# Remove whitespace
text = ' '.join(text.split())

# Remove special characters
text = re.sub(r'[^\w\s]', '', text)

# Extract numbers
numbers = re.findall(r'\d+', text)

# Extract price
price = re.search(r'\$?([\d,]+\.?\d*)', text)
if price:
    price = float(price.group(1).replace(',', ''))

import re

# Remove whitespace
text = ' '.join(text.split())

# Remove special characters
text = re.sub(r'[^\w\s]', '', text)

# Extract numbers
numbers = re.findall(r'\d+', text)

# Extract price
price = re.search(r'\$?([\d,]+\.?\d*)', text)
if price:
    price = float(price.group(1).replace(',', ''))

Python

List Comprehension

# Extract text from elements
texts = [elem.text for elem in elements]

# Extract with condition
links = [a.get('href') for a in soup.find_all('a') if a.get('href')]

# Extract and clean
prices = [float(p.text.strip('$')) for p in soup.find_all('span', class_='price')]

# Extract text from elements
texts = [elem.text for elem in elements]

# Extract with condition
links = [a.get('href') for a in soup.find_all('a') if a.get('href')]

# Extract and clean
prices = [float(p.text.strip('$')) for p in soup.find_all('span', class_='price')]

Python

Pandas DataFrame

import pandas as pd

# From list of dicts
data = [{'name': 'Item 1', 'price': 10}, {'name': 'Item 2', 'price': 20}]
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('output.csv', index=False)

# Save to Excel
df.to_excel('output.xlsx', index=False)

# Save to JSON
df.to_json('output.json', orient='records', indent=4)

import pandas as pd

# From list of dicts
data = [{'name': 'Item 1', 'price': 10}, {'name': 'Item 2', 'price': 20}]
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('output.csv', index=False)

# Save to Excel
df.to_excel('output.xlsx', index=False)

# Save to JSON
df.to_json('output.json', orient='records', indent=4)

Python

7. Error Handling

Try-Except Blocks

import requests
from requests.exceptions import RequestException

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
except requests.Timeout:
    print('Request timed out')
except requests.HTTPError as e:
    print(f'HTTP error: {e}')
except RequestException as e:
    print(f'Request failed: {e}')
except Exception as e:
    print(f'Unexpected error: {e}')

import requests
from requests.exceptions import RequestException

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
except requests.Timeout:
    print('Request timed out')
except requests.HTTPError as e:
    print(f'HTTP error: {e}')
except RequestException as e:
    print(f'Request failed: {e}')
except Exception as e:
    print(f'Unexpected error: {e}')

Python

Selenium Exceptions

from selenium.common.exceptions import (
    NoSuchElementException,
    TimeoutException,
    StaleElementReferenceException
)

try:
    element = driver.find_element(By.ID, 'element-id')
except NoSuchElementException:
    print('Element not found')
except TimeoutException:
    print('Timed out waiting for element')
except StaleElementReferenceException:
    print('Element is no longer attached to DOM')

from selenium.common.exceptions import (
    NoSuchElementException,
    TimeoutException,
    StaleElementReferenceException
)

try:
    element = driver.find_element(By.ID, 'element-id')
except NoSuchElementException:
    print('Element not found')
except TimeoutException:
    print('Timed out waiting for element')
except StaleElementReferenceException:
    print('Element is no longer attached to DOM')

Python

8. Useful Code Snippets

Check robots.txt

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
can_fetch = rp.can_fetch('*', 'https://example.com/page')

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
can_fetch = rp.can_fetch('*', 'https://example.com/page')

Python

Random User Agent

import random

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]

headers = {'User-Agent': random.choice(USER_AGENTS)}

import random

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]

headers = {'User-Agent': random.choice(USER_AGENTS)}

Python

Download File

import requests

response = requests.get(file_url, stream=True)
with open('file.pdf', 'wb') as f:
    for chunk in response.iter_content(chunk_size=8192):
        f.write(chunk)

import requests

response = requests.get(file_url, stream=True)
with open('file.pdf', 'wb') as f:
    for chunk in response.iter_content(chunk_size=8192):
        f.write(chunk)

Python

Parse JSON from API

response = requests.get(api_url)
data = response.json()

# Or
import json
data = json.loads(response.text)

response = requests.get(api_url)
data = response.json()

# Or
import json
data = json.loads(response.text)

Python

Save to Database (SQLite)

import sqlite3

conn = sqlite3.connect('scraped_data.db')
cursor = conn.cursor()

cursor.execute('''
    CREATE TABLE IF NOT EXISTS items (
        id INTEGER PRIMARY KEY,
        title TEXT,
        price REAL
    )
''')

cursor.execute('INSERT INTO items (title, price) VALUES (?, ?)', ('Item', 19.99))
conn.commit()
conn.close()

import sqlite3

conn = sqlite3.connect('scraped_data.db')
cursor = conn.cursor()

cursor.execute('''
    CREATE TABLE IF NOT EXISTS items (
        id INTEGER PRIMARY KEY,
        title TEXT,
        price REAL
    )
''')

cursor.execute('INSERT INTO items (title, price) VALUES (?, ?)', ('Item', 19.99))
conn.commit()
conn.close()

Python

9. XPath Quick Reference

# Basic
'//div'                              # All divs
'//div[@class="content"]'            # Div with class
'//div[@id="main"]'                  # Div with id
'//a[@href]'                         # Links with href

# Text
'//h1/text()'                        # Text content
'//div[contains(text(), "Hello")]'   # Contains text
'//div[starts-with(text(), "Hello")]'# Starts with

# Attributes
'//img/@src'                         # Get src attribute
'//a[contains(@href, ".pdf")]'       # Href contains .pdf

# Position
'//div[1]'                           # First div
'//div[last()]'                      # Last div
'//div[position() < 3]'              # First two divs

# Axes
'//div/parent::*'                    # Parent
'//div/following-sibling::*'         # Next siblings
'//div/preceding-sibling::*'         # Previous siblings
'//div/descendant::p'                # All p descendants

# Basic
'//div'                              # All divs
'//div[@class="content"]'            # Div with class
'//div[@id="main"]'                  # Div with id
'//a[@href]'                         # Links with href

# Text
'//h1/text()'                        # Text content
'//div[contains(text(), "Hello")]'   # Contains text
'//div[starts-with(text(), "Hello")]'# Starts with

# Attributes
'//img/@src'                         # Get src attribute
'//a[contains(@href, ".pdf")]'       # Href contains .pdf

# Position
'//div[1]'                           # First div
'//div[last()]'                      # Last div
'//div[position() < 3]'              # First two divs

# Axes
'//div/parent::*'                    # Parent
'//div/following-sibling::*'         # Next siblings
'//div/preceding-sibling::*'         # Previous siblings
'//div/descendant::p'                # All p descendants

Python

10. Performance Tips

# Use lxml parser (fastest)
soup = BeautifulSoup(html, 'lxml')

# Limit results
soup.find_all('div', limit=10)

# Use CSS selectors (faster than find_all)
soup.select('div.item')

# Compile regex patterns
import re
pattern = re.compile(r'\d+')
numbers = pattern.findall(text)

# Multiprocessing for multiple URLs
from multiprocessing import Pool

def scrape(url):
    # Scraping logic
    return data

with Pool(5) as p:
    results = p.map(scrape, urls)

# Async requests
import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks)

asyncio.run(main())

# Use lxml parser (fastest)
soup = BeautifulSoup(html, 'lxml')

# Limit results
soup.find_all('div', limit=10)

# Use CSS selectors (faster than find_all)
soup.select('div.item')

# Compile regex patterns
import re
pattern = re.compile(r'\d+')
numbers = pattern.findall(text)

# Multiprocessing for multiple URLs
from multiprocessing import Pool

def scrape(url):
    # Scraping logic
    return data

with Pool(5) as p:
    results = p.map(scrape, urls)

# Async requests
import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks)

asyncio.run(main())

Python

11. Debugging Tips

# Print response
print(response.status_code)
print(response.headers)
print(response.text[:500])  # First 500 chars

# Pretty print soup
print(soup.prettify()[:1000])

# Check if element exists
element = soup.find('div', class_='content')
print(f"Element found: {element is not None}")

# Save HTML for inspection
with open('page.html', 'w', encoding='utf-8') as f:
    f.write(response.text)

# Selenium debugging
print(driver.current_url)
print(driver.title)
print(driver.page_source[:500])

# Wait and see (Selenium)
import time
time.sleep(5)  # Pause to see what's happening

# Print response
print(response.status_code)
print(response.headers)
print(response.text[:500])  # First 500 chars

# Pretty print soup
print(soup.prettify()[:1000])

# Check if element exists
element = soup.find('div', class_='content')
print(f"Element found: {element is not None}")

# Save HTML for inspection
with open('page.html', 'w', encoding='utf-8') as f:
    f.write(response.text)

# Selenium debugging
print(driver.current_url)
print(driver.title)
print(driver.page_source[:500])

# Wait and see (Selenium)
import time
time.sleep(5)  # Pause to see what's happening

Python

12. Common Selectors Cheatsheet

Element Type	BeautifulSoup	CSS Selector	XPath
By tag	`find('div')`	`select('div')`	`//div`
By class	`find('div', class_='item')`	`select('.item')`	`//div[@class="item"]`
By id	`find(id='main')`	`select('#main')`	`//*[@id="main"]`
By attribute	`find('a', href=True)`	`select('a[href]')`	`//a[@href]`
First child	N/A	`select('div > p')`	`//div/p`
Descendant	N/A	`select('div p')`	`//div//p`
Multiple	`find_all(['h1','h2'])`	`select('h1, h2')`	`//h1

Quick Troubleshooting

Problem	Solution
Empty results	Check selectors, wait for JS loading, verify page structure
403/429 errors	Add User-Agent, reduce request rate, use proxies
Timeout errors	Increase timeout, check internet connection
Element not found	Wait for dynamic content, verify selector
Encoding issues	Use `response.content` with proper encoding
Cookie/Session	Use `requests.Session()` or Selenium
CAPTCHA	Use APIs, contact site owner, reduce frequency
Dynamic content	Use Selenium or check for JSON/API calls

Happy Scraping! 🚀

Discover more from Altgr Blog

Subscribe to get the latest posts sent to your email.

Table of Contents

1. Introduction to Web Scraping

When to Use Web Scraping

2. Legal and Ethical Considerations

Legal Framework

Ethical Guidelines

Best Practices:

3. Understanding HTML Structure

HTML Basics

Key HTML Elements for Scraping

CSS Selectors

XPath (Alternative to CSS)

4. Setting Up Your Environment

Installing Required Libraries

Creating a Virtual Environment

Verify Installation

5. Basic Web Scraping with Requests and BeautifulSoup

5.1 Making HTTP Requests

5.2 Parsing HTML with BeautifulSoup

5.3 Complete Example: Scraping Quotes

5.4 Navigating the DOM Tree

6. Advanced Techniques

6.1 Handling Pagination

6.2 Handling Forms and POST Requests

6.3 Working with JSON APIs

6.4 Handling Different Encodings

6.5 Working with Tables

6.6 Downloading Files

7. Handling Dynamic Content with Selenium

7.1 Setting Up Selenium

7.2 Basic Selenium Usage

7.3 Interacting with Elements

7.4 Waiting for Elements

7.5 Handling Multiple Windows/Tabs

7.6 Headless Browser

7.7 Complete Selenium Example

8. Best Practices

8.1 Error Handling

8.2 Rate Limiting and Delays

8.3 Using Proxies

8.4 User Agents

8.5 Caching

8.6 Data Validation and Cleaning

9. Common Challenges and Solutions

9.1 Handling AJAX/Dynamic Content

9.2 Bypassing CAPTCHAs

9.3 Handling Session and Cookies

9.4 Handling Infinite Scroll

9.5 Handling Rate Limiting (429 Errors)

10. Real-World Projects

Project 1: E-commerce Price Tracker

Project 2: News Aggregator

Project 3: Job Listing Scraper

Conclusion

Next Steps

Additional Resources

Python Web Scraping Cheatsheet

Quick Reference Guide

1. Installation & Setup

2. Requests Library

3. BeautifulSoup Parsing

4. Selenium WebDriver

5. Common Patterns

6. Data Extraction & Cleaning

7. Error Handling

8. Useful Code Snippets

9. XPath Quick Reference

10. Performance Tips

11. Debugging Tips

12. Common Selectors Cheatsheet

Quick Troubleshooting

Related

Discover more from Altgr Blog

Related Posts

Python NumPy

Python Pandas

Complete Django Guide

Leave a Reply Cancel reply