Table of Contents

    1. Introduction to Web Scraping
    2. Legal and Ethical Considerations
    3. Understanding HTML Structure
    4. Setting Up Your Environment
    5. Basic Web Scraping with Requests and BeautifulSoup
    6. Advanced Techniques
    7. Handling Dynamic Content with Selenium
    8. Best Practices
    9. Common Challenges and Solutions
    10. Real-World Projects

    1. Introduction to Web Scraping

    Web scraping is the automated process of extracting data from websites. It’s a powerful technique used for:

    • Data Analysis: Collecting data for research and analytics
    • Price Monitoring: Tracking competitor prices
    • Content Aggregation: Building news or product aggregators
    • Lead Generation: Gathering business contact information
    • Market Research: Analyzing trends and patterns

    When to Use Web Scraping

    Good Use Cases:

    • Public data that’s not available via API
    • Monitoring website changes
    • Research and academic purposes
    • Personal projects

    When NOT to Use:

    • If an official API exists, use it instead
    • When explicitly prohibited by terms of service
    • For scraping personal or private data

    Before scraping any website, consider:

    1. Terms of Service (ToS): Always review the website’s ToS
    2. robots.txt: Check the robots.txt file (e.g., https://example.com/robots.txt)
    3. Copyright: Respect intellectual property rights
    4. Rate Limiting: Don’t overload servers with requests

    Ethical Guidelines

    # Example: Checking robots.txt
    from urllib.robotparser import RobotFileParser
    
    def can_scrape(url, user_agent='*'):
        """Check if scraping is allowed"""
        rp = RobotFileParser()
        rp.set_url(url + '/robots.txt')
        rp.read()
        return rp.can_fetch(user_agent, url)
    
    # Usage
    website = "https://example.com"
    if can_scrape(website):
        print("Scraping is allowed!")
    else:
        print("Scraping is not allowed.")
    Python

    Best Practices:

    • Add delays between requests
    • Identify your bot with a proper User-Agent
    • Respect rate limits
    • Use APIs when available

    3. Understanding HTML Structure

    HTML Basics

    HTML (HyperText Markup Language) structures web pages using tags:

    <!DOCTYPE html>
    <html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>Main Heading</h1>
        <div class="content">
            <p>This is a paragraph.</p>
            <ul id="items">
                <li class="item">Item 1</li>
                <li class="item">Item 2</li>
            </ul>
        </div>
        <a href="https://example.com">Link</a>
    </body>
    </html>
    HTML

    Key HTML Elements for Scraping

    • Tags: <div>, <span>, <p>, <h1>, etc.
    • Attributes: class, id, href, src
    • Text: Content between opening and closing tags

    CSS Selectors

    CSS selectors help locate elements:

    SelectorExampleDescription
    TagpSelects all <p> elements
    Class.contentSelects elements with class=”content”
    ID#itemsSelects element with id=”items”
    Descendantdiv pSelects <p> inside <div>
    Childdiv > pDirect child only
    Attribute[href]Elements with href attribute

    XPath (Alternative to CSS)

    # XPath examples
    /html/body/div[1]           # First div in body
    //div[@class='content']     # Any div with class='content'
    //a[@href]                  # All links with href
    //li[text()='Item 1']       # List item with specific text
    Python

    4. Setting Up Your Environment

    Installing Required Libraries

    # Core libraries
    pip install requests beautifulsoup4 lxml
    
    # For dynamic content
    pip install selenium
    
    # Additional useful libraries
    pip install pandas  # Data manipulation
    pip install scrapy  # Advanced scraping framework
    Bash

    Creating a Virtual Environment

    # Windows
    python -m venv scraping_env
    scraping_env\Scripts\activate
    
    # Linux/Mac
    python3 -m venv scraping_env
    source scraping_env/bin/activate
    Bash

    Verify Installation

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    print("All libraries installed successfully!")
    Python

    5. Basic Web Scraping with Requests and BeautifulSoup

    5.1 Making HTTP Requests

    import requests
    
    # Basic GET request
    url = "https://example.com"
    response = requests.get(url)
    
    # Check status code
    if response.status_code == 200:
        print("Success!")
        html_content = response.text
    else:
        print(f"Error: {response.status_code}")
    
    # Adding headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    response = requests.get(url, headers=headers)
    
    # Handling timeouts
    try:
        response = requests.get(url, timeout=5)
    except requests.Timeout:
        print("Request timed out")
    except requests.RequestException as e:
        print(f"Error: {e}")
    Python

    5.2 Parsing HTML with BeautifulSoup

    from bs4 import BeautifulSoup
    import requests
    
    # Fetch and parse
    url = "https://example.com"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')
    
    # Finding elements
    title = soup.find('h1')                    # First h1
    all_paragraphs = soup.find_all('p')        # All paragraphs
    by_class = soup.find('div', class_='content')
    by_id = soup.find(id='main-content')
    
    # CSS Selectors
    items = soup.select('.item')               # All elements with class 'item'
    first_item = soup.select_one('#first')     # Element with id 'first'
    
    # Extracting text
    print(title.text)                          # Get text content
    print(title.get_text(strip=True))          # Remove whitespace
    
    # Extracting attributes
    link = soup.find('a')
    href = link.get('href')                    # Get href attribute
    href = link['href']                        # Alternative syntax
    Python

    5.3 Complete Example: Scraping Quotes

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    import time
    
    def scrape_quotes():
        """Scrape quotes from quotes.toscrape.com"""
        base_url = "http://quotes.toscrape.com/page/"
        all_quotes = []
    
        for page in range(1, 6):  # Scrape first 5 pages
            print(f"Scraping page {page}...")
            url = f"{base_url}{page}/"
    
            # Add headers
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
            }
    
            response = requests.get(url, headers=headers)
    
            if response.status_code != 200:
                print(f"Failed to fetch page {page}")
                continue
    
            soup = BeautifulSoup(response.content, 'lxml')
    
            # Find all quote containers
            quotes = soup.find_all('div', class_='quote')
    
            for quote in quotes:
                # Extract quote text
                text = quote.find('span', class_='text').text
    
                # Extract author
                author = quote.find('small', class_='author').text
    
                # Extract tags
                tags = [tag.text for tag in quote.find_all('a', class_='tag')]
    
                all_quotes.append({
                    'quote': text,
                    'author': author,
                    'tags': ', '.join(tags)
                })
    
            # Be polite - add delay between requests
            time.sleep(1)
    
        # Save to CSV
        df = pd.DataFrame(all_quotes)
        df.to_csv('quotes.csv', index=False)
        print(f"Scraped {len(all_quotes)} quotes!")
    
        return df
    
    # Run the scraper
    if __name__ == "__main__":
        quotes_df = scrape_quotes()
        print(quotes_df.head())
    Python

    5.4 Navigating the DOM Tree

    from bs4 import BeautifulSoup
    
    html = """
    <div class="container">
        <h1>Title</h1>
        <div class="content">
            <p>First paragraph</p>
            <p>Second paragraph</p>
        </div>
    </div>
    """
    
    soup = BeautifulSoup(html, 'lxml')
    
    # Parent navigation
    p = soup.find('p')
    parent = p.parent                    # Get parent element
    parents = p.parents                  # All parents (generator)
    
    # Sibling navigation
    first_p = soup.find('p')
    next_sibling = first_p.find_next_sibling()
    prev_sibling = first_p.find_previous_sibling()
    
    # Child navigation
    container = soup.find('div', class_='container')
    children = container.children        # Direct children (generator)
    descendants = container.descendants  # All descendants (generator)
    
    # Finding next/previous elements
    h1 = soup.find('h1')
    next_elem = h1.find_next('p')       # Next <p> tag
    Python

    6. Advanced Techniques

    6.1 Handling Pagination

    def scrape_paginated_site(base_url, max_pages=10):
        """Scrape multiple pages"""
        all_data = []
    
        for page in range(1, max_pages + 1):
            url = f"{base_url}?page={page}"
            response = requests.get(url)
    
            if response.status_code != 200:
                break
    
            soup = BeautifulSoup(response.content, 'lxml')
    
            # Check if "Next" button exists
            next_button = soup.find('a', class_='next')
            if not next_button:
                print(f"Reached last page at page {page}")
                break
    
            # Extract data from current page
            items = soup.find_all('div', class_='item')
            for item in items:
                data = {
                    'title': item.find('h2').text,
                    'price': item.find('span', class_='price').text
                }
                all_data.append(data)
    
            time.sleep(1)  # Rate limiting
    
        return all_data
    Python

    6.2 Handling Forms and POST Requests

    import requests
    
    # POST request with form data
    login_url = "https://example.com/login"
    login_data = {
        'username': 'your_username',
        'password': 'your_password'
    }
    
    # Create a session to persist cookies
    session = requests.Session()
    
    # Send POST request
    response = session.post(login_url, data=login_data)
    
    # Access protected pages using the same session
    protected_url = "https://example.com/dashboard"
    response = session.get(protected_url)
    
    soup = BeautifulSoup(response.content, 'lxml')
    Python

    6.3 Working with JSON APIs

    import requests
    import json
    
    # Fetch JSON data
    api_url = "https://api.example.com/data"
    response = requests.get(api_url)
    
    # Parse JSON
    data = response.json()
    
    # Alternative
    data = json.loads(response.text)
    
    # Extract specific fields
    for item in data['results']:
        print(item['name'], item['price'])
    
    # Save to file
    with open('data.json', 'w') as f:
        json.dump(data, f, indent=4)
    Python

    6.4 Handling Different Encodings

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://example.com"
    response = requests.get(url)
    
    # Check encoding
    print(response.encoding)
    
    # Set encoding manually if needed
    response.encoding = 'utf-8'
    
    # Or let BeautifulSoup handle it
    soup = BeautifulSoup(response.content, 'lxml', from_encoding='utf-8')
    Python

    6.5 Working with Tables

    import pandas as pd
    from bs4 import BeautifulSoup
    import requests
    
    def scrape_table(url):
        """Scrape HTML tables"""
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'lxml')
    
        # Method 1: Using pandas (easiest for simple tables)
        tables = pd.read_html(response.text)
        df = tables[0]  # First table
    
        # Method 2: Manual parsing (more control)
        table = soup.find('table')
    
        # Extract headers
        headers = []
        for th in table.find_all('th'):
            headers.append(th.text.strip())
    
        # Extract rows
        rows = []
        for tr in table.find_all('tr')[1:]:  # Skip header row
            row = []
            for td in tr.find_all('td'):
                row.append(td.text.strip())
            rows.append(row)
    
        # Create DataFrame
        df = pd.DataFrame(rows, columns=headers)
    
        return df
    
    # Usage
    df = scrape_table("https://example.com/table")
    print(df)
    Python

    6.6 Downloading Files

    import requests
    import os
    
    def download_file(url, filename=None):
        """Download file from URL"""
        response = requests.get(url, stream=True)
    
        # Get filename from URL if not provided
        if filename is None:
            filename = url.split('/')[-1]
    
        # Download in chunks for large files
        with open(filename, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)
    
        print(f"Downloaded: {filename}")
    
    # Download multiple files
    def download_images(soup, folder='images'):
        """Download all images from page"""
        os.makedirs(folder, exist_ok=True)
    
        images = soup.find_all('img')
    
        for idx, img in enumerate(images):
            img_url = img.get('src')
    
            # Handle relative URLs
            if not img_url.startswith('http'):
                img_url = urljoin(base_url, img_url)
    
            filename = f"{folder}/image_{idx}.jpg"
            download_file(img_url, filename)
            time.sleep(0.5)
    
    # Usage
    from urllib.parse import urljoin
    
    url = "https://example.com"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')
    download_images(soup)
    Python

    7. Handling Dynamic Content with Selenium

    Many modern websites use JavaScript to load content dynamically. For these sites, Selenium is essential.

    7.1 Setting Up Selenium

    # Install Selenium
    pip install selenium
    
    # Install WebDriver Manager (automatic driver management)
    pip install webdriver-manager
    Bash

    7.2 Basic Selenium Usage

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from webdriver_manager.chrome import ChromeDriverManager
    from selenium.webdriver.chrome.service import Service
    
    # Setup Chrome driver
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service)
    
    # Navigate to page
    driver.get("https://example.com")
    
    # Wait for page to load
    wait = WebDriverWait(driver, 10)
    
    # Find elements
    title = driver.find_element(By.TAG_NAME, "h1")
    print(title.text)
    
    # Find multiple elements
    items = driver.find_elements(By.CLASS_NAME, "item")
    
    # CSS Selectors
    element = driver.find_element(By.CSS_SELECTOR, ".content > p")
    
    # XPath
    element = driver.find_element(By.XPATH, "//div[@class='content']")
    
    # Close browser
    driver.quit()
    Python

    7.3 Interacting with Elements

    from selenium.webdriver.common.keys import Keys
    
    # Click button
    button = driver.find_element(By.ID, "submit-btn")
    button.click()
    
    # Fill form
    search_box = driver.find_element(By.NAME, "search")
    search_box.send_keys("web scraping")
    search_box.send_keys(Keys.RETURN)  # Press Enter
    
    # Scroll to element
    element = driver.find_element(By.ID, "footer")
    driver.execute_script("arguments[0].scrollIntoView();", element)
    
    # Execute JavaScript
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    # Select dropdown
    from selenium.webdriver.support.select import Select
    
    dropdown = Select(driver.find_element(By.ID, "dropdown"))
    dropdown.select_by_visible_text("Option 1")
    dropdown.select_by_value("option1")
    dropdown.select_by_index(0)
    Python

    7.4 Waiting for Elements

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    
    # Explicit wait
    wait = WebDriverWait(driver, 10)
    
    # Wait for element to be present
    element = wait.until(
        EC.presence_of_element_located((By.ID, "dynamic-content"))
    )
    
    # Wait for element to be clickable
    button = wait.until(
        EC.element_to_be_clickable((By.ID, "submit-btn"))
    )
    
    # Wait for element to be visible
    element = wait.until(
        EC.visibility_of_element_located((By.CLASS_NAME, "modal"))
    )
    
    # Wait for title to contain text
    wait.until(EC.title_contains("Expected Title"))
    
    # Implicit wait (applies to all find operations)
    driver.implicitly_wait(10)  # seconds
    Python

    7.5 Handling Multiple Windows/Tabs

    # Get current window handle
    main_window = driver.current_window_handle
    
    # Click link that opens new tab
    link = driver.find_element(By.LINK_TEXT, "Open in new tab")
    link.click()
    
    # Switch to new tab
    all_windows = driver.window_handles
    for window in all_windows:
        if window != main_window:
            driver.switch_to.window(window)
            break
    
    # Do something in new tab
    print(driver.title)
    
    # Switch back to main window
    driver.switch_to.window(main_window)
    
    # Close current tab
    driver.close()
    Python

    7.6 Headless Browser

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    
    # Configure headless mode
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--window-size=1920,1080")
    
    # Create driver with options
    driver = webdriver.Chrome(options=chrome_options)
    
    # Use normally
    driver.get("https://example.com")
    print(driver.title)
    driver.quit()
    Python

    7.7 Complete Selenium Example

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.chrome.options import Options
    from webdriver_manager.chrome import ChromeDriverManager
    from selenium.webdriver.chrome.service import Service
    import pandas as pd
    import time
    
    def scrape_dynamic_site():
        """Scrape a site with dynamic content"""
    
        # Setup
        chrome_options = Options()
        chrome_options.add_argument("--headless")
    
        service = Service(ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service, options=chrome_options)
    
        try:
            # Navigate to site
            driver.get("https://example.com")
    
            # Wait for dynamic content to load
            wait = WebDriverWait(driver, 10)
            wait.until(
                EC.presence_of_element_located((By.CLASS_NAME, "item"))
            )
    
            # Scroll to load more content
            for _ in range(3):
                driver.execute_script(
                    "window.scrollTo(0, document.body.scrollHeight);"
                )
                time.sleep(2)
    
            # Extract data
            items = driver.find_elements(By.CLASS_NAME, "item")
    
            data = []
            for item in items:
                title = item.find_element(By.CLASS_NAME, "title").text
                price = item.find_element(By.CLASS_NAME, "price").text
    
                data.append({
                    'title': title,
                    'price': price
                })
    
            # Save to DataFrame
            df = pd.DataFrame(data)
            df.to_csv('scraped_data.csv', index=False)
    
            print(f"Scraped {len(data)} items")
            return df
    
        finally:
            driver.quit()
    
    # Run scraper
    if __name__ == "__main__":
        df = scrape_dynamic_site()
        print(df.head())
    Python

    8. Best Practices

    8.1 Error Handling

    import requests
    from bs4 import BeautifulSoup
    import logging
    
    # Setup logging
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s'
    )
    
    def safe_scrape(url):
        """Scrape with comprehensive error handling"""
        try:
            # Make request
            response = requests.get(url, timeout=10)
            response.raise_for_status()  # Raise exception for bad status codes
    
            # Parse HTML
            soup = BeautifulSoup(response.content, 'lxml')
    
            # Extract data with safety checks
            title = soup.find('h1')
            if title:
                logging.info(f"Found title: {title.text}")
            else:
                logging.warning("No title found")
    
            return soup
    
        except requests.Timeout:
            logging.error(f"Timeout error for {url}")
        except requests.HTTPError as e:
            logging.error(f"HTTP error: {e}")
        except requests.RequestException as e:
            logging.error(f"Request failed: {e}")
        except Exception as e:
            logging.error(f"Unexpected error: {e}")
    
        return None
    
    # Usage
    soup = safe_scrape("https://example.com")
    if soup:
        # Process data
        pass
    Python

    8.2 Rate Limiting and Delays

    import time
    from datetime import datetime
    import random
    
    class RateLimiter:
        """Simple rate limiter"""
        def __init__(self, requests_per_second=1):
            self.delay = 1.0 / requests_per_second
            self.last_request = 0
    
        def wait(self):
            """Wait if necessary to maintain rate limit"""
            elapsed = time.time() - self.last_request
            if elapsed < self.delay:
                time.sleep(self.delay - elapsed)
            self.last_request = time.time()
    
    # Usage
    limiter = RateLimiter(requests_per_second=2)
    
    for url in urls:
        limiter.wait()
        response = requests.get(url)
        # Process response
    
    # Alternative: Random delay
    def random_delay(min_seconds=1, max_seconds=3):
        """Add random delay between requests"""
        time.sleep(random.uniform(min_seconds, max_seconds))
    
    # Usage
    for url in urls:
        response = requests.get(url)
        random_delay(1, 3)
    Python

    8.3 Using Proxies

    import requests
    
    # Single proxy
    proxies = {
        'http': 'http://proxy.example.com:8080',
        'https': 'https://proxy.example.com:8080',
    }
    
    response = requests.get(url, proxies=proxies)
    
    # Rotating proxies
    class ProxyRotator:
        """Rotate through multiple proxies"""
        def __init__(self, proxy_list):
            self.proxies = proxy_list
            self.current = 0
    
        def get_proxy(self):
            """Get next proxy in rotation"""
            proxy = self.proxies[self.current]
            self.current = (self.current + 1) % len(self.proxies)
            return {'http': proxy, 'https': proxy}
    
    # Usage
    proxy_list = [
        'http://proxy1.com:8080',
        'http://proxy2.com:8080',
        'http://proxy3.com:8080',
    ]
    
    rotator = ProxyRotator(proxy_list)
    
    for url in urls:
        proxies = rotator.get_proxy()
        try:
            response = requests.get(url, proxies=proxies, timeout=5)
        except:
            continue
    Python

    8.4 User Agents

    import random
    
    USER_AGENTS = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101',
    ]
    
    def get_random_user_agent():
        """Return random user agent"""
        return random.choice(USER_AGENTS)
    
    # Usage
    headers = {
        'User-Agent': get_random_user_agent()
    }
    
    response = requests.get(url, headers=headers)
    Python

    8.5 Caching

    import requests
    from functools import lru_cache
    import pickle
    import os
    from datetime import datetime, timedelta
    
    # Simple in-memory caching with lru_cache
    @lru_cache(maxsize=128)
    def fetch_page(url):
        """Cached page fetching"""
        response = requests.get(url)
        return response.text
    
    # File-based caching
    class FileCache:
        """Simple file-based cache"""
        def __init__(self, cache_dir='cache', expiry_hours=24):
            self.cache_dir = cache_dir
            self.expiry = timedelta(hours=expiry_hours)
            os.makedirs(cache_dir, exist_ok=True)
    
        def get_cache_path(self, url):
            """Generate cache file path"""
            filename = url.replace('/', '_').replace(':', '_')
            return os.path.join(self.cache_dir, f"{filename}.pkl")
    
        def get(self, url):
            """Get cached content"""
            cache_path = self.get_cache_path(url)
    
            if not os.path.exists(cache_path):
                return None
    
            # Check if cache is expired
            file_time = datetime.fromtimestamp(os.path.getmtime(cache_path))
            if datetime.now() - file_time > self.expiry:
                return None
    
            with open(cache_path, 'rb') as f:
                return pickle.load(f)
    
        def set(self, url, content):
            """Save content to cache"""
            cache_path = self.get_cache_path(url)
            with open(cache_path, 'wb') as f:
                pickle.dump(content, f)
    
    # Usage
    cache = FileCache()
    
    def fetch_with_cache(url):
        # Try cache first
        content = cache.get(url)
        if content:
            print("Using cached version")
            return content
    
        # Fetch if not cached
        response = requests.get(url)
        content = response.text
        cache.set(url, content)
        return content
    Python

    8.6 Data Validation and Cleaning

    import re
    import pandas as pd
    
    def clean_text(text):
        """Clean extracted text"""
        if not text:
            return ""
    
        # Remove extra whitespace
        text = ' '.join(text.split())
    
        # Remove special characters if needed
        text = re.sub(r'[^\w\s.,!?-]', '', text)
    
        return text.strip()
    
    def clean_price(price_string):
        """Extract numeric price from string"""
        # Remove currency symbols and extract number
        match = re.search(r'[\d,]+\.?\d*', price_string)
        if match:
            number = match.group().replace(',', '')
            return float(number)
        return None
    
    def validate_data(df):
        """Validate scraped data"""
        # Remove duplicates
        df = df.drop_duplicates()
    
        # Remove rows with missing values
        df = df.dropna(subset=['title', 'price'])
    
        # Validate price range
        df = df[df['price'] > 0]
        df = df[df['price'] < 10000]
    
        # Clean text fields
        df['title'] = df['title'].apply(clean_text)
    
        return df
    
    # Usage
    data = []
    for item in items:
        title = clean_text(item.find('h2').text)
        price = clean_price(item.find('span', class_='price').text)
    
        if title and price:
            data.append({'title': title, 'price': price})
    
    df = pd.DataFrame(data)
    df = validate_data(df)
    Python

    9. Common Challenges and Solutions

    9.1 Handling AJAX/Dynamic Content

    Problem: Content loads after page load via JavaScript.

    Solution:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    
    driver = webdriver.Chrome()
    driver.get(url)
    
    # Wait for AJAX content
    wait = WebDriverWait(driver, 10)
    element = wait.until(
        EC.presence_of_element_located((By.CLASS_NAME, "ajax-content"))
    )
    
    # Or inspect network requests and call API directly
    import requests
    api_url = "https://example.com/api/data"
    response = requests.get(api_url)
    data = response.json()
    Python

    9.2 Bypassing CAPTCHAs

    Problem: Sites use CAPTCHAs to prevent bots.

    Solutions:

    1. Use official APIs instead
    2. Contact site owner for permission
    3. Use CAPTCHA solving services (2captcha, Anti-Captcha)
    4. Implement human-like behavior patterns
    # Example: More human-like scraping
    import time
    import random
    from selenium.webdriver.common.action_chains import ActionChains
    
    def human_like_delay():
        time.sleep(random.uniform(1, 3))
    
    def human_like_scroll(driver):
        # Scroll gradually like a human
        for i in range(5):
            driver.execute_script(f"window.scrollTo(0, {i * 200});")
            time.sleep(random.uniform(0.1, 0.3))
    
    def human_like_mouse_movement(driver, element):
        actions = ActionChains(driver)
        actions.move_to_element(element)
        actions.pause(random.uniform(0.1, 0.5))
        actions.click()
        actions.perform()
    Python

    9.3 Handling Session and Cookies

    Problem: Need to maintain session across requests.

    Solution:

    import requests
    
    # Use session object
    session = requests.Session()
    
    # Login
    login_data = {'username': 'user', 'password': 'pass'}
    session.post('https://example.com/login', data=login_data)
    
    # Session maintains cookies automatically
    response = session.get('https://example.com/protected-page')
    
    # Manually set cookies
    cookies = {'session_id': 'abc123'}
    response = requests.get(url, cookies=cookies)
    
    # Save and load cookies with Selenium
    from selenium import webdriver
    import pickle
    
    driver = webdriver.Chrome()
    driver.get('https://example.com')
    
    # Save cookies
    cookies = driver.get_cookies()
    pickle.dump(cookies, open('cookies.pkl', 'wb'))
    
    # Load cookies
    driver.get('https://example.com')
    cookies = pickle.load(open('cookies.pkl', 'rb'))
    for cookie in cookies:
        driver.add_cookie(cookie)
    driver.refresh()
    Python

    9.4 Handling Infinite Scroll

    Problem: Content loads continuously as you scroll.

    Solution:

    from selenium import webdriver
    import time
    
    def scrape_infinite_scroll(url, scroll_pause=2, max_scrolls=10):
        driver = webdriver.Chrome()
        driver.get(url)
    
        # Get scroll height
        last_height = driver.execute_script("return document.body.scrollHeight")
    
        scrolls = 0
        while scrolls < max_scrolls:
            # Scroll down
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
            # Wait for content to load
            time.sleep(scroll_pause)
    
            # Calculate new scroll height
            new_height = driver.execute_script("return document.body.scrollHeight")
    
            # Break if reached bottom
            if new_height == last_height:
                break
    
            last_height = new_height
            scrolls += 1
    
        # Extract data
        html = driver.page_source
        driver.quit()
    
        return html
    Python

    9.5 Handling Rate Limiting (429 Errors)

    Problem: Too many requests cause blocks.

    Solution:

    import time
    import requests
    from requests.adapters import HTTPAdapter
    from requests.packages.urllib3.util.retry import Retry
    
    def requests_with_retry():
        """Create session with automatic retries"""
        session = requests.Session()
    
        retry = Retry(
            total=5,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504],
            method_whitelist=["HEAD", "GET", "OPTIONS"]
        )
    
        adapter = HTTPAdapter(max_retries=retry)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
    
        return session
    
    # Usage
    session = requests_with_retry()
    
    def scrape_with_backoff(url, max_retries=3):
        """Exponential backoff on rate limiting"""
        for attempt in range(max_retries):
            response = session.get(url)
    
            if response.status_code == 429:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Rate limited. Waiting {wait_time} seconds...")
                time.sleep(wait_time)
                continue
    
            if response.status_code == 200:
                return response
    
        return None
    Python

    10. Real-World Projects

    Project 1: E-commerce Price Tracker

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    from datetime import datetime
    import smtplib
    from email.mime.text import MIMEText
    
    class PriceTracker:
        """Track product prices on e-commerce sites"""
    
        def __init__(self, products_csv='products.csv'):
            self.products_csv = products_csv
            self.products = self.load_products()
    
        def load_products(self):
            """Load products to track"""
            try:
                return pd.read_csv(self.products_csv)
            except FileNotFoundError:
                return pd.DataFrame(columns=['name', 'url', 'target_price'])
    
        def get_price(self, url):
            """Extract price from product page"""
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
            }
    
            response = requests.get(url, headers=headers)
            soup = BeautifulSoup(response.content, 'lxml')
    
            # This selector varies by site - adjust accordingly
            price_elem = soup.find('span', class_='price')
            if price_elem:
                price_text = price_elem.text
                # Extract numeric value
                import re
                match = re.search(r'[\d,]+\.?\d*', price_text)
                if match:
                    return float(match.group().replace(',', ''))
    
            return None
    
        def check_prices(self):
            """Check all product prices"""
            results = []
    
            for _, product in self.products.iterrows():
                current_price = self.get_price(product['url'])
    
                if current_price:
                    result = {
                        'name': product['name'],
                        'current_price': current_price,
                        'target_price': product['target_price'],
                        'timestamp': datetime.now(),
                        'alert': current_price <= product['target_price']
                    }
                    results.append(result)
    
                    print(f"{product['name']}: ${current_price}")
    
            return pd.DataFrame(results)
    
        def send_alert(self, product_name, current_price, target_price):
            """Send email alert for price drop"""
            msg = MIMEText(
                f"Price alert for {product_name}!\n"
                f"Current price: ${current_price}\n"
                f"Target price: ${target_price}"
            )
    
            msg['Subject'] = f'Price Alert: {product_name}'
            msg['From'] = 'your_email@gmail.com'
            msg['To'] = 'recipient@gmail.com'
    
            # Send email (configure SMTP settings)
            # smtp = smtplib.SMTP('smtp.gmail.com', 587)
            # smtp.starttls()
            # smtp.login('your_email@gmail.com', 'password')
            # smtp.send_message(msg)
            # smtp.quit()
    
        def run(self):
            """Run price tracking"""
            results = self.check_prices()
    
            # Check for alerts
            alerts = results[results['alert']]
            for _, alert in alerts.iterrows():
                print(f"ALERT: {alert['name']} is now ${alert['current_price']}!")
                # self.send_alert(...)
    
            # Save history
            history_file = 'price_history.csv'
            try:
                history = pd.read_csv(history_file)
                history = pd.concat([history, results], ignore_index=True)
            except FileNotFoundError:
                history = results
    
            history.to_csv(history_file, index=False)
    
            return results
    
    # Usage
    if __name__ == "__main__":
        tracker = PriceTracker()
        results = tracker.run()
        print(results)
    Python

    Project 2: News Aggregator

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    from datetime import datetime
    import feedparser
    
    class NewsAggregator:
        """Aggregate news from multiple sources"""
    
        def __init__(self):
            self.sources = {
                'TechCrunch': 'https://techcrunch.com',
                'The Verge': 'https://www.theverge.com',
            }
            self.articles = []
    
        def scrape_rss(self, rss_url):
            """Scrape articles from RSS feed"""
            feed = feedparser.parse(rss_url)
    
            articles = []
            for entry in feed.entries[:10]:  # Top 10 articles
                article = {
                    'title': entry.title,
                    'link': entry.link,
                    'published': entry.published,
                    'summary': entry.summary if hasattr(entry, 'summary') else ''
                }
                articles.append(article)
    
            return articles
    
        def scrape_website(self, url, source_name):
            """Scrape articles from website"""
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
            }
    
            response = requests.get(url, headers=headers)
            soup = BeautifulSoup(response.content, 'lxml')
    
            articles = []
    
            # This varies by site - customize selectors
            article_elements = soup.find_all('article', limit=10)
    
            for article_elem in article_elements:
                title_elem = article_elem.find('h2')
                link_elem = article_elem.find('a')
    
                if title_elem and link_elem:
                    article = {
                        'source': source_name,
                        'title': title_elem.text.strip(),
                        'link': link_elem.get('href'),
                        'timestamp': datetime.now()
                    }
                    articles.append(article)
    
            return articles
    
        def aggregate(self):
            """Aggregate news from all sources"""
            all_articles = []
    
            for source_name, url in self.sources.items():
                print(f"Scraping {source_name}...")
                articles = self.scrape_website(url, source_name)
                all_articles.extend(articles)
    
            df = pd.DataFrame(all_articles)
            df = df.drop_duplicates(subset=['title'])
            df.to_csv('news_aggregated.csv', index=False)
    
            return df
    
        def filter_by_keywords(self, df, keywords):
            """Filter articles by keywords"""
            mask = df['title'].str.contains('|'.join(keywords), case=False)
            return df[mask]
    
    # Usage
    if __name__ == "__main__":
        aggregator = NewsAggregator()
        articles = aggregator.aggregate()
    
        # Filter by topics
        ai_articles = aggregator.filter_by_keywords(
            articles, 
            ['AI', 'artificial intelligence', 'machine learning']
        )
    
        print(f"Found {len(ai_articles)} AI-related articles")
        print(ai_articles[['source', 'title']])
    Python

    Project 3: Job Listing Scraper

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    from datetime import datetime
    import re
    
    class JobScraper:
        """Scrape job listings from multiple sites"""
    
        def __init__(self, keywords, location=''):
            self.keywords = keywords
            self.location = location
            self.jobs = []
    
        def scrape_indeed(self):
            """Scrape jobs from Indeed"""
            base_url = "https://www.indeed.com/jobs"
            params = {
                'q': ' '.join(self.keywords),
                'l': self.location
            }
    
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
            }
    
            response = requests.get(base_url, params=params, headers=headers)
            soup = BeautifulSoup(response.content, 'lxml')
    
            # Find job cards (selector may need updating)
            job_cards = soup.find_all('div', class_='job_seen_beacon')
    
            jobs = []
            for card in job_cards:
                try:
                    title_elem = card.find('h2', class_='jobTitle')
                    company_elem = card.find('span', class_='companyName')
                    location_elem = card.find('div', class_='companyLocation')
    
                    job = {
                        'title': title_elem.text.strip() if title_elem else 'N/A',
                        'company': company_elem.text.strip() if company_elem else 'N/A',
                        'location': location_elem.text.strip() if location_elem else 'N/A',
                        'source': 'Indeed',
                        'scraped_date': datetime.now()
                    }
                    jobs.append(job)
                except Exception as e:
                    continue
    
            return jobs
    
        def scrape_linkedin(self):
            """Scrape jobs from LinkedIn (requires authentication)"""
            # LinkedIn requires login - use Selenium or API
            pass
    
        def clean_salary(self, salary_text):
            """Extract salary range from text"""
            if not salary_text:
                return None, None
    
            # Extract numbers
            numbers = re.findall(r'\d+,?\d*', salary_text)
            if len(numbers) >= 2:
                min_salary = int(numbers[0].replace(',', ''))
                max_salary = int(numbers[1].replace(',', ''))
                return min_salary, max_salary
    
            return None, None
    
        def aggregate_jobs(self):
            """Aggregate jobs from all sources"""
            all_jobs = []
    
            print("Scraping Indeed...")
            indeed_jobs = self.scrape_indeed()
            all_jobs.extend(indeed_jobs)
    
            # Add more sources
            # all_jobs.extend(self.scrape_linkedin())
    
            df = pd.DataFrame(all_jobs)
            df = df.drop_duplicates(subset=['title', 'company'])
    
            return df
    
        def save_results(self, df, filename='jobs.csv'):
            """Save results to CSV"""
            df.to_csv(filename, index=False)
            print(f"Saved {len(df)} jobs to {filename}")
    
        def filter_by_salary(self, df, min_salary):
            """Filter jobs by minimum salary"""
            # Implement salary filtering logic
            pass
    
    # Usage
    if __name__ == "__main__":
        scraper = JobScraper(
            keywords=['Python', 'Developer'],
            location='New York'
        )
    
        jobs_df = scraper.aggregate_jobs()
        scraper.save_results(jobs_df)
    
        print(f"Found {len(jobs_df)} jobs")
        print(jobs_df.head())
    Python

    Conclusion

    Web scraping is a powerful skill that opens up endless possibilities for data collection and analysis. Remember to:

    1. Be Ethical: Respect robots.txt, terms of service, and rate limits
    2. Start Simple: Begin with basic scraping before moving to complex scenarios
    3. Handle Errors: Always implement proper error handling
    4. Stay Updated: Websites change frequently – maintain your scrapers
    5. Use APIs: When available, APIs are always better than scraping

    Next Steps

    Additional Resources


    Python Web Scraping Cheatsheet

    Quick Reference Guide

    1. Installation & Setup

    # Essential libraries
    pip install requests beautifulsoup4 lxml selenium pandas
    
    # Optional but useful
    pip install scrapy playwright webdriver-manager fake-useragent
    Python

    2. Requests Library

    Basic GET Request

    import requests
    
    response = requests.get('https://example.com')
    html = response.text
    content = response.content  # bytes
    status = response.status_code
    Python

    Headers & User-Agent

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
        'Accept': 'text/html,application/xhtml+xml',
        'Accept-Language': 'en-US,en;q=0.9'
    }
    response = requests.get(url, headers=headers)
    Python

    Parameters & POST Requests

    # GET with parameters
    params = {'search': 'python', 'page': 1}
    response = requests.get(url, params=params)
    
    # POST with data
    data = {'username': 'user', 'password': 'pass'}
    response = requests.post(url, data=data)
    
    # POST with JSON
    json_data = {'key': 'value'}
    response = requests.post(url, json=json_data)
    Python

    Sessions & Cookies

    session = requests.Session()
    session.get('https://example.com/login')
    session.post('https://example.com/auth', data=login_data)
    response = session.get('https://example.com/profile')
    Python

    Timeouts & Retries

    # Timeout
    response = requests.get(url, timeout=5)
    
    # Proxies
    proxies = {'http': 'http://proxy:8080', 'https': 'https://proxy:8080'}
    response = requests.get(url, proxies=proxies)
    
    # Verify SSL (disable for testing only)
    response = requests.get(url, verify=False)
    Python

    3. BeautifulSoup Parsing

    Initialize Soup

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')           # Fast parser
    soup = BeautifulSoup(html, 'html.parser')    # Built-in parser
    Python

    Finding Elements

    # Find first matching element
    soup.find('div')
    soup.find('div', class_='content')
    soup.find('div', id='main')
    soup.find('a', href=True)
    soup.find('div', attrs={'data-id': '123'})
    
    # Find all matching elements
    soup.find_all('p')
    soup.find_all('div', class_='item')
    soup.find_all(['h1', 'h2', 'h3'])
    soup.find_all('a', limit=5)
    Python

    CSS Selectors

    soup.select('div.content')           # Class selector
    soup.select('#main')                 # ID selector
    soup.select('div > p')               # Direct child
    soup.select('div p')                 # Descendant
    soup.select('a[href]')               # Has attribute
    soup.select('a[href^="http"]')       # Starts with
    soup.select('a[href$=".pdf"]')       # Ends with
    soup.select('div:nth-of-type(2)')    # Nth element
    soup.select_one('div.content')       # First match only
    Python

    Extracting Data

    # Get text
    element.text
    element.get_text()
    element.get_text(strip=True)
    element.string                       # Direct text only
    
    # Get attributes
    element.get('href')
    element['href']
    element.attrs                        # All attributes dict
    
    # Check existence
    if element:
        print("Element exists")
    Python

    Navigation

    # Parents
    element.parent
    element.parents
    
    # Siblings
    element.next_sibling
    element.previous_sibling
    element.next_siblings
    element.previous_siblings
    
    # Children
    element.children                     # Direct children
    element.descendants                  # All descendants
    
    # Finding relatives
    element.find_next('p')
    element.find_previous('div')
    element.find_next_sibling('span')
    Python

    4. Selenium WebDriver

    Setup

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.chrome.options import Options
    from webdriver_manager.chrome import ChromeDriverManager
    from selenium.webdriver.chrome.service import Service
    
    # Basic setup
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service)
    
    # Headless mode
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(service=service, options=options)
    Python

    Finding Elements

    # Single element
    driver.find_element(By.ID, 'element-id')
    driver.find_element(By.NAME, 'element-name')
    driver.find_element(By.CLASS_NAME, 'class-name')
    driver.find_element(By.TAG_NAME, 'div')
    driver.find_element(By.CSS_SELECTOR, 'div.content')
    driver.find_element(By.XPATH, '//div[@class="content"]')
    driver.find_element(By.LINK_TEXT, 'Click Here')
    driver.find_element(By.PARTIAL_LINK_TEXT, 'Click')
    
    # Multiple elements
    driver.find_elements(By.CLASS_NAME, 'item')
    Python

    Interactions

    # Click
    element.click()
    
    # Type text
    element.send_keys('text to type')
    element.send_keys(Keys.RETURN)
    element.send_keys(Keys.CONTROL, 'a')
    
    # Clear input
    element.clear()
    
    # Get values
    element.text
    element.get_attribute('href')
    element.get_attribute('value')
    
    # Check states
    element.is_displayed()
    element.is_enabled()
    element.is_selected()
    Python

    Waits

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    # Implicit wait (global)
    driver.implicitly_wait(10)
    
    # Explicit wait
    wait = WebDriverWait(driver, 10)
    element = wait.until(EC.presence_of_element_located((By.ID, 'myElement')))
    element = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'content')))
    element = wait.until(EC.element_to_be_clickable((By.ID, 'button')))
    
    # Wait conditions
    EC.title_contains('Expected')
    EC.title_is('Exact Title')
    EC.presence_of_element_located()
    EC.visibility_of_element_located()
    EC.element_to_be_clickable()
    EC.staleness_of()
    EC.frame_to_be_available_and_switch_to_it()
    Python

    Navigation & Actions

    # Navigation
    driver.get('https://example.com')
    driver.back()
    driver.forward()
    driver.refresh()
    
    # JavaScript execution
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    driver.execute_script('arguments[0].click();', element)
    
    # Screenshots
    driver.save_screenshot('screenshot.png')
    
    # Window management
    driver.maximize_window()
    driver.set_window_size(1920, 1080)
    
    # Cookies
    driver.get_cookies()
    driver.add_cookie({'name': 'key', 'value': 'value'})
    driver.delete_all_cookies()
    
    # Close
    driver.close()    # Current window
    driver.quit()     # All windows
    Python

    Switching Contexts

    # Windows/Tabs
    main_window = driver.current_window_handle
    all_windows = driver.window_handles
    driver.switch_to.window(all_windows[1])
    
    # Frames
    driver.switch_to.frame('frame-name')
    driver.switch_to.frame(0)  # By index
    driver.switch_to.default_content()  # Back to main
    
    # Alerts
    alert = driver.switch_to.alert
    alert.accept()
    alert.dismiss()
    alert.send_keys('text')
    alert_text = alert.text
    Python

    5. Common Patterns

    Safe Element Finding

    # BeautifulSoup
    element = soup.find('h1')
    text = element.text if element else 'N/A'
    
    # Alternative
    text = soup.find('h1').text if soup.find('h1') else 'N/A'
    
    # With get_text
    text = element.get_text(strip=True) if element else ''
    Python

    Pagination Loop

    page = 1
    while True:
        url = f'{base_url}?page={page}'
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'lxml')
    
        items = soup.find_all('div', class_='item')
        if not items:
            break
    
        # Process items
        for item in items:
            # Extract data
            pass
    
        page += 1
        time.sleep(1)
    Python

    Infinite Scroll

    from selenium import webdriver
    import time
    
    driver = webdriver.Chrome()
    driver.get(url)
    
    last_height = driver.execute_script('return document.body.scrollHeight')
    
    while True:
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        time.sleep(2)
    
        new_height = driver.execute_script('return document.body.scrollHeight')
        if new_height == last_height:
            break
        last_height = new_height
    Python

    Rate Limiting

    import time
    from datetime import datetime
    
    def rate_limit(calls_per_second=1):
        min_interval = 1.0 / calls_per_second
        last_called = [0.0]
    
        def decorator(func):
            def wrapper(*args, **kwargs):
                elapsed = time.time() - last_called[0]
                if elapsed < min_interval:
                    time.sleep(min_interval - elapsed)
                result = func(*args, **kwargs)
                last_called[0] = time.time()
                return result
            return wrapper
        return decorator
    
    @rate_limit(calls_per_second=2)
    def scrape_page(url):
        return requests.get(url)
    Python

    6. Data Extraction & Cleaning

    Text Cleaning

    import re
    
    # Remove whitespace
    text = ' '.join(text.split())
    
    # Remove special characters
    text = re.sub(r'[^\w\s]', '', text)
    
    # Extract numbers
    numbers = re.findall(r'\d+', text)
    
    # Extract price
    price = re.search(r'\$?([\d,]+\.?\d*)', text)
    if price:
        price = float(price.group(1).replace(',', ''))
    Python

    List Comprehension

    # Extract text from elements
    texts = [elem.text for elem in elements]
    
    # Extract with condition
    links = [a.get('href') for a in soup.find_all('a') if a.get('href')]
    
    # Extract and clean
    prices = [float(p.text.strip('$')) for p in soup.find_all('span', class_='price')]
    Python

    Pandas DataFrame

    import pandas as pd
    
    # From list of dicts
    data = [{'name': 'Item 1', 'price': 10}, {'name': 'Item 2', 'price': 20}]
    df = pd.DataFrame(data)
    
    # Save to CSV
    df.to_csv('output.csv', index=False)
    
    # Save to Excel
    df.to_excel('output.xlsx', index=False)
    
    # Save to JSON
    df.to_json('output.json', orient='records', indent=4)
    Python

    7. Error Handling

    Try-Except Blocks

    import requests
    from requests.exceptions import RequestException
    
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
    except requests.Timeout:
        print('Request timed out')
    except requests.HTTPError as e:
        print(f'HTTP error: {e}')
    except RequestException as e:
        print(f'Request failed: {e}')
    except Exception as e:
        print(f'Unexpected error: {e}')
    Python

    Selenium Exceptions

    from selenium.common.exceptions import (
        NoSuchElementException,
        TimeoutException,
        StaleElementReferenceException
    )
    
    try:
        element = driver.find_element(By.ID, 'element-id')
    except NoSuchElementException:
        print('Element not found')
    except TimeoutException:
        print('Timed out waiting for element')
    except StaleElementReferenceException:
        print('Element is no longer attached to DOM')
    Python

    8. Useful Code Snippets

    Check robots.txt

    from urllib.robotparser import RobotFileParser
    
    rp = RobotFileParser()
    rp.set_url('https://example.com/robots.txt')
    rp.read()
    can_fetch = rp.can_fetch('*', 'https://example.com/page')
    Python

    Random User Agent

    import random
    
    USER_AGENTS = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
    ]
    
    headers = {'User-Agent': random.choice(USER_AGENTS)}
    Python

    Download File

    import requests
    
    response = requests.get(file_url, stream=True)
    with open('file.pdf', 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
    Python

    Parse JSON from API

    response = requests.get(api_url)
    data = response.json()
    
    # Or
    import json
    data = json.loads(response.text)
    Python

    Save to Database (SQLite)

    import sqlite3
    
    conn = sqlite3.connect('scraped_data.db')
    cursor = conn.cursor()
    
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS items (
            id INTEGER PRIMARY KEY,
            title TEXT,
            price REAL
        )
    ''')
    
    cursor.execute('INSERT INTO items (title, price) VALUES (?, ?)', ('Item', 19.99))
    conn.commit()
    conn.close()
    Python

    9. XPath Quick Reference

    # Basic
    '//div'                              # All divs
    '//div[@class="content"]'            # Div with class
    '//div[@id="main"]'                  # Div with id
    '//a[@href]'                         # Links with href
    
    # Text
    '//h1/text()'                        # Text content
    '//div[contains(text(), "Hello")]'   # Contains text
    '//div[starts-with(text(), "Hello")]'# Starts with
    
    # Attributes
    '//img/@src'                         # Get src attribute
    '//a[contains(@href, ".pdf")]'       # Href contains .pdf
    
    # Position
    '//div[1]'                           # First div
    '//div[last()]'                      # Last div
    '//div[position() < 3]'              # First two divs
    
    # Axes
    '//div/parent::*'                    # Parent
    '//div/following-sibling::*'         # Next siblings
    '//div/preceding-sibling::*'         # Previous siblings
    '//div/descendant::p'                # All p descendants
    Python

    10. Performance Tips

    # Use lxml parser (fastest)
    soup = BeautifulSoup(html, 'lxml')
    
    # Limit results
    soup.find_all('div', limit=10)
    
    # Use CSS selectors (faster than find_all)
    soup.select('div.item')
    
    # Compile regex patterns
    import re
    pattern = re.compile(r'\d+')
    numbers = pattern.findall(text)
    
    # Multiprocessing for multiple URLs
    from multiprocessing import Pool
    
    def scrape(url):
        # Scraping logic
        return data
    
    with Pool(5) as p:
        results = p.map(scrape, urls)
    
    # Async requests
    import asyncio
    import aiohttp
    
    async def fetch(session, url):
        async with session.get(url) as response:
            return await response.text()
    
    async def main():
        async with aiohttp.ClientSession() as session:
            tasks = [fetch(session, url) for url in urls]
            results = await asyncio.gather(*tasks)
    
    asyncio.run(main())
    Python

    11. Debugging Tips

    # Print response
    print(response.status_code)
    print(response.headers)
    print(response.text[:500])  # First 500 chars
    
    # Pretty print soup
    print(soup.prettify()[:1000])
    
    # Check if element exists
    element = soup.find('div', class_='content')
    print(f"Element found: {element is not None}")
    
    # Save HTML for inspection
    with open('page.html', 'w', encoding='utf-8') as f:
        f.write(response.text)
    
    # Selenium debugging
    print(driver.current_url)
    print(driver.title)
    print(driver.page_source[:500])
    
    # Wait and see (Selenium)
    import time
    time.sleep(5)  # Pause to see what's happening
    Python

    12. Common Selectors Cheatsheet

    Element TypeBeautifulSoupCSS SelectorXPath
    By tagfind('div')select('div')//div
    By classfind('div', class_='item')select('.item')//div[@class="item"]
    By idfind(id='main')select('#main')//*[@id="main"]
    By attributefind('a', href=True)select('a[href]')//a[@href]
    First childN/Aselect('div > p')//div/p
    DescendantN/Aselect('div p')//div//p
    Multiplefind_all(['h1','h2'])select('h1, h2')`//h1

    Quick Troubleshooting

    ProblemSolution
    Empty resultsCheck selectors, wait for JS loading, verify page structure
    403/429 errorsAdd User-Agent, reduce request rate, use proxies
    Timeout errorsIncrease timeout, check internet connection
    Element not foundWait for dynamic content, verify selector
    Encoding issuesUse response.content with proper encoding
    Cookie/SessionUse requests.Session() or Selenium
    CAPTCHAUse APIs, contact site owner, reduce frequency
    Dynamic contentUse Selenium or check for JSON/API calls

    Happy Scraping! 🚀


    Discover more from Altgr Blog

    Subscribe to get the latest posts sent to your email.

    Leave a Reply

    Your email address will not be published. Required fields are marked *