Table of Contents
- Introduction to Web Scraping
- Legal and Ethical Considerations
- Understanding HTML Structure
- Setting Up Your Environment
- Basic Web Scraping with Requests and BeautifulSoup
- Advanced Techniques
- Handling Dynamic Content with Selenium
- Best Practices
- Common Challenges and Solutions
- Real-World Projects
1. Introduction to Web Scraping
Web scraping is the automated process of extracting data from websites. It’s a powerful technique used for:
- Data Analysis: Collecting data for research and analytics
- Price Monitoring: Tracking competitor prices
- Content Aggregation: Building news or product aggregators
- Lead Generation: Gathering business contact information
- Market Research: Analyzing trends and patterns
When to Use Web Scraping
✅ Good Use Cases:
- Public data that’s not available via API
- Monitoring website changes
- Research and academic purposes
- Personal projects
❌ When NOT to Use:
- If an official API exists, use it instead
- When explicitly prohibited by terms of service
- For scraping personal or private data
2. Legal and Ethical Considerations
Legal Framework
Before scraping any website, consider:
- Terms of Service (ToS): Always review the website’s ToS
- robots.txt: Check the robots.txt file (e.g.,
https://example.com/robots.txt) - Copyright: Respect intellectual property rights
- Rate Limiting: Don’t overload servers with requests
Ethical Guidelines
# Example: Checking robots.txt
from urllib.robotparser import RobotFileParser
def can_scrape(url, user_agent='*'):
"""Check if scraping is allowed"""
rp = RobotFileParser()
rp.set_url(url + '/robots.txt')
rp.read()
return rp.can_fetch(user_agent, url)
# Usage
website = "https://example.com"
if can_scrape(website):
print("Scraping is allowed!")
else:
print("Scraping is not allowed.")PythonBest Practices:
- Add delays between requests
- Identify your bot with a proper User-Agent
- Respect rate limits
- Use APIs when available
3. Understanding HTML Structure
HTML Basics
HTML (HyperText Markup Language) structures web pages using tags:
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>Main Heading</h1>
<div class="content">
<p>This is a paragraph.</p>
<ul id="items">
<li class="item">Item 1</li>
<li class="item">Item 2</li>
</ul>
</div>
<a href="https://example.com">Link</a>
</body>
</html>HTMLKey HTML Elements for Scraping
- Tags:
<div>,<span>,<p>,<h1>, etc. - Attributes:
class,id,href,src - Text: Content between opening and closing tags
CSS Selectors
CSS selectors help locate elements:
| Selector | Example | Description |
|---|---|---|
| Tag | p | Selects all <p> elements |
| Class | .content | Selects elements with class=”content” |
| ID | #items | Selects element with id=”items” |
| Descendant | div p | Selects <p> inside <div> |
| Child | div > p | Direct child only |
| Attribute | [href] | Elements with href attribute |
XPath (Alternative to CSS)
# XPath examples
/html/body/div[1] # First div in body
//div[@class='content'] # Any div with class='content'
//a[@href] # All links with href
//li[text()='Item 1'] # List item with specific textPython4. Setting Up Your Environment
Installing Required Libraries
# Core libraries
pip install requests beautifulsoup4 lxml
# For dynamic content
pip install selenium
# Additional useful libraries
pip install pandas # Data manipulation
pip install scrapy # Advanced scraping frameworkBashCreating a Virtual Environment
# Windows
python -m venv scraping_env
scraping_env\Scripts\activate
# Linux/Mac
python3 -m venv scraping_env
source scraping_env/bin/activateBashVerify Installation
import requests
from bs4 import BeautifulSoup
import pandas as pd
print("All libraries installed successfully!")Python5. Basic Web Scraping with Requests and BeautifulSoup
5.1 Making HTTP Requests
import requests
# Basic GET request
url = "https://example.com"
response = requests.get(url)
# Check status code
if response.status_code == 200:
print("Success!")
html_content = response.text
else:
print(f"Error: {response.status_code}")
# Adding headers
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
# Handling timeouts
try:
response = requests.get(url, timeout=5)
except requests.Timeout:
print("Request timed out")
except requests.RequestException as e:
print(f"Error: {e}")Python5.2 Parsing HTML with BeautifulSoup
from bs4 import BeautifulSoup
import requests
# Fetch and parse
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
# Finding elements
title = soup.find('h1') # First h1
all_paragraphs = soup.find_all('p') # All paragraphs
by_class = soup.find('div', class_='content')
by_id = soup.find(id='main-content')
# CSS Selectors
items = soup.select('.item') # All elements with class 'item'
first_item = soup.select_one('#first') # Element with id 'first'
# Extracting text
print(title.text) # Get text content
print(title.get_text(strip=True)) # Remove whitespace
# Extracting attributes
link = soup.find('a')
href = link.get('href') # Get href attribute
href = link['href'] # Alternative syntaxPython5.3 Complete Example: Scraping Quotes
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
def scrape_quotes():
"""Scrape quotes from quotes.toscrape.com"""
base_url = "http://quotes.toscrape.com/page/"
all_quotes = []
for page in range(1, 6): # Scrape first 5 pages
print(f"Scraping page {page}...")
url = f"{base_url}{page}/"
# Add headers
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to fetch page {page}")
continue
soup = BeautifulSoup(response.content, 'lxml')
# Find all quote containers
quotes = soup.find_all('div', class_='quote')
for quote in quotes:
# Extract quote text
text = quote.find('span', class_='text').text
# Extract author
author = quote.find('small', class_='author').text
# Extract tags
tags = [tag.text for tag in quote.find_all('a', class_='tag')]
all_quotes.append({
'quote': text,
'author': author,
'tags': ', '.join(tags)
})
# Be polite - add delay between requests
time.sleep(1)
# Save to CSV
df = pd.DataFrame(all_quotes)
df.to_csv('quotes.csv', index=False)
print(f"Scraped {len(all_quotes)} quotes!")
return df
# Run the scraper
if __name__ == "__main__":
quotes_df = scrape_quotes()
print(quotes_df.head())Python5.4 Navigating the DOM Tree
from bs4 import BeautifulSoup
html = """
<div class="container">
<h1>Title</h1>
<div class="content">
<p>First paragraph</p>
<p>Second paragraph</p>
</div>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
# Parent navigation
p = soup.find('p')
parent = p.parent # Get parent element
parents = p.parents # All parents (generator)
# Sibling navigation
first_p = soup.find('p')
next_sibling = first_p.find_next_sibling()
prev_sibling = first_p.find_previous_sibling()
# Child navigation
container = soup.find('div', class_='container')
children = container.children # Direct children (generator)
descendants = container.descendants # All descendants (generator)
# Finding next/previous elements
h1 = soup.find('h1')
next_elem = h1.find_next('p') # Next <p> tagPython6. Advanced Techniques
6.1 Handling Pagination
def scrape_paginated_site(base_url, max_pages=10):
"""Scrape multiple pages"""
all_data = []
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
response = requests.get(url)
if response.status_code != 200:
break
soup = BeautifulSoup(response.content, 'lxml')
# Check if "Next" button exists
next_button = soup.find('a', class_='next')
if not next_button:
print(f"Reached last page at page {page}")
break
# Extract data from current page
items = soup.find_all('div', class_='item')
for item in items:
data = {
'title': item.find('h2').text,
'price': item.find('span', class_='price').text
}
all_data.append(data)
time.sleep(1) # Rate limiting
return all_dataPython6.2 Handling Forms and POST Requests
import requests
# POST request with form data
login_url = "https://example.com/login"
login_data = {
'username': 'your_username',
'password': 'your_password'
}
# Create a session to persist cookies
session = requests.Session()
# Send POST request
response = session.post(login_url, data=login_data)
# Access protected pages using the same session
protected_url = "https://example.com/dashboard"
response = session.get(protected_url)
soup = BeautifulSoup(response.content, 'lxml')Python6.3 Working with JSON APIs
import requests
import json
# Fetch JSON data
api_url = "https://api.example.com/data"
response = requests.get(api_url)
# Parse JSON
data = response.json()
# Alternative
data = json.loads(response.text)
# Extract specific fields
for item in data['results']:
print(item['name'], item['price'])
# Save to file
with open('data.json', 'w') as f:
json.dump(data, f, indent=4)Python6.4 Handling Different Encodings
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
# Check encoding
print(response.encoding)
# Set encoding manually if needed
response.encoding = 'utf-8'
# Or let BeautifulSoup handle it
soup = BeautifulSoup(response.content, 'lxml', from_encoding='utf-8')Python6.5 Working with Tables
import pandas as pd
from bs4 import BeautifulSoup
import requests
def scrape_table(url):
"""Scrape HTML tables"""
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
# Method 1: Using pandas (easiest for simple tables)
tables = pd.read_html(response.text)
df = tables[0] # First table
# Method 2: Manual parsing (more control)
table = soup.find('table')
# Extract headers
headers = []
for th in table.find_all('th'):
headers.append(th.text.strip())
# Extract rows
rows = []
for tr in table.find_all('tr')[1:]: # Skip header row
row = []
for td in tr.find_all('td'):
row.append(td.text.strip())
rows.append(row)
# Create DataFrame
df = pd.DataFrame(rows, columns=headers)
return df
# Usage
df = scrape_table("https://example.com/table")
print(df)Python6.6 Downloading Files
import requests
import os
def download_file(url, filename=None):
"""Download file from URL"""
response = requests.get(url, stream=True)
# Get filename from URL if not provided
if filename is None:
filename = url.split('/')[-1]
# Download in chunks for large files
with open(filename, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
print(f"Downloaded: {filename}")
# Download multiple files
def download_images(soup, folder='images'):
"""Download all images from page"""
os.makedirs(folder, exist_ok=True)
images = soup.find_all('img')
for idx, img in enumerate(images):
img_url = img.get('src')
# Handle relative URLs
if not img_url.startswith('http'):
img_url = urljoin(base_url, img_url)
filename = f"{folder}/image_{idx}.jpg"
download_file(img_url, filename)
time.sleep(0.5)
# Usage
from urllib.parse import urljoin
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
download_images(soup)Python7. Handling Dynamic Content with Selenium
Many modern websites use JavaScript to load content dynamically. For these sites, Selenium is essential.
7.1 Setting Up Selenium
# Install Selenium
pip install selenium
# Install WebDriver Manager (automatic driver management)
pip install webdriver-managerBash7.2 Basic Selenium Usage
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
# Setup Chrome driver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
# Navigate to page
driver.get("https://example.com")
# Wait for page to load
wait = WebDriverWait(driver, 10)
# Find elements
title = driver.find_element(By.TAG_NAME, "h1")
print(title.text)
# Find multiple elements
items = driver.find_elements(By.CLASS_NAME, "item")
# CSS Selectors
element = driver.find_element(By.CSS_SELECTOR, ".content > p")
# XPath
element = driver.find_element(By.XPATH, "//div[@class='content']")
# Close browser
driver.quit()Python7.3 Interacting with Elements
from selenium.webdriver.common.keys import Keys
# Click button
button = driver.find_element(By.ID, "submit-btn")
button.click()
# Fill form
search_box = driver.find_element(By.NAME, "search")
search_box.send_keys("web scraping")
search_box.send_keys(Keys.RETURN) # Press Enter
# Scroll to element
element = driver.find_element(By.ID, "footer")
driver.execute_script("arguments[0].scrollIntoView();", element)
# Execute JavaScript
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Select dropdown
from selenium.webdriver.support.select import Select
dropdown = Select(driver.find_element(By.ID, "dropdown"))
dropdown.select_by_visible_text("Option 1")
dropdown.select_by_value("option1")
dropdown.select_by_index(0)Python7.4 Waiting for Elements
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
# Explicit wait
wait = WebDriverWait(driver, 10)
# Wait for element to be present
element = wait.until(
EC.presence_of_element_located((By.ID, "dynamic-content"))
)
# Wait for element to be clickable
button = wait.until(
EC.element_to_be_clickable((By.ID, "submit-btn"))
)
# Wait for element to be visible
element = wait.until(
EC.visibility_of_element_located((By.CLASS_NAME, "modal"))
)
# Wait for title to contain text
wait.until(EC.title_contains("Expected Title"))
# Implicit wait (applies to all find operations)
driver.implicitly_wait(10) # secondsPython7.5 Handling Multiple Windows/Tabs
# Get current window handle
main_window = driver.current_window_handle
# Click link that opens new tab
link = driver.find_element(By.LINK_TEXT, "Open in new tab")
link.click()
# Switch to new tab
all_windows = driver.window_handles
for window in all_windows:
if window != main_window:
driver.switch_to.window(window)
break
# Do something in new tab
print(driver.title)
# Switch back to main window
driver.switch_to.window(main_window)
# Close current tab
driver.close()Python7.6 Headless Browser
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Configure headless mode
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--window-size=1920,1080")
# Create driver with options
driver = webdriver.Chrome(options=chrome_options)
# Use normally
driver.get("https://example.com")
print(driver.title)
driver.quit()Python7.7 Complete Selenium Example
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
import pandas as pd
import time
def scrape_dynamic_site():
"""Scrape a site with dynamic content"""
# Setup
chrome_options = Options()
chrome_options.add_argument("--headless")
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)
try:
# Navigate to site
driver.get("https://example.com")
# Wait for dynamic content to load
wait = WebDriverWait(driver, 10)
wait.until(
EC.presence_of_element_located((By.CLASS_NAME, "item"))
)
# Scroll to load more content
for _ in range(3):
driver.execute_script(
"window.scrollTo(0, document.body.scrollHeight);"
)
time.sleep(2)
# Extract data
items = driver.find_elements(By.CLASS_NAME, "item")
data = []
for item in items:
title = item.find_element(By.CLASS_NAME, "title").text
price = item.find_element(By.CLASS_NAME, "price").text
data.append({
'title': title,
'price': price
})
# Save to DataFrame
df = pd.DataFrame(data)
df.to_csv('scraped_data.csv', index=False)
print(f"Scraped {len(data)} items")
return df
finally:
driver.quit()
# Run scraper
if __name__ == "__main__":
df = scrape_dynamic_site()
print(df.head())Python8. Best Practices
8.1 Error Handling
import requests
from bs4 import BeautifulSoup
import logging
# Setup logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
def safe_scrape(url):
"""Scrape with comprehensive error handling"""
try:
# Make request
response = requests.get(url, timeout=10)
response.raise_for_status() # Raise exception for bad status codes
# Parse HTML
soup = BeautifulSoup(response.content, 'lxml')
# Extract data with safety checks
title = soup.find('h1')
if title:
logging.info(f"Found title: {title.text}")
else:
logging.warning("No title found")
return soup
except requests.Timeout:
logging.error(f"Timeout error for {url}")
except requests.HTTPError as e:
logging.error(f"HTTP error: {e}")
except requests.RequestException as e:
logging.error(f"Request failed: {e}")
except Exception as e:
logging.error(f"Unexpected error: {e}")
return None
# Usage
soup = safe_scrape("https://example.com")
if soup:
# Process data
passPython8.2 Rate Limiting and Delays
import time
from datetime import datetime
import random
class RateLimiter:
"""Simple rate limiter"""
def __init__(self, requests_per_second=1):
self.delay = 1.0 / requests_per_second
self.last_request = 0
def wait(self):
"""Wait if necessary to maintain rate limit"""
elapsed = time.time() - self.last_request
if elapsed < self.delay:
time.sleep(self.delay - elapsed)
self.last_request = time.time()
# Usage
limiter = RateLimiter(requests_per_second=2)
for url in urls:
limiter.wait()
response = requests.get(url)
# Process response
# Alternative: Random delay
def random_delay(min_seconds=1, max_seconds=3):
"""Add random delay between requests"""
time.sleep(random.uniform(min_seconds, max_seconds))
# Usage
for url in urls:
response = requests.get(url)
random_delay(1, 3)Python8.3 Using Proxies
import requests
# Single proxy
proxies = {
'http': 'http://proxy.example.com:8080',
'https': 'https://proxy.example.com:8080',
}
response = requests.get(url, proxies=proxies)
# Rotating proxies
class ProxyRotator:
"""Rotate through multiple proxies"""
def __init__(self, proxy_list):
self.proxies = proxy_list
self.current = 0
def get_proxy(self):
"""Get next proxy in rotation"""
proxy = self.proxies[self.current]
self.current = (self.current + 1) % len(self.proxies)
return {'http': proxy, 'https': proxy}
# Usage
proxy_list = [
'http://proxy1.com:8080',
'http://proxy2.com:8080',
'http://proxy3.com:8080',
]
rotator = ProxyRotator(proxy_list)
for url in urls:
proxies = rotator.get_proxy()
try:
response = requests.get(url, proxies=proxies, timeout=5)
except:
continuePython8.4 User Agents
import random
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101',
]
def get_random_user_agent():
"""Return random user agent"""
return random.choice(USER_AGENTS)
# Usage
headers = {
'User-Agent': get_random_user_agent()
}
response = requests.get(url, headers=headers)Python8.5 Caching
import requests
from functools import lru_cache
import pickle
import os
from datetime import datetime, timedelta
# Simple in-memory caching with lru_cache
@lru_cache(maxsize=128)
def fetch_page(url):
"""Cached page fetching"""
response = requests.get(url)
return response.text
# File-based caching
class FileCache:
"""Simple file-based cache"""
def __init__(self, cache_dir='cache', expiry_hours=24):
self.cache_dir = cache_dir
self.expiry = timedelta(hours=expiry_hours)
os.makedirs(cache_dir, exist_ok=True)
def get_cache_path(self, url):
"""Generate cache file path"""
filename = url.replace('/', '_').replace(':', '_')
return os.path.join(self.cache_dir, f"{filename}.pkl")
def get(self, url):
"""Get cached content"""
cache_path = self.get_cache_path(url)
if not os.path.exists(cache_path):
return None
# Check if cache is expired
file_time = datetime.fromtimestamp(os.path.getmtime(cache_path))
if datetime.now() - file_time > self.expiry:
return None
with open(cache_path, 'rb') as f:
return pickle.load(f)
def set(self, url, content):
"""Save content to cache"""
cache_path = self.get_cache_path(url)
with open(cache_path, 'wb') as f:
pickle.dump(content, f)
# Usage
cache = FileCache()
def fetch_with_cache(url):
# Try cache first
content = cache.get(url)
if content:
print("Using cached version")
return content
# Fetch if not cached
response = requests.get(url)
content = response.text
cache.set(url, content)
return contentPython8.6 Data Validation and Cleaning
import re
import pandas as pd
def clean_text(text):
"""Clean extracted text"""
if not text:
return ""
# Remove extra whitespace
text = ' '.join(text.split())
# Remove special characters if needed
text = re.sub(r'[^\w\s.,!?-]', '', text)
return text.strip()
def clean_price(price_string):
"""Extract numeric price from string"""
# Remove currency symbols and extract number
match = re.search(r'[\d,]+\.?\d*', price_string)
if match:
number = match.group().replace(',', '')
return float(number)
return None
def validate_data(df):
"""Validate scraped data"""
# Remove duplicates
df = df.drop_duplicates()
# Remove rows with missing values
df = df.dropna(subset=['title', 'price'])
# Validate price range
df = df[df['price'] > 0]
df = df[df['price'] < 10000]
# Clean text fields
df['title'] = df['title'].apply(clean_text)
return df
# Usage
data = []
for item in items:
title = clean_text(item.find('h2').text)
price = clean_price(item.find('span', class_='price').text)
if title and price:
data.append({'title': title, 'price': price})
df = pd.DataFrame(data)
df = validate_data(df)Python9. Common Challenges and Solutions
9.1 Handling AJAX/Dynamic Content
Problem: Content loads after page load via JavaScript.
Solution:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get(url)
# Wait for AJAX content
wait = WebDriverWait(driver, 10)
element = wait.until(
EC.presence_of_element_located((By.CLASS_NAME, "ajax-content"))
)
# Or inspect network requests and call API directly
import requests
api_url = "https://example.com/api/data"
response = requests.get(api_url)
data = response.json()Python9.2 Bypassing CAPTCHAs
Problem: Sites use CAPTCHAs to prevent bots.
Solutions:
- Use official APIs instead
- Contact site owner for permission
- Use CAPTCHA solving services (2captcha, Anti-Captcha)
- Implement human-like behavior patterns
# Example: More human-like scraping
import time
import random
from selenium.webdriver.common.action_chains import ActionChains
def human_like_delay():
time.sleep(random.uniform(1, 3))
def human_like_scroll(driver):
# Scroll gradually like a human
for i in range(5):
driver.execute_script(f"window.scrollTo(0, {i * 200});")
time.sleep(random.uniform(0.1, 0.3))
def human_like_mouse_movement(driver, element):
actions = ActionChains(driver)
actions.move_to_element(element)
actions.pause(random.uniform(0.1, 0.5))
actions.click()
actions.perform()Python9.3 Handling Session and Cookies
Problem: Need to maintain session across requests.
Solution:
import requests
# Use session object
session = requests.Session()
# Login
login_data = {'username': 'user', 'password': 'pass'}
session.post('https://example.com/login', data=login_data)
# Session maintains cookies automatically
response = session.get('https://example.com/protected-page')
# Manually set cookies
cookies = {'session_id': 'abc123'}
response = requests.get(url, cookies=cookies)
# Save and load cookies with Selenium
from selenium import webdriver
import pickle
driver = webdriver.Chrome()
driver.get('https://example.com')
# Save cookies
cookies = driver.get_cookies()
pickle.dump(cookies, open('cookies.pkl', 'wb'))
# Load cookies
driver.get('https://example.com')
cookies = pickle.load(open('cookies.pkl', 'rb'))
for cookie in cookies:
driver.add_cookie(cookie)
driver.refresh()Python9.4 Handling Infinite Scroll
Problem: Content loads continuously as you scroll.
Solution:
from selenium import webdriver
import time
def scrape_infinite_scroll(url, scroll_pause=2, max_scrolls=10):
driver = webdriver.Chrome()
driver.get(url)
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
scrolls = 0
while scrolls < max_scrolls:
# Scroll down
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait for content to load
time.sleep(scroll_pause)
# Calculate new scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
# Break if reached bottom
if new_height == last_height:
break
last_height = new_height
scrolls += 1
# Extract data
html = driver.page_source
driver.quit()
return htmlPython9.5 Handling Rate Limiting (429 Errors)
Problem: Too many requests cause blocks.
Solution:
import time
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
def requests_with_retry():
"""Create session with automatic retries"""
session = requests.Session()
retry = Retry(
total=5,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
method_whitelist=["HEAD", "GET", "OPTIONS"]
)
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
# Usage
session = requests_with_retry()
def scrape_with_backoff(url, max_retries=3):
"""Exponential backoff on rate limiting"""
for attempt in range(max_retries):
response = session.get(url)
if response.status_code == 429:
wait_time = 2 ** attempt # Exponential backoff
print(f"Rate limited. Waiting {wait_time} seconds...")
time.sleep(wait_time)
continue
if response.status_code == 200:
return response
return NonePython10. Real-World Projects
Project 1: E-commerce Price Tracker
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import smtplib
from email.mime.text import MIMEText
class PriceTracker:
"""Track product prices on e-commerce sites"""
def __init__(self, products_csv='products.csv'):
self.products_csv = products_csv
self.products = self.load_products()
def load_products(self):
"""Load products to track"""
try:
return pd.read_csv(self.products_csv)
except FileNotFoundError:
return pd.DataFrame(columns=['name', 'url', 'target_price'])
def get_price(self, url):
"""Extract price from product page"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
# This selector varies by site - adjust accordingly
price_elem = soup.find('span', class_='price')
if price_elem:
price_text = price_elem.text
# Extract numeric value
import re
match = re.search(r'[\d,]+\.?\d*', price_text)
if match:
return float(match.group().replace(',', ''))
return None
def check_prices(self):
"""Check all product prices"""
results = []
for _, product in self.products.iterrows():
current_price = self.get_price(product['url'])
if current_price:
result = {
'name': product['name'],
'current_price': current_price,
'target_price': product['target_price'],
'timestamp': datetime.now(),
'alert': current_price <= product['target_price']
}
results.append(result)
print(f"{product['name']}: ${current_price}")
return pd.DataFrame(results)
def send_alert(self, product_name, current_price, target_price):
"""Send email alert for price drop"""
msg = MIMEText(
f"Price alert for {product_name}!\n"
f"Current price: ${current_price}\n"
f"Target price: ${target_price}"
)
msg['Subject'] = f'Price Alert: {product_name}'
msg['From'] = 'your_email@gmail.com'
msg['To'] = 'recipient@gmail.com'
# Send email (configure SMTP settings)
# smtp = smtplib.SMTP('smtp.gmail.com', 587)
# smtp.starttls()
# smtp.login('your_email@gmail.com', 'password')
# smtp.send_message(msg)
# smtp.quit()
def run(self):
"""Run price tracking"""
results = self.check_prices()
# Check for alerts
alerts = results[results['alert']]
for _, alert in alerts.iterrows():
print(f"ALERT: {alert['name']} is now ${alert['current_price']}!")
# self.send_alert(...)
# Save history
history_file = 'price_history.csv'
try:
history = pd.read_csv(history_file)
history = pd.concat([history, results], ignore_index=True)
except FileNotFoundError:
history = results
history.to_csv(history_file, index=False)
return results
# Usage
if __name__ == "__main__":
tracker = PriceTracker()
results = tracker.run()
print(results)PythonProject 2: News Aggregator
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import feedparser
class NewsAggregator:
"""Aggregate news from multiple sources"""
def __init__(self):
self.sources = {
'TechCrunch': 'https://techcrunch.com',
'The Verge': 'https://www.theverge.com',
}
self.articles = []
def scrape_rss(self, rss_url):
"""Scrape articles from RSS feed"""
feed = feedparser.parse(rss_url)
articles = []
for entry in feed.entries[:10]: # Top 10 articles
article = {
'title': entry.title,
'link': entry.link,
'published': entry.published,
'summary': entry.summary if hasattr(entry, 'summary') else ''
}
articles.append(article)
return articles
def scrape_website(self, url, source_name):
"""Scrape articles from website"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
articles = []
# This varies by site - customize selectors
article_elements = soup.find_all('article', limit=10)
for article_elem in article_elements:
title_elem = article_elem.find('h2')
link_elem = article_elem.find('a')
if title_elem and link_elem:
article = {
'source': source_name,
'title': title_elem.text.strip(),
'link': link_elem.get('href'),
'timestamp': datetime.now()
}
articles.append(article)
return articles
def aggregate(self):
"""Aggregate news from all sources"""
all_articles = []
for source_name, url in self.sources.items():
print(f"Scraping {source_name}...")
articles = self.scrape_website(url, source_name)
all_articles.extend(articles)
df = pd.DataFrame(all_articles)
df = df.drop_duplicates(subset=['title'])
df.to_csv('news_aggregated.csv', index=False)
return df
def filter_by_keywords(self, df, keywords):
"""Filter articles by keywords"""
mask = df['title'].str.contains('|'.join(keywords), case=False)
return df[mask]
# Usage
if __name__ == "__main__":
aggregator = NewsAggregator()
articles = aggregator.aggregate()
# Filter by topics
ai_articles = aggregator.filter_by_keywords(
articles,
['AI', 'artificial intelligence', 'machine learning']
)
print(f"Found {len(ai_articles)} AI-related articles")
print(ai_articles[['source', 'title']])PythonProject 3: Job Listing Scraper
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import re
class JobScraper:
"""Scrape job listings from multiple sites"""
def __init__(self, keywords, location=''):
self.keywords = keywords
self.location = location
self.jobs = []
def scrape_indeed(self):
"""Scrape jobs from Indeed"""
base_url = "https://www.indeed.com/jobs"
params = {
'q': ' '.join(self.keywords),
'l': self.location
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
response = requests.get(base_url, params=params, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
# Find job cards (selector may need updating)
job_cards = soup.find_all('div', class_='job_seen_beacon')
jobs = []
for card in job_cards:
try:
title_elem = card.find('h2', class_='jobTitle')
company_elem = card.find('span', class_='companyName')
location_elem = card.find('div', class_='companyLocation')
job = {
'title': title_elem.text.strip() if title_elem else 'N/A',
'company': company_elem.text.strip() if company_elem else 'N/A',
'location': location_elem.text.strip() if location_elem else 'N/A',
'source': 'Indeed',
'scraped_date': datetime.now()
}
jobs.append(job)
except Exception as e:
continue
return jobs
def scrape_linkedin(self):
"""Scrape jobs from LinkedIn (requires authentication)"""
# LinkedIn requires login - use Selenium or API
pass
def clean_salary(self, salary_text):
"""Extract salary range from text"""
if not salary_text:
return None, None
# Extract numbers
numbers = re.findall(r'\d+,?\d*', salary_text)
if len(numbers) >= 2:
min_salary = int(numbers[0].replace(',', ''))
max_salary = int(numbers[1].replace(',', ''))
return min_salary, max_salary
return None, None
def aggregate_jobs(self):
"""Aggregate jobs from all sources"""
all_jobs = []
print("Scraping Indeed...")
indeed_jobs = self.scrape_indeed()
all_jobs.extend(indeed_jobs)
# Add more sources
# all_jobs.extend(self.scrape_linkedin())
df = pd.DataFrame(all_jobs)
df = df.drop_duplicates(subset=['title', 'company'])
return df
def save_results(self, df, filename='jobs.csv'):
"""Save results to CSV"""
df.to_csv(filename, index=False)
print(f"Saved {len(df)} jobs to {filename}")
def filter_by_salary(self, df, min_salary):
"""Filter jobs by minimum salary"""
# Implement salary filtering logic
pass
# Usage
if __name__ == "__main__":
scraper = JobScraper(
keywords=['Python', 'Developer'],
location='New York'
)
jobs_df = scraper.aggregate_jobs()
scraper.save_results(jobs_df)
print(f"Found {len(jobs_df)} jobs")
print(jobs_df.head())PythonConclusion
Web scraping is a powerful skill that opens up endless possibilities for data collection and analysis. Remember to:
- Be Ethical: Respect robots.txt, terms of service, and rate limits
- Start Simple: Begin with basic scraping before moving to complex scenarios
- Handle Errors: Always implement proper error handling
- Stay Updated: Websites change frequently – maintain your scrapers
- Use APIs: When available, APIs are always better than scraping
Next Steps
- Practice on scraping-friendly sites like:
- Explore frameworks:
- Scrapy: Full-featured scraping framework
- Playwright: Modern browser automation
- Puppeteer: Node.js browser automation
Additional Resources
- Documentation:
- Books:
- “Web Scraping with Python” by Ryan Mitchell
- “Automate the Boring Stuff with Python” by Al Sweigart
- Practice Sites:
Python Web Scraping Cheatsheet
Quick Reference Guide
1. Installation & Setup
# Essential libraries
pip install requests beautifulsoup4 lxml selenium pandas
# Optional but useful
pip install scrapy playwright webdriver-manager fake-useragentPython2. Requests Library
Basic GET Request
import requests
response = requests.get('https://example.com')
html = response.text
content = response.content # bytes
status = response.status_codePythonHeaders & User-Agent
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9'
}
response = requests.get(url, headers=headers)PythonParameters & POST Requests
# GET with parameters
params = {'search': 'python', 'page': 1}
response = requests.get(url, params=params)
# POST with data
data = {'username': 'user', 'password': 'pass'}
response = requests.post(url, data=data)
# POST with JSON
json_data = {'key': 'value'}
response = requests.post(url, json=json_data)PythonSessions & Cookies
session = requests.Session()
session.get('https://example.com/login')
session.post('https://example.com/auth', data=login_data)
response = session.get('https://example.com/profile')PythonTimeouts & Retries
# Timeout
response = requests.get(url, timeout=5)
# Proxies
proxies = {'http': 'http://proxy:8080', 'https': 'https://proxy:8080'}
response = requests.get(url, proxies=proxies)
# Verify SSL (disable for testing only)
response = requests.get(url, verify=False)Python3. BeautifulSoup Parsing
Initialize Soup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') # Fast parser
soup = BeautifulSoup(html, 'html.parser') # Built-in parserPythonFinding Elements
# Find first matching element
soup.find('div')
soup.find('div', class_='content')
soup.find('div', id='main')
soup.find('a', href=True)
soup.find('div', attrs={'data-id': '123'})
# Find all matching elements
soup.find_all('p')
soup.find_all('div', class_='item')
soup.find_all(['h1', 'h2', 'h3'])
soup.find_all('a', limit=5)PythonCSS Selectors
soup.select('div.content') # Class selector
soup.select('#main') # ID selector
soup.select('div > p') # Direct child
soup.select('div p') # Descendant
soup.select('a[href]') # Has attribute
soup.select('a[href^="http"]') # Starts with
soup.select('a[href$=".pdf"]') # Ends with
soup.select('div:nth-of-type(2)') # Nth element
soup.select_one('div.content') # First match onlyPythonExtracting Data
# Get text
element.text
element.get_text()
element.get_text(strip=True)
element.string # Direct text only
# Get attributes
element.get('href')
element['href']
element.attrs # All attributes dict
# Check existence
if element:
print("Element exists")PythonNavigation
# Parents
element.parent
element.parents
# Siblings
element.next_sibling
element.previous_sibling
element.next_siblings
element.previous_siblings
# Children
element.children # Direct children
element.descendants # All descendants
# Finding relatives
element.find_next('p')
element.find_previous('div')
element.find_next_sibling('span')Python4. Selenium WebDriver
Setup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
# Basic setup
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
# Headless mode
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(service=service, options=options)PythonFinding Elements
# Single element
driver.find_element(By.ID, 'element-id')
driver.find_element(By.NAME, 'element-name')
driver.find_element(By.CLASS_NAME, 'class-name')
driver.find_element(By.TAG_NAME, 'div')
driver.find_element(By.CSS_SELECTOR, 'div.content')
driver.find_element(By.XPATH, '//div[@class="content"]')
driver.find_element(By.LINK_TEXT, 'Click Here')
driver.find_element(By.PARTIAL_LINK_TEXT, 'Click')
# Multiple elements
driver.find_elements(By.CLASS_NAME, 'item')PythonInteractions
# Click
element.click()
# Type text
element.send_keys('text to type')
element.send_keys(Keys.RETURN)
element.send_keys(Keys.CONTROL, 'a')
# Clear input
element.clear()
# Get values
element.text
element.get_attribute('href')
element.get_attribute('value')
# Check states
element.is_displayed()
element.is_enabled()
element.is_selected()PythonWaits
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Implicit wait (global)
driver.implicitly_wait(10)
# Explicit wait
wait = WebDriverWait(driver, 10)
element = wait.until(EC.presence_of_element_located((By.ID, 'myElement')))
element = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'content')))
element = wait.until(EC.element_to_be_clickable((By.ID, 'button')))
# Wait conditions
EC.title_contains('Expected')
EC.title_is('Exact Title')
EC.presence_of_element_located()
EC.visibility_of_element_located()
EC.element_to_be_clickable()
EC.staleness_of()
EC.frame_to_be_available_and_switch_to_it()PythonNavigation & Actions
# Navigation
driver.get('https://example.com')
driver.back()
driver.forward()
driver.refresh()
# JavaScript execution
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
driver.execute_script('arguments[0].click();', element)
# Screenshots
driver.save_screenshot('screenshot.png')
# Window management
driver.maximize_window()
driver.set_window_size(1920, 1080)
# Cookies
driver.get_cookies()
driver.add_cookie({'name': 'key', 'value': 'value'})
driver.delete_all_cookies()
# Close
driver.close() # Current window
driver.quit() # All windowsPythonSwitching Contexts
# Windows/Tabs
main_window = driver.current_window_handle
all_windows = driver.window_handles
driver.switch_to.window(all_windows[1])
# Frames
driver.switch_to.frame('frame-name')
driver.switch_to.frame(0) # By index
driver.switch_to.default_content() # Back to main
# Alerts
alert = driver.switch_to.alert
alert.accept()
alert.dismiss()
alert.send_keys('text')
alert_text = alert.textPython5. Common Patterns
Safe Element Finding
# BeautifulSoup
element = soup.find('h1')
text = element.text if element else 'N/A'
# Alternative
text = soup.find('h1').text if soup.find('h1') else 'N/A'
# With get_text
text = element.get_text(strip=True) if element else ''PythonPagination Loop
page = 1
while True:
url = f'{base_url}?page={page}'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
items = soup.find_all('div', class_='item')
if not items:
break
# Process items
for item in items:
# Extract data
pass
page += 1
time.sleep(1)PythonInfinite Scroll
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get(url)
last_height = driver.execute_script('return document.body.scrollHeight')
while True:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(2)
new_height = driver.execute_script('return document.body.scrollHeight')
if new_height == last_height:
break
last_height = new_heightPythonRate Limiting
import time
from datetime import datetime
def rate_limit(calls_per_second=1):
min_interval = 1.0 / calls_per_second
last_called = [0.0]
def decorator(func):
def wrapper(*args, **kwargs):
elapsed = time.time() - last_called[0]
if elapsed < min_interval:
time.sleep(min_interval - elapsed)
result = func(*args, **kwargs)
last_called[0] = time.time()
return result
return wrapper
return decorator
@rate_limit(calls_per_second=2)
def scrape_page(url):
return requests.get(url)Python6. Data Extraction & Cleaning
Text Cleaning
import re
# Remove whitespace
text = ' '.join(text.split())
# Remove special characters
text = re.sub(r'[^\w\s]', '', text)
# Extract numbers
numbers = re.findall(r'\d+', text)
# Extract price
price = re.search(r'\$?([\d,]+\.?\d*)', text)
if price:
price = float(price.group(1).replace(',', ''))PythonList Comprehension
# Extract text from elements
texts = [elem.text for elem in elements]
# Extract with condition
links = [a.get('href') for a in soup.find_all('a') if a.get('href')]
# Extract and clean
prices = [float(p.text.strip('$')) for p in soup.find_all('span', class_='price')]PythonPandas DataFrame
import pandas as pd
# From list of dicts
data = [{'name': 'Item 1', 'price': 10}, {'name': 'Item 2', 'price': 20}]
df = pd.DataFrame(data)
# Save to CSV
df.to_csv('output.csv', index=False)
# Save to Excel
df.to_excel('output.xlsx', index=False)
# Save to JSON
df.to_json('output.json', orient='records', indent=4)Python7. Error Handling
Try-Except Blocks
import requests
from requests.exceptions import RequestException
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
except requests.Timeout:
print('Request timed out')
except requests.HTTPError as e:
print(f'HTTP error: {e}')
except RequestException as e:
print(f'Request failed: {e}')
except Exception as e:
print(f'Unexpected error: {e}')PythonSelenium Exceptions
from selenium.common.exceptions import (
NoSuchElementException,
TimeoutException,
StaleElementReferenceException
)
try:
element = driver.find_element(By.ID, 'element-id')
except NoSuchElementException:
print('Element not found')
except TimeoutException:
print('Timed out waiting for element')
except StaleElementReferenceException:
print('Element is no longer attached to DOM')Python8. Useful Code Snippets
Check robots.txt
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
can_fetch = rp.can_fetch('*', 'https://example.com/page')PythonRandom User Agent
import random
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]
headers = {'User-Agent': random.choice(USER_AGENTS)}PythonDownload File
import requests
response = requests.get(file_url, stream=True)
with open('file.pdf', 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)PythonParse JSON from API
response = requests.get(api_url)
data = response.json()
# Or
import json
data = json.loads(response.text)PythonSave to Database (SQLite)
import sqlite3
conn = sqlite3.connect('scraped_data.db')
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS items (
id INTEGER PRIMARY KEY,
title TEXT,
price REAL
)
''')
cursor.execute('INSERT INTO items (title, price) VALUES (?, ?)', ('Item', 19.99))
conn.commit()
conn.close()Python9. XPath Quick Reference
# Basic
'//div' # All divs
'//div[@class="content"]' # Div with class
'//div[@id="main"]' # Div with id
'//a[@href]' # Links with href
# Text
'//h1/text()' # Text content
'//div[contains(text(), "Hello")]' # Contains text
'//div[starts-with(text(), "Hello")]'# Starts with
# Attributes
'//img/@src' # Get src attribute
'//a[contains(@href, ".pdf")]' # Href contains .pdf
# Position
'//div[1]' # First div
'//div[last()]' # Last div
'//div[position() < 3]' # First two divs
# Axes
'//div/parent::*' # Parent
'//div/following-sibling::*' # Next siblings
'//div/preceding-sibling::*' # Previous siblings
'//div/descendant::p' # All p descendantsPython10. Performance Tips
# Use lxml parser (fastest)
soup = BeautifulSoup(html, 'lxml')
# Limit results
soup.find_all('div', limit=10)
# Use CSS selectors (faster than find_all)
soup.select('div.item')
# Compile regex patterns
import re
pattern = re.compile(r'\d+')
numbers = pattern.findall(text)
# Multiprocessing for multiple URLs
from multiprocessing import Pool
def scrape(url):
# Scraping logic
return data
with Pool(5) as p:
results = p.map(scrape, urls)
# Async requests
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def main():
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks)
asyncio.run(main())Python11. Debugging Tips
# Print response
print(response.status_code)
print(response.headers)
print(response.text[:500]) # First 500 chars
# Pretty print soup
print(soup.prettify()[:1000])
# Check if element exists
element = soup.find('div', class_='content')
print(f"Element found: {element is not None}")
# Save HTML for inspection
with open('page.html', 'w', encoding='utf-8') as f:
f.write(response.text)
# Selenium debugging
print(driver.current_url)
print(driver.title)
print(driver.page_source[:500])
# Wait and see (Selenium)
import time
time.sleep(5) # Pause to see what's happeningPython12. Common Selectors Cheatsheet
| Element Type | BeautifulSoup | CSS Selector | XPath |
|---|---|---|---|
| By tag | find('div') | select('div') | //div |
| By class | find('div', class_='item') | select('.item') | //div[@class="item"] |
| By id | find(id='main') | select('#main') | //*[@id="main"] |
| By attribute | find('a', href=True) | select('a[href]') | //a[@href] |
| First child | N/A | select('div > p') | //div/p |
| Descendant | N/A | select('div p') | //div//p |
| Multiple | find_all(['h1','h2']) | select('h1, h2') | `//h1 |
Quick Troubleshooting
| Problem | Solution |
|---|---|
| Empty results | Check selectors, wait for JS loading, verify page structure |
| 403/429 errors | Add User-Agent, reduce request rate, use proxies |
| Timeout errors | Increase timeout, check internet connection |
| Element not found | Wait for dynamic content, verify selector |
| Encoding issues | Use response.content with proper encoding |
| Cookie/Session | Use requests.Session() or Selenium |
| CAPTCHA | Use APIs, contact site owner, reduce frequency |
| Dynamic content | Use Selenium or check for JSON/API calls |
Happy Scraping! 🚀
Discover more from Altgr Blog
Subscribe to get the latest posts sent to your email.
