8 HTML Scraping & Parsing

This chapter demonstrates how to safely download, cache, and parse HTML content using requests and BeautifulSoup.
Following the course guidelines, we practice ethical scraping by checking robots.txt, adding a user agent, throttling requests, and caching raw HTML to avoid repeated downloads.

8.1 Example: Download, Cache, and Parse a Web Page

import requests, time, pathlib
from bs4 import BeautifulSoup

# URL to scrape (kept simple for demonstration)
URL = "https://example.com"

# Location to store cached HTML
raw_path = pathlib.Path("data/raw/page.html")
raw_path.parent.mkdir(parents=True, exist_ok=True)

# Cache-first scraping strategy
if raw_path.exists():
    html = raw_path.read_text(encoding="utf-8")
else:
    # (Ethics) Identify your scraper and avoid overloading servers
    headers = {"User-Agent": "STAT4160-student/1.0"}
    
    # (Ethics) In real scraping: check robots.txt, rate-limit, record request time
    response = requests.get(URL, headers=headers, timeout=10)
    response.raise_for_status()

    html = response.text
    raw_path.write_text(html, encoding="utf-8")

# Parse HTML with BeautifulSoup
soup = BeautifulSoup(html, "lxml")

# Extract title
title = soup.find("title").get_text(strip=True)

# Extract first 5 hyperlinks
links = [
    (a.get_text(strip=True), a.get("href"))
    for a in soup.select("a[href]")
][:5]

title, links

8.2 Explanation

This example highlights several best practices in ethical web scraping:

Caching HTML reduces server load and speeds up local development.
User-Agent headers identify your scraper responsibly.
Requests timeout prevents hanging network calls.
BeautifulSoup (lxml parser) extracts relevant information quickly.
CSS selectors (a[href]) allow easy link parsing.

The extracted title and link preview demonstrate how basic scraping workflows support further text parsing, data extraction, or downstream analysis.