import requests, time, pathlib
from bs4 import BeautifulSoup
# URL to scrape (kept simple for demonstration)
URL = "https://example.com"
# Location to store cached HTML
raw_path = pathlib.Path("data/raw/page.html")
raw_path.parent.mkdir(parents=True, exist_ok=True)
# Cache-first scraping strategy
if raw_path.exists():
html = raw_path.read_text(encoding="utf-8")
else:
# (Ethics) Identify your scraper and avoid overloading servers
headers = {"User-Agent": "STAT4160-student/1.0"}
# (Ethics) In real scraping: check robots.txt, rate-limit, record request time
response = requests.get(URL, headers=headers, timeout=10)
response.raise_for_status()
html = response.text
raw_path.write_text(html, encoding="utf-8")
# Parse HTML with BeautifulSoup
soup = BeautifulSoup(html, "lxml")
# Extract title
title = soup.find("title").get_text(strip=True)
# Extract first 5 hyperlinks
links = [
(a.get_text(strip=True), a.get("href"))
for a in soup.select("a[href]")
][:5]
title, links8 HTML Scraping & Parsing
This chapter demonstrates how to safely download, cache, and parse HTML content using requests and BeautifulSoup.
Following the course guidelines, we practice ethical scraping by checking robots.txt, adding a user agent, throttling requests, and caching raw HTML to avoid repeated downloads.
8.1 Example: Download, Cache, and Parse a Web Page
8.2 Explanation
This example highlights several best practices in ethical web scraping:
- Caching HTML reduces server load and speeds up local development.
- User-Agent headers identify your scraper responsibly.
- Requests timeout prevents hanging network calls.
- BeautifulSoup (
lxmlparser) extracts relevant information quickly.
- CSS selectors (
a[href]) allow easy link parsing.
The extracted title and link preview demonstrate how basic scraping workflows support further text parsing, data extraction, or downstream analysis.