9  pytest & Data Validation

This chapter demonstrates how to write lightweight, high-value tests using pytest.
These tests validate the integrity of processed data and ensure that key assumptions—such as the absence of missing values or accidental lookahead bias—are satisfied.
Automated tests help maintain reproducibility and catch errors early in the workflow.


9.1 Creating Data Validation Tests

Below is an example test module that verifies two important conditions:

  1. The processed CSV contains a volume column and has no missing values in adj_close.
  2. The joined parquet dataset does not contain lookahead errors—dates should be non-decreasing within each ticker.

Save the following code to:

tests/test_data_health.py
import pandas as pd
import numpy as np

def test_prices_has_volume_and_no_nans():
    df = pd.read_csv("data/processed/prices_with_vol.csv")
    
    # Column must exist
    assert "volume" in df.columns
    
    # No missing closing prices
    assert df["adj_close"].isna().sum() == 0


def test_no_lookahead_in_join():
    df = pd.read_parquet("data/processed/prices.parquet")
    
    # Sort and compute differences in dates per ticker
    g = (
        df.sort_values(["ticker", "date"])
          .groupby("ticker")["date"]
          .diff()
          .dt.days
    )
    
    # All differences must be >= 0 (no time going backwards)
    assert (g.fillna(1) >= 0).all()

9.2 Running the Tests

From the project root directory, run:

pytest -q

A successful output looks like:

2 passed in 0.12s

9.3 Explanation

These tests illustrate good practices for data validation:

  • Schema checks (ensuring expected columns exist).
  • NA checks to prevent downstream model failures.
  • No-lookahead rules, ensuring that dates move forward within each ticker group.
  • Simple logic + high value → fast, reliable safeguards for your pipeline.

pytest allows you to keep tests small, readable, and easy to extend as your project grows.