4 Shell Essentials

This chapter highlights powerful shell commands used to filter datasets, perform inline data transformations, and discover files safely across a directory structure. These tools are fundamental in building efficient, scriptable pipelines for data engineering and automation tasks.

4.1 Filtering, Transforming, and Locating Files

mkdir -p ../data/raw ../data/processed

printf "ticker,date,adj_close\nAAPL,2023-01-02,125.0\nAAPL,2023-01-03,126.5\n" > ../data/raw/prices.csv

wc -l ../data/raw/prices.csv
head -n 3 ../data/raw/prices.csv

grep -E '^AAPL,' ../data/raw/prices.csv

awk -F, 'BEGIN{OFS=","} NR==1{print $0,"volume";next} {print $0,100000+NR*10}' \
  ../data/raw/prices.csv > ../data/processed/prices_with_vol.csv

find ../data -type f -name "*.csv" -print0 | xargs -0 ls -lh

4.2 Explanation

This example highlights key Unix command-line tools used to manipulate and search data:

grep -E extracts only the lines that match a regular expression, enabling fast row-level filtering.

awk performs a structured transformation by generating and appending a volume field to each record.

find … -print0 | xargs -0 performs robust file discovery while safely handling unusual filenames.

Combined, these utilities illustrate how the shell can efficiently filter, reshape, and locate data components within a reproducible pipeline.