4 Shell Essentials
This chapter highlights powerful shell commands used to filter datasets, perform inline data transformations, and discover files safely across a directory structure. These tools are fundamental in building efficient, scriptable pipelines for data engineering and automation tasks.
4.1 Filtering, Transforming, and Locating Files
mkdir -p ../data/raw ../data/processed
printf "ticker,date,adj_close\nAAPL,2023-01-02,125.0\nAAPL,2023-01-03,126.5\n" > ../data/raw/prices.csv
wc -l ../data/raw/prices.csv
head -n 3 ../data/raw/prices.csv
grep -E '^AAPL,' ../data/raw/prices.csv
awk -F, 'BEGIN{OFS=","} NR==1{print $0,"volume";next} {print $0,100000+NR*10}' \
../data/raw/prices.csv > ../data/processed/prices_with_vol.csv
find ../data -type f -name "*.csv" -print0 | xargs -0 ls -lh
4.2 Explanation
This example highlights key Unix command-line tools used to manipulate and search data:
grep -E extracts only the lines that match a regular expression, enabling fast row-level filtering.
awk performs a structured transformation by generating and appending a volume field to each record.
find … -print0 | xargs -0 performs robust file discovery while safely handling unusual filenames.
Combined, these utilities illustrate how the shell can efficiently filter, reshape, and locate data components within a reproducible pipeline.