awk for Data Science

Introduction to awk in Data Science

While awk might not be the first tool that comes to mind for data science, it can be incredibly powerful for quick data manipulation, exploration, and preprocessing. Its ability to work with structured text data makes it particularly useful for handling CSV files, log files, and other common data formats.

1. Data Cleaning and Preprocessing

Removing duplicate lines:

awk '!seen[$0]++' data.csv

Filtering rows based on conditions:

awk -F',' '$3 > 1000 {print $0}' data.csv

Handling missing values:

awk -F',' '{for(i=1;i<=NF;i++) if($i=="") $i="NA"} 1' OFS=',' data.csv

2. Data Transformation

Scaling numerical values:

awk -F',' '{$3 = $3 / 100; print}' OFS=',' data.csv

Encoding categorical variables:

awk -F',' ' BEGIN {split("low,medium,high", levels, ","); for(i in levels) category_map[levels[i]] = i} {$4 = category_map[$4]; print} ' OFS=',' data.csv

3. Descriptive Statistics

Calculating mean and standard deviation:

awk -F',' ' {sum+=$3; sumsq+=$3*$3} END { mean=sum/NR; stddev=sqrt(sumsq/NR - mean*mean); print "Mean:", mean, "StdDev:", stddev } ' data.csv

Finding min, max, and median:

awk -F',' ' {arr[NR]=$3; sum+=$3} END { asort(arr); min=arr[1]; max=arr[NR]; if (NR%2) median=arr[(NR+1)/2]; else median=(arr[NR/2]+arr[NR/2+1])/2; print "Min:", min, "Max:", max, "Median:", median } ' data.csv

4. Data Aggregation

Group by and count:

awk -F',' '{count[$2]++} END {for (group in count) print group, count[group]}' data.csv

Group by and calculate mean:

awk -F',' ' {sum[$2]+=$3; count[$2]++} END { for (group in sum) print group, sum[group]/count[group] } ' data.csv

5. Time Series Analysis

Calculating moving average:

awk -F',' ' { sum += $3 count++ if (count > 5) { sum -= values[count % 5] } values[count % 5] = $3 if (count >= 5) { print $1, sum / 5 } } ' data.csv

Tip:

While awk is powerful for quick data manipulations, for more complex analyses or larger datasets, consider using it in combination with other tools like R or Python.

Exercise:

Given a CSV file with columns: Date, Product, Sales, Region

Write an awk script to:

Filter out rows where Sales is less than 100
Calculate the total sales for each Region
Find the top-selling product in each Region

Hint: You'll need to use arrays and multiple passes through the data.