Introduction to awk in Data Science
While awk might not be the first tool that comes to mind for data science, it can be incredibly powerful for quick data manipulation, exploration, and preprocessing. Its ability to work with structured text data makes it particularly useful for handling CSV files, log files, and other common data formats.
1. Data Cleaning and Preprocessing
Removing duplicate lines:
awk '!seen[$0]++' data.csv
Filtering rows based on conditions:
awk -F',' '$3 > 1000 {print $0}' data.csv
Handling missing values:
awk -F',' '{for(i=1;i<=NF;i++) if($i=="") $i="NA"} 1' OFS=',' data.csv
2. Data Transformation
Scaling numerical values:
awk -F',' '{$3 = $3 / 100; print}' OFS=',' data.csv
Encoding categorical variables:
awk -F',' '
BEGIN {split("low,medium,high", levels, ","); for(i in levels) category_map[levels[i]] = i}
{$4 = category_map[$4]; print}
' OFS=',' data.csv
3. Descriptive Statistics
Calculating mean and standard deviation:
awk -F',' '
{sum+=$3; sumsq+=$3*$3}
END {
mean=sum/NR;
stddev=sqrt(sumsq/NR - mean*mean);
print "Mean:", mean, "StdDev:", stddev
}
' data.csv
Finding min, max, and median:
awk -F',' '
{arr[NR]=$3; sum+=$3}
END {
asort(arr);
min=arr[1];
max=arr[NR];
if (NR%2) median=arr[(NR+1)/2];
else median=(arr[NR/2]+arr[NR/2+1])/2;
print "Min:", min, "Max:", max, "Median:", median
}
' data.csv
4. Data Aggregation
Group by and count:
awk -F',' '{count[$2]++} END {for (group in count) print group, count[group]}' data.csv
Group by and calculate mean:
awk -F',' '
{sum[$2]+=$3; count[$2]++}
END {
for (group in sum)
print group, sum[group]/count[group]
}
' data.csv
5. Time Series Analysis
Calculating moving average:
awk -F',' '
{
sum += $3
count++
if (count > 5) {
sum -= values[count % 5]
}
values[count % 5] = $3
if (count >= 5) {
print $1, sum / 5
}
}
' data.csv
Tip:
While awk is powerful for quick data manipulations, for more complex analyses or larger datasets, consider using it in combination with other tools like R or Python.
Exercise:
Given a CSV file with columns: Date, Product, Sales, Region
Write an awk script to:
- Filter out rows where Sales is less than 100
- Calculate the total sales for each Region
- Find the top-selling product in each Region
Hint: You'll need to use arrays and multiple passes through the data.