Advanced awk Exercises

Exercise 1: Custom CSV Parser

Create an awk script that can parse a CSV file, correctly handling fields that contain commas within quoted strings.

Sample Input:

Name,Age,"Address, City",Country
John Doe,30,"123 Main St, Anytown",USA
Jane Smith,28,"456 Elm St, Somewhere",Canada

Solution:

awk 'BEGIN {FS = ","; FPAT = "([^,]+)|(\"[^\"]+\")"} { for (i=1; i<=NF; i++) { gsub(/^"|"$/, "", $i) # Remove surrounding quotes printf "%s%s", $i, (i==NF ? "\n" : "|") } }' input.csv

This script uses FPAT to define field patterns, correctly parsing quoted fields with commas. It then removes the quotes and outputs the fields separated by '|'.

Exercise 2: Advanced Log Analysis

Analyze a web server log to generate a report of unique visitors per hour, sorted by the hour with the most visitors.

Sample Input (simplified log format):

2023-05-01 08:30:45 192.168.1.1
2023-05-01 08:45:30 192.168.1.2
2023-05-01 09:15:20 192.168.1.1
2023-05-01 09:30:10 192.168.1.3

Solution:

awk '{ split($2, time, ":") hour = time[1] ip = $3 visits[hour][ip]++ } END { for (h in visits) { unique = length(visits[h]) print h, unique } }' log_file.txt | sort -k2 -nr

This script extracts the hour and IP from each log entry, counts unique IPs per hour, and outputs the results. The sort command then orders the output by visitor count.

Exercise 3: Data Normalization

Normalize a dataset by calculating the z-score for each value in a column.

Sample Input:

Name Score
Alice 85
Bob 92
Charlie 78
David 88

Solution:

awk 'NR==1 {print $0, "Z-Score"; next} { sum += $2 sqsum += $2 ^ 2 data[NR] = $0 } END { n = NR - 1 mean = sum / n stddev = sqrt((sqsum - n * (mean ^ 2)) / (n - 1)) for (i=2; i<=NR; i++) { split(data[i], fields) zscore = (fields[2] - mean) / stddev printf "%s %.2f\n", data[i], zscore } }' input.txt

This script calculates the mean and standard deviation of the scores, then computes and appends the z-score for each entry.

Exercise 4: Advanced Text Processing

Create an awk script that can find and highlight the longest common substring between two lines of text.

Sample Input:

The quick brown fox jumps over the lazy dog
A quick brown dog jumps over the lazy fox

Solution:

awk ' function lcs(X, Y, m, n, L, i, j, result) { m = split(X, x, "") n = split(Y, y, "") for (i = 0; i <= m; i++) for (j = 0; j <= n; j++) if (i == 0 || j == 0) L[i,j] = 0 else if (x[i] == y[j]) L[i,j] = L[i-1,j-1] + 1 else L[i,j] = (L[i-1,j] > L[i,j-1]) ? L[i-1,j] : L[i,j-1] result = "" while (m > 0 && n > 0) { if (x[m] == y[n]) { result = x[m] result m--; n-- } else if (L[m-1,n] > L[m,n-1]) m-- else n-- } return result } { if (NR == 1) line1 = $0 else if (NR == 2) { line2 = $0 common = lcs(line1, line2) print line1 print line2 print "Longest common substring: " common } }' input.txt

This script implements the Longest Common Substring algorithm using dynamic programming, then applies it to find the common substring between two lines of text.

Further Learning

To continue improving your awk skills, consider exploring:

awk for Data Science

Advanced awk Exercises

Tip:

Exercise 1: Custom CSV Parser

Sample Input:

Solution:

Exercise 2: Advanced Log Analysis

Sample Input (simplified log format):

Solution:

Exercise 3: Data Normalization

Sample Input:

Solution:

Exercise 4: Advanced Text Processing

Sample Input:

Solution:

Challenge Yourself:

Further Learning