Statistics on the command line for beginning data scientists

Image by editor

, Introduction

If you’re just starting your data science journey, you may think you need tools like Python, R, or other software to run statistical analysis on data. However, the command line is already a powerful statistical toolkit.

Command line tools can often process large datasets faster than loading them into memory-heavy applications. They are easy to script and automate. Furthermore, these tools work on any UNIX system Without Install anything.

In this article, you will learn how to perform essential statistical operations directly from your terminal using only built-in Unix tools.

🔗is here Bash script on GitHubCoding is highly recommended to understand the concepts completely,

To follow this tutorial, you will need:

You will need a Unix-like environment (Linux, macOS, or Windows with WSL).
We will only use the standard Unix tools that are already installed.

To get started, open your terminal.

, setting sample data

Before we can analyze data, we need a dataset. Create a simple CSV file representing daily website traffic by running the following command in your terminal:

cat > traffic.csv << EOF
date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8
2024-01-05,980,3400,51.2
2024-01-06,1100,3900,48.5
2024-01-07,1680,6100,40.1
2024-01-08,1550,5600,41.9
2024-01-09,1420,5100,44.2
2024-01-10,1290,4700,46.3
EOF

This creates a new file called traffic.csv With ten lines of header and sample data.

, Searching your data

, Counting rows in your dataset

One of the first things to identify in a dataset is the number of records it contains. wc (word count) with command -l The flag counts the number of lines in the file:

The output displays: 11 traffic.csv (11 rows total, minus 1 header = 10 data rows).

, Viewing your data

Before proceeding with the calculations, it is helpful to verify the data structure. head The command displays the first few lines of the file:

This shows the first 5 rows, so you can preview the data.

date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8

, remove single column

To work with specific columns in a CSV file, use Cutting Order with a delimiter and field number. The following command extracts the visitor column:

cut -d',' -f2 traffic.csv | tail -n +2

Field 2 (Visitor column) is extracted using cutAnd tail -n +2 Skips the header row.

, Calculating the measure of central tendency

, Finding the Mean (Average)

Mean is obtained by dividing the sum of all the values by the number of values. We can calculate this by extracting the target column, then using Strange To store values:

cut -d',' -f2 traffic.csv | tail -n +2 | awk '{sum+=$1; count++} END {print "Mean:", sum/count}'

awk The command accumulates the sum and count as it processes each row, then divides them. END block.

Next, we calculate the median and mode.

, finding the median

The mean is the middle value when the dataset is sorted. For an even number of values, it is the average of the two middle values. First, sort the data, then find the middle:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '{arr(NR)=$1; count=NR} END {if(count%2==1) print "Median:", arr((count+1)/2); else print "Median:", (arr(count/2)+arr(count/2+1))/2}'

It sorts the data numerically sort -nStores the values in an array, then finds the middle value (or the average of the two middle values if the count is even).

, finding mode

Mode is the most frequently occurring value. We do this by sorting, counting duplicates, and identifying which value appears most often:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | uniq -c | sort -rn | head -n 1 | awk '{print "Mode:", $2, "(appears", $1, "times)"}'

It sorts the values, counts duplicates uniq -cSorts in reverse order by frequency, and selects the top result.

, Calculating a measure of dispersion (or dispersion)

, finding maximum value

To find the largest value in your dataset, we compare each value and track the maximum:

awk -F',' 'NR>1 {if($2>max) max=$2} END {print "Maximum:", max}' traffic.csv

This misses the header NR>1Compares each value to the current maximum, and updates it if a larger value is found.

, finding minimum price

Similarly, to find the smallest value, start the minimum from the first data row and update it as smaller values are found:

awk -F',' 'NR==2 {min=$2} NR>2 {if($2

Run the above command to retrieve the maximum and minimum values.

, finding both minimum and maximum

Instead of running two separate commands, we can find both the minimum and maximum in a single pass:

awk -F',' 'NR==2 {min=$2; max=$2} NR>2 {if($2max) max=$2} END {print "Min:", min, "Max:", max}' traffic.csv

This single-pass approach initializes both variables from the first row, then updates each independently.

, Calculating the (Population) Standard Deviation

Standard deviation measures how much values spread from the mean. For the entire population, use this formula:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; print "Std Dev:", sqrt((sumsq/count)-(mean*mean))}' traffic.csv

It collects the sum and sum of squares, then applies the formula: \( \sqrt{\frac{\sum x^2}{N} – \mu^2} \), yielding the output:

, Calculating Sample Standard Deviation

When working with a sample rather than the entire population, use Bessel’s reform (Dividing by \( n-1 \) for unbiased sample estimate):

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; print "Sample Std Dev:", sqrt((sumsq-(sum*sum/count))/(count-1))}' traffic.csv

It provides:

, calculation of variance

Variance is the square of the standard deviation. This is another measure of dispersion useful in many statistical calculations:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; var=(sumsq/count)-(mean*mean); print "Variance:", var}' traffic.csv

This calculation mirrors the standard deviation but omits the square root.

, Calculating Percentage

, calculation of quartiles

Quartiles divide sorted data into four equal parts. They are particularly useful for understanding data distribution:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '
{arr(NR)=$1; count=NR}
END {
  q1_pos = (count+1)/4
  q2_pos = (count+1)/2
  q3_pos = 3*(count+1)/4
  print "Q1 (25th percentile):", arr(int(q1_pos))
  print "Q2 (Median):", (count%2==1) ? arr(int(q2_pos)) : (arr(count/2)+arr(count/2+1))/2
  print "Q3 (75th percentile):", arr(int(q3_pos))
}'

This script stores the sorted values in an array, calculates the quartile positions using the formula \((n+1)/4\), and extracts the values at those positions. Code output:

Q1 (25th percentile): 1100
Q2 (Median): 1355
Q3 (75th percentile): 1520

, calculate any percentage

You can calculate any percentile by adjusting the position calculation. The following flexible approach uses linear interpolation:

PERCENTILE=90
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk -v p=$PERCENTILE '
{arr(NR)=$1; count=NR}
END {
  pos = (count+1) * p/100
  idx = int(pos)
  frac = pos - idx
  if(idx >= count) print p "th percentile:", arr(count)
  else print p "th percentile:", arr(idx) + frac * (arr(idx+1) - arr(idx))
}'

It calculates the position as \( (n+1) \times (percentile/100) \), then uses linear interpolation between array indices for fractional positions.

, Working with multiple columns

Often, you will want to calculate statistics for multiple columns at once. Here’s how to calculate the average of visitors, page views, and bounce rate together:

awk -F',' '
NR>1 {
  v_sum += $2
  pv_sum += $3
  br_sum += $4
  count++
}
END {
  print "Average visitors:", v_sum/count
  print "Average page views:", pv_sum/count
  print "Average bounce rate:", br_sum/count
}' traffic.csv

This keeps separate accumulators for each column and shares the same calculation across all three, giving the following output:

Average visitors: 1340
Average page views: 4850
Average bounce rate: 45.06

, Calculating Correlation

Correlation measures the relationship between two variables. Pearson correlation coefficient Ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation):

awk -F', *' '
NR>1 {
  x(NR-1) = $2
  y(NR-1) = $3

  sum_x += $2
  sum_y += $3

  count++
}
END {
  if (count < 2) exit

  mean_x = sum_x / count
  mean_y = sum_y / count

  for (i = 1; i <= count; i++) {
    dx = x(i) - mean_x
    dy = y(i) - mean_y

    cov   += dx * dy
    var_x += dx * dx
    var_y += dy * dy
  }

  sd_x = sqrt(var_x / count)
  sd_y = sqrt(var_y / count)

  correlation = (cov / count) / (sd_x * sd_y)

  print "Correlation:", correlation
}' traffic.csv

It calculates the Pearson correlation by dividing the covariance by the product of the standard deviations.

, conclusion

The command line is a powerful tool for statistical analysis. You can process large amounts of data, calculate complex statistics, and automate reports – all without installing anything beyond what’s already on your system.

These skills complement, rather than replace, your Python and R knowledge. Use command-line tools for quick exploration and data validation, then move on to specialized tools for complex modeling and visualization when needed.

The best part is that these tools are available on almost every system you will use in your data science career. Open your terminal and start exploring your data.

Bala Priya C is a developer and technical writer from India. She likes to work in the fields of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, Data Science, and Natural Language Processing. She loves reading, writing, coding, and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by writing tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.