Accuracy is the most misunderstood metric in machine learning. A model with 95% accuracy can still destroy business value, while a model with 70% accuracy might generate millions in revenue. The difference lies in choosing evaluation metrics that align with actual business objectives.
Accuracy measures the percentage of correct predictions across all cases. This sounds reasonable until you consider real-world constraints and costs.
Example: Fraud Detection A credit card company processes 1 million transactions daily, with 0.1% being fraudulent (1,000 fraud cases, 999,000 legitimate).
A naive model that predicts “not fraud” for every transaction achieves 99.9% accuracy. Yet it catches zero fraudulent transactions, potentially costing millions in losses.
The fundamental issue: accuracy treats all errors equally, but business impact varies dramatically across different types of mistakes.
Before exploring advanced metrics, we need to understand how model predictions break down:
Different business scenarios care about different types of errors.
Formula: TP / (TP + FP) Question: “Of all positive predictions, how many were actually positive?”
When to prioritise precision:
Formula: TP / (TP + FN) Question: “Of all actual positive cases, how many did we catch?”
When to prioritise recall:
These metrics are typically inversely related. Increasing the prediction threshold improves precision but reduces recall, and vice versa.
Example: Customer Churn Prediction
The optimal balance depends on the cost of retention campaigns versus the lifetime value of customers.
Formula: 2 × (Precision × Recall) / (Precision + Recall)
F1 score provides a single metric balancing precision and recall, assuming equal importance of both.
Formula: (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)
For fraud detection, you might use F2 score to emphasise catching fraudulent transactions over minimising false alarms.
Measures performance across all classification thresholds:
ROC AUC works well for balanced datasets but can mislead with highly imbalanced data.
For imbalanced datasets, precision-recall curves often provide better insight than ROC curves.
Example: Predictive Maintenance Equipment failure prediction with 99.5% uptime:
The PR AUC reveals the model struggles to identify actual equipment failures, which is the primary business concern.
Assign different costs to different types of errors based on business impact.
Example: Insurance Fraud Detection
The optimal model minimises total business cost, not classification error.
Lift: How much better the model performs compared to random selection.
Example: Marketing Campaigns
For recommendation systems, traditional accuracy is irrelevant.
Example: Product Recommendations
Business cares about whether customers find relevant products, not whether the #1 recommendation is perfect.
Example: Recommendation Systems Model A achieves 3.2% click-through rate generating £2.1M revenue. Model B achieves 3.4% click-through rate but only £2.0M revenue.
Model B has higher engagement but lower revenue, revealing it recommends cheaper products. The business decision depends on whether the goal is engagement or revenue optimisation.
Macro Average: Calculate metric for each class separately, then average. Treats all classes equally regardless of size.
Micro Average: Aggregate contributions of all classes, then calculate metric. Influenced by larger classes.
Business decision: Use macro averaging when all categories matter equally (safety classifications). Use micro averaging when larger categories are more important (revenue by product category).
For prediction problems, similar business thinking applies to regression metrics:
Average absolute difference between predictions and actual values. Use when all errors have similar impact (inventory forecasting).
Penalises larger errors more heavily. Use when large errors are disproportionately costly (financial risk models).
Average percentage error. Use when relative error matters more than absolute error (revenue forecasting).
Business-Adjusted Example: Revenue forecasting where 10% error on £1M sale matters more than 10% error on £1k sale. Weight errors by business impact rather than treating them equally.
Using information from the future or target variable in features. Example: predicting fraud using post-transaction account balance.
Test set doesn’t match production distribution. Example: training on historical data but deploying during seasonal peaks.
Pursuing metric improvements that don’t translate to business value. Example: optimising F1 score when business cares about revenue.
Using arbitrary thresholds (like 0.5) instead of optimising for business outcomes.
Effective model evaluation requires understanding your business context as much as your technical metrics. The best model isn’t the one with the highest accuracy or F1 score—it’s the one that delivers the most business value within acceptable constraints.
Start with clear business objectives, choose metrics that reflect real costs and benefits, and remember that model evaluation continues through deployment and production monitoring. The goal isn’t perfect predictions but profitable decisions.