Bay Information Systems

Metrics for Business Goals

Accuracy is the most misunderstood metric in machine learning. A model with 95% accuracy can still destroy business value, while a model with 70% accuracy might generate millions in revenue. The difference lies in choosing evaluation metrics that align with actual business objectives.

Why Accuracy Misleads

Accuracy measures the percentage of correct predictions across all cases. This sounds reasonable until you consider real-world constraints and costs.

Example: Fraud Detection A credit card company processes 1 million transactions daily, with 0.1% being fraudulent (1,000 fraud cases, 999,000 legitimate).

A naive model that predicts “not fraud” for every transaction achieves 99.9% accuracy. Yet it catches zero fraudulent transactions, potentially costing millions in losses.

The fundamental issue: accuracy treats all errors equally, but business impact varies dramatically across different types of mistakes.

Understanding the Confusion Matrix

Before exploring advanced metrics, we need to understand how model predictions break down:

Different business scenarios care about different types of errors.

Precision vs Recall: The Core Trade-off

Precision

Formula: TP / (TP + FP) Question: “Of all positive predictions, how many were actually positive?”

When to prioritise precision:

Recall (Sensitivity)

Formula: TP / (TP + FN) Question: “Of all actual positive cases, how many did we catch?”

When to prioritise recall:

The Trade-off

These metrics are typically inversely related. Increasing the prediction threshold improves precision but reduces recall, and vice versa.

Example: Customer Churn Prediction

The optimal balance depends on the cost of retention campaigns versus the lifetime value of customers.

F1 Score and Variants

F1 Score

Formula: 2 × (Precision × Recall) / (Precision + Recall)

F1 score provides a single metric balancing precision and recall, assuming equal importance of both.

F-Beta Score

Formula: (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)

For fraud detection, you might use F2 score to emphasise catching fraudulent transactions over minimising false alarms.

ROC and Precision-Recall Analysis

ROC AUC (Area Under Curve)

Measures performance across all classification thresholds:

ROC AUC works well for balanced datasets but can mislead with highly imbalanced data.

Precision-Recall Curves

For imbalanced datasets, precision-recall curves often provide better insight than ROC curves.

Example: Predictive Maintenance Equipment failure prediction with 99.5% uptime:

The PR AUC reveals the model struggles to identify actual equipment failures, which is the primary business concern.

Business-Specific Metrics

Cost-Sensitive Evaluation

Assign different costs to different types of errors based on business impact.

Example: Insurance Fraud Detection

The optimal model minimises total business cost, not classification error.

Lift and Gain

Lift: How much better the model performs compared to random selection.

Example: Marketing Campaigns

Top-K Accuracy

For recommendation systems, traditional accuracy is irrelevant.

Example: Product Recommendations

Business cares about whether customers find relevant products, not whether the #1 recommendation is perfect.

Revenue vs Engagement Trade-offs

Example: Recommendation Systems Model A achieves 3.2% click-through rate generating £2.1M revenue. Model B achieves 3.4% click-through rate but only £2.0M revenue.

Model B has higher engagement but lower revenue, revealing it recommends cheaper products. The business decision depends on whether the goal is engagement or revenue optimisation.

Multi-Class Classification

Macro vs Micro Averaging

Macro Average: Calculate metric for each class separately, then average. Treats all classes equally regardless of size.

Micro Average: Aggregate contributions of all classes, then calculate metric. Influenced by larger classes.

Business decision: Use macro averaging when all categories matter equally (safety classifications). Use micro averaging when larger categories are more important (revenue by product category).

Regression Metrics for Business

For prediction problems, similar business thinking applies to regression metrics:

Mean Absolute Error (MAE)

Average absolute difference between predictions and actual values. Use when all errors have similar impact (inventory forecasting).

Root Mean Squared Error (RMSE)

Penalises larger errors more heavily. Use when large errors are disproportionately costly (financial risk models).

Mean Absolute Percentage Error (MAPE)

Average percentage error. Use when relative error matters more than absolute error (revenue forecasting).

Business-Adjusted Example: Revenue forecasting where 10% error on £1M sale matters more than 10% error on £1k sale. Weight errors by business impact rather than treating them equally.

Common Evaluation Pitfalls

Data Leakage in Evaluation

Using information from the future or target variable in features. Example: predicting fraud using post-transaction account balance.

Evaluation on Non-Representative Data

Test set doesn’t match production distribution. Example: training on historical data but deploying during seasonal peaks.

Optimising Metrics vs Business Goals

Pursuing metric improvements that don’t translate to business value. Example: optimising F1 score when business cares about revenue.

Threshold Selection Without Business Context

Using arbitrary thresholds (like 0.5) instead of optimising for business outcomes.

Conclusion

Effective model evaluation requires understanding your business context as much as your technical metrics. The best model isn’t the one with the highest accuracy or F1 score—it’s the one that delivers the most business value within acceptable constraints.

Start with clear business objectives, choose metrics that reflect real costs and benefits, and remember that model evaluation continues through deployment and production monitoring. The goal isn’t perfect predictions but profitable decisions.