International College of Digital Innovation, CMU
September 18, 2025
Model performance can be evaluated using several metrics, such as:
Each of these metrics has different meanings and use cases, as explained below:
\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i- \hat{y}_i)^2 \]
Measures the squared difference between the actual values (\(y_i\)) and the predicted values (\(\hat{y}_i\)).
A lower MSE indicates that the model’s predictions are closer to the actual values.
A drawback is that the unit of MSE is the square of the original unit of the data, which makes interpretation more difficult.
\[ RMSE = \sqrt{\text{MSE}} = \sqrt{\dfrac{1}{n} \sum_{i=1}^{n} (y_i- \hat{y}_i)^2} \]
The square root of MSE, which makes it have the same unit as the variable being predicted.
Provides a more interpretable measure of the prediction error.
A lower RMSE indicates that the model has lower prediction error.
\[ MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i- \hat{y}_i| \]
Calculates the average of the absolute differences between actual and predicted values.
Less sensitive to outliers compared to MSE or RMSE.
Easy to interpret because it is in the same unit as the target variable.
\[ MAPE = \frac{100\%}{n} \sum_{i=1}^{n} \left| \frac{y_i- \hat{y}_i}{y_i} \right| \]
Measures prediction error in percentage terms,
which allows comparison across datasets with different units.
A drawback is that if the actual value (\(y_i\)) is zero,
MAPE can become extremely large or undefined.
\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i- \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i- \bar{y})^2} \]
A metric that indicates the proportion of variance in the dependent variable (\(y\))
that can be explained by the model.
\(R^2\) ranges from 0 to 1.
\(R^2 = 1\) means the model explains all the variability in the data.
\(R^2 = 0\) means the model explains none of the variability.
A negative \(R^2\) may indicate that the model performs worse than simply predicting the mean.
| Metric | Unit | Lower Value Means | Weakness |
|---|---|---|---|
| MSE | Squared units | Lower prediction error | Hard to interpret because the unit is squared |
| RMSE | Same as data | Lower prediction error | Sensitive to outliers |
| MAE | Same as data | Lower prediction error | Does not indicate whether large errors come from outliers |
| MAPE | Percentage (%) | Lower prediction error | Cannot be used when actual values are zero |
| \(R^2\) | Unitless | Better model fit | Cannot be directly compared across very different models |
To evaluate overall error → use MSE or RMSE
To make interpretation easier in the same unit as the data → use MAE or RMSE
To evaluate error in percentage terms → use MAPE
To measure how well the model explains the variance in the data → use \(R^2\)
MSE and RMSE are suitable when you want outliers to have a stronger impact on the model’s learning.
MAE is suitable when you want to reduce the influence of outliers.
MAPE is suitable for comparing models across datasets with different units.
\(R^2\) is suitable for assessing the overall quality of the model.
A Confusion Matrix is a table used to evaluate the results of models applied to grouping or classification problems.
It compares the actual values (Actual) with the predicted values (Predicted).
From this matrix, four key standard evaluation metrics can be calculated:
Accuracy
Precision
Recall
F1-score
A Confusion Matrix is a \(2 \times 2\) table (for Binary Classification problems) with the following components:
| Actual / Predicted | Predicted Positive | Predicted Negative |
|---|---|---|
| Actual Positive (P) | True Positive (TP) | False Negative (FN) |
| Actual Negative (N) | False Positive (FP) | True Negative (TN) |
True Positive (TP) : Predicted as “Positive” and actually Positive.
True Negative (TN) : Predicted as “Negative” and actually Negative.
False Positive (FP) : Predicted as “Positive” but actually “Negative” (also known as False Alarm or Type I Error).
False Negative (FN) : Predicted as “Negative” but actually “Positive” (also known as Missed Detection or Type II Error).
\[ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \]
Measures the proportion of all correct predictions.
Works well when the dataset is balanced.
Not suitable when the dataset is highly imbalanced.
Example: If Accuracy = 90%, it means the model correctly predicts 90% of all cases.
\[ Precision = \frac{TP}{TP + FP} \]
Indicates the proportion of predicted “Positive” cases that are actually Positive.
Important in scenarios where False Positives have a high cost, such as disease diagnosis or fraud detection.
Example: If Precision = 80%, it means that among all cases predicted as Positive, 80% are correct.
\[ Recall = \frac{TP}{TP + FN} \]
Indicates the proportion of all actual “Positive” cases that the model correctly identifies.
Important in scenarios where False Negatives are costly, such as cancer detection or terrorist identification.
Example: If Recall = 75%, it means the model correctly detects 75% of all actual Positive cases.
\[ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \]
A weighted harmonic mean of Precision and Recall.
Useful when a balance between Precision and Recall is desired.
Example: If F1-score = 85%, it means the model achieves a good balance between Precision and Recall at 85%.
With the given counts (TP=, FN=, FP=, TN=):
Accuracy = ()
=
Precision = )
=
Recall = )
=
F1 = 2·)
≈
| Metric | Best suited for |
|---|---|
| Accuracy | When the dataset is balanced |
| Precision | When reducing False Positives is critical, e.g., fraud detection |
| Recall | When reducing False Negatives is critical, e.g., serious disease diagnosis |
| F1-Score | When a balance between Precision and Recall is needed |
| Metric | Meaning | Formula | Good Value |
|---|---|---|---|
| Accuracy | Overall correctness | \(\frac{TP + TN}{TP + TN + FP + FN}\) | > 90% is very good (depends on the problem) |
| Precision | Accuracy of the Positive Class | \(\frac{TP}{TP + FP}\) | The closer to 1, the better |
| Recall | Ability to detect the Positive Class | \(\frac{TP}{TP + FN}\) | The closer to 1, the better |
| F1-Score | Weighted average of Precision & Recall | \(2 \times \frac{Precision \times Recall}{Precision + Recall}\) | The closer to 1, the better |
If the dataset is imbalanced (e.g., 99% Negative and 1% Positive),
Accuracy alone can be misleading. In such cases, Precision, Recall, or F1-Score should also be used.
If the goal is high precision (minimizing False Positives) → use Precision
If the goal is to capture all actual Positives (minimizing False Negatives) → use Recall
If the goal is to balance Precision and Recall → use F1-Score
If the goal is to evaluate overall performance → use Accuracy (only when the dataset is balanced)
Underfitting
Overfitting
Note
Too simple (underfitting): High Bias, Low Variance → Poor performance.
Too complex (overfitting): Low Bias, High Variance → Poor generalization.
Just right: Balanced bias and variance → Minimum test error and best generalization to new data.
The Bias–Variance Tradeoff explains why we worry about underfitting and overfitting. And techniques like Train/Test Split and K-Fold Cross-Validation are practical ways to detect and control these problems.
Train & Test Split
Purpose: To evaluate how well a model generalizes to unseen data.
How it works:
Split the dataset into two parts:
Benefit: Provides a quick estimate of accuracy but may depend heavily on which data ends up in training vs. test.
K-Fold Cross-Validation
Purpose: To get a more reliable and stable performance estimate.
How it works:
Benefit:
Chai, T., & Draxler, R. R. (2014). Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature. Geoscientific Model Development, 7(3), 1247–1250. https://doi.org/10.5194/gmd-7-1247-2014
Willmott, C. J., & Matsuura, K. (2005). Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Research, 30(1), 79–82. https://doi.org/10.3354/cr030079
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning: With applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874. https://doi.org/10.1016/j.patrec.2005.10.010
Powers, D. M. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies, 2(1), 37–63. https://arxiv.org/abs/2010.16061
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Springer. https://doi.org/10.1007/978-1-4614-6849-3
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1–26. https://doi.org/10.18637/jss.v028.i05
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). Springer. https://doi.org/10.1007/978-0-387-21706-2