\(\)Model Evaluation\(\)

Somsak Chanaim

International College of Digital Innovation, CMU

September 18, 2025

Evaluation for Prediction Model

Model performance can be evaluated using several metrics, such as:

Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
Mean Absolute Percentage Error (MAPE)
\(R^2\) (Coefficient of Determination)

Each of these metrics has different meanings and use cases, as explained below:

1. Mean Squared Error (MSE)

\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i- \hat{y}_i)^2 \]

Measures the squared difference between the actual values (\(y_i\)) and the predicted values (\(\hat{y}_i\)).
A lower MSE indicates that the model’s predictions are closer to the actual values.
A drawback is that the unit of MSE is the square of the original unit of the data, which makes interpretation more difficult.

2. Root Mean Squared Error (RMSE)

\[ RMSE = \sqrt{\text{MSE}} = \sqrt{\dfrac{1}{n} \sum_{i=1}^{n} (y_i- \hat{y}_i)^2} \]

The square root of MSE, which makes it have the same unit as the variable being predicted.
Provides a more interpretable measure of the prediction error.
A lower RMSE indicates that the model has lower prediction error.

3. Mean Absolute Error (MAE)

\[ MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i- \hat{y}_i| \]

Calculates the average of the absolute differences between actual and predicted values.
Less sensitive to outliers compared to MSE or RMSE.
Easy to interpret because it is in the same unit as the target variable.

4. Mean Absolute Percentage Error (MAPE)

\[ MAPE = \frac{100\%}{n} \sum_{i=1}^{n} \left| \frac{y_i- \hat{y}_i}{y_i} \right| \]

Measures prediction error in percentage terms,
which allows comparison across datasets with different units.
A drawback is that if the actual value (\(y_i\)) is zero,
MAPE can become extremely large or undefined.

5. Coefficient of Determination (\(R^2\))

\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i- \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i- \bar{y})^2} \]

A metric that indicates the proportion of variance in the dependent variable (\(y\))
that can be explained by the model.
\(R^2\) ranges from 0 to 1.
\(R^2 = 1\) means the model explains all the variability in the data.
\(R^2 = 0\) means the model explains none of the variability.
A negative \(R^2\) may indicate that the model performs worse than simply predicting the mean.

Summary

Metric	Unit	Lower Value Means	Weakness
MSE	Squared units	Lower prediction error	Hard to interpret because the unit is squared
RMSE	Same as data	Lower prediction error	Sensitive to outliers
MAE	Same as data	Lower prediction error	Does not indicate whether large errors come from outliers
MAPE	Percentage (%)	Lower prediction error	Cannot be used when actual values are zero
\(R^2\)	Unitless	Better model fit	Cannot be directly compared across very different models

Choosing the Right Metric

To evaluate overall error → use MSE or RMSE
To make interpretation easier in the same unit as the data → use MAE or RMSE
To evaluate error in percentage terms → use MAPE
To measure how well the model explains the variance in the data → use \(R^2\)
MSE and RMSE are suitable when you want outliers to have a stronger impact on the model’s learning.
MAE is suitable when you want to reduce the influence of outliers.
MAPE is suitable for comparing models across datasets with different units.
\(R^2\) is suitable for assessing the overall quality of the model.

Evaluating Classification Model Performance with a Confusion Matrix

A Confusion Matrix is a table used to evaluate the results of models applied to grouping or classification problems.
It compares the actual values (Actual) with the predicted values (Predicted).

From this matrix, four key standard evaluation metrics can be calculated:

Accuracy
Precision
Recall
F1-score

1. What is a Confusion Matrix?

A Confusion Matrix is a \(2 \times 2\) table (for Binary Classification problems) with the following components:

Actual / Predicted	Predicted Positive	Predicted Negative
Actual Positive (P)	True Positive (TP)	False Negative (FN)
Actual Negative (N)	False Positive (FP)	True Negative (TN)

True Positive (TP) : Predicted as “Positive” and actually Positive.
True Negative (TN) : Predicted as “Negative” and actually Negative.
False Positive (FP) : Predicted as “Positive” but actually “Negative” (also known as False Alarm or Type I Error).
False Negative (FN) : Predicted as “Negative” but actually “Positive” (also known as Missed Detection or Type II Error).

Four Standard Evaluation Metrics

(1) Accuracy

\[ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \]

Measures the proportion of all correct predictions.
Works well when the dataset is balanced.
Not suitable when the dataset is highly imbalanced.

Example: If Accuracy = 90%, it means the model correctly predicts 90% of all cases.

(2) Precision (Positive Predictive Value)

\[ Precision = \frac{TP}{TP + FP} \]

Indicates the proportion of predicted “Positive” cases that are actually Positive.
Important in scenarios where False Positives have a high cost, such as disease diagnosis or fraud detection.

Example: If Precision = 80%, it means that among all cases predicted as Positive, 80% are correct.

(3) Recall (Sensitivity or True Positive Rate)

\[ Recall = \frac{TP}{TP + FN} \]

Indicates the proportion of all actual “Positive” cases that the model correctly identifies.
Important in scenarios where False Negatives are costly, such as cancer detection or terrorist identification.

Example: If Recall = 75%, it means the model correctly detects 75% of all actual Positive cases.

(4) F1-score

\[ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \]

A weighted harmonic mean of Precision and Recall.
Useful when a balance between Precision and Recall is desired.

Example: If F1-score = 85%, it means the model achieves a good balance between Precision and Recall at 85%.

Examples

With the given counts (TP=, FN=, FP=, TN=):

viewof clicks2 = Inputs.button("Click to Random")

Accuracy = ()
=
Precision = )
=
Recall = )
=
F1 = 2·)
≈

Choosing the Right Metric

Metric	Best suited for
Accuracy	When the dataset is balanced
Precision	When reducing False Positives is critical, e.g., fraud detection
Recall	When reducing False Negatives is critical, e.g., serious disease diagnosis
F1-Score	When a balance between Precision and Recall is needed

5. Summary

Metric	Meaning	Formula	Good Value
Accuracy	Overall correctness	\(\frac{TP + TN}{TP + TN + FP + FN}\)	> 90% is very good (depends on the problem)
Precision	Accuracy of the Positive Class	\(\frac{TP}{TP + FP}\)	The closer to 1, the better
Recall	Ability to detect the Positive Class	\(\frac{TP}{TP + FN}\)	The closer to 1, the better
F1-Score	Weighted average of Precision & Recall	\(2 \times \frac{Precision \times Recall}{Precision + Recall}\)	The closer to 1, the better

If the dataset is imbalanced (e.g., 99% Negative and 1% Positive),
Accuracy alone can be misleading. In such cases, Precision, Recall, or F1-Score should also be used.

Guidelines for Choosing the Right Metric

If the goal is high precision (minimizing False Positives) → use Precision
If the goal is to capture all actual Positives (minimizing False Negatives) → use Recall
If the goal is to balance Precision and Recall → use F1-Score
If the goal is to evaluate overall performance → use Accuracy (only when the dataset is balanced)

Model Evaluation and Scoring in Orange Data Mining

Example Calculation with Orange Data Mining

Underfitting vs. Overfitting in Machine Learning

Underfitting

Definition: When a model is too simple to capture the underlying patterns in the data.
Cause:
- Using a model that lacks complexity (e.g., linear regression on nonlinear data).
- Too few features or ignoring important relationships.
Symptoms:
- High error on training data.
- High error on test data.

Overfitting

Definition: When a model is too complex and learns not only the patterns but also the noise in the training data. * Cause:

Too many parameters/features.
Model trained for too long without regularization.
Symptoms:
- Very low error on training data.
- Very high error on test data (poor generalization).

The “Optimum” (Good Fit)

Definition: A balanced model that captures underlying patterns while ignoring random noise.

Goal: Minimize both bias and variance.
Symptoms:
- Low error on training data.
- Low error on test data (good generalization).

Bias–Variance Tradeoff

Note

Too simple (underfitting): High Bias, Low Variance → Poor performance.

Too complex (overfitting): Low Bias, High Variance → Poor generalization.

Just right: Balanced bias and variance → Minimum test error and best generalization to new data.

The Bias–Variance Tradeoff explains why we worry about underfitting and overfitting. And techniques like Train/Test Split and K-Fold Cross-Validation are practical ways to detect and control these problems.

Train/Test Split and K-Fold Cross-Validation

Train & Test Split

Purpose: To evaluate how well a model generalizes to unseen data.
How it works:
- Split the dataset into two parts:
  - Training set → used to fit (train) the model.
  - Test set → used to check performance on unseen data.
Benefit: Provides a quick estimate of accuracy but may depend heavily on which data ends up in training vs. test.

K-Fold Cross-Validation

Purpose: To get a more reliable and stable performance estimate.
How it works:
- Split the dataset into K equal parts (folds).
- Train the model on K–1 folds and test on the remaining fold.
- Repeat this process K times, each time using a different fold as the test set.
- Average the results to get the final performance score.
Benefit:
- Reduces the risk of bias from one particular train/test split.
- Makes better use of all available data for both training and testing.

Cross-Validation in Orange Data Mining

References

Chai, T., & Draxler, R. R. (2014). Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature. Geoscientific Model Development, 7(3), 1247–1250. https://doi.org/10.5194/gmd-7-1247-2014
Willmott, C. J., & Matsuura, K. (2005). Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Research, 30(1), 79–82. https://doi.org/10.3354/cr030079
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning: With applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874. https://doi.org/10.1016/j.patrec.2005.10.010
Powers, D. M. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies, 2(1), 37–63. https://arxiv.org/abs/2010.16061
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Springer. https://doi.org/10.1007/978-1-4614-6849-3
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1–26. https://doi.org/10.18637/jss.v028.i05
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). Springer. https://doi.org/10.1007/978-0-387-21706-2

\(~~~~~~~~~~\)Model Evaluation\(~~~~~~~~~~\)

Evaluation for Prediction Model

1. Mean Squared Error (MSE)

2. Root Mean Squared Error (RMSE)

3. Mean Absolute Error (MAE)

4. Mean Absolute Percentage Error (MAPE)

5. Coefficient of Determination (\(R^2\))

Summary

Choosing the Right Metric

Evaluating Classification Model Performance with a Confusion Matrix

1. What is a Confusion Matrix?

Four Standard Evaluation Metrics

(1) Accuracy

(2) Precision (Positive Predictive Value)

(3) Recall (Sensitivity or True Positive Rate)

(4) F1-score

Examples

Choosing the Right Metric

5. Summary

Guidelines for Choosing the Right Metric

Model Evaluation and Scoring in Orange Data Mining

Example Calculation with Orange Data Mining

Underfitting vs. Overfitting in Machine Learning

Bias–Variance Tradeoff

Train/Test Split and K-Fold Cross-Validation

Cross-Validation in Orange Data Mining

References

\(\)Model Evaluation\(\)