\(~~~~~~~~~~\)Model Evaluation\(~~~~~~~~~~\)

Somsak Chanaim

International College of Digital Innovation, CMU

September 18, 2025

Evaluation for Prediction Model

Model performance can be evaluated using several metrics, such as:

  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)
  • Mean Absolute Error (MAE)
  • Mean Absolute Percentage Error (MAPE)
  • \(R^2\) (Coefficient of Determination)

Each of these metrics has different meanings and use cases, as explained below:

1. Mean Squared Error (MSE)

\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i- \hat{y}_i)^2 \]

  • Measures the squared difference between the actual values (\(y_i\)) and the predicted values (\(\hat{y}_i\)).

  • A lower MSE indicates that the model’s predictions are closer to the actual values.

  • A drawback is that the unit of MSE is the square of the original unit of the data, which makes interpretation more difficult.

2. Root Mean Squared Error (RMSE)

\[ RMSE = \sqrt{\text{MSE}} = \sqrt{\dfrac{1}{n} \sum_{i=1}^{n} (y_i- \hat{y}_i)^2} \]

  • The square root of MSE, which makes it have the same unit as the variable being predicted.

  • Provides a more interpretable measure of the prediction error.

  • A lower RMSE indicates that the model has lower prediction error.

3. Mean Absolute Error (MAE)

\[ MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i- \hat{y}_i| \]

  • Calculates the average of the absolute differences between actual and predicted values.

  • Less sensitive to outliers compared to MSE or RMSE.

  • Easy to interpret because it is in the same unit as the target variable.

4. Mean Absolute Percentage Error (MAPE)

\[ MAPE = \frac{100\%}{n} \sum_{i=1}^{n} \left| \frac{y_i- \hat{y}_i}{y_i} \right| \]

  • Measures prediction error in percentage terms,
    which allows comparison across datasets with different units.

  • A drawback is that if the actual value (\(y_i\)) is zero,
    MAPE can become extremely large or undefined.

5. Coefficient of Determination (\(R^2\))

\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i- \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i- \bar{y})^2} \]

  • A metric that indicates the proportion of variance in the dependent variable (\(y\))
    that can be explained by the model.

  • \(R^2\) ranges from 0 to 1.

  • \(R^2 = 1\) means the model explains all the variability in the data.

  • \(R^2 = 0\) means the model explains none of the variability.

  • A negative \(R^2\) may indicate that the model performs worse than simply predicting the mean.

Summary

Metric Unit Lower Value Means Weakness
MSE Squared units Lower prediction error Hard to interpret because the unit is squared
RMSE Same as data Lower prediction error Sensitive to outliers
MAE Same as data Lower prediction error Does not indicate whether large errors come from outliers
MAPE Percentage (%) Lower prediction error Cannot be used when actual values are zero
\(R^2\) Unitless Better model fit Cannot be directly compared across very different models

Choosing the Right Metric

  • To evaluate overall error → use MSE or RMSE

  • To make interpretation easier in the same unit as the data → use MAE or RMSE

  • To evaluate error in percentage terms → use MAPE

  • To measure how well the model explains the variance in the data → use \(R^2\)

  • MSE and RMSE are suitable when you want outliers to have a stronger impact on the model’s learning.

  • MAE is suitable when you want to reduce the influence of outliers.

  • MAPE is suitable for comparing models across datasets with different units.

  • \(R^2\) is suitable for assessing the overall quality of the model.

Evaluating Classification Model Performance with a Confusion Matrix

A Confusion Matrix is a table used to evaluate the results of models applied to grouping or classification problems.
It compares the actual values (Actual) with the predicted values (Predicted).

From this matrix, four key standard evaluation metrics can be calculated:

  • Accuracy

  • Precision

  • Recall

  • F1-score

1. What is a Confusion Matrix?

A Confusion Matrix is a \(2 \times 2\) table (for Binary Classification problems) with the following components:

Actual / Predicted Predicted Positive Predicted Negative
Actual Positive (P) True Positive (TP) False Negative (FN)
Actual Negative (N) False Positive (FP) True Negative (TN)
  • True Positive (TP) : Predicted as “Positive” and actually Positive.

  • True Negative (TN) : Predicted as “Negative” and actually Negative.

  • False Positive (FP) : Predicted as “Positive” but actually “Negative” (also known as False Alarm or Type I Error).

  • False Negative (FN) : Predicted as “Negative” but actually “Positive” (also known as Missed Detection or Type II Error).

Four Standard Evaluation Metrics

(1) Accuracy

\[ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \]

  • Measures the proportion of all correct predictions.

  • Works well when the dataset is balanced.

  • Not suitable when the dataset is highly imbalanced.

Example: If Accuracy = 90%, it means the model correctly predicts 90% of all cases.

(2) Precision (Positive Predictive Value)

\[ Precision = \frac{TP}{TP + FP} \]

  • Indicates the proportion of predicted “Positive” cases that are actually Positive.

  • Important in scenarios where False Positives have a high cost, such as disease diagnosis or fraud detection.

Example: If Precision = 80%, it means that among all cases predicted as Positive, 80% are correct.

(3) Recall (Sensitivity or True Positive Rate)

\[ Recall = \frac{TP}{TP + FN} \]

  • Indicates the proportion of all actual “Positive” cases that the model correctly identifies.

  • Important in scenarios where False Negatives are costly, such as cancer detection or terrorist identification.

Example: If Recall = 75%, it means the model correctly detects 75% of all actual Positive cases.

(4) F1-score

\[ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \]

  • A weighted harmonic mean of Precision and Recall.

  • Useful when a balance between Precision and Recall is desired.

Example: If F1-score = 85%, it means the model achieves a good balance between Precision and Recall at 85%.

Examples

With the given counts (TP=, FN=, FP=, TN=):

  • Accuracy = ()
    =

  • Precision = )
    =

  • Recall = )
    =

  • F1 = )

Choosing the Right Metric

Metric Best suited for
Accuracy When the dataset is balanced
Precision When reducing False Positives is critical, e.g., fraud detection
Recall When reducing False Negatives is critical, e.g., serious disease diagnosis
F1-Score When a balance between Precision and Recall is needed

5. Summary

Metric Meaning Formula Good Value
Accuracy Overall correctness \(\frac{TP + TN}{TP + TN + FP + FN}\) > 90% is very good (depends on the problem)
Precision Accuracy of the Positive Class \(\frac{TP}{TP + FP}\) The closer to 1, the better
Recall Ability to detect the Positive Class \(\frac{TP}{TP + FN}\) The closer to 1, the better
F1-Score Weighted average of Precision & Recall \(2 \times \frac{Precision \times Recall}{Precision + Recall}\) The closer to 1, the better

If the dataset is imbalanced (e.g., 99% Negative and 1% Positive),
Accuracy alone can be misleading. In such cases, Precision, Recall, or F1-Score should also be used.

Guidelines for Choosing the Right Metric

  • If the goal is high precision (minimizing False Positives) → use Precision

  • If the goal is to capture all actual Positives (minimizing False Negatives) → use Recall

  • If the goal is to balance Precision and Recall → use F1-Score

  • If the goal is to evaluate overall performance → use Accuracy (only when the dataset is balanced)

Model Evaluation and Scoring in Orange Data Mining

Example Calculation with Orange Data Mining

Titanic dataset

Titanic dataset

Underfitting vs. Overfitting in Machine Learning

Underfitting

  • Definition: When a model is too simple to capture the underlying patterns in the data.

  • Cause:

    • Using a model that lacks complexity (e.g., linear regression on nonlinear data).
    • Too few features or ignoring important relationships.
  • Symptoms:

    • High error on training data.
    • High error on test data.

Overfitting

Definition: When a model is too complex and learns not only the patterns but also the noise in the training data. * Cause:

  • Too many parameters/features.

  • Model trained for too long without regularization.

  • Symptoms:

    • Very low error on training data.
    • Very high error on test data (poor generalization).

The “Optimum” (Good Fit)

Definition: A balanced model that captures underlying patterns while ignoring random noise.

  • Goal: Minimize both bias and variance.

  • Symptoms:

    • Low error on training data.

    • Low error on test data (good generalization).

Bias–Variance Tradeoff

Note

Too simple (underfitting): High Bias, Low Variance → Poor performance.

Too complex (overfitting): Low Bias, High Variance → Poor generalization.

Just right: Balanced bias and variance → Minimum test error and best generalization to new data.

The Bias–Variance Tradeoff explains why we worry about underfitting and overfitting. And techniques like Train/Test Split and K-Fold Cross-Validation are practical ways to detect and control these problems.

Train/Test Split and K-Fold Cross-Validation

Train & Test Split

  • Purpose: To evaluate how well a model generalizes to unseen data.

  • How it works:

    • Split the dataset into two parts:

      • Training set → used to fit (train) the model.
      • Test set → used to check performance on unseen data.
  • Benefit: Provides a quick estimate of accuracy but may depend heavily on which data ends up in training vs. test.

K-Fold Cross-Validation

  • Purpose: To get a more reliable and stable performance estimate.

  • How it works:

    • Split the dataset into K equal parts (folds).
    • Train the model on K–1 folds and test on the remaining fold.
    • Repeat this process K times, each time using a different fold as the test set.
    • Average the results to get the final performance score.
  • Benefit:

    • Reduces the risk of bias from one particular train/test split.
    • Makes better use of all available data for both training and testing.

Cross-Validation in Orange Data Mining

References