\(~~~~~~~~~~\)Regression Model\(~~~~~~~~~~\)

Somsak Chanaim

International College of Digital Innovation, CMU

September 19, 2025

Linear Regression

Linear Regression is a statistical and Machine Learning technique used to model the relationship between

  • (independent variable or predictor)

  • (dependent variable or response)

Assuming that the relationship is linear. Regression is a powerful tool for business data analysis.

Pearson Correlation

Pearson Correlation is a statistical measure used to evaluate the linear relationship between two variables.

It indicates both the direction of the relationship and the strength of that relationship.

Formula for Calculating Pearson Correlation

\[r = \dfrac{\displaystyle\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\displaystyle\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}}\]

Where:

  • \(x_i, y_i\) are the sample data points of variables \(x\) and \(y\)

  • \(\bar{x}, \bar{y}\) are the means of \(x\) and \(y\)

  • \(n\) is the number of observations

Values of Pearson Correlation

The value of \(r\) ranges from \(-1\) to \(+1\):

  • \(r = +1\): Perfect positive relationship (Positive Linear Relationship)

    • Variables \(x\) and \(y\) increase together in a straight-line pattern.
  • \(r = -1\): Perfect negative relationship (Negative Linear Relationship)

    • Variable \(x\) increases while \(y\) decreases in a straight-line pattern.
  • \(r = 0\): No linear relationship between \(x\) and \(y\).

Interpretation of \(|r|\)

Value \(|r|\) Level of Relationship
\(0.9\) to \(1.0\) Very Strong linear relationship
\(0.7\) to \(0.9\) Strong linear relationship
\(0.5\) to \(0.7\) Moderate linear relationship
\(0.3\) to \(0.5\) Weak linear relationship
\(0.0\) to \(0.3\) Very Weak (almost no) linear relationship

Limitations of Pearson Correlation

  1. Measures only linear relationships Pearson correlation is applicable only when the relationship between variables is linear. If the relationship is nonlinear, the value of \(r\) may misleadingly suggest no relationship.

    • For example, if data points are distributed in a circular pattern, \(r\) may equal 0 even though the variables are still related.
  2. Sensitive to outliers The value of \(r\) can change drastically if outliers are present in the data.

  3. Requires quantitative data Pearson correlation can only be applied to numerical (quantitative) data.

  4. Assumes normality Both variables should approximately follow a normal distribution, and their variances should be similar.

Example of Pearson Correlation Calculation

Example 1: Positive correlation

Example 2: No correlation

Correlation Analysis in Orange Data Mining

Practical Applications

  • Finance: Examining the relationship between two stock prices (CAPM model)

  • Science: Checking the relationship between variables such as temperature and humidity

  • Education: Exploring the relationship between study hours and exam performance

1. Nonlinear Relationship (x and y is a parabola.)

Pearson correlation is 0.0222165.

2. Circular or Elliptical Data

Pearson correlation is -0.0064386.

3. Outliers affect the relationship

Pearson correlation is 0.9865504.

4. Stepwise relationship

Pearson correlation is 0.8250965.

5. Multiple-group relationship (Clusters of Data)

Pearson correlation is 0.8939282.

Principle of Linear Regression

Linear Regression attempts to find the best-fitting straight-line equation in the form:

\[y = f(x_1,x_2,\cdots,x_n)+\varepsilon =\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \varepsilon\]

  • \(y\): Dependent variable (response)

  • \(x_1, x_2, \dots, x_n\): Independent variables (predictors)

  • \(\beta_0\): Constant term (intercept) or the point where the line crosses the \(y\)-axis

  • \(\beta_1, \beta_2, \dots, \beta_n\): Coefficients of the independent variables

  • \(\varepsilon\): Error term or residual

The goal of Linear Regression is to find the coefficients \(\beta_0, \beta_1, \dots, \beta_n\) that make the straight-line equation describe the data as accurately as possible by minimizing the error between the predicted values (\(\hat{y}\)) and the actual values (\(y\)).

The most common method used to fit the model is Ordinary Least Squares (OLS).

Ordinary Least Squares (OLS)

The principle of Ordinary Least Squares (OLS) in the context of the Linear Regression equation mentioned above is to find the coefficients (\(\beta_0, \beta_1, \dots, \beta_n\)) that minimize the squared sum of errors between the actual values (\(y_i\)) and the predicted values (\(\hat{y}_i\)).

Which Line Fits Best? OLS vs User-Defined Line

  • OLS: y = + x, SSE =

  • USER: y = + x, SSE =

Ordinary Least Squares (OLS)

OLS (Ordinary Least Squares) tries to find the best line (or plane, if more than one \(x\)) that fits the data by choosing \(\beta_0, \beta_1, \dots, \beta_n\) so that the prediction of \(y\) is as close as possible to the real values.

Principle of OLS

OLS aims to find the coefficients (\(\beta\)) that minimize the
Residual Sum of Squares (RSS):

\[ RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

Where:

  • \(y_i\): the actual value of the dependent variable for the \(i\)-th observation

  • \(\hat{y}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_n x_{in}\):
    the predicted value from the model

  • \(n\): the number of observations

Mathematical Procedure

\[y_i = \beta_0+ \beta_1x_{1i}+\cdots+\beta_nx_{1i}+\varepsilon_i, ~i =1,2,\cdots,m\]

  1. Write \(y\) and \(X\) in matrix form:

\[\begin{aligned}Y &= \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix}, X = \begin{bmatrix} 1 & x_{11} & x_{12} & \dots & x_{1n} \\ 1 & x_{21} & x_{22} & \dots & x_{2n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{m1} & x_{m2} & \dots & x_{mn} \end{bmatrix},\\ \beta &= \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_n \end{bmatrix},\varepsilon = \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_m \end{bmatrix}\end{aligned}\]

  1. Predicted values (\(\hat{y}\)) in matrix form:
    \[ \hat{Y} = X \beta \]

  2. Residuals:
    \[ \varepsilon = Y - \hat{Y} = Y - X \beta \]

  3. Residual Sum of Squares (RSS):
    \[ RSS = \varepsilon^T \varepsilon = (Y - X \beta)^T (Y - X \beta) \]

  4. Find \(\beta\) that minimizes \(RSS\):
    \[ \hat{\beta} = (X^T X)^{-1} X^T Y \]

    • \(X^T\): transpose of \(X\)
    • \((X^T X)^{-1}\): inverse of \(X^T X\)

Types of Linear Regression

Simple Linear Regression

  • Uses only one independent variable (\(x\))

  • Equation:
    \[ y = \beta_0 + \beta_1 x + \varepsilon \]

  • Example: Using height (\(x\)) to predict weight (\(y\))

Multiple Linear Regression

  • Uses more than one independent variable (\(x_1, x_2, \dots, x_n\))

  • Equation:
    \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \varepsilon \]

  • Example: Using age (\(x_1\)) and income level (\(x_2\)) to predict savings amount (\(y\))

Assumptions of Linear Regression

  1. Linearity: The relationship between independent and dependent variables must be linear.

  2. Independence of errors: The error terms (\(\varepsilon\)) should not be correlated with each other.

  3. Constant variance (Homoscedasticity): The errors should have the same variance across all levels of the independent variables.

  4. Normality: The errors should be approximately normally distributed.

  5. No Multicollinearity: Independent variables should not be too highly correlated with each other.

Examples of regression applications in business include:

1. Sales Forecasting

Business Problem: The company wants to forecast sales for the next month.

  • Independent Variables:
    • Advertising Spend
    • Product Price
    • Promotion
  • Dependent Variable:
    • Sales

Regression Equation

\[ \begin{aligned} \text{sales}=&\beta_0+\beta_1\text{advertising_spend}\\ &+\beta_2\text{product_price} + \beta_3\text{promotion}+ \varepsilon \end{aligned} \]

Results

Interpreting the Coefficients

  • (Intercept) = 3716.95
    If advertising_spend, product_price, and promotion are all 0, the expected sales (sales) will be 3716.95 units (this serves as the baseline average sales).

  • advertising_spend = 0.48895
    For every 1-unit increase in advertising spend (advertising_spend), average sales increase by 0.49 units (statistically significant).

  • product_price = -4.11545
    For every 1-unit increase in product price (product_price), average sales decrease by 4.12 units.
    However, the p-value for this variable is 0.253, which is greater than 0.05.
    This means the effect is not statistically significant (we cannot confirm that price truly impacts sales).

  • promotion = 1525.80
    If promotion (promotion) increases by 1 unit, average sales increase by 1525.8 units (statistically significant).

Model Performance

  • Residual standard error = 1844
    On average, the predicted values (fitted values) deviate from the actual sales by about 1844 units.

  • Multiple R-squared = 0.61
    The model explains approximately 61% of the variation in sales, which is considered a moderate level of explanatory power.

  • Adjusted R-squared = 0.5978
    This value adjusts for the number of predictors in the model.
    Since it is slightly lower than the regular R-squared, it suggests that adding product_price may not have substantially improved the model.

  • F-statistic = 50.04, p-value < 2.2e-16
    The model as a whole is statistically significant, because the p-value is less than 0.05.

Hands-on Practice with Orange Data Mining

2. Price Optimization

Business Problem: Determine the optimal product price to maximize sales.

  • Independent Variables:
    • Product Price
    • Competitor Price
  • Dependent Variable:
    • Demand

Example: Use Nonlinear Regression or Polynomial Regression to capture the non-linear relationship between price and customer demand.

Regression Equation

\[ \begin{aligned} \text{demand}=\beta_0+\beta_1\text{price}+\beta_2\text{price}^2+\varepsilon \end{aligned} \]

Hands-on Practice with Orange Data Mining

3. Advertising Return Analysis (Return on Advertising Spend - ROAS)

Business Problem: Analyze how investments in different advertising campaigns affect sales.

Independent Variables:

  • Advertising spend by channel (e.g., Facebook, Google Ads)

  • Duration of campaign

Dependent Variable:

  • Sales lift (increase in sales from the campaign)

Example: Use Multiple Regression to identify which marketing channels generate the highest ROAS.

Regression Equation

\[ \begin{aligned} \text{sales_lift}=\beta_0+\beta_1\text{facebook_ads} + \beta_2\text{google_ads}+\varepsilon \end{aligned} \]

Hands-on Practice with Orange Data Mining

4. Inventory Demand Forecasting

Business Problem: A manufacturing company wants to forecast future raw material demand to better manage inventory.

Independent Variables:

  • Seasonality

  • Historical Sales

Dependent Variable:

  • Demand Quantity

Example: Use Time Series Regression, or combine Regression with time-series models (e.g., ARIMA + Regression).

Regression Equation

\[ \begin{aligned} \text{demand}=\beta_0+ \beta_1\text{sin_term} + \beta_2\text{cos_term}+\varepsilon \end{aligned} \]

Hands-on Practice with Orange Data Mining

5. Customer Satisfaction Analysis

Business Problem: A hotel business wants to analyze the factors that influence customer satisfaction.

Independent Variables:

  • Service Quality

  • Cleanliness

  • Room Price

Dependent Variable:

  • Satisfaction Score

Example: Use Linear Regression to build a model that identifies which factors have the greatest impact on satisfaction scores.

Regression Equation

\[ \begin{aligned} \text{satisfaction} =&\beta_0+\beta_1\text{service_quality} \\&+ \beta_2\text{cleanliness} + \beta_3\text{room_price}+\varepsilon \end{aligned} \]

Hands-on Practice with Orange Data Mining

References

  1. Demšar, J., Zupan, B., Leban, G., & Curk, T. (2013). Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research, 14, 2349–2353. Retrieved from https://www.jmlr.org/papers/v14/demsar13a.html

  2. Toplak, M., Németh, S., & Demšar, J. (2022). Data mining with visual programming: A case study of Orange. Communications of the ACM, 65(7), 77–85. https://doi.org/10.1145/3507286

  3. Zupan, B., & Demšar, J. (2004). Orange: From experimental machine learning to interactive data mining. White Paper, Faculty of Computer and Information Science, University of Ljubljana. Retrieved from https://orange.biolab.si

  4. Draper, N. R., & Smith, H. (1998). Applied Regression Analysis (3rd ed.). Wiley. https://doi.org/10.1002/9781118625590

  5. Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied Linear Statistical Models (5th ed.). McGraw-Hill Education.