Understanding Sum of Squares: SST, SSR, and SSE

by | Statistics

Introduction

In regression analysis, understanding how well our model fits the data is crucial. The sum of squares helps us quantify this fit by breaking down the total variability in our data into explained and unexplained components.

Components of Sum of Squares

Total Sum of Squares (SST)

SST measures the total variation in the dependent variable (y) around its mean. It represents the total amount of variability in the data:

Formula:

\[ SST = \sum(y_i – \bar{y})^2 \]

Where:

  • \( y_i \) = each observed value
  • \( \bar{y} \) = the mean of all observed values

This value is always positive because it sums the squared differences between the observed values and their mean.

Regression Sum of Squares (SSR)

SSR quantifies the variation explained by the regression model. It measures how much of the total variation (SST) is due to the regression model:

Formula:

\[ SSR = \sum(\hat{y}_i – \bar{y})^2 \]

Where:

  • \( \hat{y}_i \) = each predicted value
  • \( \bar{y} \) = the mean of the observed values

A higher SSR indicates that the regression model explains a large proportion of the variability in the data.

Error Sum of Squares (SSE)

SSE measures the variation in the dependent variable that is not explained by the regression model. It represents the errors or residuals:

Formula:

\[ SSE = \sum(y_i – \hat{y}_i)^2 \]

Where:

  • \( y_i \) = each observed value
  • \( \hat{y}_i \) = each predicted value

Lower SSE indicates better model fit.

The Fundamental Relationship

The relationship between SST, SSR, and SSE can be expressed as:

\[ SST = SSR + SSE \]

This formula indicates that the total variation in the data is divided into the variation explained by the model (SSR) and the variation unexplained (SSE).

Coefficient of Determination (R²)

\( R^2 \) is a statistical measure that represents the proportion of the variance for the dependent variable that’s explained by the independent variable(s):

Formula:

\[ R^2 = \frac{SSR}{SST} = 1 – \frac{SSE}{SST} \]

Interpretation:

  • \( R^2 = 1 \): The model perfectly explains the variability in the data.
  • \( R^2 = 0 \): The model does not explain any of the variability.
  • A higher \( R^2 \) value indicates a better fit.

Interactive Example

Let’s explore these concepts with real data. The table below shows study hours (x) and test scores (y). You can modify the values to see how they affect the different sum of squares components.

Hours Studied (x) Test Score (y) Predicted Score (ŷ)

SST

SSR

SSE

Step-by-Step Calculation Process

This section provides a detailed walkthrough of how to calculate Sum of Squares components using the data from our interactive example above. We’ll first find the line of best fit for our study hours and test scores data, then show how to calculate the total SST, SSR, and SSE values that help us understand how well our regression line fits the data points.

Step 0: Finding the Line of Best Fit

Before calculating the sum of squares components, we need to find the regression line using the least squares method:

1. Calculate means:

2. Calculate slope (β₁):

We use the formula: \[ \beta_1 = \frac{\sum(x_i – \bar{x})(y_i – \bar{y})}{\sum(x_i – \bar{x})^2} \]

Let’s break this down:

x y (x – x̄) (y – ȳ) (x – x̄)(y – ȳ) (x – x̄)²
Sums:

3. Calculate intercept (β₀):

Using: \[ \beta_0 = \bar{y} – \beta_1\bar{x} \]

4. Final Equation:

Step 1: Calculate Mean and Predicted Values

First, we calculate the mean of observed values and predicted values for all points:

Step 2: Calculate Total SST

Calculate SST by summing squared deviations from mean for all points:

\[ SST = \sum(y_i – \bar{y})^2 \]

Step 3: Calculate Total SSR

Calculate SSR by summing squared deviations of predicted values from mean:

\[ SSR = \sum(\hat{y}_i – \bar{y})^2 \]

Step 4: Calculate Total SSE

Calculate SSE by summing squared differences between observed and predicted values:

\[ SSE = \sum(y_i – \hat{y}_i)^2 \]

Step 5: Verify Total Relationship

Verify that SST = SSR + SSE for the total sums:

Use Cases of Sum of Squares in Regression Analysis

The sum of squares plays a crucial role in regression analysis and statistical modeling. Here are some expanded use cases demonstrating its importance:

  • Model Evaluation:

    The sum of squares helps evaluate the performance of different regression models by comparing their SSR, SSE, and \( R^2 \) values. This evaluation is critical for identifying the best model for predictive accuracy.

  • R-squared Calculation:

    \( R^2 \), derived from SSR and SST, measures the proportion of variability in the dependent variable explained by the independent variables. A higher \( R^2 \) value indicates a better-fitting model.

  • ANOVA (Analysis of Variance):

    In experimental design, the sum of squares is used to partition the total variability into components due to different factors and their interactions, aiding in hypothesis testing.

  • Feature Selection:

    By analyzing the contribution of each predictor to SSR, the sum of squares helps identify the most important features in a regression model, facilitating effective feature selection.

  • Goodness-of-Fit Testing:

    Comparing SSE and SSR provides insights into the goodness-of-fit of the regression model, indicating how well it explains the observed data.

  • Residual Analysis:

    Analyzing SSE and residuals helps detect patterns or inconsistencies in the model, providing a foundation for improving its accuracy and robustness.

  • Forecasting and Prediction:

    Accurate calculation of SSR and SSE ensures reliable forecasts and predictions, especially in time series and econometric models.

  • Comparative Model Diagnostics:

    Sum of squares metrics are integral to diagnostic tools like AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion), which compare multiple models to determine the most efficient one.

Conclusion

Understanding the components of Sum of Squares (SST, SSR, and SSE) is fundamental to grasping how well a regression model fits data. Through our interactive example of study hours and test scores, we’ve seen how:

  • Total Sum of Squares (SST = \(\sum(y_i – \bar{y})^2\)) measures the overall variability in our test scores
  • Regression Sum of Squares (SSR = \(\sum(\hat{y}_i – \bar{y})^2\)) quantifies how much of that variability is explained by study hours
  • Error Sum of Squares (SSE = \(\sum(y_i – \hat{y}_i)^2\)) shows the remaining unexplained variability

The relationship SST = SSR + SSE holds true when summing across all points, demonstrating how total variability breaks down into explained and unexplained components. This breakdown, along with the R² value (\(\frac{SSR}{SST}\)), provides critical insights into model effectiveness. In our example, we can determine what percentage of test score variation is explained by study time, helping educators and students understand the relationship between study hours and performance.

If you’re interested in diving deeper into linear regression or exploring our regression tools, check out the Further Reading section.

Have fun and happy researching!

Further Reading

Expand your knowledge with these additional resources. Whether you’re looking for interactive tools or in-depth guides, these links will help you dive deeper into the concepts covered in this guide.

Attribution and Citation

If you found this guide and tools helpful, feel free to link back to this page or cite it in your work!

Buy Me a Coffee ✨