Chapter 4. Multiple Regression

Jiwon N. Speers

Chapter 4. Multiple Regression

So far we have worked with models for explaining outcomes when the outcome is continuous and there is only one continuous predictor. Now we will turn to multiple regression analysis, where we will be examining the roles of several predictors. There are many similarities between simple regression and multiple regression.

Unlike a simple regression that looks at the relationship between one independent variable (IV) and one dependent variable (DV), if we want to assess all IVs at one time the different associations that the DV has with multiple IVs, we can see that relationship better with a multiple regression. It helps us isolate the effects of IVs by controlling for the other variables.

I. Multiple Regression Models

In multiple regression, we are modeling variation in $Y$ via a model where

Outcome = Continuous Predictors + Error

or more specifically, the multiple regression population model for scores is

$Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \ldots \beta_pX_{pi} + e_i$

for case or person $i$ with outcome $Y_1$ (a continuous variable) and predictors $X_1$ through $X_p$ . The $e_i$ are still assumed to be independent and normally distributed with mean 0 and variance ${s^2}_e$ , as was true in simple regression. The error term ( $e_i$ ) is defined as the difference between actual values and predicted values $(e_i = Y_i - \hat{Y}i$ , which is generally called “residual”). $\beta_0$ is the intercept (value of $Y$ when all $X$ equal 0) in the population, $\beta_j$ is the slope for predictor $X_j$ in the population, where the slope is the predicted change in $Y$ for one unit increase in $X_j$ , holding all other X constant.

The regression estimated, fitted, or sample model is

$Y_i = b_0 + b_1X_{1i} + b_2X_{2i} +\ldots b_pX_{pi}$ (the line)

where $Y_i$ is the score for case or person $i$ on outcome $Y$ , and is the predicted score, $b$ , is the intercept (the predicted value of $Y$ when all $X$ equal 0) in the sample, $b_j$ is the estimated slope for predictor $X_j$ , where the slope is the predicted change in $Y$ for one unit increase in $X_j$ , holding all other X constant, and $e_i$ is the residual for person $i$ . If we write the model in terms of $\hat{Y}i$ there is no $e_i$ term.

The underlined phrase is “holding all other $X$ constant.” This is how we specify that we are statistically controlling for the influence of $X$ other than the one whose slope we are interpreting. By including additional $X$ in the model, it is as if we were equating the cases whose scores we are analyzing on the added $X$ .

II. Adding Independent Variables

Suppose we are examining the role of CEO salary as a predictor of CEO performance.

Graph that shows the role of CEO salary as a predictor of CEO performance

The errors are larger than the value predicted by the estimated regression line according to the above graph. Therefore, by putting another $X$ , say organizational ownership, into the regression model, we aim to decrease the sizes of the errors and it is also as if we have equated organizations on their ownerships. Then the slope of the original $X$ shows how much CEO salary affects CEO performance for the organizations with the same types of ownership (i.e., holding ownership constant).

When we have $X$ that are quite independent of each other, then adding an $X$ may not change the slope of the original $X$ at all. However, to the extent the $X$ overlap (are related to each other), we may see either bigger or smaller slopes for the original $X$ when a new $X$ is added. In either case, we hope that the added $X$ explain more variation in $Y$ that just one $X$ alone. If they do, it will reduce the sizes of residuals (and thus reduce MSE), and therefore give us more powerful tests of the slopes in our model. Also adding $X$ should increase $R^2$ , which is one of our indices of model quality.

Meanwhile, a control variable is defined as a variable held constant to assess or clarify the relationship between other variables. It is a variable that influences the DV, but not of a researcher’s immediate interests unlike the IV (i.e., CEO salary). For instance, let us say, the researcher’s main interest is whether or not the CEO salary (IV) is positively associated with his or her performance (DV). By controlling for organizational ownership (i.e., control variable), we are seeking to understand the CEO performance (DV) predicted by the CEO salary (IV) with the same types of organizational ownership (control variable). Here, the organizational ownership includes for-profit organizations, government organizations, and nonprofit organizations. Are CEO performance different due to the CEO’s salaries?

III. Assumptions in Multiple Regression

Aside from assuming independence and normality and equal-variances of residuals as noted above, another assumption that we are again making is that our model is properly specified. That is, we need to assume that (1) we have all the important X in the regression model, and (2) we have no irrelevant X in the model. These two assumptions, together with the assumption of linearity of the X-Y relationships, represent our assumptions about “model specification.”

3-1. Model Specification

With multiple regression, we will have a way to check these two assumptions. We will therefore learn how to assess “model specification.” When we say we have a properly specified model we are saying that we have the “right” or “correct” model — all the variables in the model (and no others) have linear relationships to $Y$ in the population. Having a properly specified model is important because it has implications for our estimates.

One new issue is that multiple regression models assume that the predictors are independent of each other, but sometimes they are not. In the ANOVA case (where the predictors are categorical) we can be sure predictors are independent (not “confounded”) by making sure the group sizes are equal (or proportional). This is most strictly controlled in experimental designs where subjects are often randomly assigned to groups in equal numbers.

However, we do not typically assign people to have particular values of the $X$ in regression analyses. Therefore, we will need methods for assessing whether our $X$ are interrelated or “collinear.” The problem of having interrelated $X$ is called multicollinearity. In the next section, we will learn to assess whether multicollinearity is a problem for our models.

3-2. Multicollinearity

Multiple regression analysis uses several independent variables, and if these independent variables have strong correlations, the results of regression analysis are significantly less accurate. For example, suppose that two independent variables, $X_1$ and $X_2$ are selected, and the observed data ( $x_1, x_2, y$ ) are

(1, 2, 11), (3, 6, 25), (-2, -4, -10), (2, 4, 18)

According to this data, $x_2$ is always twice $x_1$ . Therefore, it can be seen that it has a relationship of $x_2 = 2x_1$ . Let’s say that from the above data, for example, we estimated the regression equation $y = 4 + 3x_1 + 2x_2$ . Then, due to the relationship $x_2 = 2x_1, y = 4 + 3x_1 + x_2 +x_2 = 4 + 3x_1 + x_2 + 2x_1 = 4 + 5x_1 + x_2$ can also be said to be an appropriate regression equation. Therefore, it is not known which of these two equations is an appropriate regression equation because it is impossible to distinguish whether the change of the dependent variable is due to the independent variable $X_1$ or $X_2$ . Therefore, in this case it becomes impossible to estimate the regression equation.

When multicollinearity is a problem, an example of the most effective countermeasure is to combine the variables into one. For instance, let us say $X_1$ is the number of men and $X_2$ is the number of women. At this time, if the number of men and the number of women changes approximately proportionally, strong multicollinearity occurs between the two variables.

In this case, this problem can be eliminated by introducing only one variable, the total population, rather than introducing these two variables separately. However, if it is theoretically desirable to consider both introduced variables, removing one of them can significantly reduce the explanatory power of the model.

In most multiple regression analyses, some multicollinearity usually exists. In particular, when the number of independent variables is plural, the possibility of multicollinearity is high. Even in the case of such high multicollinearity, multiple regression analysis is not at all meaningless.

When multicollinearity becomes a problem in regression analysis, it is necessary to be careful in interpreting the analysis results. In other words, there are many problems in determining the value of the coefficient of each variable (i.e., the significance of individual variables is unclear), but the value of the dependent variable produced by combining all of these variables provides a sort of accurate predictive value (i.e. the overall model combining individual variables can be significant).

Of course, in this case, if the values of the independent variables to be input are data found in actual reality, it means that there is a correlation between the independent variables through the values in this data. In addition, meaningful prediction of the value of the dependent variable is possible by inputting the data of these independent variables.

However, if you create and use artifacts that do not show correlation between independent variables (i.e., if you estimate the dependent variable for the values of independent variables that do not have correlation), the estimate will be difficult to trust and you will get meaningless results.

IV. Goodness of Model

Many of the analyses for multiple regression have tests and procedures similar to those we have learned already. For instance, we will see MSE and SEE, we’ll use F tests for gauging significance of the whole model, and we’ll have to assess whether assumptions about our residuals are appropriate using plots of the residuals. We will use the “variance explained” measure $R^2$ to assess whether the set of $X$ we have chosen is helping us to understand variation in the outcome. We will also have tests of the individual predictors that will be t tests.

How do we begin using multiple regression? First of course we need to have some idea of what predictors might reasonably be used to explain why $Y$ scores vary. A starting point is what theory or prior research suggest might be good variables to use as $X_1$ through $X_p$ .

$Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \ldots \beta_pX_{pi} + e_i$

Also we will learn how to interpret the values of the slopes, $\beta_1$ through $\beta_p$ , as well as how to decide which predictors are not useful, which predictor (of the useful ones) is most important, and how much we have explained about $Y$ with the set of variables we’ve chosen.

Meanwhile, we will see that it is possible to examine categorical predictors using regression, but when there are multi-category variables it is very tedious and the ANOVA framework is more sensible and easier to use.

Graph that shows the role of CEO salary as a predictor of CEO performance

As in simple regression, in multiple regression, SST (i.e., the total variable of the dependent variable) is divided into two parts as follows:

$\Sigma{(Y_i - \bar{Y})}^2 = \Sigma{[(a + bx_i) - \bar{Y}]}^2 + \Sigma{[Y_i - (a + bx_i)]}^2$

Where SST = SSR + SSE

The left term is Sum of Squares Total (SST), which is total variation around the average of Y. The right term consists of SSR and SSE, which are variation explained by the regression line and variation not explained by the regression line, respectively. Also, the coefficient of determination, $R^2$ , is represented as

$R^2 = 1 - \frac{SSE}{SST}$

$R^2$ is the rate of the verified variation to the total variation. This effect size is widely used to assess the adequacy of regression analysis. However, in multiple regression analysis, when an independent variable is added, the coefficient of determination increases. Even when an independent variable that is thought to be completely independent of the dependent variable is added, there may be a slight correlation between observed data by chance, and in this case, the value of the $R^2$ increases by adding this new variable.

Therefore, statistical analysts who are inexperienced are likely to make an error in determining that this model is meaningful after increasing the value of the coefficient of determination by adding many meaningless independent variables. It is adjusted $R^2$ that was introduced to correct this error.

When one variable is added, the residuals continue to decrease, so the SSE value decreases, the ratio of the sum of residual squares to the total sum of squares (i.e., $\frac{SSE}{SST}$ ) decreases; and $R^2$ increases. In this case, to prevent the error of increasing the value of the $R^2$ by introducing a meaningless independent variable randomly, adjusted $R^2$ is used and it is defined as follows:

Adjusted $R^2 = 1 - (1 - R^2) \frac{n - 1}{n - k - 1}$

where $k$ is the number of independent variables.

V. Dummy Variable

Even though we have multi-category variables (e.g., race group: Black, White, Hispanic and Asian), we can only use dichotomies (dummy variables) in multiple regression. Thus, we must learn to use dummy variables to represent multi-category factors.

First, a dummy variable is a dichotomy coded as a 0/1 variable. If ‘gender’ is the variable we may code a gender dummy variable $X$ where

$X = 0$ if subject is female (0 = reference group),

$X = 1$ if subject is male (1 = focal group).

Dummy variables can be used in multiple regression. When we use a dichotomy like $X$ in multiple regression, we get the typical estimated equation

$\hat{Y}_i = b_0 + b_1X_i$

$b_1$ represents the predicted change in $Y_i$ associated with 1 unit increase in $X_i$ . When one point change occurs in the scale of $X$ , a change of the $b_1$ amount would change in $Y_i$ .

But when $X_i$ is a dummy variable ( $X$ ) and only takes on two values (0 = female, 1 = male), a difference of one point can only occur when $X_i$ is equal to 1. When $X_i = 0$ , the subject is female, but when $X_i= 1$ , the subject is male.

So, the slope $b_1$ for our bivariate regression where $X$ is a dummy variable equals the mean difference between the two represented groups – for this $X$ the slope $b_1$ is the difference between the means of males and females. Also we can see that

$\hat{Y}_i = b_0 + b_1X_i$

will equal $b_0$ when $X_i= 0$ .

$b_1$ : Mean difference between male and female groups

$b_0$ : the intercept $b_0$ represents the mean of $Y$ for the females (i.e., the mean for all cases with $X_i = 0$ ).

It is easier to see this empirically by running descriptive statistics and a t test on two groups, and then running a regression using a dummy variable that represents the 2 groups as our $X$ .

We will see t test output for tchcomm with a variable that represents public vs private schools (“public”) as our predictor as well as a regression using that dummy variable.

$\hat{Y}_i = b_0 + b_1X_i$

The variable “public” has two levels: 1 for public schools, and 0 for all other schools (e.g., private)

Table that details regression analysis results

Coefficients table shows the slope equals the mean difference from the descriptive statistics and t output: -5.918= (-0.9668) – (4.95) (See Group Statistics and Coefficients)

The circle shows the value of the t test for the slope equals the two-sample t test. Also the intercept equals the mean of the reference group (= private schools):

Coefficients table for the model

The above results only hold exactly for a dummy variable coded 0 and 1.

If the dichotomy is coded with numbers that are not one unit apart (e.g., 1 and 2), then the slope will not equal the mean difference.

$\hat{Y}_i = b_0 + b_1X_i$

You can check this by running a regression using a dummy variable that represents the 2 groups with numbers other than 0 and 1. You will know that it is convenient that you assign 0 to a reference group and 1 to a focal group. What about a categorical variable that consists of three values?

Suppose now we have 3 groups (label, level), say, urban, suburban and rural schools. We have a $X$ variable g10urban

X = 1 for urban schools

X = 2 for suburban schools, and

X = 3 for rural schools

If we include g10urban in a regression, multiple regression will treat it as if it were a number, not a label. Is a suburban school ( $X = 2$ ) “twice” as good as an urban one ( $X = 1$ )? So using $X$ in multiple regression as it is would be a mistake.

This is not something that we try to do. Note that SPSS [machine] does not stop you from using a categorical variable in regression, as if the $x$ variable is a continuous numeric variable. Here is output using g10urban to predict tchcomm, which is wrong:

The charted output of a model summary

Coefficients table of the model

We use dummy (0/1) variables to differentiate these 3 groups (label); however, SPSS mistakenly takes the variable as a numeric variable. So we need to represent the 3 groups in some other way: Conversion of the number into dummy variables.

We will use (k-1) dummy variables to represent k groups.

Let $X_{urban}$ represent ‘Is the school urban?’(1 = Yes) and $X_{sub}$ represent ‘Is the school suburban?’ (1 = Yes).

If we have one school from each group, their values of the original factor (g10urban) and the two dummy variables $X_1$ and $X_2$ would be

School Level chart forthcoming

Again here are the scores:

School Group chart forthcoming

We do not need a third variable ( $X_{rural}$ ) that represents ‘Is the school rural?’ (1 = Yes).

The pair of values ( $X_{urban}$ , $X_{sub}$ ) is different for each group of subjects so we can tell 3 groups apart using 2 dummy variables.

Also note, we interpret $b_{urban}$ as the difference between urban schools and all others, holding $X_{sub}$ constant.

$\hat{Y}_i = b_0 + b_{urban}X_{urban_i} + b_{sub}X_{sub_i}$

For all the urban schools, $X_{urban} = 1$ and $X_{sub} = 0$ , so $b_0 + b_{urban}$

For the others,

$X_{sub} = 0$ if they are rural, (and $X_{urban} = 0$ ) therefore $b_0$

$X_{sub} = 1$ if they are suburban, (and $X_1 = 0$ ) therefore $b_0 + b_{sub}$

Because of the way we created $X_{urban}$ and $X_{sub}$ , we will never have a case where both $X_{urban} = 1$ and $X_{sub} = 1$ .

Suppose now that we use the variables $X_1$ and $X_2$ in multiple regression. In our estimated regression equation

$\hat{Y}_i = b_0 + b_1 X_{1i} + b_2 X_{2i}$

or

$\hat{Y}_i = b_0 + b_{urban}X_{urban_i} + b_{sub} X_{sub_i}$

the value of $b_0$ is the mean (or predicted score) for the rural schools, which is the predicted value of $Y$ when the values of all the $X$ are zero.

In a case such as this we might actually be interested in testing whether $\beta_0 = 0$ because $\beta_0$ represents the population mean of the rural schools.

With the intercept and two slopes we can compute all of the group means.

Our estimated regression equation will be

$\hat{Y}_i = b_0 + b_1 X_{1i} + b_2 X_{2i}$

The slope $b_1$ represents predicted change in $Y$ for 1 unit increase in $x_1$ , holding $X_2$ constant.

It is also the difference between the urban group mean and the mean for the reference group of rural schools.

The slope $b_2$ represents predicted change in $Y$ for 1 unit increase in $X_2$ , holding $X_1$ constant, and it is the mean difference between the suburban schools and the reference group of rural schools.

Specifically we can see that

$\hat{Y}_i = b_0 = \bar{Y}$ for rural schools

$\hat{Y}_i = b_0 + b_1 = \bar{Y}$ for urban schools

$\hat{Y}_i = b_0 + b_2 = \bar{Y}$ for suburban schools

None of our cases ever has

$\hat{Y}_i = b_0 + b_1 + b_2$

because no school can be in two locations.

The school means on tchcomm are

A chart of the school means report

The regression is

A chart showing the regression analysis

So b₀ = -.298 the rural mean (these are circled)

b₀ + b₁ = -.298 + .555 = .257 the urban mean

b₀ + b₂ = -.298 + .422 = .124 the suburban mean

VI. Recode a categorical variable into dummy variables in SPSS

In the SPSS “variable view” we can find out that g10urban is coded with

1 = urban
2 = suburban
3 = rural.

In addition, there are several schools with missing values on this variable. We have learned that we need 2 dummy variables to represent these 3 groups. Let’s name these 2 dummy variables as “urban” and “suburban,” so that:

Urban: if school type is urban, then coded as 1, otherwise, coded as 0
Suburban: if school type is suburban, then coded as 1, otherwise, coded as 0

The following few images guide you on how to create “urban.” You need to do “suburban” by yourself with similar procedure.

Step 1. To create dummy variable called “Urban,” go to SPSS > Transform > Recode > Into Different Variables…