Logistic Regression
Using SPSS
See
www.stattutorials.com/SPSSDATA
for files mentioned in this tutorial © TexaSoft, 2008
These SPSS statistics tutorials briefly explain the use and
interpretation of standard statistical analysis techniques for Medical,
Pharmaceutical, Clinical Trials, Marketing or Scientific Research. The examples
include howto instructions for SPSS Software.
Logistic
Regression in SPSS
This
example is adapted from information in Statistical Analysis Quick Reference
Guidebook (2007).
A sales director
for a chain of appliance stores wants to find out what circumstances encourage
customers to purchase extended warranties after a major appliance purchase.
The response variable is an indicator of whether or not a warranty is
purchased. The predictor variables they want to consider are
There are several strategies you can take to develop the “best” model for the
data. It is recommended that you examine several models before determining
which one is best for your analysis. (In this example we allow the computer
to help specify important variables, but it is inadvisable to accept a
computer designated model without examining alternatives.) Begin by examining
the significance of each variable in a fully populated model.
1. Open the data set named WARRANTY.SAV (downloadable from the data
section) and choose Analyze/Regression/Binary Logistic.
2. Select Bought as the dependent variable and Gender, Gift, Age,
Price, and Race as the covariates (i.e. the independent or predictor)
variables.
3.
Click on the Categorical checkbox (It is a button in SPSS version 16)
and specify Race as a categorical variable. Click Continue and then
OK. This produces the following SPSS output table.
Variables in the Equation 


B 
S.E. 
Wald 
df 
Sig. 
Exp(B) 
Step
1 
Gender 
3.772 
2.568 
2.158 
1 
.142 
.023 
Price 
.001 
.000 
3.363 
1 
.067 
1.001 
Age 
.091 
.056 
2.638 
1 
.104 
1.096 
Gift 
2.715 
1.567 
3.003 
1 
.083 
15.112 
Race 


2.827 
3 
.419 

Race(1) 
3.773 
13.863 
.074 
1 
.785 
43.518 
Race(2) 
1.163 
13.739 
.007 
1 
.933 
3.199 
Race(3) 
6.347 
14.070 
.203 
1 
.652 
570.898 
Constant 
12.018 
14.921 
.649 
1 
.421 
.000 
The “Variables in the Equation” table shows the output resulting from
including all of the candidate predictor variables in the equation. Notice
that the Race variable, which was originally coded as 1=White,
2=African American, 3=Hispanic and 4=0ther has been changed (by the SPSS
logistic procedure) into three (4  1) indicator variables called Race(1),
Race(2), and Race (3). These three variables each enter the
equation with their own coefficient and pvalue and there is an overall
pvalue given for Race.
The significance of each variable is measured using a Wald statistic. Using
p=0.10 as a cutoff criterion for not including variables in the equation, it
can be seen that Gender (p=0.142) and Race (p=0.419) do not
appear to be important predictor variables. Age is marginal (p=0.104),
but we’ll leave it in for the time being. Rerun the analysis again after
taking out Gender and Race as predictor variables. The analysis is rerun
without these “unimportant” variables, yields the following output:
Variables in the Equation 


B 
S.E. 
Wald 
df 
Sig. 
Exp(B) 
Step 1 
Price 
.000 
.000 
6.165 
1 
.013 
1.000 
Age 
.064 
.032 
4.132 
1 
.042 
1.066 
Gift 
2.339 
1.131 
4.273 
1 
.039 
10.368 
Constant 
6.096 
2.142 
8.096 
1 
.004 
.002 
This
reduced model indicates that there is a significant predictive power for the
variables Gift (p=0.039), Age (p=0.042), and Price
(p=0.013). Although the pvalue for Price is small, notice that the OR
= 1 and the coefficient for Price is zero to three decimal places.
These seemingly contradictory bits of information (i.e. small pvalue but OR =
1.0, etc.) are suggestive that the values for Price are hiding the
actual Odds Ration (OR) relationship. If the same model is run with the
variable Price100, which is Price divided by 100, the odds ratio
for Price100 is 1.041 and the estimated coefficient for Price100 is
0.040 as shown below.
Variables in the Equation 


B 
S.E. 
Wald 
df 
Sig. 
Exp(B) 
Step
1 
Age 
.064 
.032 
4.132 
1 
.042 
1.066 
Gift 
2.339 
1.131 
4.273 
1 
.039 
10.368 
Price100 
.040 
.016 
6.165 
1 
.013 
1.041 
Constant 
6.096 
2.142 
8.096 
1 
.004 
.002 
All of the other
values in the table remain the same. All we have done is to recode Price
into a more usable number. Another tactic often used is to standardize values
such as Price by subtracting the mean and dividing by the standard
deviation. Using standardized scores eliminates the problem observed with the
Price variable, and also simplifies the comparison of odds ratios for
different variables.
The
result is that we can now see that the odds that a customer who is offered a
gift will purchase a warranty is 10 (see Exp(B) for Gift) times greater than
the corresponding odds for a customer not offered a gift. We also observe
that for each additional $100 in Price, the odds that a customer will
purchase a warranty increases by about 4%. This tells us that people tend to
be more likely to purchase warranties for more expensive appliances. Finally,
the OR for age, 1.066, tells us that older buyers are more likely to purchase
a warranty.
One
way to assess the model is by the HosmerLemeshoi criteria. To product this
information:
4.
Rerun the analysis and click on the Options checkbox and select the
select the HosmerLemeshow goodnessoffit. Click Continue and OK.
Hosmer and Lemeshow Test 
Step 
Chisquare 
df 
Sig. 
1 
1.792 
8 
.987 
This test divides
the data into several groups based on
values, then computes a chisquare from observed and
expected frequencies of subjects falling in the two categories of the binary
response variable within these groups. Large chisquare values (and
correspondingly small pvalues) indicate a lack of fit for the model. In the
table above we see that the HosmerLemeshow chisquare test for the final
warranty model yields a pvalue of 0.987 thus suggesting a model with good
predictive value. Note that the Hosmer and Lemeshow chisquare test is not a
test of importance of specific model parameters (which may also appear in your
computer printout). It is a separate posthoc test performed to
evaluate a specific model.
Interpretation
of the multiple logistic regression model
Once
we are satisfied with the model, it can be used for prediction just as in the
simple logistic example above. For this model, the prediction would be
(For more details in predicting see Statistical Analysis Quick
Reference Guideboo (Elliott, 2007.)
Using this
equation it would be reasonable to predict that a person with the
characteristics (Age = 54, Price = $3,850, and Gift = 1)
would purchase a warranty because and the person where no gift is offered would not be
predicted to purchase a warranty because. The typical cutoff for the decision would be 0.5 (or 50%).
Thus, using this cutoff anyone whose score was higher than 0.5 would be
predicted to buy the warranty and anyone with a lower score would be predicted
to not buy the warranty. However, there may be times when you want to adjust
this cutoff value. Neter et al (1996) suggests three ways to select a
cutoff value for predicting:

Use the
standard 0.5 cutoff value.

Determine a
cutoff value that will give you the best predictive fit for your sample
data. This is usually determined through trial and error.

Select a
cutoff value that will separate your sample data into a specific proportion
of your two states based on a prior known proportion split in your
population.
For example, to
use the second option for deciding on a cutoff value, examine the model
classification table that is part of the SPSS logistic output
Classification Table^{a} 

Observed 
Predicted 

Bought 

No 
Yes 
Percentage Correct 
Step
1 
Bought 
No 
12 
2 
85.7 
Yes 
1 
35 
97.2 
Overall Percentage 


94.0 
a. The
cut value is .500 



This table
indicates that the final model correctly classifies 94% of the cases
correctly. The model used the default 0.5 cutoff value to classify each
subject’s outcome. (Notice the footnote on the table “The cut value is
.500.”) You can rerun the analysis with a series of cutoff values such as
0.4, 0.45, 0.55 and 0.65 to see if the cutoff value could be adjusted for a
better fit. For this particular model, these alternate cutoff values do not
lead to better predictions. In this case, the default 0.5 cutoff value is
deemed sufficient. (For more information about classification see
Statistical Analysis Quick Reference Guidebook, 2007.)
References

Cohen, J.,
Cohen, P. West, S.G., and Aiken, L.S. (2002) Applied Multiple
regression/Correlation Analysis for the Behavioral Sciences, Third Edition,
Lawrence Erlbaum Associates, Publishers.

Elliott, A.,
and Woodward, W. (2007) Statistical Analysis Quick Reference Guidebook,
Thousand Oaks: Sage.

Hosmer, D.W.
and Lemeshow, S. (2000). Applied Logistic Regression, 2nd edition, New
York: John Wiley and Sons, Inc.

Neter, J.,
Wasserman, W., Nachtsheim, C. J., & Kutner, M. H. (1996) Applied Linear
Regression Models (3rd Ed.).Chicago: Irwin.
See
www.stattutorials.com/SPSSDATA
for files mentioned in this tutorial © TexaSoft, 2008