Logistic Regression (Single-Trial Data)

The relationship between countries’ credit ratings and the volatility of the countries’ stock markets was examined in the Journal of Portfolio Management (Spring 1996). Our interest lies in whether volatility (standard deviation of stock returns) and credit rating (in percent) can be used to classify a country into one of the two market types (developed or emerging).

country/region	volatile	credit	market
Afghanistan	55.7	8.3	emerge
Australia	23.9	71.2	develop
China	27.2	57.0	emerge
Cuba	55.0	8.7	emerge
Germany	20.3	90.9	develop
France	20.6	89.1	develop
India	30.3	46.1	emerge
Belgium	22.3	79.2	develop
Canada	22.1	80.3	develop
Ethiopia	47.9	14.1	emerge
Haiti	54.9	8.8	emerge
Japan	20.2	91.6	develop
Libya	36.7	30.0	emerge
Malaysia	24.3	69.1	emerge
Mexico	31.8	41.8	emerge
NewZealand	24.3	69.4	develop
Nigeria	46.2	15.8	emerge
Oman	28.6	51.8	develop
Panama	38.6	26.4	emerge
Spain	23.4	73.7	develop
Sudan	60.5	6.0	emerge
Taiwan	22.2	79.9	develop
Norway	21.4	84.6	develop
Sweden	23.3	74.1	develop
Togo	45.1	17.0	emerge
Ukraine	46.3	15.7	emerge
UnitedKingdom	20.8	87.8	develop
UnitedStates	20.3	90.7	develop
Vietnam	36.9	29.5	emerge
Zimbabwe	36.2	31.0	emerge

Logistic regression

In R, logistic regression model is fitted by glm() with binomial distribution family. The default link function for binomial family is the logit link.

fit <- glm(market ~ volatile + credit, family = "binomial", data = volatile)
fit

## 
## Call:  glm(formula = market ~ volatile + credit, family = "binomial", 
##     data = volatile)
## 
## Coefficients:
## (Intercept)     volatile       credit  
##   -11.56273      0.04501      0.17385  
## 
## Degrees of Freedom: 29 Total (i.e. Null);  27 Residual
## Null Deviance:       41.46 
## Residual Deviance: 9.254     AIC: 15.25

Model fit results

# response profile
summary(volatile$market)

##  emerge develop 
##      16      14

In order to compute the model fit statistics for an intercept only model, we can just fit the model explicitely and use the generic functions.

fit2 <- glm(market ~ 1, family = "binomial", data = volatile)
AIC(fit2, fit)

##      df      AIC
## fit2  1 43.45540
## fit   3 15.25401

BIC(fit2, fit)

##      df      BIC
## fit2  1 44.85660
## fit   3 19.45761

-2 * logLik(fit2)

## 'log Lik.' 41.4554 (df=1)

-2 * logLik(fit)

## 'log Lik.' 9.254014 (df=3)

In the SAS output, the same model is fitted using PROC GENMOD. There are some additional output accessing goodness of fit. Here we compute some of the statistics.

# Deviance
fit$deviance

## [1] 9.254014

fit$df.residual

## [1] 27

fit$deviance / fit$df.residual

## [1] 0.3427413

# Pearson's Chi-Square
sum(residuals(fit, "pearson")^2)

## [1] 9.81283

sum(residuals(fit, "pearson")^2) / fit$df.residual

## [1] 0.3634381

These all show evidence of under-dispersion.

Hypothesis testing

We know the likelihood ratio test is just the \(-2log\Lambda\), we can use the information above to calculate it.

-2 * as.numeric(logLik(fit2) - logLik(fit))

## [1] 32.20138

1 - pchisq(-2 * as.numeric(logLik(fit2) - logLik(fit)), df = 2)  # p-value

## [1] 1.017556e-07

For Wald test \(H_0: L\theta = 0\), we need to specify L to compute the test statistic. For example if we want to test \(H_0: \beta_1 = \beta_2 = 0\), we can set L in the following way:

L <- rbind(c(0, 1, 0), c(0, 0, 1))
L

##      [,1] [,2] [,3]
## [1,]    0    1    0
## [2,]    0    0    1

There is a function wald.test from the aod package. Note that for a Wald test, we need the variance of the parameter estimates, which is \(a(\phi)(\hat{F}\hat{V}\hat{F})^{-1}\). This matrix can be calculated using the vcov() function. It’s calculated by QR decomposition, so the result is slightly different than those from SAS.

library(aod)
wald.test(vcov(fit), b = coef(fit), L = L)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 6.0, df = 2, P(> X2) = 0.049

In SAS, the analysis of maximum likelihood estimates is done by Wald test and a Chi-square distribution with degree of freedom 1. This is the same as the a Z-test with a standard normal distribution. The Z statistic is the square root of the Wald Chi-square statistic and the p-values are the same.

Concordance

The concordant and discordant percent all requires the fitted value to be calculated. Note that when we are modeling the probability for develop because it is the second level in the variable market.

i <- volatile$market == 'develop'
j <- volatile$market == 'emerge'
pairs <- sum(i) * sum(j)

# expand pairs into grid
grid <- with(volatile, 
             cbind(expand.grid(yi = market[i], yj = market[j]),
                   expand.grid(pi = fitted(fit)[i], pj = fitted(fit)[j])))

# Percent Concordant
sum(grid$pi > grid$pj) / pairs

## [1] 0.9910714

# Percent Discordant
sum(grid$pi < grid$pj) / pairs

## [1] 0.008928571

# Somers' D
nc <- sum(grid$pi > grid$pj)
(nc - (pairs - nc)) / pairs

## [1] 0.9821429

# Tau-a 
2 * (nc - (pairs - nc)) / (nobs(fit) * (nobs(fit) - 1))

## [1] 0.5057471

Parameter estimates and confidence intervals

summary(fit)$coefficients

##                 Estimate Std. Error    z value  Pr(>|z|)
## (Intercept) -11.56273217 40.9868601 -0.2821083 0.7778605
## volatile      0.04500712  0.9463449  0.0475589 0.9620678
## credit        0.17384805  0.2632646  0.6603547 0.5090262

The estimates of odds ratio is obtained by applying the exponential function to the parameter estimates and confidence intervals.

# profile likelihood confidence interval
confint(fit)

## Waiting for profiling to be done...

##                   2.5 %    97.5 %
## (Intercept) -57.2306037        NA
## volatile             NA 0.8767002
## credit       -0.8975829 0.5591083

exp(cbind(OR = coef(fit)[-1], confint(fit)[-1, ]))  # odds ratio

## Waiting for profiling to be done...

##                OR     2.5 %   97.5 %
## volatile 1.046035        NA 2.402957
## credit   1.189875 0.4075536 1.749112

# Wald confidence intervals
confint.default(fit)

##                   2.5 %     97.5 %
## (Intercept) -91.8955018 68.7700374
## volatile     -1.8097948  1.8998090
## credit       -0.3421412  0.6898373

exp(coef(fit)[-1])  # odds ratio

## volatile   credit 
## 1.046035 1.189875

exp(confint.default(fit)[-1, ])

##              2.5 %   97.5 %
## volatile 0.1636877 6.684618
## credit   0.7102479 1.993391

Classification table

SAS output have a nice classification table that shows the classification results under different cutoffs. Thanks to this answer, R can achieve a similar table as well, though there is a little bit difference due to precision (the maximum absolute difference between the two predicted probabilities are 1.890711e-08).

getMisclass <- function(cutoff, p, event) {
   pred <- factor(1*(p > cutoff), labels = levels(event)) 
   t <- table(pred, event)
   cat("cutoff ", cutoff, ":\n")
   print(t)
   cat("correct    :", round(sum(t[c(1,4)])/sum(t), 3),"\n")
   cat("Specificity:", round(t[1]/sum(t[,1]), 3),"\n")
   cat("Sensitivity:", round(t[4]/sum(t[,2]), 3),"\n\n")
   invisible(t)
}
cutoffs <- seq(0.1, .9, by=.1)
getMisclass(.5, p = fitted(fit), event = volatile$market)

## cutoff  0.5 :
##          event
## pred      emerge develop
##   emerge      15       1
##   develop      1      13
## correct    : 0.933 
## Specificity: 0.938 
## Sensitivity: 0.929

You can apply this function to get a more thorough output like SAS.

sapply(cutoffs, getMisclass, p=fitted(fit), event=volatile$market)

Logistic Regression (Single-Trial Data)

Descriptive analysis