The relationship between countries’ credit ratings and the volatility of the countries’ stock markets was examined in the Journal of Portfolio Management (Spring 1996). Our interest lies in whether volatility (standard deviation of stock returns) and credit rating (in percent) can be used to classify a country into one of the two market types (developed or emerging).
The data is read into a data.frame
with the response as a factor. It’s customary and preferrable to use a classification variable as a factor and let R do the rest of the things. When the factor is created from character data like we do in this example, the default order of the levels of that factor are alphabetical. It can be changed so that the level in higher order are coded with larger number. In glm()
, it always model the level with higher order as the event. Therefore here we changed the order of levels so that in glm we can model the probability of the response being a developed country.
volatile <- read.table("datasets/volatile.dat")
names(volatile) <- c('country/region', 'volatile', 'credit', 'market')
volatile$market <- factor(volatile$market, levels = c('emerge', 'develop'))
kable(volatile)
country/region | volatile | credit | market |
---|---|---|---|
Afghanistan | 55.7 | 8.3 | emerge |
Australia | 23.9 | 71.2 | develop |
China | 27.2 | 57.0 | emerge |
Cuba | 55.0 | 8.7 | emerge |
Germany | 20.3 | 90.9 | develop |
France | 20.6 | 89.1 | develop |
India | 30.3 | 46.1 | emerge |
Belgium | 22.3 | 79.2 | develop |
Canada | 22.1 | 80.3 | develop |
Ethiopia | 47.9 | 14.1 | emerge |
Haiti | 54.9 | 8.8 | emerge |
Japan | 20.2 | 91.6 | develop |
Libya | 36.7 | 30.0 | emerge |
Malaysia | 24.3 | 69.1 | emerge |
Mexico | 31.8 | 41.8 | emerge |
NewZealand | 24.3 | 69.4 | develop |
Nigeria | 46.2 | 15.8 | emerge |
Oman | 28.6 | 51.8 | develop |
Panama | 38.6 | 26.4 | emerge |
Spain | 23.4 | 73.7 | develop |
Sudan | 60.5 | 6.0 | emerge |
Taiwan | 22.2 | 79.9 | develop |
Norway | 21.4 | 84.6 | develop |
Sweden | 23.3 | 74.1 | develop |
Togo | 45.1 | 17.0 | emerge |
Ukraine | 46.3 | 15.7 | emerge |
UnitedKingdom | 20.8 | 87.8 | develop |
UnitedStates | 20.3 | 90.7 | develop |
Vietnam | 36.9 | 29.5 | emerge |
Zimbabwe | 36.2 | 31.0 | emerge |
In R, logistic regression model is fitted by glm()
with binomial distribution family. The default link function for binomial family is the logit link.
fit <- glm(market ~ volatile + credit, family = "binomial", data = volatile)
fit
##
## Call: glm(formula = market ~ volatile + credit, family = "binomial",
## data = volatile)
##
## Coefficients:
## (Intercept) volatile credit
## -11.56273 0.04501 0.17385
##
## Degrees of Freedom: 29 Total (i.e. Null); 27 Residual
## Null Deviance: 41.46
## Residual Deviance: 9.254 AIC: 15.25
# response profile
summary(volatile$market)
## emerge develop
## 16 14
In order to compute the model fit statistics for an intercept only model, we can just fit the model explicitely and use the generic functions.
fit2 <- glm(market ~ 1, family = "binomial", data = volatile)
AIC(fit2, fit)
## df AIC
## fit2 1 43.45540
## fit 3 15.25401
BIC(fit2, fit)
## df BIC
## fit2 1 44.85660
## fit 3 19.45761
-2 * logLik(fit2)
## 'log Lik.' 41.4554 (df=1)
-2 * logLik(fit)
## 'log Lik.' 9.254014 (df=3)
In the SAS output, the same model is fitted using PROC GENMOD
. There are some additional output accessing goodness of fit. Here we compute some of the statistics.
# Deviance
fit$deviance
## [1] 9.254014
fit$df.residual
## [1] 27
fit$deviance / fit$df.residual
## [1] 0.3427413
# Pearson's Chi-Square
sum(residuals(fit, "pearson")^2)
## [1] 9.81283
sum(residuals(fit, "pearson")^2) / fit$df.residual
## [1] 0.3634381
These all show evidence of under-dispersion.
We know the likelihood ratio test is just the \(-2log\Lambda\), we can use the information above to calculate it.
-2 * as.numeric(logLik(fit2) - logLik(fit))
## [1] 32.20138
1 - pchisq(-2 * as.numeric(logLik(fit2) - logLik(fit)), df = 2) # p-value
## [1] 1.017556e-07
For Wald test \(H_0: L\theta = 0\), we need to specify L to compute the test statistic. For example if we want to test \(H_0: \beta_1 = \beta_2 = 0\), we can set L in the following way:
L <- rbind(c(0, 1, 0), c(0, 0, 1))
L
## [,1] [,2] [,3]
## [1,] 0 1 0
## [2,] 0 0 1
There is a function wald.test
from the aod package. Note that for a Wald test, we need the variance of the parameter estimates, which is \(a(\phi)(\hat{F}\hat{V}\hat{F})^{-1}\). This matrix can be calculated using the vcov()
function. It’s calculated by QR decomposition, so the result is slightly different than those from SAS.
library(aod)
wald.test(vcov(fit), b = coef(fit), L = L)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 6.0, df = 2, P(> X2) = 0.049
In SAS, the analysis of maximum likelihood estimates is done by Wald test and a Chi-square distribution with degree of freedom 1. This is the same as the a Z-test with a standard normal distribution. The Z statistic is the square root of the Wald Chi-square statistic and the p-values are the same.
The concordant and discordant percent all requires the fitted value to be calculated. Note that when we are modeling the probability for develop
because it is the second level in the variable market
.
i <- volatile$market == 'develop'
j <- volatile$market == 'emerge'
pairs <- sum(i) * sum(j)
# expand pairs into grid
grid <- with(volatile,
cbind(expand.grid(yi = market[i], yj = market[j]),
expand.grid(pi = fitted(fit)[i], pj = fitted(fit)[j])))
# Percent Concordant
sum(grid$pi > grid$pj) / pairs
## [1] 0.9910714
# Percent Discordant
sum(grid$pi < grid$pj) / pairs
## [1] 0.008928571
# Somers' D
nc <- sum(grid$pi > grid$pj)
(nc - (pairs - nc)) / pairs
## [1] 0.9821429
# Tau-a
2 * (nc - (pairs - nc)) / (nobs(fit) * (nobs(fit) - 1))
## [1] 0.5057471
summary(fit)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.56273217 40.9868601 -0.2821083 0.7778605
## volatile 0.04500712 0.9463449 0.0475589 0.9620678
## credit 0.17384805 0.2632646 0.6603547 0.5090262
The estimates of odds ratio is obtained by applying the exponential function to the parameter estimates and confidence intervals.
# profile likelihood confidence interval
confint(fit)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -57.2306037 NA
## volatile NA 0.8767002
## credit -0.8975829 0.5591083
exp(cbind(OR = coef(fit)[-1], confint(fit)[-1, ])) # odds ratio
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## volatile 1.046035 NA 2.402957
## credit 1.189875 0.4075536 1.749112
# Wald confidence intervals
confint.default(fit)
## 2.5 % 97.5 %
## (Intercept) -91.8955018 68.7700374
## volatile -1.8097948 1.8998090
## credit -0.3421412 0.6898373
exp(coef(fit)[-1]) # odds ratio
## volatile credit
## 1.046035 1.189875
exp(confint.default(fit)[-1, ])
## 2.5 % 97.5 %
## volatile 0.1636877 6.684618
## credit 0.7102479 1.993391
SAS output have a nice classification table that shows the classification results under different cutoffs. Thanks to this answer, R can achieve a similar table as well, though there is a little bit difference due to precision (the maximum absolute difference between the two predicted probabilities are 1.890711e-08).
getMisclass <- function(cutoff, p, event) {
pred <- factor(1*(p > cutoff), labels = levels(event))
t <- table(pred, event)
cat("cutoff ", cutoff, ":\n")
print(t)
cat("correct :", round(sum(t[c(1,4)])/sum(t), 3),"\n")
cat("Specificity:", round(t[1]/sum(t[,1]), 3),"\n")
cat("Sensitivity:", round(t[4]/sum(t[,2]), 3),"\n\n")
invisible(t)
}
cutoffs <- seq(0.1, .9, by=.1)
getMisclass(.5, p = fitted(fit), event = volatile$market)
## cutoff 0.5 :
## event
## pred emerge develop
## emerge 15 1
## develop 1 13
## correct : 0.933
## Specificity: 0.938
## Sensitivity: 0.929
You can apply this function to get a more thorough output like SAS.
sapply(cutoffs, getMisclass, p=fitted(fit), event=volatile$market)