Predicting readmission probability for diabetes inpatients
Saurav Bose, Nicole Berkman, Jason Liebmann,
1 Executive Summary
1.1 Background
Diabetes is a chronic medical condition affecting millions of Americans, but if managed well, with good diet, exercise and medication, patients can lead relatively normal lives. However, if improperly managed, diabetes can lead to patients being continuously admitted and readmitted to hospitals. Readmissions are especially serious - they represent a failure of the health system to provide adequate support to the patient and are extremely costly to the system. As a result, the Centers for Medicare and Medicaid Services announced in 2012 that they would no longer reimburse hospitals for services rendered if a patient was readmitted with complications within 30 days of discharge. Given these policy changes, being able to identify and predict those patients most at risk for costly readmissions has become a pressing priority for hospital administrators. In this project, we shall want to help better manage diabetes patients who have been admitted to a hospital. Our goals are two-fold. First, we identify the factors predicting whether or not the patient will be readmitted within 30 days and secondly, we propose a classification rule to predict if a patient will be readmitted within 30 days. This is done with an intention to avoid patients being readmitted within 30 days of discharge, which inturn reduces costs for the hospital and improves outcomes for patients.
1.2 Summary about the data
The original data is from the Center for Clinical and Translational Research at Virginia Commonwealth University. It covers data on diabetes patients across 130 U.S. hospitals from 1999 to 2008. There are over 100,000 unique hospital admissions in this dataset, from ~70,000 unique patients. The data includes demographic elements, such as age, gender, and race, as well as clinical attributes such as patient number, tests conducted,time in hospital, admission type, diabetic medications, medical specialty of admitting physician, number of lab test performed, emergency/inpatient visits, etc.
All observations have five things in common:
- They are all hospital admissions
- Each patient had some form of diabetes
- The patient stayed for between 1 and 14 days.
- The patient had laboratory tests performed on him/her.
- The patient was given some form of medication during the visit.
In the cleaned dataset, Payer code
, weight
and Medical Specialty
are not included since they have a large number of missing values. Moreover, variables such as acetohexamide
, glimepiride.pioglitazone
, metformin.rosiglitazone
, metformin.pioglitazone
have little variability, and are as such excluded. This also includes the following variables: chlorpropamide
, acetohexamide
, tolbutamide
, acarbose
, miglitor
, troglitazone
, tolazamide
, examide
, citoglipton
, glyburide.metformin
, glipizide.metformin
, and glimepiride.pioglitazone
.
Some categorical variables like age and the ICD9 codes have been regrouped into bins. This has also been done for the response variable readmitted
which indicates whether a patient was readmitted after a particular admission. Originally there were 3 levels for this variable: “NO” = no readmission, “< 30” = readmission within 30 days and “> 30” = readmission after more than 30 days. The 30 day distinction is of practical importance to hospitals because federal regulations penalize hospitals for an excessive proportion of such readmissions. This is why in the cleaned dataset this variable is modified to be binary wherein “YES” indicates readmission within 30 days and “NO” indicates readmission after more than 30 days or no readmission.
A detailed description of the variables in the dataset is given in the Appendix.
1.3 Predictive Modeling
The cleaned dataset contains 100343 observations and 30 features excluding the response variable which is then split into a training and a test set. The analysis follows an 80 - 20 training- test split. All the models were developed on the training set. Model development includes feature selection and hyperparameter tuning. Each developed model was then tested on the test set to figure out its performance. The performance criterion used to select the best model was AUC.
In total 4 models were developed:
- Backward Selection: A Logistic Regression model which uses the features that were selected by the Backward selection routine, i.e. a recursive procedure that kicks out the feature with the highest p-value till we are left with all features that are significant at the 0.05 level.
- LASSO: A regularised Logistic regression model that chooses the important features based on \(L_1\) regularization
- Elastic Net: A regularized Logistic regresion model that is a combination of LASSO (\(L_1\) regularization) and Ridge Regression (\(L_2\) regularization)
- Random Forest: A bunch of decision trees that use a random set of features and also a random set of records for each tree.
The ROC curves for the models is shown below:
It was estimated that it costs twice as much to mislabel a readmission than it does to mislabel a non-readmission. Based on this risk ratio, the classifier for the models predicted a positive label when the probability of the label being 1 was greater than 0.33.
Finally, it was obserrved that the AUC for LASSO was 0.656 which was the highest of the 4 models and hence LASSO was chosen as the final model. The final model is
\[readmission: timeinhospital + nummedications + numberemergency + numberinpatient + numberdiagnoses + insulin + diabetesMed + dischdispmodified + agemod + diag1mod + diag3mod\]
The final LASSO model when tested on the held out test set gave a weighted misclassification error of 0.2277.
1.4 Conclusion and Limitations of the study
The study concluded that for the given data, LASSO was the best model in terms of AUC. We therefore used LASSO to predict whether a patient will be a readmit within 30 days with a weighted misclassification error of 0.2277. Moreover time_in_hospital, num_medications, number_emergency, number_inpatient, number_diagnoses, insulin, diabetesMed, disch_disp_modified, age_mod, diag1_mod and diag3_mod were the features that were deemed important in determining the readmission status of the patients. Despite the successful predictive modeling, there were a few factors that if improved could have led to a more accurate model. First and foremost the data only includes information about “patients” or in other words people who have been associated with a hospital. It is possible that a vast fraction of diabetes patients dont really have active hospital records. MOreover the data only contains records from the past 10 years and from 130 hospitals. Hence the data may not be truely representative of the entire population. Gathering more data would be a solution to this problem. Secondly, the dataset contained many missing values. These could have originated due to poor hospital policy or just loss of data. Missing values are generally tricky to handle as removing them could potentially skew the decision for or against a variable.
2 Detailed Analysis
The original data is from the Center for Clinical and Translational Research at Virginia Commonwealth University. It covers data on diabetes patients across 130 U.S. hospitals during a ten-year period from 1999 to 2008. There are over 100,000 unique hospital admissions in this dataset, from ~70,000 unique patients. The data includes demographic elements, such as age, gender, and race, as well as clinical attributes such as tests conducted, emergency/inpatient visits, etc.
#Read in data
data.all <- read.csv("diabetic.data.csv", header = T)
head(data.all)
sapply(data.all, function(x) sum(is.na(x)))
dim(data.all)
#Check for missing values
sapply(data.all, function(x) length(unique(x)))
sapply(data.all, function(x) sum(x == "?"))
The original dataset contains 101766 records with 50 variables each. The original dataset contains a large number of missing values for weight, medical specialty and payer codes of patients. Hence for the rest of the analysis, these variables are removed. Variables such as acetohexamide, glimepiride.pioglitazone, metformin.rosiglitazone, metformin.pioglitazone have little variability, and are as such excluded. This also includes the following variables: chlorpropamide, acetohexamide, tolbutamide, acarbose, miglitor,troglitazone, tolazamide, examide, citoglipton, glyburide.metformin, glipizide.metformin, and glimepiride.pioglitazone. Moreover, variables like age and ICD9 codes were binned to keep some original levels with large number of patients and aggregate other patients as others.
#Read in clean data
data.small <- read.csv("readmission.csv", header = T)
sapply(data.small, function(x) sum(x == "?"))
The now cleaner dataset contains missing values in race and tertiary diagnoses of patients. The missing race records were binned into a new category called Unknown to take care of the missing value issue. Moreover this dataset contained only about 1% of the tertiary diagnoses values missing and hence those were removed.
#Rectify missing race
levels(data.small$race)[1] <- "Unknown"
summary(data.small$race)
summary(data.small$diag3_mod)
data.test <- data.small
#Rectify missing tertiary diagnosis
data.test <- data.test %>% mutate(diag3_mod = as.character(diag3_mod)) %>% filter(diag3_mod!="?") %>% mutate(diag3_mod = as.factor(diag3_mod))
summary(data.test$diag3_mod)
data.clean <- data.test
sapply(data.clean, function(x) sum(x=="?"))
sapply(data.clean, function(x) sum(x==" "))
Finally, in the clean dataset, the response variable readmitted
which indicates whether a patient was readmitted after a particular admission was binned. Originally there were 3 levels for this variable: “NO” = no readmission, “< 30” = readmission within 30 days and “> 30” = readmission after more than 30 days. The 30 day distinction is of practical importance to hospitals because federal regulations penalize hospitals for an excessive proportion of such readmissions. This is why in the cleaned dataset this variable is modified to be binary wherein “YES” indicates readmission within 30 days and “NO” indicates readmission after more than 30 days or no readmission.
#Bin response variable
data.clean <- data.clean %>% mutate(readmitted = as.character(readmitted)) %>% mutate(readmitted = ifelse(readmitted == ">30","NO",readmitted)) %>% mutate(readmitted = ifelse(readmitted == "<30","YES",readmitted)) %>% mutate(readmitted = as.factor(readmitted))
levels(data.clean$readmitted) <- c(0,1)
summary(data.clean$readmitted)
The cleaned dataset contains 100343 records and 31 variables. A part of the clean dataset is shown below:
head(data.clean)
A quantitative summary of the data set is shown below:
xtable(summary(data.clean))
Encounter ID and patient ID should not play any role in the modeling and hence are removed. We now try to understand the data better by looking at some of the features individually and also at their relationships to other variables. First lets try and understand the demographics. Quantitatively, the dataset has a gender split with more Females than Males.
ggplot(data.clean)+geom_bar(aes(x=gender))
Moreover, majority of the patients seem to be Caucasian followed by African Americans.
ggplot(data.clean)+geom_bar(aes(x = race))
The female dominance over males is observed across all races.
ggplot(data.clean, aes(x = race, fill = gender))+geom_bar(position = "dodge")
Finally, in order to summarize the demographics and visualise the interactions between all demographic variables like age, gender and race, we create a parallel plot as shown below:
data.temp <- data.clean %>% group_by(race, gender, age_mod) %>% summarise(Freq = n())
#Parallel plot (Credits to a brilliant stack overflow answer - https://stats.stackexchange.com/questions/12029/is-it-possible-to-create-parallel-sets-plot-using-r)
parallelset <- function(..., freq, col="gray", border=0, layer,
alpha=0.5, gap.width=0.05) {
p <- data.frame(..., freq, col, border, alpha, stringsAsFactors=FALSE)
n <- nrow(p)
if(missing(layer)) { layer <- 1:n }
p$layer <- layer
np <- ncol(p) - 5
d <- p[ , 1:np, drop=FALSE]
p <- p[ , -c(1:np), drop=FALSE]
p$freq <- with(p, freq/sum(freq))
col <- col2rgb(p$col, alpha=TRUE)
if(!identical(alpha, FALSE)) { col["alpha", ] <- p$alpha*256 }
p$col <- apply(col, 2, function(x) do.call(rgb, c(as.list(x), maxColorValue = 256)))
getp <- function(i, d, f, w=gap.width) {
a <- c(i, (1:ncol(d))[-i])
o <- do.call(order, d[a])
x <- c(0, cumsum(f[o])) * (1-w)
x <- cbind(x[-length(x)], x[-1])
gap <- cumsum( c(0L, diff(as.numeric(d[o,i])) != 0) )
gap <- gap / max(gap) * w
(x + gap)[order(o),]
}
dd <- lapply(seq_along(d), getp, d=d, f=p$freq)
par(mar = c(0, 0, 2, 0) + 0.1, xpd=TRUE )
plot(NULL, type="n",xlim=c(0, 1), ylim=c(np, 1),
xaxt="n", yaxt="n", xaxs="i", yaxs="i", xlab='', ylab='', frame=FALSE)
for(i in rev(order(p$layer)) ) {
for(j in 1:(np-1) )
polygon(c(dd[[j]][i,], rev(dd[[j+1]][i,])), c(j, j, j+1, j+1),
col=p$col[i], border=p$border[i])
}
text(0, seq_along(dd), labels=names(d), adj=c(0,-2), font=2)
for(j in seq_along(dd)) {
ax <- lapply(split(dd[[j]], d[,j]), range)
for(k in seq_along(ax)) {
lines(ax[[k]], c(j, j))
text(ax[[k]][1], j, labels=names(ax)[k], adj=c(0, -0.25))
}
}
}
myt <- subset(data.temp, select=c("race","gender","age_mod","Freq"))
myt <- within(myt, {
color <- ifelse(gender=="Male","#008888","#330066")
})
with(myt, parallelset(race, gender, age_mod, freq=Freq, col=color, alpha=0.2))
From the above plot we can see that the majority race is Caucasian followed by African American. The majority gender is Female and the majority age is 60-79 years followed by 20 - 59 years. Furthermore we can see that of the Caucasians who are Males, most lie in the age group of 60-79 and the rest are equally split between ages 20 -59 and 80+. Moreover, the African Americans are predominantly Females and of them, the majority lie in the age group of 20-59 years. The African American Males lie either in the age group of 20-59 years or 60-79 years. The old patients, i.e. 80+ years comprose of Caucasians (men and women) and African American women.
Visualizing a few other features reveals that most patients were admitted in the hospital for upto 5 days. Moreover, visits by the patient(inpatient, outpatient or emergency) in the year prior to the current encounter is predominantly 0. Which leads us to conclude that most of the data is of first time patients
p1 <- ggplot(data.clean, aes(x = time_in_hospital)) + geom_bar() +
labs(title = "Time in Hospital", x = "Time in Hospital", y = "Count")
p2 <- ggplot(data.clean, aes(x = number_outpatient)) + geom_bar() +
labs(title = "Number of Outpatient Visits", x = "Number of Outpatient Visits", y = "Count")
p3 <- ggplot(data.clean, aes(x = number_inpatient)) + geom_bar() +
labs(title = "Number of Inpatient Visits", x = "Number of Inpatient Visits", y = "Count")
p4 <- ggplot(data.clean, aes(x = number_emergency)) + geom_bar() +
labs(title = "Number of Emergency Visits", x = "Number of Emergency Visits", y = "Count")
grid.arrange(p1, p2, p3, p4)
Lets now look at the response variable and its interaction with other variables. The data is heavily skewed towards patients not being readmitted within 30 days.
ggplot(data.clean) + geom_bar(aes(x = readmitted))
This immediately tells us that just Sensitivity or specificity can’t be used as mnodel evaluation parameters.
The parallel plot is used to visualise the interaction between the response variable and the demographics of the data.
data.temp2 <- data.clean %>% group_by(race, gender, age_mod, readmitted) %>% summarise(Freq = n())
#Parallel plot (Credits to a brilliant stack overflow answer - https://stats.stackexchange.com/questions/12029/is-it-possible-to-create-parallel-sets-plot-using-r)
parallelset <- function(..., freq, col="gray", border=0, layer,
alpha=0.5, gap.width=0.05) {
p <- data.frame(..., freq, col, border, alpha, stringsAsFactors=FALSE)
n <- nrow(p)
if(missing(layer)) { layer <- 1:n }
p$layer <- layer
np <- ncol(p) - 5
d <- p[ , 1:np, drop=FALSE]
p <- p[ , -c(1:np), drop=FALSE]
p$freq <- with(p, freq/sum(freq))
col <- col2rgb(p$col, alpha=TRUE)
if(!identical(alpha, FALSE)) { col["alpha", ] <- p$alpha*256 }
p$col <- apply(col, 2, function(x) do.call(rgb, c(as.list(x), maxColorValue = 256)))
getp <- function(i, d, f, w=gap.width) {
a <- c(i, (1:ncol(d))[-i])
o <- do.call(order, d[a])
x <- c(0, cumsum(f[o])) * (1-w)
x <- cbind(x[-length(x)], x[-1])
gap <- cumsum( c(0L, diff(as.numeric(d[o,i])) != 0) )
gap <- gap / max(gap) * w
(x + gap)[order(o),]
}
dd <- lapply(seq_along(d), getp, d=d, f=p$freq)
par(mar = c(0, 0, 2, 0) + 0.1, xpd=TRUE )
plot(NULL, type="n",xlim=c(0, 1), ylim=c(np, 1),
xaxt="n", yaxt="n", xaxs="i", yaxs="i", xlab='', ylab='', frame=FALSE)
for(i in rev(order(p$layer)) ) {
for(j in 1:(np-1) )
polygon(c(dd[[j]][i,], rev(dd[[j+1]][i,])), c(j, j, j+1, j+1),
col=p$col[i], border=p$border[i])
}
text(0, seq_along(dd), labels=names(d), adj=c(0,-2), font=2)
for(j in seq_along(dd)) {
ax <- lapply(split(dd[[j]], d[,j]), range)
for(k in seq_along(ax)) {
lines(ax[[k]], c(j, j))
text(ax[[k]][1], j, labels=names(ax)[k], adj=c(0, -0.25))
}
}
}
myt <- subset(data.temp2, select=c("race","gender","age_mod","readmitted","Freq"))
myt <- within(myt, {
color <- ifelse(gender=="Male","#008888","#330066")
})
with(myt, parallelset(race, gender, age_mod, readmitted, freq=Freq, col=color, alpha=0.2))
The plot reveals that most patients are not readmitted within 30 days. Furthermore, of the people who are readmitted within 30 days, 4 kinds of people dominante the proportion: Caucasian women in the age group of 60 -79, Caucasian women in the age group of 80+ Caucasian men in the age group of 60-79 and Caucasian men in the age group of 80+.
Furthermore, it was observed that irrespective of whether the patient was readmitted within 30 days or not, the large majority of them had No results for the glucose serum and teh A1C tests.
p5 <- ggplot(data.clean, aes(x = readmitted, fill = max_glu_serum))+geom_bar(position = "dodge")
p6 <- ggplot(data.clean, aes(x = readmitted, fill = A1Cresult))+geom_bar(position = "dodge")
grid.arrange(p5, p6)
Finally, An interesting relationship was observed between the readmission status and the total number of diagnoses entered for the patient. On average the patients who were readmitted within 30 days seem to have a larger nymber of diagnoses.
ggplot(data.clean, aes(x = readmitted, y = number_diagnoses)) + geom_boxplot() +
labs(y = "Number of diagnoses", x = "readmitted status")
Now that we have a sense of what the data contains, we can build models to predict the response variable, i.e. readmission status. The cleaned dataset had 30 variables aside from the response variable in it. Of these 30 variables, we can realise that Encounter ID and Patient Number shouldn’t have any real effect on the prediction. So we remove them from the dataset. This leaves us with 28 variables which act as input for predicting the readmission status.
3 Predictive Modeling
data.model <- data.clean %>% select(-c(encounter_id,patient_nbr))
head(data.model)
dim(data.model)
We will split the data into training and testing sets with an approximately 80-20 split. The training set thus contains 80343 records and the test set contains 20000 records. The records are assigned to the training and test sets randomly.
#Split into traiing and testing sets
set.seed(123)
train_id <- sample(nrow(data.model), 80343)
test_id <- setdiff(rownames(data.model),train_id)
train_data <- data.model[train_id,]
test_data <- data.model[test_id,]
Moreover, we estimate that it costs twice as much to mislabel a readmission than it does to mislabel a non-readmission. Based on this risk ratio, the threshold for the classifiers is computed to be, predict \(readmission = 1\) if \(P(readmission = 1|x) > 0.33\).
The models developed have been described below:
- Backward Selection
The first model is a Logistic regression model wherein the features were selected using the Backward Selection routine. First the model with all 28 predictors was developed. Thereafter the variable with the highest p-value was kicked out and the updated model was fit. This process was continued till all the variables left in the model were significant at the 0.05 level. The Backward selection model is shown below:
set.seed(123)
#Full model
fit.back <- glm(readmitted~., train_data, family = binomial(logit))
s <- summary(fit.back)
p <- data.frame(coef(s)[,4])
colnames(p) <- c("p")
var <- names(train_data)
var <- var[var!="readmitted"]
p_max <- max(p)
#Backward selection routine
while(p_max>0.05)
{
id <- vector("numeric", length(var))
rem <- which(p==max(p))
t <- rownames(p)[rem]
for (i in 1:length(var))
{
id[i] <- grepl(var[i],t)
}
to_upd <- var[which(id == T)]
var <- var[var!=to_upd]
f <- as.formula(paste("readmitted~",paste(var,collapse = "+")))
fit.back <- glm(f, train_data, family = binomial(logit))
s <- summary(fit.back)
p <- data.frame(coef(s)[,4])
colnames(p) <- c("p")
p_max <- max(p)
}
xtable(summary(fit.back),type='html')
var.back <- var
#Response variable of test set as numeric
test_resp <- as.numeric(test_data$readmitted)-1
test_X <- test_data %>% select(-readmitted)
#Predict on test set
predict.back <- predict(fit.back, test_X, type = "response")
predict.back.values <- ifelse(predict.back>0.33,1,0)
#Unweighted accuracy
acc_back <- 1- sum(abs(predict.back.values - test_resp))/length(test_resp)
The significant variables according to this model along with their p-values are listed below:
var.back
## [1] "time_in_hospital" "number_emergency" "number_inpatient"
## [4] "number_diagnoses" "insulin" "diabetesMed"
## [7] "disch_disp_modified"
an.back <- Anova(fit.back)
an.back
The predictions for the model are summarised as follows:
t <- table(predict.back.values,test_data$readmitted)
t
##
## predict.back.values 0 1
## 0 17562 2223
## 1 132 83
The unweighted misclassification error is:
1-acc_back
## [1] 0.11775
The weighted misclassification error is :
mce.back <- (2*t[1,2]+t[2,1])/length(predict.back.values)
acc_back_weight <- 1-mce.back
1-acc_back_weight
## [1] 0.2289
- LASSO
The second model is a Logistic Regression model with a feature selection done using \(L_1\) regularization. This is called LASSO. This model tend to drive all the insignificant features to zero and so the final model we would be left with would contain features that play a role in determining the readmission status.
set.seed(123)
x <- model.matrix(readmitted~., train_data)[,-1]
y <- train_data$readmitted
fit.lass <- cv.glmnet(x,y,nfolds = 10, family = "binomial", alpha = 1)
var.out <- rownames(coef(fit.lass, s = "lambda.1se"))[which(coef(fit.lass, s = "lambda.1se")!=0)][-1]
var <- names(train_data)
var <- var[var!="readmitted"]
for(i in 1:length(var)){
t <- which(grepl(var[i],var.out) ==T)
var.out[t] <- var[i]
}
var.lass <- unique(var.out)
plot(fit.lass)
f.lass <- as.formula(paste("readmitted~",paste(var.lass, collapse = "+")))
fit.lass.final <- glm(f.lass, train_data, family = binomial(logit))
predict.lass <- predict(fit.lass.final, test_X, type = "response")
predict.lass.values <- ifelse(predict.lass>0.33,1,0)
acc_lass <- 1- sum(abs(predict.lass.values - test_resp))/length(test_resp)
As seen from the above plot, lambda.min is 2.962118e-05 and lambda.1se is 0.0054. For our model we pick lambda.1se because it results in an effective reduction in the features. The model with lambda.min results in a complex model as it doesn’t result in effective fearture selection and might be overfitting the data.
The significant variables according to LASSO and their respective p-values are listed below:
var.lass
## [1] "time_in_hospital" "num_medications" "number_emergency"
## [4] "number_inpatient" "number_diagnoses" "insulin"
## [7] "diabetesMed" "disch_disp_modified" "age_mod"
## [10] "diag1_mod" "diag3_mod"
an.lass <- Anova(fit.lass.final)
an.lass
The LASSO model is summarised below:
xtable(summary(fit.lass.final))
The predictions for the model are summarised as follows:
t <- table(predict.lass.values,test_data$readmitted)
t
##
## predict.lass.values 0 1
## 0 17552 2206
## 1 142 100
The unweighted misclassification error is:
1-acc_lass
## [1] 0.1174
The weighted misclassification error is :
mce.lass <- (2*t[1,2]+t[2,1])/length(predict.lass.values)
acc_lass_weight <- 1-mce.lass
1-acc_lass_weight
## [1] 0.2277
- Elastic Net
The third model is a mixture of LASSO(\(L_1\) regularization) and Ridge regression (\(L_2\)) regularization. In order to develop the model we had to run cross validation over alpha which determines the relative contribution of the LASSO and Ridge models. For each value of alpha, the algorithm tests with different values of lambda to find the best value of lambda in terms of cross validation error. We try 6 values of alpha in the range 0 and 1 and they are summarised below:
alph <- seq(0,1, by = 0.2)
set.seed(123)
x <- model.matrix(readmitted~., train_data)[,-1]
y <- train_data$readmitted
cve.lambda.min <- vector("numeric", length(alph))
cve.lambda.1se <- vector("numeric", length(alph))
for (a in 1:length(alph)){
fit.lass <- cv.glmnet(x,y,nfolds = 10, family = "binomial", alpha = alph[a])
cve.lambda.min[a] <- fit.lass$cvm[which(fit.lass$lambda == fit.lass$lambda.min)]
cve.lambda.1se[a] <- fit.lass$cvm[which(fit.lass$lambda == fit.lass$lambda.1se)]
}
output.enet <- data.frame(alpha = alph, cve.min = cve.lambda.min, cve.1se = cve.lambda.1se)
output.enet %>% arrange(cve.1se)
fit.net <- cv.glmnet(x,y, nfolds = 10, alpha = 0.6, family = "binomial")
var.out <- rownames(coef(fit.net))[which(coef(fit.net, s = "lambda.1se")!=0)][-1]
var <- names(train_data)
var <- var[var!="readmitted"]
for(i in 1:length(var)){
t <- which(grepl(var[i],var.out) ==T)
var.out[t] <- var[i]
}
var.enet <- unique(var.out)
f.net <- as.formula(paste("readmitted~",paste(var.enet, collapse = "+")))
fit.net.final <- glm(f.net, train_data, family = binomial(logit))
predict.enet <- predict(fit.net.final, test_X, type = "response")
predict.enet.values <- ifelse(predict.enet>0.33,1,0)
acc_enet <- 1- sum(abs(predict.enet.values - test_resp))/length(test_resp)
Based on the above summary, the best value of alpha is either 0.6 or 0.8. Alpha of 0.8 results in severe underfitting as it churns out only a single feature in the model. Hence we use alpha of 0.6 as the final alpha.
The variables that are deemed significant based on the Elastic Net are summarised below:
var.enet
## [1] "time_in_hospital" "num_medications" "number_emergency"
## [4] "number_inpatient" "number_diagnoses" "insulin"
## [7] "diabetesMed" "disch_disp_modified" "age_mod"
## [10] "diag1_mod" "diag2_mod" "diag3_mod"
an.enet <- Anova(fit.net.final)
an.enet
The Elastic Net model is summaarised below:
xtable(summary(fit.net.final))
The predictions for the model are summarised as follows:
t <- table(predict.enet.values,test_data$readmitted)
t
##
## predict.enet.values 0 1
## 0 17546 2208
## 1 148 98
The unweighted misclassification error is:
1-acc_enet
## [1] 0.1178
The weighted misclassification error is :
mce.enet <- (2*t[1,2]+t[2,1])/length(predict.enet.values)
acc_enet_weight <- 1-mce.enet
1-acc_enet_weight
## [1] 0.2282
- Random Forest
The fourth model is the Random Forest model. Here,a bunch of decision trees were built where each tree uses a random set of features and also a random set of records. The number of features to randomly select for each tree is a hyperparameter that needs to be tuned.
set.seed(123)
#
#The following commented lines are hyperparameter tuning for mtry. 19 values are tried. The code is commented as it takes way too long to run during the knitting process. It was run once in order to figure out the best mtry value.
# par(mfrow=c(3,1))
# rf.error.p <- 1:19 # set up a vector of length 19
# for (p in 1:19) # repeat the following code inside { } 19 times
# {
# fit.rf <- randomForest(readmitted~., train_data, mtry=p, ntree=250)
# rf.error.p[p] <- mean(train_data$readmitted != fit.rf$predicted) # collecting oob mse based on 250 trees
# }
# rf.error.p # oob mse returned: should be a vector of 19
#
# plot(1:19, rf.error.p, pch=16,
# xlab="mtry",
# ylab="mse of mtry")
fit.rf <- randomForest(readmitted~., train_data, mtry = 4, ntree = 250)
predict.rf.y <- predict(fit.rf, newdata=test_data)
acc.rf <- 1-mean(test_data$readmitted != predict.rf.y)
predict.rf <- predict(fit.rf, newdata=test_data, type="prob") #probabilities
predict.rf.values <- ifelse(predict.rf[,2]>0.33,1,0)
Bassed on the hyperparameter tuning done on mtry we figured out that mtry of 4 gives the lowest OOB error and hence we chose that for our final model.
The predictions for the model are summarised as follows:
t <- table(predict.rf.values,test_data$readmitted)
t
##
## predict.rf.values 0 1
## 0 17384 2144
## 1 310 162
The unweighted misclassification error is:
1-acc.rf
## [1] 0.11515
The weighted misclassification error is :
mce.rf <- (2*t[1,2]+t[2,1])/length(predict.rf.values)
acc_rf_weight <- 1-mce.rf
1-acc_rf_weight
## [1] 0.2299
3.1 Model Comparison
Now that we have a few models developed we need to pick the best of the lot. To do this we use the evaluation criterion called AUC which is the area under the ROC curve. We generate the ROC curve and compute the associated AUC values for the test data using the model trained on the training data.
The ROC curves are shown below:
roc.back <- roc(test_data$readmitted, predict.back)
roc.lass <- roc(test_data$readmitted, predict.lass)
roc.enet <- roc(test_data$readmitted, predict.enet)
roc.rf <- roc(test_data$readmitted, predict.rf[,2])
plot(1-roc.back$specificities, roc.back$sensitivities, col="red", pch=16, cex=.3, xlab="False Positive", ylab="Sensitivity")
lines(1-roc.lass$specificities, roc.lass$sensitivities, col="blue", pch=16, cex=.3)
lines(1-roc.enet$specificities, roc.enet$sensitivities, col="green", pch=16, cex=.3)
lines(1-roc.rf$specificities, roc.rf$sensitivities, col="black", pch=16, cex=.3)
title("Comparison of out-sample ROC curves for the 4 Models")
legend("topleft", legend = c("Backward Selection", "LASSO","Elastic Net", "Random Forest"), lty=c(1,1),lwd = 4,col = c("red", "blue","green", "black"))
The AUCs of the models are summaried below:
auc.back <- auc(roc.back)
auc.lass <- auc(roc.lass)
auc.enet <- auc(roc.enet)
auc.rf <- auc(roc.rf)
auc.sum <- data.frame(model = c("Backward Selection", "LASSO", "Elastic Net", "Random Forest"), AUC = c(auc.back, auc.lass, auc.enet, auc.rf))
auc.sum %>% arrange(desc(AUC))
Based on the ROC curves and the associated AUC values we see that the LASSO gives the largest AUC of 0.656.
Therefore, we select the LASSO as our final model.
4 Conclusion
The modeling procedure involved training four different models on the training data which was around 80% of the total dataset. Hyperparameter tuning and feature selection was done on the training data for each of the models to come up with 4 final models. The performance of these 4 models were compared on the testing data by computing the area under the ROC curves. Based on this criterion, the LASSO performed the best with an AUC of 0.656 and hence was chosen as the final model.
According to LASSO, the following variables are deemed important in predicting the readmission status of a patient:
var.lass
## [1] "time_in_hospital" "num_medications" "number_emergency"
## [4] "number_inpatient" "number_diagnoses" "insulin"
## [7] "diabetesMed" "disch_disp_modified" "age_mod"
## [10] "diag1_mod" "diag3_mod"
The final model is given as:
f.final <- as.formula(paste("readmission~",paste(var.lass,collapse = "+")))
f.final
## readmission ~ time_in_hospital + num_medications + number_emergency +
## number_inpatient + number_diagnoses + insulin + diabetesMed +
## disch_disp_modified + age_mod + diag1_mod + diag3_mod
When tested on the testing data the following are some other performance measures:
t.lass <- table(predict.lass.values, test_data$readmitted)
t.lass
##
## predict.lass.values 0 1
## 0 17552 2206
## 1 142 100
Weighted misclassification error:
mce.lass
## [1] 0.2277
Unweighted misclassification error:
1-acc_lass
## [1] 0.1174
Sensitivity: 0.0433
t.lass[2,2]/sum(test_data$readmitted == "YES")
Specificity: 0.9919
t.lass[1,1]/sum(test_data$readmitted == "NO")
Positive Prediction Rate:
t.lass[2, 2] / (t.lass[2, 1] + t.lass[2, 2])
## [1] 0.4132231
Negative Prediction Rate:
t.lass[1, 1] / (t.lass[1, 1] + t.lass[1, 2])
## [1] 0.888349
It should be noted that the final choice of the model is heavily dependent on the evaluation criterion. Evaluating based on AUC does not necessarily give the same best model as compared to evaluation based on misclassification error. Hence, it is important to understand what evaluation criterion is best suited for a problem statement. Moreover, it is very important to make sure that the model does not overfit or underfit. This reasoning enabled us to not select models that had too many or too few features in them.
5 Appendix
5.1 Description of variables
The dataset used covers ~50 different variables to describe every hospital diabetes admission. In this section we give an overview and brief description of the variables in this dataset.
a) Patient identifiers:
encounter_id
: unique identifier for each admissionpatient_nbr
: unique identifier for each patient
b) Patient Demographics:
race
, age
, gender
, weight
cover the basic demographic information associated with each patient. Payer_code
is an additional variable that identifies which health insurance (Medicare /Medicaid / Commercial) the patient holds.
c) Admission and discharge details:
admission_source_id
andadmission_type_id
identify who referred the patient to the hospital (e.g. physician vs. emergency dept.) and what type of admission this was (Emergency vs. Elective vs. Urgent).discharge_disposition_id
indicates where the patient was discharged to after treatment.
d) Patient Medical History:
num_outpatient
: number of outpatient visits by the patient in the year prior to the current encounternum_inpatient
: number of inpatient visits by the patient in the year prior to the current encounternum_emergency
: number of emergency visits by the patient in the year prior to the current encounter
e) Patient admission details:
medical_specialty
: the specialty of the physician admitting the patientdiag_1
,diag_2
,diag_3
: ICD9 codes for the primary, secondary and tertiary diagnoses of the patient. ICD9 are the universal codes that all physicians use to record diagnoses. There are various easy to use tools to lookup what individual codes mean (Wikipedia is pretty decent on its own)time_in_hospital
: the patient’s length of stay in the hospital (in days)number_diagnoses
: Total no. of diagnosis entered for the patientnum_lab_procedures
: No. of lab procedures performed in the current encounternum_procedures
: No. of non-lab procedures performed in the current encounternum_medications
: No. of distinct medications prescribed in the current encounter
f) Clinical Results:
max_glu_serum
: indicates results of the glucose serum testA1Cresult
: indicates results of the A1c test
g) Medication Details:
diabetesMed
: indicates if any diabetes medication was prescribedchange
: indicates if there was a change in diabetes medication24 medication variables
: indicate whether the dosage of the medicines was changed in any manner during the encounter
h) Readmission indicator:
The is the variable we are trying to predict. Indicates whether a patient was readmitted after a particular admission. In the original dataset there are 3 levels for this variable: “NO” = no readmission, “< 30” = readmission within 30 days and “> 30” = readmission after more than 30 days. The 30 day distinction is of practical importance to hospitals because federal regulations penalize hospitals for an excessive proportion of such readmissions. This is why in the cleaned dataset this variable is modified to be binary wherein “YES” indicates readmission within 30 days and “NO” indicates readmission after more than 30 days or no readmission.