Can we develop a model that can accurately determine whether a patient has diabetes based on particular medical variables? According to the National Institute of Diabetes and Digestive and Kidney Diseases, over 9.4% of the United States population has diabetes, with “more than 1 in 4 [individuals]” not knowing they have diabetes. This dataset, which was pulled from Kaggle, has data over 768 female patients aged 21 or older with various medical variables including a variable that describes whether or not the patient is diabetic. (Source: https://www.niddk.nih.gov/health-information/diabetes/overview/what-is-diabetes)
The first variable Pregnancies
counts the number of times the patient has been pregnant, Glucose
measures the patient’s concentration of glucose present in plasma after a glucose tolerance test, BloodPressue
measures the patient’s diastolic blood pressure, SkinThickness
measures the patient’s tricep skin fold thickness (in mm), Insulin
measures the patient’s 2-Hour serum insulin (in U/ml), BMI
measures the patient’s body mass index (weight kg/height m^2), DiabetesPedigreeFunction
is a diabetes function for the patient, Age
is age of the patient measured in years, and Outcome
is a boolean variable [1 or 0] that classifies whether the patient has diabetes [1] or not [0].
### Read Dataset and required libraries
library(tidyverse)
library(cluster)
library(GGally)
library(caret)
library(rpart)
library(rpart.plot)
# hyperlink to dataset
# https://www.kaggle.com/mathchi/diabetes-data-set?select=diabetes.csv
diabetes <- read_csv("diabetes.csv")
diabetes <- diabetes %>% na.omit()
# any other code here
head(diabetes)
## # A tibble: 6 x 9
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigre…
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 148 72 35 0 33.6 0.627
## 2 1 85 66 29 0 26.6 0.351
## 3 8 183 64 0 0 23.3 0.672
## 4 1 89 66 23 94 28.1 0.167
## 5 0 137 40 35 168 43.1 2.29
## 6 5 116 74 0 0 25.6 0.201
## # … with 2 more variables: Age <dbl>, Outcome <dbl>
### Cluster Analysis
# Silhouette analysis, pick 2 clusters based on graph
sil_width <- vector() #empty vector to hold mean sil width
for (i in 2:10) {
kms <- kmeans(diabetes, centers = i) #compute k-means solution for each k
sil <- silhouette(kms$cluster, dist(diabetes)) #get sil widths
sil_width[i] <- mean(sil[, 3]) #take averages (higher is better)
}
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k",
breaks = 1:10)
# PAM clustering algorithm
diabetes_pam <- diabetes %>% scale() %>% pam(k = 2)
# Representatives for medoids
diabetes %>% slice(diabetes_pam$id.med)
## # A tibble: 2 x 9
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigre…
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 134 70 23 130 35.4 0.542
## 2 2 112 68 22 94 34.1 0.315
## # … with 2 more variables: Age <dbl>, Outcome <dbl>
plot(diabetes_pam, which = 2)
pamclust <- diabetes %>% mutate(cluster = as.factor(diabetes_pam$clustering))
The goodness of fit scored a 0.2, which means that no substantial structure was found. Furthermore, The two patients in the table above are representatives of their clusters. In terms of their original variables, Cluster 1 tends to have higher Glucose
and Insulin
.
#Pairwise Cluster
diabetes1 <- diabetes %>% mutate(cluster = as.factor(diabetes_pam$clustering))
ggpairs(diabetes1, columns = 1:6, aes(color = cluster))
Looking at the pairwise plots, the most distinct difference between Cluster 1 and Cluster 2 is in Glucose
Level. Cluster 1’s distribution is to the right of Cluster 2, indicating that that group has a higher mean value for Glucose
. Although less significant in shift, Cluster 1 also has a slightly higher BMI which is consistent with other studies pertaining to indicators of diabetes. Interestingly, Cluster 1 has on average more pregnancies than Cluster 2. Furthermore, the two variables with the highest correlation are the SkinThickness
and Insulin
variables. According to a study by the American Diabetes association, there is a statistically significant relationship between skin thickness (determined by collagen content) and higher levels of insulin. Source(https://care.diabetesjournals.org/content/12/5/309)
### Principal Component Analysis
pca1 <- princomp(na.omit(diabetes), cor = T)
summary(pca1, loadings = T)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.5337867 1.3320330 1.0584069 0.93912453 0.91903396
## Proportion of Variance 0.2613891 0.1971458 0.1244695 0.09799499 0.09384705
## Cumulative Proportion 0.2613891 0.4585348 0.5830043 0.68099929 0.77484634
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.85724456 0.69887295 0.64666926 0.6204113
## Proportion of Variance 0.08165203 0.05426927 0.04646457 0.0427678
## Cumulative Proportion 0.85649836 0.91076763 0.95723220 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Pregnancies 0.216 0.527 0.165 0.161 0.212 0.457
## Glucose 0.437 -0.391 -0.327 0.109 -0.383 -0.208
## BloodPressure 0.300 0.630 -0.608 0.327
## SkinThickness 0.307 -0.448 0.294 0.172 0.398 0.282
## Insulin 0.336 -0.355 -0.142 0.650
## BMI 0.397 -0.210 0.252 -0.163 -0.527 0.241 -0.573
## DiabetesPedigreeFunction 0.238 -0.175 -0.285 0.874 -0.157 -0.172
## Age 0.279 0.533 0.126 0.171 0.204 -0.299
## Outcome 0.416 0.155 -0.395 -0.182 -0.378 0.171 0.580
## Comp.8 Comp.9
## Pregnancies 0.541 0.265
## Glucose 0.581
## BloodPressure 0.163
## SkinThickness -0.435 0.395
## Insulin 0.269 -0.478
## BMI 0.166 -0.126
## DiabetesPedigreeFunction
## Age -0.615 -0.289
## Outcome -0.318
# Plot PCA
pca_plot_data <- pca1$scores %>% data.frame(general = diabetes$Glucose,
PC1 = pca1$scores[, 1], PC2 = pca1$scores[, 2])
ggplot(pca_plot_data, aes(PC1, PC2)) + geom_point(aes(color = diabetes$Glucose)) +
theme_minimal()
The components that can fairly explain most of the variance (i.e. >85%) make up a total of 6 components: PC1 through PC6.
PC1
is positive for all variables, indicating that if an patient scores high on PC1
, they tend to have a high score for all 9 variables in this dataset, since they are correlated positively. Conversely, if an individual scores low on PC1
, they tend to have a low score for all 9 variables.PC2
is negative for SkinThickness
, Insulin
, BMI
, and DiabetesPedigreeFunction
, thus if you score high on PC2
you tend to score low on those variables, but high on Age
, and Pregnancies
, and vice versa if you score low on PC2
.PC3
is negative for Glucose
, Insulin
, and DiabetesPedigreeFunction
, thus if you score high on PC3
you tend to score low on those variables, but high on all other variables.PC4
is negative for BMI
and Glucose
, thus if you score high on PC4
you tend to score low on those variables, but high on Pregnancies
, Age
, and DiabetesPedigreeFunction
.PC5
is negative for BMI
and DiabetesPedigreeFunction
, thus if you score high on PC5
you tend to score low on those variables, but high on Pregnancies
, Glucose
, SkinThickness
, and Insulin
.PC6
is negative for Glucose
, BloodPressure
, and DiabetesPedigreeFunction
, thus if you score high on PC6
you tend to score low on those variables, but high on Pregnancies
, Skin Thickness
and BMI
.Furthermore, as seen on the plot, patients that scored high on PC1
tended to have a higher score for the Glucose
variable (indicated by lighter blue dots). Thus, PC1
is positively correlated with glucose level.
### Logistic Classification using entire data set to train
logistic_fit <- glm(Outcome ~ ., data = diabetes, family = "binomial")
score <- predict(logistic_fit)
class_diag(score, truth = diabetes$Outcome, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.7721 0.4851 0.926 0.7784 0.5977 0.7055 0.8394
Using the entire dataset to train the model, we get an AUC of 83.94%, which is a single measure to quantify how well the model is performing in terms of prediction, overall. This AUC scores “Good”.
### K-fold Cross Validation of Logistic Regression
set.seed(2)
# omit NAs
diabetes1 <- diabetes %>% na.omit()
k = 10 #choose number of folds
data <- diabetes1[sample(nrow(diabetes1)), ] #randomly order rows
folds <- cut(seq(1:nrow(diabetes1)), breaks = k, labels = F) #create 10 folds
diags <- NULL
for (i in 1:k) {
## Create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Outcome
## Train model on training set
fit <- glm(Outcome ~ ., data = train, family = "binomial")
probs <- predict(fit, newdata = test, type = "response")
## Test model on test set (save all k results)
diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}
summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.77601 0.5729 0.88609 0.72739 0.63622 0.72949 0.82946
Using a k-Fold cross validation procedure, we separate the data into 10 segments, randomizing the rows to account for any possible pattern in the data, then we train and test the model on each of the folds, and then take the mean of all 10 folds. This procedure produced an AUC of 82.9%, which is only slightly less than in the original logistic fit (1%); therefore, the model likely does not suffer from overfitting.
### Non-parametric Classifier- k-Nearest Neighbors
# Fit k-Nearnest Neighbors model
knn_fit <- knn3(Outcome ~ ., data = diabetes1, k = 5)
y_hat_knn <- predict(knn_fit, diabetes)
class_diag(y_hat_knn[, 2], diabetes1$Outcome, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.8034 0.653 0.884 0.7511 0.6986 0.7685 0.8719
Using k-Nearest Neighbors to classify this data set, we have an AUC of 87.19, which is “Good”.
# k-Folds Cross-validation of k-Nearest Neighbors
set.seed(2)
k = 10 #choose number of folds
data <- diabetes1[sample(nrow(diabetes1)), ] #randomly order rows
folds <- cut(seq(1:nrow(diabetes1)), breaks = k, labels = F) #create folds
diags <- NULL
for (i in 1:k) {
## Create training and test sets
train <- data[folds != i, ]
test <- data[folds == i, ]
truth <- test$Outcome ## Truth labels for fold i
## Train model on training set (all but fold i)
fit <- knn3(Outcome ~ ., data = train)
## Test model on test set (fold i)
probs <- predict(fit, newdata = test)[, 2]
## Get diagnostics for fold i
diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}
summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.72271 0.53612 0.82424 0.6334 0.5737 0.68016 0.7532
Using a k-Fold cross validation procedure, I produced an AUC of 75.3%, which is a much lower number than in the original knn3 fit (~12% less); therefore, the model suffers from overfitting. Furthermore, compared to our logistic fit and its subsequent k-Fold cross-validation, the K-Nearest Neighbors model performed objectively worse in terms of AUC (8.2% worse).
### Regression Tree Refactor data for easier interpretation
diabetes2 <- diabetes1 %>% mutate(Outcome = ifelse(Outcome ==
1, "Diabetic", "Not"))
# Fit regression and prune
reg_fit <- train(Glucose ~ ., data = diabetes1, method = "rpart")
rpart.plot(reg_fit$finalModel, digits = 4)
# Calculate MSE overall
y_hat_reg <- predict(reg_fit)
mean((diabetes1$Glucose - y_hat_reg)^2)
## [1] 740.8652
The average value for Glucose
across the whole dataset is 120.9. The first node separates those who have diabetes and those who do not. Of those who do have diabetes (Outcome does not equal 0), their average glucose level was 141.3. Of those who do not have diabetes (Outcome equals zero), the average glucose level is 110. The second node compares Insulin
; for those who had an insulin level less than 126, their average glucose level is 105.4. For those who have an insulin level greater than 126, their average glucose level is 129.2.
This model chooses splits to reduce the mean squared error (MSE), which is 740.9.
# Regression Tree Cross Validation
set.seed(2)
cv_tree <- trainControl(method = "cv", number = 5, classProbs = T,
savePredictions = T)
fit <- train(Glucose ~ ., data = diabetes1, trControl = cv_tree,
method = "rpart")
# Calculate MSE from CV
min(fit$results$RMSE)^2
## [1] 743.406
The MSE after cross-validating is 743.4, which is a relatively small difference (2.5 increase in MSE). Thus, the model does not appear to suffer from overfitting.
library(reticulate)
# select python3 version
use_python("/usr/bin/python3")
x <- 6
class(x)
## [1] "numeric"
# python code here
y = 2
#access R-defined objects with r
print(r.x*y)
## 12.0
print(type(r.x))
## <class 'float'>
In the R code chunk, we loaded the reticulate library so that variables can be shared between the R and Python environments. In the R code chunk, I assigned the number 6 to x. In the Python code chunk, I assigned the number 2 to y. I used the python print()
function to multiply the two variables together, identifying that x is an R variable with r.x
. And in fact, 2 times 6 equals 12. Furthermore, the variable was converted to a float, when it was originally numeric when it was assigned in R.
In conclusion, Glucose
is a decent measure that can be used to classify whether or not a patient is diabetic. However, the models are improved when using multiple variables to classify whether or not a patient is diabetic.