Mining, Classification, Prediction

Samantha Uy Tesy SAU275

Introduction

Can we develop a model that can accurately determine whether a patient has diabetes based on particular medical variables? According to the National Institute of Diabetes and Digestive and Kidney Diseases, over 9.4% of the United States population has diabetes, with “more than 1 in 4 [individuals]” not knowing they have diabetes. This dataset, which was pulled from Kaggle, has data over 768 female patients aged 21 or older with various medical variables including a variable that describes whether or not the patient is diabetic. (Source: https://www.niddk.nih.gov/health-information/diabetes/overview/what-is-diabetes)

The first variable Pregnancies counts the number of times the patient has been pregnant, Glucose measures the patient’s concentration of glucose present in plasma after a glucose tolerance test, BloodPressue measures the patient’s diastolic blood pressure, SkinThickness measures the patient’s tricep skin fold thickness (in mm), Insulin measures the patient’s 2-Hour serum insulin (in U/ml), BMI measures the patient’s body mass index (weight kg/height m^2), DiabetesPedigreeFunction is a diabetes function for the patient, Age is age of the patient measured in years, and Outcome is a boolean variable [1 or 0] that classifies whether the patient has diabetes [1] or not [0].

### Read Dataset and required libraries
library(tidyverse)
library(cluster)
library(GGally)
library(caret)
library(rpart)
library(rpart.plot)

# hyperlink to dataset
# https://www.kaggle.com/mathchi/diabetes-data-set?select=diabetes.csv

diabetes <- read_csv("diabetes.csv")

diabetes <- diabetes %>% na.omit()

# any other code here
head(diabetes)

## # A tibble: 6 x 9
##   Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI DiabetesPedigre…
##         <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>            <dbl>
## 1           6     148            72            35       0  33.6            0.627
## 2           1      85            66            29       0  26.6            0.351
## 3           8     183            64             0       0  23.3            0.672
## 4           1      89            66            23      94  28.1            0.167
## 5           0     137            40            35     168  43.1            2.29 
## 6           5     116            74             0       0  25.6            0.201
## # … with 2 more variables: Age <dbl>, Outcome <dbl>

Cluster Analysis

### Cluster Analysis

# Silhouette analysis, pick 2 clusters based on graph
sil_width <- vector()  #empty vector to hold mean sil width
for (i in 2:10) {
    kms <- kmeans(diabetes, centers = i)  #compute k-means solution for each k
    sil <- silhouette(kms$cluster, dist(diabetes))  #get sil widths
    sil_width[i] <- mean(sil[, 3])  #take averages (higher is better)
}
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k", 
    breaks = 1:10)

# PAM clustering algorithm
diabetes_pam <- diabetes %>% scale() %>% pam(k = 2)

# Representatives for medoids
diabetes %>% slice(diabetes_pam$id.med)

## # A tibble: 2 x 9
##   Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI DiabetesPedigre…
##         <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>            <dbl>
## 1           6     134            70            23     130  35.4            0.542
## 2           2     112            68            22      94  34.1            0.315
## # … with 2 more variables: Age <dbl>, Outcome <dbl>

plot(diabetes_pam, which = 2)

pamclust <- diabetes %>% mutate(cluster = as.factor(diabetes_pam$clustering))

The goodness of fit scored a 0.2, which means that no substantial structure was found. Furthermore, The two patients in the table above are representatives of their clusters. In terms of their original variables, Cluster 1 tends to have higher Glucose and Insulin.

#Pairwise Cluster

diabetes1 <- diabetes %>% mutate(cluster = as.factor(diabetes_pam$clustering))

ggpairs(diabetes1, columns = 1:6, aes(color = cluster))

Looking at the pairwise plots, the most distinct difference between Cluster 1 and Cluster 2 is in Glucose Level. Cluster 1’s distribution is to the right of Cluster 2, indicating that that group has a higher mean value for Glucose. Although less significant in shift, Cluster 1 also has a slightly higher BMI which is consistent with other studies pertaining to indicators of diabetes. Interestingly, Cluster 1 has on average more pregnancies than Cluster 2. Furthermore, the two variables with the highest correlation are the SkinThickness and Insulin variables. According to a study by the American Diabetes association, there is a statistically significant relationship between skin thickness (determined by collagen content) and higher levels of insulin. Source(https://care.diabetesjournals.org/content/12/5/309)

Dimensionality Reduction with PCA

### Principal Component Analysis
pca1 <- princomp(na.omit(diabetes), cor = T)
summary(pca1, loadings = T)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     1.5337867 1.3320330 1.0584069 0.93912453 0.91903396
## Proportion of Variance 0.2613891 0.1971458 0.1244695 0.09799499 0.09384705
## Cumulative Proportion  0.2613891 0.4585348 0.5830043 0.68099929 0.77484634
##                            Comp.6     Comp.7     Comp.8    Comp.9
## Standard deviation     0.85724456 0.69887295 0.64666926 0.6204113
## Proportion of Variance 0.08165203 0.05426927 0.04646457 0.0427678
## Cumulative Proportion  0.85649836 0.91076763 0.95723220 1.0000000
## 
## Loadings:
##                          Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Pregnancies               0.216  0.527  0.165  0.161  0.212  0.457       
## Glucose                   0.437        -0.391 -0.327  0.109 -0.383 -0.208
## BloodPressure             0.300         0.630               -0.608  0.327
## SkinThickness             0.307 -0.448  0.294         0.172  0.398  0.282
## Insulin                   0.336 -0.355 -0.142         0.650              
## BMI                       0.397 -0.210  0.252 -0.163 -0.527  0.241 -0.573
## DiabetesPedigreeFunction  0.238 -0.175 -0.285  0.874 -0.157 -0.172       
## Age                       0.279  0.533  0.126  0.171  0.204        -0.299
## Outcome                   0.416  0.155 -0.395 -0.182 -0.378  0.171  0.580
##                          Comp.8 Comp.9
## Pregnancies               0.541  0.265
## Glucose                          0.581
## BloodPressure             0.163       
## SkinThickness            -0.435  0.395
## Insulin                   0.269 -0.478
## BMI                       0.166 -0.126
## DiabetesPedigreeFunction              
## Age                      -0.615 -0.289
## Outcome                         -0.318

# Plot PCA
pca_plot_data <- pca1$scores %>% data.frame(general = diabetes$Glucose, 
    PC1 = pca1$scores[, 1], PC2 = pca1$scores[, 2])

ggplot(pca_plot_data, aes(PC1, PC2)) + geom_point(aes(color = diabetes$Glucose)) + 
    theme_minimal()

The components that can fairly explain most of the variance (i.e. >85%) make up a total of 6 components: PC1 through PC6.

PC1 is positive for all variables, indicating that if an patient scores high on PC1, they tend to have a high score for all 9 variables in this dataset, since they are correlated positively. Conversely, if an individual scores low on PC1, they tend to have a low score for all 9 variables.
PC2 is negative for SkinThickness, Insulin, BMI, and DiabetesPedigreeFunction, thus if you score high on PC2 you tend to score low on those variables, but high on Age, and Pregnancies, and vice versa if you score low on PC2.
PC3 is negative for Glucose, Insulin, and DiabetesPedigreeFunction, thus if you score high on PC3 you tend to score low on those variables, but high on all other variables.
PC4 is negative for BMI and Glucose, thus if you score high on PC4 you tend to score low on those variables, but high on Pregnancies, Age, and DiabetesPedigreeFunction.
PC5 is negative for BMI and DiabetesPedigreeFunction, thus if you score high on PC5 you tend to score low on those variables, but high on Pregnancies, Glucose, SkinThickness, and Insulin.
PC6 is negative for Glucose, BloodPressure, and DiabetesPedigreeFunction, thus if you score high on PC6 you tend to score low on those variables, but high on Pregnancies, Skin Thickness and BMI.

Furthermore, as seen on the plot, patients that scored high on PC1 tended to have a higher score for the Glucose variable (indicated by lighter blue dots). Thus, PC1 is positively correlated with glucose level.

Linear Classifier

### Logistic Classification using entire data set to train
logistic_fit <- glm(Outcome ~ ., data = diabetes, family = "binomial")
score <- predict(logistic_fit)

class_diag(score, truth = diabetes$Outcome, positive = 1)

##            acc   sens  spec    ppv     f1     ba    auc
## Metrics 0.7721 0.4851 0.926 0.7784 0.5977 0.7055 0.8394

Using the entire dataset to train the model, we get an AUC of 83.94%, which is a single measure to quantify how well the model is performing in terms of prediction, overall. This AUC scores “Good”.

### K-fold Cross Validation of Logistic Regression
set.seed(2)

# omit NAs
diabetes1 <- diabetes %>% na.omit()

k = 10  #choose number of folds

data <- diabetes1[sample(nrow(diabetes1)), ]  #randomly order rows
folds <- cut(seq(1:nrow(diabetes1)), breaks = k, labels = F)  #create 10 folds

diags <- NULL
for (i in 1:k) {
    ## Create training and test sets
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    truth <- test$Outcome
    
    ## Train model on training set
    fit <- glm(Outcome ~ ., data = train, family = "binomial")
    probs <- predict(fit, newdata = test, type = "response")
    
    ## Test model on test set (save all k results)
    diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}

summarize_all(diags, mean)

##       acc   sens    spec     ppv      f1      ba     auc
## 1 0.77601 0.5729 0.88609 0.72739 0.63622 0.72949 0.82946

Using a k-Fold cross validation procedure, we separate the data into 10 segments, randomizing the rows to account for any possible pattern in the data, then we train and test the model on each of the folds, and then take the mean of all 10 folds. This procedure produced an AUC of 82.9%, which is only slightly less than in the original logistic fit (1%); therefore, the model likely does not suffer from overfitting.

Non-Parametric Classifier

### Non-parametric Classifier- k-Nearest Neighbors

# Fit k-Nearnest Neighbors model
knn_fit <- knn3(Outcome ~ ., data = diabetes1, k = 5)
y_hat_knn <- predict(knn_fit, diabetes)


class_diag(y_hat_knn[, 2], diabetes1$Outcome, positive = 1)

##            acc  sens  spec    ppv     f1     ba    auc
## Metrics 0.8034 0.653 0.884 0.7511 0.6986 0.7685 0.8719

Using k-Nearest Neighbors to classify this data set, we have an AUC of 87.19, which is “Good”.

# k-Folds Cross-validation of k-Nearest Neighbors
set.seed(2)
k = 10  #choose number of folds

data <- diabetes1[sample(nrow(diabetes1)), ]  #randomly order rows
folds <- cut(seq(1:nrow(diabetes1)), breaks = k, labels = F)  #create folds

diags <- NULL
for (i in 1:k) {
    ## Create training and test sets
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    truth <- test$Outcome  ## Truth labels for fold i
    
    ## Train model on training set (all but fold i)
    fit <- knn3(Outcome ~ ., data = train)
    
    ## Test model on test set (fold i)
    probs <- predict(fit, newdata = test)[, 2]
    
    ## Get diagnostics for fold i
    diags <- rbind(diags, class_diag(probs, truth, positive = 1))
}

summarize_all(diags, mean)

##       acc    sens    spec    ppv     f1      ba    auc
## 1 0.72271 0.53612 0.82424 0.6334 0.5737 0.68016 0.7532

Using a k-Fold cross validation procedure, I produced an AUC of 75.3%, which is a much lower number than in the original knn3 fit (~12% less); therefore, the model suffers from overfitting. Furthermore, compared to our logistic fit and its subsequent k-Fold cross-validation, the K-Nearest Neighbors model performed objectively worse in terms of AUC (8.2% worse).

Regression/Numeric Prediction

### Regression Tree Refactor data for easier interpretation
diabetes2 <- diabetes1 %>% mutate(Outcome = ifelse(Outcome == 
    1, "Diabetic", "Not"))

# Fit regression and prune
reg_fit <- train(Glucose ~ ., data = diabetes1, method = "rpart")
rpart.plot(reg_fit$finalModel, digits = 4)

# Calculate MSE overall
y_hat_reg <- predict(reg_fit)
mean((diabetes1$Glucose - y_hat_reg)^2)

## [1] 740.8652

The average value for Glucose across the whole dataset is 120.9. The first node separates those who have diabetes and those who do not. Of those who do have diabetes (Outcome does not equal 0), their average glucose level was 141.3. Of those who do not have diabetes (Outcome equals zero), the average glucose level is 110. The second node compares Insulin; for those who had an insulin level less than 126, their average glucose level is 105.4. For those who have an insulin level greater than 126, their average glucose level is 129.2.

This model chooses splits to reduce the mean squared error (MSE), which is 740.9.

# Regression Tree Cross Validation
set.seed(2)
cv_tree <- trainControl(method = "cv", number = 5, classProbs = T, 
    savePredictions = T)
fit <- train(Glucose ~ ., data = diabetes1, trControl = cv_tree, 
    method = "rpart")

# Calculate MSE from CV
min(fit$results$RMSE)^2

## [1] 743.406

The MSE after cross-validating is 743.4, which is a relatively small difference (2.5 increase in MSE). Thus, the model does not appear to suffer from overfitting.

Python

library(reticulate)

# select python3 version
use_python("/usr/bin/python3")

x <- 6
class(x)

## [1] "numeric"

# python code here

y = 2

#access R-defined objects with r
print(r.x*y)

## 12.0

print(type(r.x))

## <class 'float'>

In the R code chunk, we loaded the reticulate library so that variables can be shared between the R and Python environments. In the R code chunk, I assigned the number 6 to x. In the Python code chunk, I assigned the number 2 to y. I used the python print() function to multiply the two variables together, identifying that x is an R variable with r.x. And in fact, 2 times 6 equals 12. Furthermore, the variable was converted to a float, when it was originally numeric when it was assigned in R.

Concluding Remarks

In conclusion, Glucose is a decent measure that can be used to classify whether or not a patient is diabetic. However, the models are improved when using multiple variables to classify whether or not a patient is diabetic.

Project 2: Data Mining, Classification, Prediction

SDS322E