Data Wrangling, Exploration, Visualization

Samantha Uy Tesy

Introduction

What effect access to health care insurance have on COVID-19 deaths in each state? This analysis explores how accessibility to health care, a growing problem in the United States, may have prevented individuals who were sick from seeking medical treatment.

“Indicators of Health Insurance Coverage” (insurance_data) supplies census-type data from interviews on insurance status during the COVID-19 pandemic. “Provisional Death Counts for Influenza, Pneumonia, and COVID-19” supplies demographic and state data over COVID-19 testing, including number of positives, negatives, and deaths as a result of COVID-19, Pneumonia, or Influenza from January 1, 2020 to date. Both datasets were acquired from Kaggle.com. In this analysis, we will be looking at “peak” COVID-19 months, so I expect to find high rates of infection and death as well as some relationship between proportion died to proportion insured for each state.

# load libraries
library(tidyverse)
library(dplyr)
library(kableExtra)
library(ggplot2)

# read csv files
covid_data <- read_csv("us_states_covid19_daily.csv")
insurance_data <- read_csv("insurance.csv")

# inspecting data
head(covid_data)
## # A tibble: 6 x 55
##     date state positive probableCases negative pending totalTestResult…
##    <dbl> <chr>    <dbl>         <dbl>    <dbl>   <dbl> <chr>           
## 1 2.02e7 AK       35720            NA  1042056      NA totalTestsViral 
## 2 2.02e7 AL      269877         45962  1421126      NA totalTestsPeopl…
## 3 2.02e7 AR      170924         22753  1614979      NA totalTestsViral 
## 4 2.02e7 AS           0            NA     2140      NA totalTestsViral 
## 5 2.02e7 AZ      364276         12590  2018813      NA totalTestsPeopl…
## 6 2.02e7 CA     1341700            NA 23853346      NA totalTestsViral 
## # … with 48 more variables: totalTestResults <dbl>,
## #   hospitalizedCurrently <dbl>, hospitalizedCumulative <dbl>,
## #   inIcuCurrently <dbl>, inIcuCumulative <dbl>, onVentilatorCurrently <dbl>,
## #   onVentilatorCumulative <dbl>, recovered <dbl>, dataQualityGrade <chr>,
## #   lastUpdateEt <chr>, dateModified <dttm>, checkTimeEt <chr>, death <dbl>,
## #   hospitalized <dbl>, dateChecked <dttm>, totalTestsViral <dbl>,
## #   positiveTestsViral <dbl>, negativeTestsViral <dbl>,
## #   positiveCasesViral <dbl>, deathConfirmed <dbl>, deathProbable <dbl>,
## #   totalTestEncountersViral <dbl>, totalTestsPeopleViral <dbl>,
## #   totalTestsAntibody <dbl>, positiveTestsAntibody <dbl>,
## #   negativeTestsAntibody <dbl>, totalTestsPeopleAntibody <dbl>,
## #   positiveTestsPeopleAntibody <dbl>, negativeTestsPeopleAntibody <dbl>,
## #   totalTestsPeopleAntigen <dbl>, positiveTestsPeopleAntigen <dbl>,
## #   totalTestsAntigen <dbl>, positiveTestsAntigen <dbl>, fips <chr>,
## #   positiveIncrease <dbl>, negativeIncrease <dbl>, total <dbl>,
## #   totalTestResultsIncrease <dbl>, posNeg <dbl>, deathIncrease <dbl>,
## #   hospitalizedIncrease <dbl>, hash <chr>, commercialScore <dbl>,
## #   negativeRegularScore <dbl>, negativeScore <dbl>, positiveScore <dbl>,
## #   score <dbl>, grade <lgl>
head(insurance_data)
## # A tibble: 6 x 15
##   Indicator Group State Subgroup Phase `Time Period` `Time Period La…
##   <chr>     <chr> <chr> <chr>    <dbl>         <dbl> <chr>           
## 1 Uninsure… Nati… Unit… United …     1             1 Apr 23 - May 5,…
## 2 Uninsure… By A… Unit… 18 - 24…     1             1 Apr 23 - May 5,…
## 3 Uninsure… By A… Unit… 25 - 34…     1             1 Apr 23 - May 5,…
## 4 Uninsure… By A… Unit… 35 - 44…     1             1 Apr 23 - May 5,…
## 5 Uninsure… By A… Unit… 45 - 64…     1             1 Apr 23 - May 5,…
## 6 Uninsure… By S… Unit… Male         1             1 Apr 23 - May 5,…
## # … with 8 more variables: `Time Period Start Date` <chr>, `Time Period End
## #   Date` <chr>, Value <dbl>, `Low CI` <dbl>, `High CI` <dbl>, `Confidence
## #   Interval` <chr>, `Quartile Range` <chr>, `Suppression Flag` <dbl>

Tidying

Preliminary Cleaning

Tidying and reshaping will be broken up into two steps: “Preliminary Cleaning” and “Reshaping”. I will first be cleaning the data because it take a sufficient amount of cleaning to properly join later on, then I will reshape in the subsequent next steps. The two tables have different date formats and time periods that do not match up. Therefore, I will first transform the date variable in covid_data into a numeric and match based on the formatting of Time Period Label in insurance_data.

# manually tidy data to for subsequent join

# turn date variable into numeric
covid_data$date <- covid_data$date %>% as.numeric()

# match date and relabel for subsequent match
# insurance_data.csv
first_week <- filter(covid_data, between(covid_data$date, "20200423", 
    "20200505")) %>% mutate(`Time Period Label` = "Apr 23 - May 5, 2020")
second_week <- filter(covid_data, between(covid_data$date, "20200507", 
    "20200512")) %>% mutate(`Time Period Label` = "May 7 - May 12, 2020")
third_week <- filter(covid_data, between(covid_data$date, "20200514", 
    "20200519")) %>% mutate(`Time Period Label` = "May 14 - May 19, 2020")
fourth_week <- filter(covid_data, between(covid_data$date, "20200521", 
    "20200526")) %>% mutate(`Time Period Label` = "May 21 - May 26, 2020")
fifth_week <- filter(covid_data, between(covid_data$date, "20200528", 
    "20200602")) %>% mutate(`Time Period Label` = "May 28 - June 2, 2020")
sixth_week <- filter(covid_data, between(covid_data$date, "20200604", 
    "20200609")) %>% mutate(`Time Period Label` = "June 4 - June 9, 2020")
seventh_week <- filter(covid_data, between(covid_data$date, "20200611", 
    "20200616")) %>% mutate(`Time Period Label` = "June 11 - June 16, 2020")
eighth_week <- filter(covid_data, between(covid_data$date, "20200618", 
    "20200623")) %>% mutate(`Time Period Label` = "June 18 - June 23, 2020")
ninth_week <- filter(covid_data, between(covid_data$date, "20200625", 
    "20200630")) %>% mutate(`Time Period Label` = "June 25 - June 30, 2020")

# combine data for dates of interest
covid_new_data <- rbind(rbind(rbind(rbind(rbind(rbind(rbind(rbind(first_week, 
    second_week), third_week), fourth_week), fifth_week), sixth_week), 
    seventh_week), eighth_week), ninth_week)

# remove subgroup for convenience and filter by state to
# remove other demographic data
insurance_data_1 <- insurance_data %>% select(-Subgroup) %>% 
    filter(Group == "By State")

# recode State data into state codes for joining
insurance_data_1$State <- state.abb[match(insurance_data_1$State, 
    state.name)]

Reshaping Data

Now, I will pivot the data to a more intuitive format. I used the pivot_wider() function to take the pivot the rows in “Indicator”, which describe insurance status (uninsured, publicly insured, and privately insured), and “Values” which describe the proportion of the population in each group, and transform them into a single column.

# reshaping data
insurance_data_clean <- insurance_data_1 %>% select(State, `Time Period Label`, 
    Indicator, Value) %>% # use pivot_wider to transform the row data into a column for
# the proportion uninsured
pivot_wider(names_from = Indicator, values_from = Value) %>% 
    # arrange by state for data validation
arrange(State)

# head(insurance_data_clean)

Joining/Merging

I will use a left join to join the tidy insurance data to the tidy covid data. Using the left_join() function insures that the necessary data from the covid dataset (e.g. positive [results], deaths, etc.) is preserved in the process. The data was joined by 2 unique IDs to ensure the data matched correctly by Time Period Label and State.

# join data employing a left join, join by 2 IDs that do not
# have same variable name
joined_data <- left_join(covid_new_data, insurance_data_clean, 
    by = "Time Period Label", c(state = "State"))

# final clean of joined data
joined_data_clean <- joined_data %>% select(state, State, "Time Period Label", 
    positive, negative, totalTestResults, death, "Uninsured at the Time of Interview") %>% 
    # remove duplicates
filter(state == State) %>% select(-State)

joined_data_clean
## # A tibble: 3,050 x 7
##    state `Time Period La… positive negative totalTestResults death
##    <chr> <chr>               <dbl>    <dbl>            <dbl> <dbl>
##  1 AK    Apr 23 - May 5,…      371    22321            22692     9
##  2 AL    Apr 23 - May 5,…     8285    98481           106766   313
##  3 AR    Apr 23 - May 5,…     3496    51139            54635    83
##  4 AZ    Apr 23 - May 5,…     9305    78955            88260   395
##  5 CA    Apr 23 - May 5,…    56212   723690           779902  2317
##  6 CO    Apr 23 - May 5,…    17364    68612            96666   903
##  7 CT    Apr 23 - May 5,…    30621    78022           108643  2633
##  8 DE    Apr 23 - May 5,…     5371    19309            35679   266
##  9 FL    Apr 23 - May 5,…    36183   428252           461682  1536
## 10 GA    Apr 23 - May 5,…    29711   171172           190322  1288
## # … with 3,040 more rows, and 1 more variable: `Uninsured at the Time of
## #   Interview` <dbl>

There are no missing IDs in the joined dataset. Since both datasets were formatted and organized in the preliminary cleaning stage, their information match up by demographic and time. However, the rows required some wrangling before joining. For example, the insurance data had a “Group” column, which I filtered into “States”, so the data was identifiable by state, only. Furthermore, the dates in covid_data were in a daily format, so they were wrangled to match the time periods in insurance_data.

There were a total of 15,633 observations in covid_data and 8,217 observations in the insurance data. After joining and cleaning, the final dataset has 3050 observations, which are each of the dates of interest for each of the 50 states. There were several observations that were dropped for the final dataset, but most if not all rows that were dropped were not variables of interest (e.g. gender, race) or fell out of the desired time interval of interest. The size of joined_data_clean is considerably smaller, but it is important to note it is a 3 month subsample and only state data was pulled.

Wrangling

In the following code block, I added two columns, prop tested positive, which finds the proportion of positives over the number of individuals who took the test, and prop died which finds the proportion of deaths over the number of individuals who tested positive (for each time period for every state). Using final_data, I applied various data wrangling techniques to create summary statistics by subgroup and for the entire data set.

  1. Table 1 shows the first 6 observations of the finalized dataset to be used in subsequent tables
  2. Table 2 shows summary statistics grouped by month
  3. Table 3 shows each variables mean for each state
  4. Table 4 is an interactive table that shows summary statistics for each state
  5. Table 5 shows how many NA variables there for each column in final_data

Note: The filter() function is not used here as there was some datawrangling in the preliminary cleaning stages.

# wrangling 1: add columns
final_data <- joined_data_clean %>% # create two variables as proportions of total tests
mutate(`prop tested positive` = positive/totalTestResults, `prop died` = death/positive, 
    month = substr(`Time Period Label`, 1, 3))

# examine column headings colnames(final_data)

# convert column names into similar format for convenient
# wrangling using regex
names(final_data)[5] <- gsub("([a-z])([A-Z])", "\\1 \\2", colnames(final_data[5])) %>% 
    str_to_lower()
names(final_data)[2] <- str_to_lower(colnames(final_data[2]))
names(final_data)[7] <- str_to_lower(colnames(final_data[7]))

# final data table
final_data %>% select(-`time period label`) %>% head()
## # A tibble: 6 x 9
##   state positive negative `total test res… death `uninsured at t…
##   <chr>    <dbl>    <dbl>            <dbl> <dbl>            <dbl>
## 1 AK         371    22321            22692     9             17  
## 2 AL        8285    98481           106766   313             14.1
## 3 AR        3496    51139            54635    83             10.3
## 4 AZ        9305    78955            88260   395             16.8
## 5 CA       56212   723690           779902  2317              9.8
## 6 CO       17364    68612            96666   903             10.6
## # … with 3 more variables: `prop tested positive` <dbl>, `prop died` <dbl>,
## #   month <chr>
# wrangling 2: grouped summary statistics

### this code was written as per the instructions, but since
### the data was collected at the state level for a specific
### time period, their are an identical time period
### observations for each 'category'
count_data <- final_data %>% group_by(state) %>% summarize(n = n())

# did not print as it is redundant count_data

# monthly means
monthly_mean <- final_data %>% group_by(month) %>% summarize_if(is.numeric, 
    mean, na.rm = T) %>% arrange(desc(positive))

monthly_mean
## # A tibble: 3 x 8
##   month positive negative `total test res… death `uninsured at t…
##   <chr>    <dbl>    <dbl>            <dbl> <dbl>            <dbl>
## 1 Jun     43585.  458733.          521906. 2240.             11.9
## 2 May     30865.  223908.          263178. 1763.             11.8
## 3 Apr     20832.  101589.          126038. 1136.             11.7
## # … with 2 more variables: `prop tested positive` <dbl>, `prop died` <dbl>
state_mean <- final_data %>% group_by(state) %>% summarize_if(is.numeric, 
    mean, na.rm = T) %>% arrange(desc(positive))

head(state_mean)
## # A tibble: 6 x 8
##   state positive negative `total test res…  death `uninsured at t…
##   <chr>    <dbl>    <dbl>            <dbl>  <dbl>            <dbl>
## 1 NY     353937. 1674977.         2028914. 22441.             8.08
## 2 NJ     149375.  571239.          720615. 12224.            10.2 
## 3 CA     108280. 1785596.         1893847.  3817.            11.9 
## 4 IL     103182.  709255.          812437.  4801.            11.2 
## 5 MA      89126.  449113.          689007.  6094              5.26
## 6 TX      66763.  835141.          901904.  1518.            24.9 
## # … with 2 more variables: `prop tested positive` <dbl>, `prop died` <dbl>
# wrangling 3: state summary statistics for number of deaths

state_summary <- final_data %>% group_by(state) %>% summarize(avg_perc_uninsured = mean(`uninsured at the time of interview`, 
    na.rm = T), avg_prop_death = mean(`prop died`), median_value = median(death), 
    sd_death = sd(death), var_death = var(death), max_death = max(death), 
    min_pos = min(positive))


# html kable summary statistics table
state_summary %>% group_by(state) %>% kbl() %>% kable_material(c("hover"))
state avg_perc_uninsured avg_prop_death median_value sd_death var_death max_death min_pos
AK 14.619672 0.0218176 10 1.444227 2.085792e+00 14 337
AL 14.700000 0.0348383 575 227.610917 5.180673e+04 950 5778
AR 11.145902 0.0180800 119 63.589441 4.043617e+03 270 2465
AZ 16.672131 0.0399095 807 409.135262 1.673917e+05 1632 5769
CA 11.914754 0.0373015 3814 1369.385782 1.875217e+06 5980 37369
CO 10.590164 0.0473720 1114 263.345136 6.935066e+04 1520 11262
CT 6.901818 0.0893212 3769 818.537193 6.700031e+05 4322 23100
DE 8.837705 0.0476221 444 119.509995 1.428264e+04 509 3308
FL 16.821312 0.0411334 2338 774.427684 5.997382e+05 3604 27662
GA 19.854098 0.0417442 1871 603.943903 3.647482e+05 2805 21512
HI 6.424490 0.0246474 17 1.185523 1.405464e+00 18 592
IA 5.559184 0.0249017 478 209.488121 4.388527e+04 715 3924
ID 12.922951 0.0273945 79 11.494308 1.321191e+02 91 1802
IL 11.242623 0.0459345 4923 1752.877182 3.072578e+06 7124 36934
IN 11.191803 0.0609090 2004 581.246859 3.378479e+05 2640 13039
KS 14.340984 0.0238718 188 50.609082 2.561279e+03 270 2482
KY 9.075410 0.0444148 391 117.541359 1.381597e+04 565 3373
LA 12.860656 0.0670653 2701 494.429634 2.444607e+05 3221 25739
MA 5.258182 0.0671215 6473 1700.431749 2.891468e+06 8095 47992
MD 8.876364 0.0506870 2417 727.701171 5.295490e+05 3190 15737
ME 10.220000 0.0405864 79 19.811158 3.924820e+02 105 937
MI 8.968853 0.0917032 5703 699.141262 4.887985e+05 6193 41694
MN 6.983607 0.0416035 908 414.240877 1.715955e+05 1476 4270
MO 16.237705 0.0513042 686 242.403100 5.875926e+04 1015 6321
MS 23.463934 0.0441867 652 272.767799 7.440227e+04 1073 5153
MT 9.532787 0.0322033 17 2.213841 4.901093e+00 22 442
NC 13.881967 0.0307995 766 332.604095 1.106255e+05 1343 7608
ND 10.322951 0.0235276 54 24.468671 5.987158e+02 88 709
NE 10.378689 0.0137361 150 69.749673 4.865017e+03 269 1813
NH 8.432787 0.0493504 210 105.055541 1.103667e+04 367 1588
NJ 10.222951 0.0810085 12886 2468.338193 6.092693e+06 14944 99989
NM 11.013115 0.0423959 325 133.645285 1.786106e+04 497 2379
NV 14.831148 0.0479837 412 90.242048 8.143627e+03 507 4208
NY 8.075410 0.0632667 23564 2663.431605 7.093868e+06 24855 263460
OH 10.236364 0.0576833 2002 697.276094 4.861940e+05 2863 14694
OK 20.729508 0.0499969 318 59.901997 3.588249e+03 387 3017
OR 9.301639 0.0351260 148 34.497082 1.190049e+03 207 2127
PA 7.516393 0.0688958 5152 1678.466484 2.817250e+06 6649 37053
RI 5.046939 0.0435220 634 243.760052 5.941896e+04 950 6773
SC 16.167213 0.0366315 446 169.426933 2.870549e+04 739 4917
SD 11.267273 0.0107805 50 25.328325 6.415240e+02 91 1956
TN 15.609836 0.0163350 343 131.177092 1.720743e+04 604 8266
TX 24.901639 0.0246452 1536 553.361317 3.062087e+05 2424 21944
UT 10.418033 0.0102223 101 42.634692 1.817717e+03 172 3612
VA 8.657377 0.0315922 1236 426.610516 1.819965e+05 1763 10998
VT 6.934884 0.0535064 54 3.169355 1.004481e+01 56 831
WA 8.383607 0.0469887 1070 188.307435 3.545969e+04 1320 13750
WI 8.100000 0.0312706 517 168.390606 2.835540e+04 784 6011
WV 12.196721 0.0381807 74 19.313547 3.730131e+02 93 967
WY 15.849180 0.0145268 13 5.364781 2.878087e+01 20 453
final_data %>% summarize_all(function(x) sum(is.na(x))) %>% kbl() %>% 
    kable_styling()
state time period label positive negative total test results death uninsured at the time of interview prop tested positive prop died month
0 0 0 0 0 0 96 0 0 0

Some interesting characteristics about the summary data: Table 2 shows that the number of uninsured at the time of interview did not change by much. Furthermore, the proportion of those who tested positive decreased (positive) from April to June while the proportion who died (prop died) remained approximately constant all 3 months. Overall, during this time period, New York had the highest number of positive cases from April to June. Lastly, there are only 96 NA values and it is only for the uninsured at the time of interview variable, but this is due to the considerable wrangling done at the preliminary stage.

Visualizing

  • Write a supporting paragraph (for each plot) describing what the plot depicts and any relationships/trends that are apparent (9 pts)
# scatter plot
state_summary %>% ggplot(aes(avg_perc_uninsured, avg_prop_death)) + 
    geom_point() + geom_smooth(method = "lm") + theme_classic() + 
    labs(title = "Relationship between Number of Deaths by Percent Uninsured per State", 
        subtitle = "From April to June of 2020", x = "Average Proportion of Uninsured", 
        y = "Average Proportion of Deaths") + # rename tick labels and remove white space
scale_x_continuous(labels = c(`5` = "5%", `10` = "10%", `15` = "15%", 
    `20` = "20%", `25` = "25%"), expand = c(0.01, 0))

This scatter plot, which plots average proportion of deaths to average proportion uninsured, appears to show that there is a negative linear relationship between the two variables. This is interesting because initially, I thought the inverse would be true. A possible explaination for this is that during peak pandemic, deaths may have been misclassified as “Unknown” due to the volume of patients in the hospital, so the data during this period could be inaccurate.

# histogram
final_data %>% ggplot(aes(x = `uninsured at the time of interview`)) + 
    geom_histogram(aes(y = ..density..), bins = 10) + geom_density(color = "black") + 
    facet_wrap(~month, scales = "free") + theme_classic() + labs(title = "Distribution of Uninsured by Month", 
    subtitle = "2020 Data", x = "Proportion of Uninsured", y = "Count") + 
    # change breaks and remove white space
scale_x_continuous(breaks = seq(0, 35, 5), expand = c(0.02, 0)) + 
    scale_y_continuous(breaks = seq(0, 1, 0.025), expand = c(0.01, 
        0))

This histogram shows the distribution of proportion unisured in the months of April, May, and June. April and June’s distribution is right-skewed while May is less skwewed.

# bar plot find mean deaths mean(final_data$`prop died`)

final_data %>% # reorder y-axis in order from greatest to least proportion
ggplot(aes(y = reorder(state, `prop died`), x = `prop died`)) + 
    geom_bar(stat = "summary") + # add a geom vertical line layer to show average proportion
# dead in U.S.
geom_vline(xintercept = 0.04159294, linetype = "dashed", color = "grey") + 
    # modify theme to resize font to fit text
theme(axis.text.y = element_text(size = 6)) + labs(title = "Proportion Died of Covid by State", 
    subtitle = "From April to June of 2020", x = "Proportion", 
    y = "") + # change the breaks from 0.005 to 0.0025 for accuracy, remove
# white space with expand function
scale_x_continuous(breaks = seq(0, 1, 0.01), expand = c(0, 0))

This last bar graph shows a summary of the proportion of individuals who tested positive and died per state. The vertical line shows the average proportion of deaths in the United States. It appears that there was a considerable amount of states with greater than 50% fatality. However, a possible explanation for this is at the time the data was collected, only certain individuals were tested for COVID-19. The tests were not yet publicly available to everyone, which may make the results appear more drastic.

Concluding Remarks

Overall, this analysis did not find any evidence to suggest that the state’s proportion insured increased the number of deaths per state, but further analysis into confounding variables may yield more accurate concluations about the true relationship.