What effect access to health care insurance have on COVID-19 deaths in each state? This analysis explores how accessibility to health care, a growing problem in the United States, may have prevented individuals who were sick from seeking medical treatment.
“Indicators of Health Insurance Coverage” (insurance_data
) supplies census-type data from interviews on insurance status during the COVID-19 pandemic. “Provisional Death Counts for Influenza, Pneumonia, and COVID-19” supplies demographic and state data over COVID-19 testing, including number of positives, negatives, and deaths as a result of COVID-19, Pneumonia, or Influenza from January 1, 2020 to date. Both datasets were acquired from Kaggle.com. In this analysis, we will be looking at “peak” COVID-19 months, so I expect to find high rates of infection and death as well as some relationship between proportion died to proportion insured for each state.
# load libraries
library(tidyverse)
library(dplyr)
library(kableExtra)
library(ggplot2)
# read csv files
covid_data <- read_csv("us_states_covid19_daily.csv")
insurance_data <- read_csv("insurance.csv")
# inspecting data
head(covid_data)
## # A tibble: 6 x 55
## date state positive probableCases negative pending totalTestResult…
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2.02e7 AK 35720 NA 1042056 NA totalTestsViral
## 2 2.02e7 AL 269877 45962 1421126 NA totalTestsPeopl…
## 3 2.02e7 AR 170924 22753 1614979 NA totalTestsViral
## 4 2.02e7 AS 0 NA 2140 NA totalTestsViral
## 5 2.02e7 AZ 364276 12590 2018813 NA totalTestsPeopl…
## 6 2.02e7 CA 1341700 NA 23853346 NA totalTestsViral
## # … with 48 more variables: totalTestResults <dbl>,
## # hospitalizedCurrently <dbl>, hospitalizedCumulative <dbl>,
## # inIcuCurrently <dbl>, inIcuCumulative <dbl>, onVentilatorCurrently <dbl>,
## # onVentilatorCumulative <dbl>, recovered <dbl>, dataQualityGrade <chr>,
## # lastUpdateEt <chr>, dateModified <dttm>, checkTimeEt <chr>, death <dbl>,
## # hospitalized <dbl>, dateChecked <dttm>, totalTestsViral <dbl>,
## # positiveTestsViral <dbl>, negativeTestsViral <dbl>,
## # positiveCasesViral <dbl>, deathConfirmed <dbl>, deathProbable <dbl>,
## # totalTestEncountersViral <dbl>, totalTestsPeopleViral <dbl>,
## # totalTestsAntibody <dbl>, positiveTestsAntibody <dbl>,
## # negativeTestsAntibody <dbl>, totalTestsPeopleAntibody <dbl>,
## # positiveTestsPeopleAntibody <dbl>, negativeTestsPeopleAntibody <dbl>,
## # totalTestsPeopleAntigen <dbl>, positiveTestsPeopleAntigen <dbl>,
## # totalTestsAntigen <dbl>, positiveTestsAntigen <dbl>, fips <chr>,
## # positiveIncrease <dbl>, negativeIncrease <dbl>, total <dbl>,
## # totalTestResultsIncrease <dbl>, posNeg <dbl>, deathIncrease <dbl>,
## # hospitalizedIncrease <dbl>, hash <chr>, commercialScore <dbl>,
## # negativeRegularScore <dbl>, negativeScore <dbl>, positiveScore <dbl>,
## # score <dbl>, grade <lgl>
head(insurance_data)
## # A tibble: 6 x 15
## Indicator Group State Subgroup Phase `Time Period` `Time Period La…
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Uninsure… Nati… Unit… United … 1 1 Apr 23 - May 5,…
## 2 Uninsure… By A… Unit… 18 - 24… 1 1 Apr 23 - May 5,…
## 3 Uninsure… By A… Unit… 25 - 34… 1 1 Apr 23 - May 5,…
## 4 Uninsure… By A… Unit… 35 - 44… 1 1 Apr 23 - May 5,…
## 5 Uninsure… By A… Unit… 45 - 64… 1 1 Apr 23 - May 5,…
## 6 Uninsure… By S… Unit… Male 1 1 Apr 23 - May 5,…
## # … with 8 more variables: `Time Period Start Date` <chr>, `Time Period End
## # Date` <chr>, Value <dbl>, `Low CI` <dbl>, `High CI` <dbl>, `Confidence
## # Interval` <chr>, `Quartile Range` <chr>, `Suppression Flag` <dbl>
Tidying and reshaping will be broken up into two steps: “Preliminary Cleaning” and “Reshaping”. I will first be cleaning the data because it take a sufficient amount of cleaning to properly join later on, then I will reshape in the subsequent next steps. The two tables have different date formats and time periods that do not match up. Therefore, I will first transform the date
variable in covid_data into a numeric and match based on the formatting of Time Period Label
in insurance_data.
# manually tidy data to for subsequent join
# turn date variable into numeric
covid_data$date <- covid_data$date %>% as.numeric()
# match date and relabel for subsequent match
# insurance_data.csv
first_week <- filter(covid_data, between(covid_data$date, "20200423",
"20200505")) %>% mutate(`Time Period Label` = "Apr 23 - May 5, 2020")
second_week <- filter(covid_data, between(covid_data$date, "20200507",
"20200512")) %>% mutate(`Time Period Label` = "May 7 - May 12, 2020")
third_week <- filter(covid_data, between(covid_data$date, "20200514",
"20200519")) %>% mutate(`Time Period Label` = "May 14 - May 19, 2020")
fourth_week <- filter(covid_data, between(covid_data$date, "20200521",
"20200526")) %>% mutate(`Time Period Label` = "May 21 - May 26, 2020")
fifth_week <- filter(covid_data, between(covid_data$date, "20200528",
"20200602")) %>% mutate(`Time Period Label` = "May 28 - June 2, 2020")
sixth_week <- filter(covid_data, between(covid_data$date, "20200604",
"20200609")) %>% mutate(`Time Period Label` = "June 4 - June 9, 2020")
seventh_week <- filter(covid_data, between(covid_data$date, "20200611",
"20200616")) %>% mutate(`Time Period Label` = "June 11 - June 16, 2020")
eighth_week <- filter(covid_data, between(covid_data$date, "20200618",
"20200623")) %>% mutate(`Time Period Label` = "June 18 - June 23, 2020")
ninth_week <- filter(covid_data, between(covid_data$date, "20200625",
"20200630")) %>% mutate(`Time Period Label` = "June 25 - June 30, 2020")
# combine data for dates of interest
covid_new_data <- rbind(rbind(rbind(rbind(rbind(rbind(rbind(rbind(first_week,
second_week), third_week), fourth_week), fifth_week), sixth_week),
seventh_week), eighth_week), ninth_week)
# remove subgroup for convenience and filter by state to
# remove other demographic data
insurance_data_1 <- insurance_data %>% select(-Subgroup) %>%
filter(Group == "By State")
# recode State data into state codes for joining
insurance_data_1$State <- state.abb[match(insurance_data_1$State,
state.name)]
Now, I will pivot the data to a more intuitive format. I used the pivot_wider()
function to take the pivot the rows in “Indicator”, which describe insurance status (uninsured, publicly insured, and privately insured), and “Values” which describe the proportion of the population in each group, and transform them into a single column.
# reshaping data
insurance_data_clean <- insurance_data_1 %>% select(State, `Time Period Label`,
Indicator, Value) %>% # use pivot_wider to transform the row data into a column for
# the proportion uninsured
pivot_wider(names_from = Indicator, values_from = Value) %>%
# arrange by state for data validation
arrange(State)
# head(insurance_data_clean)
I will use a left join to join the tidy insurance
data to the tidy covid
data. Using the left_join()
function insures that the necessary data from the covid
dataset (e.g. positive
[results], deaths
, etc.) is preserved in the process. The data was joined by 2 unique IDs to ensure the data matched correctly by Time Period Label
and State
.
# join data employing a left join, join by 2 IDs that do not
# have same variable name
joined_data <- left_join(covid_new_data, insurance_data_clean,
by = "Time Period Label", c(state = "State"))
# final clean of joined data
joined_data_clean <- joined_data %>% select(state, State, "Time Period Label",
positive, negative, totalTestResults, death, "Uninsured at the Time of Interview") %>%
# remove duplicates
filter(state == State) %>% select(-State)
joined_data_clean
## # A tibble: 3,050 x 7
## state `Time Period La… positive negative totalTestResults death
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AK Apr 23 - May 5,… 371 22321 22692 9
## 2 AL Apr 23 - May 5,… 8285 98481 106766 313
## 3 AR Apr 23 - May 5,… 3496 51139 54635 83
## 4 AZ Apr 23 - May 5,… 9305 78955 88260 395
## 5 CA Apr 23 - May 5,… 56212 723690 779902 2317
## 6 CO Apr 23 - May 5,… 17364 68612 96666 903
## 7 CT Apr 23 - May 5,… 30621 78022 108643 2633
## 8 DE Apr 23 - May 5,… 5371 19309 35679 266
## 9 FL Apr 23 - May 5,… 36183 428252 461682 1536
## 10 GA Apr 23 - May 5,… 29711 171172 190322 1288
## # … with 3,040 more rows, and 1 more variable: `Uninsured at the Time of
## # Interview` <dbl>
There are no missing IDs in the joined dataset. Since both datasets were formatted and organized in the preliminary cleaning stage, their information match up by demographic and time. However, the rows required some wrangling before joining. For example, the insurance data had a “Group” column, which I filtered into “States”, so the data was identifiable by state, only. Furthermore, the dates in covid_data were in a daily format, so they were wrangled to match the time periods in insurance_data.
There were a total of 15,633 observations in covid_data and 8,217 observations in the insurance
data. After joining and cleaning, the final dataset has 3050 observations, which are each of the dates of interest for each of the 50 states. There were several observations that were dropped for the final dataset, but most if not all rows that were dropped were not variables of interest (e.g. gender, race) or fell out of the desired time interval of interest. The size of joined_data_clean is considerably smaller, but it is important to note it is a 3 month subsample and only state data was pulled.
In the following code block, I added two columns, prop tested positive
, which finds the proportion of positives over the number of individuals who took the test, and prop died
which finds the proportion of deaths over the number of individuals who tested positive (for each time period for every state). Using final_data, I applied various data wrangling techniques to create summary statistics by subgroup and for the entire data set.
Note: The filter()
function is not used here as there was some datawrangling in the preliminary cleaning stages.
# wrangling 1: add columns
final_data <- joined_data_clean %>% # create two variables as proportions of total tests
mutate(`prop tested positive` = positive/totalTestResults, `prop died` = death/positive,
month = substr(`Time Period Label`, 1, 3))
# examine column headings colnames(final_data)
# convert column names into similar format for convenient
# wrangling using regex
names(final_data)[5] <- gsub("([a-z])([A-Z])", "\\1 \\2", colnames(final_data[5])) %>%
str_to_lower()
names(final_data)[2] <- str_to_lower(colnames(final_data[2]))
names(final_data)[7] <- str_to_lower(colnames(final_data[7]))
# final data table
final_data %>% select(-`time period label`) %>% head()
## # A tibble: 6 x 9
## state positive negative `total test res… death `uninsured at t…
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AK 371 22321 22692 9 17
## 2 AL 8285 98481 106766 313 14.1
## 3 AR 3496 51139 54635 83 10.3
## 4 AZ 9305 78955 88260 395 16.8
## 5 CA 56212 723690 779902 2317 9.8
## 6 CO 17364 68612 96666 903 10.6
## # … with 3 more variables: `prop tested positive` <dbl>, `prop died` <dbl>,
## # month <chr>
# wrangling 2: grouped summary statistics
### this code was written as per the instructions, but since
### the data was collected at the state level for a specific
### time period, their are an identical time period
### observations for each 'category'
count_data <- final_data %>% group_by(state) %>% summarize(n = n())
# did not print as it is redundant count_data
# monthly means
monthly_mean <- final_data %>% group_by(month) %>% summarize_if(is.numeric,
mean, na.rm = T) %>% arrange(desc(positive))
monthly_mean
## # A tibble: 3 x 8
## month positive negative `total test res… death `uninsured at t…
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Jun 43585. 458733. 521906. 2240. 11.9
## 2 May 30865. 223908. 263178. 1763. 11.8
## 3 Apr 20832. 101589. 126038. 1136. 11.7
## # … with 2 more variables: `prop tested positive` <dbl>, `prop died` <dbl>
state_mean <- final_data %>% group_by(state) %>% summarize_if(is.numeric,
mean, na.rm = T) %>% arrange(desc(positive))
head(state_mean)
## # A tibble: 6 x 8
## state positive negative `total test res… death `uninsured at t…
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NY 353937. 1674977. 2028914. 22441. 8.08
## 2 NJ 149375. 571239. 720615. 12224. 10.2
## 3 CA 108280. 1785596. 1893847. 3817. 11.9
## 4 IL 103182. 709255. 812437. 4801. 11.2
## 5 MA 89126. 449113. 689007. 6094 5.26
## 6 TX 66763. 835141. 901904. 1518. 24.9
## # … with 2 more variables: `prop tested positive` <dbl>, `prop died` <dbl>
# wrangling 3: state summary statistics for number of deaths
state_summary <- final_data %>% group_by(state) %>% summarize(avg_perc_uninsured = mean(`uninsured at the time of interview`,
na.rm = T), avg_prop_death = mean(`prop died`), median_value = median(death),
sd_death = sd(death), var_death = var(death), max_death = max(death),
min_pos = min(positive))
# html kable summary statistics table
state_summary %>% group_by(state) %>% kbl() %>% kable_material(c("hover"))
state | avg_perc_uninsured | avg_prop_death | median_value | sd_death | var_death | max_death | min_pos |
---|---|---|---|---|---|---|---|
AK | 14.619672 | 0.0218176 | 10 | 1.444227 | 2.085792e+00 | 14 | 337 |
AL | 14.700000 | 0.0348383 | 575 | 227.610917 | 5.180673e+04 | 950 | 5778 |
AR | 11.145902 | 0.0180800 | 119 | 63.589441 | 4.043617e+03 | 270 | 2465 |
AZ | 16.672131 | 0.0399095 | 807 | 409.135262 | 1.673917e+05 | 1632 | 5769 |
CA | 11.914754 | 0.0373015 | 3814 | 1369.385782 | 1.875217e+06 | 5980 | 37369 |
CO | 10.590164 | 0.0473720 | 1114 | 263.345136 | 6.935066e+04 | 1520 | 11262 |
CT | 6.901818 | 0.0893212 | 3769 | 818.537193 | 6.700031e+05 | 4322 | 23100 |
DE | 8.837705 | 0.0476221 | 444 | 119.509995 | 1.428264e+04 | 509 | 3308 |
FL | 16.821312 | 0.0411334 | 2338 | 774.427684 | 5.997382e+05 | 3604 | 27662 |
GA | 19.854098 | 0.0417442 | 1871 | 603.943903 | 3.647482e+05 | 2805 | 21512 |
HI | 6.424490 | 0.0246474 | 17 | 1.185523 | 1.405464e+00 | 18 | 592 |
IA | 5.559184 | 0.0249017 | 478 | 209.488121 | 4.388527e+04 | 715 | 3924 |
ID | 12.922951 | 0.0273945 | 79 | 11.494308 | 1.321191e+02 | 91 | 1802 |
IL | 11.242623 | 0.0459345 | 4923 | 1752.877182 | 3.072578e+06 | 7124 | 36934 |
IN | 11.191803 | 0.0609090 | 2004 | 581.246859 | 3.378479e+05 | 2640 | 13039 |
KS | 14.340984 | 0.0238718 | 188 | 50.609082 | 2.561279e+03 | 270 | 2482 |
KY | 9.075410 | 0.0444148 | 391 | 117.541359 | 1.381597e+04 | 565 | 3373 |
LA | 12.860656 | 0.0670653 | 2701 | 494.429634 | 2.444607e+05 | 3221 | 25739 |
MA | 5.258182 | 0.0671215 | 6473 | 1700.431749 | 2.891468e+06 | 8095 | 47992 |
MD | 8.876364 | 0.0506870 | 2417 | 727.701171 | 5.295490e+05 | 3190 | 15737 |
ME | 10.220000 | 0.0405864 | 79 | 19.811158 | 3.924820e+02 | 105 | 937 |
MI | 8.968853 | 0.0917032 | 5703 | 699.141262 | 4.887985e+05 | 6193 | 41694 |
MN | 6.983607 | 0.0416035 | 908 | 414.240877 | 1.715955e+05 | 1476 | 4270 |
MO | 16.237705 | 0.0513042 | 686 | 242.403100 | 5.875926e+04 | 1015 | 6321 |
MS | 23.463934 | 0.0441867 | 652 | 272.767799 | 7.440227e+04 | 1073 | 5153 |
MT | 9.532787 | 0.0322033 | 17 | 2.213841 | 4.901093e+00 | 22 | 442 |
NC | 13.881967 | 0.0307995 | 766 | 332.604095 | 1.106255e+05 | 1343 | 7608 |
ND | 10.322951 | 0.0235276 | 54 | 24.468671 | 5.987158e+02 | 88 | 709 |
NE | 10.378689 | 0.0137361 | 150 | 69.749673 | 4.865017e+03 | 269 | 1813 |
NH | 8.432787 | 0.0493504 | 210 | 105.055541 | 1.103667e+04 | 367 | 1588 |
NJ | 10.222951 | 0.0810085 | 12886 | 2468.338193 | 6.092693e+06 | 14944 | 99989 |
NM | 11.013115 | 0.0423959 | 325 | 133.645285 | 1.786106e+04 | 497 | 2379 |
NV | 14.831148 | 0.0479837 | 412 | 90.242048 | 8.143627e+03 | 507 | 4208 |
NY | 8.075410 | 0.0632667 | 23564 | 2663.431605 | 7.093868e+06 | 24855 | 263460 |
OH | 10.236364 | 0.0576833 | 2002 | 697.276094 | 4.861940e+05 | 2863 | 14694 |
OK | 20.729508 | 0.0499969 | 318 | 59.901997 | 3.588249e+03 | 387 | 3017 |
OR | 9.301639 | 0.0351260 | 148 | 34.497082 | 1.190049e+03 | 207 | 2127 |
PA | 7.516393 | 0.0688958 | 5152 | 1678.466484 | 2.817250e+06 | 6649 | 37053 |
RI | 5.046939 | 0.0435220 | 634 | 243.760052 | 5.941896e+04 | 950 | 6773 |
SC | 16.167213 | 0.0366315 | 446 | 169.426933 | 2.870549e+04 | 739 | 4917 |
SD | 11.267273 | 0.0107805 | 50 | 25.328325 | 6.415240e+02 | 91 | 1956 |
TN | 15.609836 | 0.0163350 | 343 | 131.177092 | 1.720743e+04 | 604 | 8266 |
TX | 24.901639 | 0.0246452 | 1536 | 553.361317 | 3.062087e+05 | 2424 | 21944 |
UT | 10.418033 | 0.0102223 | 101 | 42.634692 | 1.817717e+03 | 172 | 3612 |
VA | 8.657377 | 0.0315922 | 1236 | 426.610516 | 1.819965e+05 | 1763 | 10998 |
VT | 6.934884 | 0.0535064 | 54 | 3.169355 | 1.004481e+01 | 56 | 831 |
WA | 8.383607 | 0.0469887 | 1070 | 188.307435 | 3.545969e+04 | 1320 | 13750 |
WI | 8.100000 | 0.0312706 | 517 | 168.390606 | 2.835540e+04 | 784 | 6011 |
WV | 12.196721 | 0.0381807 | 74 | 19.313547 | 3.730131e+02 | 93 | 967 |
WY | 15.849180 | 0.0145268 | 13 | 5.364781 | 2.878087e+01 | 20 | 453 |
final_data %>% summarize_all(function(x) sum(is.na(x))) %>% kbl() %>%
kable_styling()
state | time period label | positive | negative | total test results | death | uninsured at the time of interview | prop tested positive | prop died | month |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 96 | 0 | 0 | 0 |
Some interesting characteristics about the summary data: Table 2 shows that the number of uninsured at the time of interview did not change by much. Furthermore, the proportion of those who tested positive decreased (positive
) from April to June while the proportion who died (prop died
) remained approximately constant all 3 months. Overall, during this time period, New York had the highest number of positive cases from April to June. Lastly, there are only 96 NA values and it is only for the uninsured at the time of interview
variable, but this is due to the considerable wrangling done at the preliminary stage.
# scatter plot
state_summary %>% ggplot(aes(avg_perc_uninsured, avg_prop_death)) +
geom_point() + geom_smooth(method = "lm") + theme_classic() +
labs(title = "Relationship between Number of Deaths by Percent Uninsured per State",
subtitle = "From April to June of 2020", x = "Average Proportion of Uninsured",
y = "Average Proportion of Deaths") + # rename tick labels and remove white space
scale_x_continuous(labels = c(`5` = "5%", `10` = "10%", `15` = "15%",
`20` = "20%", `25` = "25%"), expand = c(0.01, 0))
This scatter plot, which plots average proportion of deaths to average proportion uninsured, appears to show that there is a negative linear relationship between the two variables. This is interesting because initially, I thought the inverse would be true. A possible explaination for this is that during peak pandemic, deaths may have been misclassified as “Unknown” due to the volume of patients in the hospital, so the data during this period could be inaccurate.
# histogram
final_data %>% ggplot(aes(x = `uninsured at the time of interview`)) +
geom_histogram(aes(y = ..density..), bins = 10) + geom_density(color = "black") +
facet_wrap(~month, scales = "free") + theme_classic() + labs(title = "Distribution of Uninsured by Month",
subtitle = "2020 Data", x = "Proportion of Uninsured", y = "Count") +
# change breaks and remove white space
scale_x_continuous(breaks = seq(0, 35, 5), expand = c(0.02, 0)) +
scale_y_continuous(breaks = seq(0, 1, 0.025), expand = c(0.01,
0))
This histogram shows the distribution of proportion unisured in the months of April, May, and June. April and June’s distribution is right-skewed while May is less skwewed.
# bar plot find mean deaths mean(final_data$`prop died`)
final_data %>% # reorder y-axis in order from greatest to least proportion
ggplot(aes(y = reorder(state, `prop died`), x = `prop died`)) +
geom_bar(stat = "summary") + # add a geom vertical line layer to show average proportion
# dead in U.S.
geom_vline(xintercept = 0.04159294, linetype = "dashed", color = "grey") +
# modify theme to resize font to fit text
theme(axis.text.y = element_text(size = 6)) + labs(title = "Proportion Died of Covid by State",
subtitle = "From April to June of 2020", x = "Proportion",
y = "") + # change the breaks from 0.005 to 0.0025 for accuracy, remove
# white space with expand function
scale_x_continuous(breaks = seq(0, 1, 0.01), expand = c(0, 0))
This last bar graph shows a summary of the proportion of individuals who tested positive and died per state. The vertical line shows the average proportion of deaths in the United States. It appears that there was a considerable amount of states with greater than 50% fatality. However, a possible explanation for this is at the time the data was collected, only certain individuals were tested for COVID-19. The tests were not yet publicly available to everyone, which may make the results appear more drastic.
Overall, this analysis did not find any evidence to suggest that the state’s proportion insured increased the number of deaths per state, but further analysis into confounding variables may yield more accurate concluations about the true relationship.