Project 1: Wrangling, Exploration, Visualization

Data Wrangling, Exploration, Visualization

Samantha Uy Tesy

Introduction

What effect access to health care insurance have on COVID-19 deaths in each state? This analysis explores how accessibility to health care, a growing problem in the United States, may have prevented individuals who were sick from seeking medical treatment.

“Indicators of Health Insurance Coverage” (insurance_data) supplies census-type data from interviews on insurance status during the COVID-19 pandemic. “Provisional Death Counts for Influenza, Pneumonia, and COVID-19” supplies demographic and state data over COVID-19 testing, including number of positives, negatives, and deaths as a result of COVID-19, Pneumonia, or Influenza from January 1, 2020 to date. Both datasets were acquired from Kaggle.com. In this analysis, we will be looking at “peak” COVID-19 months, so I expect to find high rates of infection and death as well as some relationship between proportion died to proportion insured for each state.

# load libraries
library(tidyverse)
library(dplyr)
library(kableExtra)
library(ggplot2)

# read csv files
covid_data <- read_csv("us_states_covid19_daily.csv")
insurance_data <- read_csv("insurance.csv")

# inspecting data
head(covid_data)

## # A tibble: 6 x 55
##     date state positive probableCases negative pending totalTestResult…
##    <dbl> <chr>    <dbl>         <dbl>    <dbl>   <dbl> <chr>           
## 1 2.02e7 AK       35720            NA  1042056      NA totalTestsViral 
## 2 2.02e7 AL      269877         45962  1421126      NA totalTestsPeopl…
## 3 2.02e7 AR      170924         22753  1614979      NA totalTestsViral 
## 4 2.02e7 AS           0            NA     2140      NA totalTestsViral 
## 5 2.02e7 AZ      364276         12590  2018813      NA totalTestsPeopl…
## 6 2.02e7 CA     1341700            NA 23853346      NA totalTestsViral 
## # … with 48 more variables: totalTestResults <dbl>,
## #   hospitalizedCurrently <dbl>, hospitalizedCumulative <dbl>,
## #   inIcuCurrently <dbl>, inIcuCumulative <dbl>, onVentilatorCurrently <dbl>,
## #   onVentilatorCumulative <dbl>, recovered <dbl>, dataQualityGrade <chr>,
## #   lastUpdateEt <chr>, dateModified <dttm>, checkTimeEt <chr>, death <dbl>,
## #   hospitalized <dbl>, dateChecked <dttm>, totalTestsViral <dbl>,
## #   positiveTestsViral <dbl>, negativeTestsViral <dbl>,
## #   positiveCasesViral <dbl>, deathConfirmed <dbl>, deathProbable <dbl>,
## #   totalTestEncountersViral <dbl>, totalTestsPeopleViral <dbl>,
## #   totalTestsAntibody <dbl>, positiveTestsAntibody <dbl>,
## #   negativeTestsAntibody <dbl>, totalTestsPeopleAntibody <dbl>,
## #   positiveTestsPeopleAntibody <dbl>, negativeTestsPeopleAntibody <dbl>,
## #   totalTestsPeopleAntigen <dbl>, positiveTestsPeopleAntigen <dbl>,
## #   totalTestsAntigen <dbl>, positiveTestsAntigen <dbl>, fips <chr>,
## #   positiveIncrease <dbl>, negativeIncrease <dbl>, total <dbl>,
## #   totalTestResultsIncrease <dbl>, posNeg <dbl>, deathIncrease <dbl>,
## #   hospitalizedIncrease <dbl>, hash <chr>, commercialScore <dbl>,
## #   negativeRegularScore <dbl>, negativeScore <dbl>, positiveScore <dbl>,
## #   score <dbl>, grade <lgl>

head(insurance_data)

## # A tibble: 6 x 15
##   Indicator Group State Subgroup Phase `Time Period` `Time Period La…
##   <chr>     <chr> <chr> <chr>    <dbl>         <dbl> <chr>           
## 1 Uninsure… Nati… Unit… United …     1             1 Apr 23 - May 5,…
## 2 Uninsure… By A… Unit… 18 - 24…     1             1 Apr 23 - May 5,…
## 3 Uninsure… By A… Unit… 25 - 34…     1             1 Apr 23 - May 5,…
## 4 Uninsure… By A… Unit… 35 - 44…     1             1 Apr 23 - May 5,…
## 5 Uninsure… By A… Unit… 45 - 64…     1             1 Apr 23 - May 5,…
## 6 Uninsure… By S… Unit… Male         1             1 Apr 23 - May 5,…
## # … with 8 more variables: `Time Period Start Date` <chr>, `Time Period End
## #   Date` <chr>, Value <dbl>, `Low CI` <dbl>, `High CI` <dbl>, `Confidence
## #   Interval` <chr>, `Quartile Range` <chr>, `Suppression Flag` <dbl>

Tidying

Preliminary Cleaning

Tidying and reshaping will be broken up into two steps: “Preliminary Cleaning” and “Reshaping”. I will first be cleaning the data because it take a sufficient amount of cleaning to properly join later on, then I will reshape in the subsequent next steps. The two tables have different date formats and time periods that do not match up. Therefore, I will first transform the date variable in covid_data into a numeric and match based on the formatting of Time Period Label in insurance_data.

# manually tidy data to for subsequent join

# turn date variable into numeric
covid_data$date <- covid_data$date %>% as.numeric()

# match date and relabel for subsequent match
# insurance_data.csv
first_week <- filter(covid_data, between(covid_data$date, "20200423", 
    "20200505")) %>% mutate(`Time Period Label` = "Apr 23 - May 5, 2020")
second_week <- filter(covid_data, between(covid_data$date, "20200507", 
    "20200512")) %>% mutate(`Time Period Label` = "May 7 - May 12, 2020")
third_week <- filter(covid_data, between(covid_data$date, "20200514", 
    "20200519")) %>% mutate(`Time Period Label` = "May 14 - May 19, 2020")
fourth_week <- filter(covid_data, between(covid_data$date, "20200521", 
    "20200526")) %>% mutate(`Time Period Label` = "May 21 - May 26, 2020")
fifth_week <- filter(covid_data, between(covid_data$date, "20200528", 
    "20200602")) %>% mutate(`Time Period Label` = "May 28 - June 2, 2020")
sixth_week <- filter(covid_data, between(covid_data$date, "20200604", 
    "20200609")) %>% mutate(`Time Period Label` = "June 4 - June 9, 2020")
seventh_week <- filter(covid_data, between(covid_data$date, "20200611", 
    "20200616")) %>% mutate(`Time Period Label` = "June 11 - June 16, 2020")
eighth_week <- filter(covid_data, between(covid_data$date, "20200618", 
    "20200623")) %>% mutate(`Time Period Label` = "June 18 - June 23, 2020")
ninth_week <- filter(covid_data, between(covid_data$date, "20200625", 
    "20200630")) %>% mutate(`Time Period Label` = "June 25 - June 30, 2020")

# combine data for dates of interest
covid_new_data <- rbind(rbind(rbind(rbind(rbind(rbind(rbind(rbind(first_week, 
    second_week), third_week), fourth_week), fifth_week), sixth_week), 
    seventh_week), eighth_week), ninth_week)

# remove subgroup for convenience and filter by state to
# remove other demographic data
insurance_data_1 <- insurance_data %>% select(-Subgroup) %>% 
    filter(Group == "By State")

# recode State data into state codes for joining
insurance_data_1$State <- state.abb[match(insurance_data_1$State, 
    state.name)]

Reshaping Data

Now, I will pivot the data to a more intuitive format. I used the pivot_wider() function to take the pivot the rows in “Indicator”, which describe insurance status (uninsured, publicly insured, and privately insured), and “Values” which describe the proportion of the population in each group, and transform them into a single column.

# reshaping data
insurance_data_clean <- insurance_data_1 %>% select(State, `Time Period Label`, 
    Indicator, Value) %>% # use pivot_wider to transform the row data into a column for
# the proportion uninsured
pivot_wider(names_from = Indicator, values_from = Value) %>% 
    # arrange by state for data validation
arrange(State)

# head(insurance_data_clean)

Joining/Merging

I will use a left join to join the tidy insurance data to the tidy covid data. Using the left_join() function insures that the necessary data from the covid dataset (e.g. positive [results], deaths, etc.) is preserved in the process. The data was joined by 2 unique IDs to ensure the data matched correctly by Time Period Label and State.

# join data employing a left join, join by 2 IDs that do not
# have same variable name
joined_data <- left_join(covid_new_data, insurance_data_clean, 
    by = "Time Period Label", c(state = "State"))

# final clean of joined data
joined_data_clean <- joined_data %>% select(state, State, "Time Period Label", 
    positive, negative, totalTestResults, death, "Uninsured at the Time of Interview") %>% 
    # remove duplicates
filter(state == State) %>% select(-State)

joined_data_clean

## # A tibble: 3,050 x 7
##    state `Time Period La… positive negative totalTestResults death
##    <chr> <chr>               <dbl>    <dbl>            <dbl> <dbl>
##  1 AK    Apr 23 - May 5,…      371    22321            22692     9
##  2 AL    Apr 23 - May 5,…     8285    98481           106766   313
##  3 AR    Apr 23 - May 5,…     3496    51139            54635    83
##  4 AZ    Apr 23 - May 5,…     9305    78955            88260   395
##  5 CA    Apr 23 - May 5,…    56212   723690           779902  2317
##  6 CO    Apr 23 - May 5,…    17364    68612            96666   903
##  7 CT    Apr 23 - May 5,…    30621    78022           108643  2633
##  8 DE    Apr 23 - May 5,…     5371    19309            35679   266
##  9 FL    Apr 23 - May 5,…    36183   428252           461682  1536
## 10 GA    Apr 23 - May 5,…    29711   171172           190322  1288
## # … with 3,040 more rows, and 1 more variable: `Uninsured at the Time of
## #   Interview` <dbl>

There are no missing IDs in the joined dataset. Since both datasets were formatted and organized in the preliminary cleaning stage, their information match up by demographic and time. However, the rows required some wrangling before joining. For example, the insurance data had a “Group” column, which I filtered into “States”, so the data was identifiable by state, only. Furthermore, the dates in covid_data were in a daily format, so they were wrangled to match the time periods in insurance_data.

There were a total of 15,633 observations in covid_data and 8,217 observations in the insurance data. After joining and cleaning, the final dataset has 3050 observations, which are each of the dates of interest for each of the 50 states. There were several observations that were dropped for the final dataset, but most if not all rows that were dropped were not variables of interest (e.g. gender, race) or fell out of the desired time interval of interest. The size of joined_data_clean is considerably smaller, but it is important to note it is a 3 month subsample and only state data was pulled.

Wrangling

In the following code block, I added two columns, prop tested positive, which finds the proportion of positives over the number of individuals who took the test, and prop died which finds the proportion of deaths over the number of individuals who tested positive (for each time period for every state). Using final_data, I applied various data wrangling techniques to create summary statistics by subgroup and for the entire data set.

Table 1 shows the first 6 observations of the finalized dataset to be used in subsequent tables
Table 2 shows summary statistics grouped by month
Table 3 shows each variables mean for each state
Table 4 is an interactive table that shows summary statistics for each state
Table 5 shows how many NA variables there for each column in final_data

Note: The filter() function is not used here as there was some datawrangling in the preliminary cleaning stages.

# wrangling 1: add columns
final_data <- joined_data_clean %>% # create two variables as proportions of total tests
mutate(`prop tested positive` = positive/totalTestResults, `prop died` = death/positive, 
    month = substr(`Time Period Label`, 1, 3))

# examine column headings colnames(final_data)

# convert column names into similar format for convenient
# wrangling using regex
names(final_data)[5] <- gsub("([a-z])([A-Z])", "\\1 \\2", colnames(final_data[5])) %>% 
    str_to_lower()
names(final_data)[2] <- str_to_lower(colnames(final_data[2]))
names(final_data)[7] <- str_to_lower(colnames(final_data[7]))

# final data table
final_data %>% select(-`time period label`) %>% head()

## # A tibble: 6 x 9
##   state positive negative `total test res… death `uninsured at t…
##   <chr>    <dbl>    <dbl>            <dbl> <dbl>            <dbl>
## 1 AK         371    22321            22692     9             17  
## 2 AL        8285    98481           106766   313             14.1
## 3 AR        3496    51139            54635    83             10.3
## 4 AZ        9305    78955            88260   395             16.8
## 5 CA       56212   723690           779902  2317              9.8
## 6 CO       17364    68612            96666   903             10.6
## # … with 3 more variables: `prop tested positive` <dbl>, `prop died` <dbl>,
## #   month <chr>

# wrangling 2: grouped summary statistics

### this code was written as per the instructions, but since
### the data was collected at the state level for a specific
### time period, their are an identical time period
### observations for each 'category'
count_data <- final_data %>% group_by(state) %>% summarize(n = n())

# did not print as it is redundant count_data

# monthly means
monthly_mean <- final_data %>% group_by(month) %>% summarize_if(is.numeric, 
    mean, na.rm = T) %>% arrange(desc(positive))

monthly_mean

## # A tibble: 3 x 8
##   month positive negative `total test res… death `uninsured at t…
##   <chr>    <dbl>    <dbl>            <dbl> <dbl>            <dbl>
## 1 Jun     43585.  458733.          521906. 2240.             11.9
## 2 May     30865.  223908.          263178. 1763.             11.8
## 3 Apr     20832.  101589.          126038. 1136.             11.7
## # … with 2 more variables: `prop tested positive` <dbl>, `prop died` <dbl>

state_mean <- final_data %>% group_by(state) %>% summarize_if(is.numeric, 
    mean, na.rm = T) %>% arrange(desc(positive))

head(state_mean)

## # A tibble: 6 x 8
##   state positive negative `total test res…  death `uninsured at t…
##   <chr>    <dbl>    <dbl>            <dbl>  <dbl>            <dbl>
## 1 NY     353937. 1674977.         2028914. 22441.             8.08
## 2 NJ     149375.  571239.          720615. 12224.            10.2 
## 3 CA     108280. 1785596.         1893847.  3817.            11.9 
## 4 IL     103182.  709255.          812437.  4801.            11.2 
## 5 MA      89126.  449113.          689007.  6094              5.26
## 6 TX      66763.  835141.          901904.  1518.            24.9 
## # … with 2 more variables: `prop tested positive` <dbl>, `prop died` <dbl>

# wrangling 3: state summary statistics for number of deaths

state_summary <- final_data %>% group_by(state) %>% summarize(avg_perc_uninsured = mean(`uninsured at the time of interview`, 
    na.rm = T), avg_prop_death = mean(`prop died`), median_value = median(death), 
    sd_death = sd(death), var_death = var(death), max_death = max(death), 
    min_pos = min(positive))


# html kable summary statistics table
state_summary %>% group_by(state) %>% kbl() %>% kable_material(c("hover"))

state	avg_perc_uninsured	avg_prop_death	median_value	sd_death	var_death	max_death	min_pos
AK	14.619672	0.0218176	10	1.444227	2.085792e+00	14	337
AL	14.700000	0.0348383	575	227.610917	5.180673e+04	950	5778
AR	11.145902	0.0180800	119	63.589441	4.043617e+03	270	2465
AZ	16.672131	0.0399095	807	409.135262	1.673917e+05	1632	5769
CA	11.914754	0.0373015	3814	1369.385782	1.875217e+06	5980	37369
CO	10.590164	0.0473720	1114	263.345136	6.935066e+04	1520	11262
CT	6.901818	0.0893212	3769	818.537193	6.700031e+05	4322	23100
DE	8.837705	0.0476221	444	119.509995	1.428264e+04	509	3308
FL	16.821312	0.0411334	2338	774.427684	5.997382e+05	3604	27662
GA	19.854098	0.0417442	1871	603.943903	3.647482e+05	2805	21512
HI	6.424490	0.0246474	17	1.185523	1.405464e+00	18	592
IA	5.559184	0.0249017	478	209.488121	4.388527e+04	715	3924
ID	12.922951	0.0273945	79	11.494308	1.321191e+02	91	1802
IL	11.242623	0.0459345	4923	1752.877182	3.072578e+06	7124	36934
IN	11.191803	0.0609090	2004	581.246859	3.378479e+05	2640	13039
KS	14.340984	0.0238718	188	50.609082	2.561279e+03	270	2482
KY	9.075410	0.0444148	391	117.541359	1.381597e+04	565	3373
LA	12.860656	0.0670653	2701	494.429634	2.444607e+05	3221	25739
MA	5.258182	0.0671215	6473	1700.431749	2.891468e+06	8095	47992
MD	8.876364	0.0506870	2417	727.701171	5.295490e+05	3190	15737
ME	10.220000	0.0405864	79	19.811158	3.924820e+02	105	937
MI	8.968853	0.0917032	5703	699.141262	4.887985e+05	6193	41694
MN	6.983607	0.0416035	908	414.240877	1.715955e+05	1476	4270
MO	16.237705	0.0513042	686	242.403100	5.875926e+04	1015	6321
MS	23.463934	0.0441867	652	272.767799	7.440227e+04	1073	5153
MT	9.532787	0.0322033	17	2.213841	4.901093e+00	22	442
NC	13.881967	0.0307995	766	332.604095	1.106255e+05	1343	7608
ND	10.322951	0.0235276	54	24.468671	5.987158e+02	88	709
NE	10.378689	0.0137361	150	69.749673	4.865017e+03	269	1813
NH	8.432787	0.0493504	210	105.055541	1.103667e+04	367	1588
NJ	10.222951	0.0810085	12886	2468.338193	6.092693e+06	14944	99989
NM	11.013115	0.0423959	325	133.645285	1.786106e+04	497	2379
NV	14.831148	0.0479837	412	90.242048	8.143627e+03	507	4208
NY	8.075410	0.0632667	23564	2663.431605	7.093868e+06	24855	263460
OH	10.236364	0.0576833	2002	697.276094	4.861940e+05	2863	14694
OK	20.729508	0.0499969	318	59.901997	3.588249e+03	387	3017
OR	9.301639	0.0351260	148	34.497082	1.190049e+03	207	2127
PA	7.516393	0.0688958	5152	1678.466484	2.817250e+06	6649	37053
RI	5.046939	0.0435220	634	243.760052	5.941896e+04	950	6773
SC	16.167213	0.0366315	446	169.426933	2.870549e+04	739	4917
SD	11.267273	0.0107805	50	25.328325	6.415240e+02	91	1956
TN	15.609836	0.0163350	343	131.177092	1.720743e+04	604	8266
TX	24.901639	0.0246452	1536	553.361317	3.062087e+05	2424	21944
UT	10.418033	0.0102223	101	42.634692	1.817717e+03	172	3612
VA	8.657377	0.0315922	1236	426.610516	1.819965e+05	1763	10998
VT	6.934884	0.0535064	54	3.169355	1.004481e+01	56	831
WA	8.383607	0.0469887	1070	188.307435	3.545969e+04	1320	13750
WI	8.100000	0.0312706	517	168.390606	2.835540e+04	784	6011
WV	12.196721	0.0381807	74	19.313547	3.730131e+02	93	967
WY	15.849180	0.0145268	13	5.364781	2.878087e+01	20	453

final_data %>% summarize_all(function(x) sum(is.na(x))) %>% kbl() %>% 
    kable_styling()

state	time period label	positive	negative	total test results	death	uninsured at the time of interview	prop tested positive	prop died	month
0	0	0	0	0	0	96	0	0	0

Some interesting characteristics about the summary data: Table 2 shows that the number of uninsured at the time of interview did not change by much. Furthermore, the proportion of those who tested positive decreased (positive) from April to June while the proportion who died (prop died) remained approximately constant all 3 months. Overall, during this time period, New York had the highest number of positive cases from April to June. Lastly, there are only 96 NA values and it is only for the uninsured at the time of interview variable, but this is due to the considerable wrangling done at the preliminary stage.

Visualizing

Write a supporting paragraph (for each plot) describing what the plot depicts and any relationships/trends that are apparent (9 pts)

# scatter plot
state_summary %>% ggplot(aes(avg_perc_uninsured, avg_prop_death)) + 
    geom_point() + geom_smooth(method = "lm") + theme_classic() + 
    labs(title = "Relationship between Number of Deaths by Percent Uninsured per State", 
        subtitle = "From April to June of 2020", x = "Average Proportion of Uninsured", 
        y = "Average Proportion of Deaths") + # rename tick labels and remove white space
scale_x_continuous(labels = c(`5` = "5%", `10` = "10%", `15` = "15%", 
    `20` = "20%", `25` = "25%"), expand = c(0.01, 0))

This scatter plot, which plots average proportion of deaths to average proportion uninsured, appears to show that there is a negative linear relationship between the two variables. This is interesting because initially, I thought the inverse would be true. A possible explaination for this is that during peak pandemic, deaths may have been misclassified as “Unknown” due to the volume of patients in the hospital, so the data during this period could be inaccurate.

# histogram
final_data %>% ggplot(aes(x = `uninsured at the time of interview`)) + 
    geom_histogram(aes(y = ..density..), bins = 10) + geom_density(color = "black") + 
    facet_wrap(~month, scales = "free") + theme_classic() + labs(title = "Distribution of Uninsured by Month", 
    subtitle = "2020 Data", x = "Proportion of Uninsured", y = "Count") + 
    # change breaks and remove white space
scale_x_continuous(breaks = seq(0, 35, 5), expand = c(0.02, 0)) + 
    scale_y_continuous(breaks = seq(0, 1, 0.025), expand = c(0.01, 
        0))

This histogram shows the distribution of proportion unisured in the months of April, May, and June. April and June’s distribution is right-skewed while May is less skwewed.

# bar plot find mean deaths mean(final_data$`prop died`)

final_data %>% # reorder y-axis in order from greatest to least proportion
ggplot(aes(y = reorder(state, `prop died`), x = `prop died`)) + 
    geom_bar(stat = "summary") + # add a geom vertical line layer to show average proportion
# dead in U.S.
geom_vline(xintercept = 0.04159294, linetype = "dashed", color = "grey") + 
    # modify theme to resize font to fit text
theme(axis.text.y = element_text(size = 6)) + labs(title = "Proportion Died of Covid by State", 
    subtitle = "From April to June of 2020", x = "Proportion", 
    y = "") + # change the breaks from 0.005 to 0.0025 for accuracy, remove
# white space with expand function
scale_x_continuous(breaks = seq(0, 1, 0.01), expand = c(0, 0))

This last bar graph shows a summary of the proportion of individuals who tested positive and died per state. The vertical line shows the average proportion of deaths in the United States. It appears that there was a considerable amount of states with greater than 50% fatality. However, a possible explanation for this is at the time the data was collected, only certain individuals were tested for COVID-19. The tests were not yet publicly available to everyone, which may make the results appear more drastic.

Concluding Remarks

Overall, this analysis did not find any evidence to suggest that the state’s proportion insured increased the number of deaths per state, but further analysis into confounding variables may yield more accurate concluations about the true relationship.