library(dplyr)
library(tidyverse)
library(skimr)
library(flextable)
library(mice)
library(naniar)
library(gt)
library(plotly)
library(ggplot2)
library(lubridate)
library(gtExtras)
library(gganimate)
library(gifski)
library(calendR)
library(janitor)
library(moments)
library(usmap)
library(magick)
library(here)Investigating the Impact of Poverty and Population on Gun Violence Rates.
Introduction.
Gun violence remains a major public health issue in the United States. While various factors contribute to it, economic and demographic indicators such as poverty and population density are consistently suspected to be influential. This report aims to explore the relationship between poverty levels, population characteristics, and the prevalence of gun violence across U.S. states over time.
Our data sources include:
American Community Survey (ACS)- contains ACS estimates for U.S. states, including population, income, poverty rates, and demographic characteristics -,
Gun violence data -Provides gun violence incidents by state and date.
This analysis was conducted using R and Quarto, incorporating visualization, statistical testing, and reproducible data science practices.
Source
Data was obtained via the tidycensus R package and the United States Census website.
Data dictionary
American Community Survey (ACS) dataset.
| Variable | Type | Description |
|---|---|---|
| geoid | character | Geographic region ID with the first 2 digits being the state Federal Information Processing Standard (FIPS) code and the last 3 digits the county FIPS code |
| county_state | character | Geographic region |
| year | double | Year |
| population | double | Population |
| median_income | double | Median income in dollars |
| median_monthly_rent_cost | double | Median monthly rent costs for renters in dollars |
| median_monthly_home_cost | double | Median monthly housing costs for homeowners in dollars |
| prop_female | double | Proportion of people who are female |
| prop_male | double | Proportion of people who are male |
| prop_white | double | Proportion of people who are white alone |
| prop_black | double | Proportion of people who are Black or African American alone |
| prop_native | double | Proportion of people who are American Indian and Alaska Native alone |
| prop_asian | double | Proportion of people who are Asian alone |
| prop_hawaiin_islander | double | Proportion of people who are Native Hawaiian and Other Pacific Islander alone |
| prop_other_race | double | Proportion of people who are some other race alone |
| prop_multi_racial | double | Proportion of people who are two or more races |
| prop_highschool | double | Proportion of people 25 and older whose highest education-level is high school |
| prop_GED | double | Proportion of people 25 and older whose highest education-level is a GED |
| prop_some_college | double | Proportion of people 25 and older whose highest education-level is some, but less than 1 year of college |
| prop_college_no_degree | double | Proportion of people 25 and older whose highest education-level is greater than 1 year of college but no degree |
| prop_associates | double | Proportion of people 25 and older whose highest education-level is an Associates degree |
| prop_bachelors | double | Proportion of people 25 and older whose highest education-level is a Bachelors degree |
| prop_masters | double | Proportion of people 25 and older whose highest education-level is a Masters degree |
| prop_professional | double | Proportion of people 25 and older whose highest education-level is a Professional degree |
| prop_doctoral | double | Proportion of people 25 and older whose highest education-level is a Doctoral degree |
| prop_poverty | double | Proportion of people 25 and older living in poverty, defined by the Census Bureau as having an income below the poverty threshold for their family size |
Gun violence dataset.
| Variable | Type | Description |
|---|---|---|
| geoid | character | Geographic region ID (FIPS code, matching the ACS geoid) |
| incident_id | double | Unique identifier for each gun violence incident |
| date | double | Date of the incident in ISO 8601 format (YYYY-MM-DDTHH:MM:SSZ) |
| state | character | Name of the U.S. state where the incident occurred |
| city | character | Name of the city where the incident occurred |
| county | character | Name of the county where the incident occurred (if available) |
| business_or_school | character | Indicates whether the incident occurred at a business or school(if specified) |
| address | character | Address or block location where the incident took place |
| latitude | double | Latitude coordinate of the incident location |
| longitude | double | Longitude coordinate of the incident location |
| number_victims_killed | double | Number of victims killed in the incident |
| number_victims_injured | double | Number of victims injured in the incident |
| number_suspects_killed | double | Number of suspects killed in the incident |
| number_suspects_injured | double | Number of suspects injured in the incident |
| number_suspects_arrested | double | Number of suspects arrested following the incident |
| incident_characteristics | character | Text description of the nature of the incident (may include multiple characteristics) |
Loading packages.
Importing data into R.
American Community Survey (ACS) dataset
# Importing state census data from 2008 - 2023
us_data <- read_csv("census_data_state_2008-2023.csv")Rows: 780 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): geoid, county_state
dbl (24): year, population, median_income, median_monthly_rent_cost, median_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Gun violence dataset
# Importing of gun violence data
gun_violence <- read_csv("gun_violence_geo.csv")Rows: 389850 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): geoid, state, city, county, business_or_school, address, incident_...
dbl (8): incident_id, latitude, longitude, number_victims_killed, number_vi...
dttm (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Exploring missingness
To check for missingness in the ACS data.
skim(us_data)| Name | us_data |
| Number of rows | 780 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 24 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| geoid | 0 | 1 | 2 | 2 | 0 | 52 | 0 |
| county_state | 0 | 1 | 4 | 20 | 0 | 52 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2015.20 | 4.61 | 2008.00 | 2011.00 | 2015.00 | 2019.00 | 2023.00 | ▇▆▆▃▆ |
| population | 0 | 1 | 6229692.05 | 7025083.48 | 532668.00 | 1786156.75 | 4257700.00 | 7053886.75 | 39557045.00 | ▇▂▁▁▁ |
| median_income | 0 | 1 | 58358.78 | 14075.73 | 18314.00 | 48362.75 | 56484.50 | 67147.25 | 108210.00 | ▁▇▇▂▁ |
| median_monthly_rent_cost | 0 | 1 | 945.44 | 269.03 | 419.00 | 749.50 | 884.00 | 1092.25 | 1992.00 | ▃▇▃▁▁ |
| median_monthly_home_cost | 0 | 1 | 1109.16 | 371.59 | 241.00 | 850.00 | 1020.00 | 1353.00 | 2561.00 | ▁▇▃▂▁ |
| prop_female | 0 | 1 | 0.51 | 0.01 | 0.47 | 0.50 | 0.51 | 0.51 | 0.53 | ▁▂▇▇▁ |
| prop_male | 0 | 1 | 0.49 | 0.01 | 0.47 | 0.49 | 0.49 | 0.50 | 0.53 | ▁▇▇▂▁ |
| prop_white | 0 | 1 | 0.75 | 0.15 | 0.22 | 0.67 | 0.77 | 0.85 | 0.96 | ▁▁▃▇▇ |
| prop_black | 0 | 1 | 0.11 | 0.11 | 0.00 | 0.03 | 0.07 | 0.15 | 0.53 | ▇▃▂▁▁ |
| prop_native | 0 | 1 | 0.02 | 0.03 | 0.00 | 0.00 | 0.00 | 0.01 | 0.16 | ▇▁▁▁▁ |
| prop_asian | 0 | 1 | 0.04 | 0.05 | 0.00 | 0.01 | 0.03 | 0.04 | 0.39 | ▇▁▁▁▁ |
| prop_hawaiin_islander | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.11 | ▇▁▁▁▁ |
| prop_other_race | 0 | 1 | 0.04 | 0.04 | 0.00 | 0.01 | 0.02 | 0.05 | 0.32 | ▇▁▁▁▁ |
| prop_multi_racial | 0 | 1 | 0.05 | 0.05 | 0.01 | 0.02 | 0.03 | 0.05 | 0.39 | ▇▁▁▁▁ |
| prop_highschool | 0 | 1 | 0.24 | 0.04 | 0.10 | 0.22 | 0.24 | 0.26 | 0.35 | ▁▂▇▆▁ |
| prop_GED | 0 | 1 | 0.04 | 0.01 | 0.02 | 0.04 | 0.04 | 0.05 | 0.07 | ▃▇▆▃▁ |
| prop_some_college | 0 | 1 | 0.06 | 0.01 | 0.01 | 0.06 | 0.06 | 0.07 | 0.09 | ▁▁▅▇▂ |
| prop_college_no_degree | 0 | 1 | 0.15 | 0.02 | 0.08 | 0.13 | 0.14 | 0.16 | 0.22 | ▁▅▇▃▁ |
| prop_associates | 0 | 1 | 0.09 | 0.02 | 0.03 | 0.08 | 0.08 | 0.10 | 0.15 | ▁▃▇▂▁ |
| prop_bachelors | 0 | 1 | 0.19 | 0.03 | 0.10 | 0.17 | 0.19 | 0.21 | 0.29 | ▁▅▇▃▁ |
| prop_masters | 0 | 1 | 0.08 | 0.03 | 0.04 | 0.06 | 0.08 | 0.09 | 0.24 | ▇▆▁▁▁ |
| prop_professional | 0 | 1 | 0.02 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.10 | ▇▁▁▁▁ |
| prop_doctoral | 0 | 1 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.05 | ▇▃▁▁▁ |
| prop_poverty | 0 | 1 | 0.14 | 0.05 | 0.07 | 0.11 | 0.13 | 0.16 | 0.46 | ▇▃▁▁▁ |
Lollipop plot to explore ACS data for missingness.
gg_miss_var(us_data, show_pct = TRUE) +
labs(title= "Figure 1:Missingness for ACS data",
caption = "Source: American Community Survey via tidycensus")The dataset has no missing values.
To check for missingness in the second dataset.
skim(gun_violence)| Name | gun_violence |
| Number of rows | 389850 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 8 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| geoid | 46 | 1.00 | 5 | 5 | 0 | 2956 | 0 |
| state | 0 | 1.00 | 4 | 20 | 0 | 51 | 0 |
| city | 761 | 1.00 | 3 | 27 | 0 | 11548 | 0 |
| county | 389065 | 0.00 | 1 | 22 | 0 | 459 | 0 |
| business_or_school | 316795 | 0.19 | 1 | 85 | 0 | 47476 | 0 |
| address | 9665 | 0.98 | 3 | 61 | 0 | 325895 | 0 |
| incident_characteristics | 102 | 1.00 | 8 | 828 | 0 | 10939 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| incident_id | 0 | 1 | 1468977.53 | 796698.22 | 92114.00 | 756778.50 | 1529846.00 | 2165292.50 | 2976478.00 | ▇▆▇▇▅ |
| latitude | 2 | 1 | 37.33 | 4.82 | 19.03 | 33.76 | 38.31 | 41.07 | 71.34 | ▁▇▅▁▁ |
| longitude | 2 | 1 | -89.25 | 13.45 | -171.74 | -93.77 | -86.77 | -80.20 | -67.04 | ▁▁▂▃▇ |
| number_victims_killed | 0 | 1 | 0.37 | 0.56 | 0.00 | 0.00 | 0.00 | 1.00 | 60.00 | ▇▁▁▁▁ |
| number_victims_injured | 0 | 1 | 0.78 | 1.06 | 0.00 | 0.00 | 1.00 | 1.00 | 439.00 | ▇▁▁▁▁ |
| number_suspects_killed | 0 | 1 | 0.06 | 0.24 | 0.00 | 0.00 | 0.00 | 0.00 | 4.00 | ▇▁▁▁▁ |
| number_suspects_injured | 0 | 1 | 0.05 | 0.23 | 0.00 | 0.00 | 0.00 | 0.00 | 5.00 | ▇▁▁▁▁ |
| number_suspects_arrested | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2014-01-01 | 2023-12-31 | 2019-09-26 | 3652 |
Plotting a lollipop plot to explore missingness
gg_miss_var(gun_violence, show_pct = TRUE) +
labs(title= "Figure 2:Missingness for Gun Violence data",
caption = "Source: Gun Violence Archive")While the majority of the dataset was complete, we did observe missing values in certain descriptive fields of the gun violence data, notably business_or_school, county, and address. This missingness does not affect the calculation of core metrics like gun violence rates or their relationship with poverty which are the main focus of our analysis.
Exploratory Data Analysis.
1. ACS dataset.
Lets explore the summary statistics
# summarizing mean of the median_income from the year 2014-2023 by county_state.
us_data |>
filter(year >= 2014, year <= 2023) |>
group_by(year, county_state) |>
summarize(average_median_income = round(mean(median_income, na.rm = TRUE))) |>
arrange(desc(average_median_income)) |>
ungroup() |>
slice_head(n = 10) |>
flextable() |>
colformat_double(big.mark = "", digits = 0) |>
theme_zebra() |>
add_header_lines("Table 1: Top 10 states by Average median income, 2014–2023")`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
Table 1: Top 10 states by Average median income, 2014–2023 | ||
|---|---|---|
year | county_state | average_median_income |
2023 | District of Columbia | 108210 |
2022 | District of Columbia | 101027 |
2023 | Massachusetts | 99858 |
2023 | New Jersey | 99781 |
2023 | Maryland | 98678 |
2023 | New Hampshire | 96838 |
2022 | New Jersey | 96346 |
2023 | California | 95521 |
2023 | Hawaii | 95322 |
2022 | Maryland | 94991 |
Next, we can check the sample size of the population its mean and standard deviation per year.
# Summarizing sample size of the population its mean and standard deviation per year.
summary_stat <- us_data |>
group_by(year) |> # Groups the data by year.
summarize(pop_mean = mean(population, na.rm = TRUE), # Calculates the mean population for each year.
pop_size = n(), # Counts the number of entries in states per year.
pop_sd = round(sd(population, na.rm = TRUE))) |> # Calculates and rounds the standard deviation of pouplation per year.
ungroup() |>
slice_head(n = 10)
# Create a flextable for the summary statistics
flextable(summary_stat) |>
colformat_double(big.mark = "", digits = 0) |>
add_header_lines("Table 2: Summary statistics over the years") |>
theme_vanilla()Table 2: Summary statistics over the years | |||
|---|---|---|---|
year | pop_mean | pop_size | pop_sd |
2008 | 5923342 | 52 | 6664198 |
2009 | 5980266 | 52 | 6719871 |
2010 | 6020612 | 52 | 6781449 |
2011 | 6063435 | 52 | 6847219 |
2012 | 6107329 | 52 | 6914594 |
2013 | 6148922 | 52 | 6975056 |
2014 | 6200105 | 52 | 7063774 |
2015 | 6247942 | 52 | 7140930 |
2016 | 6279593 | 52 | 7183936 |
2017 | 6328007 | 52 | 7257007 |
This table summarizes the average state-level population statistics from 2008 to 2017.These values reflect state-level averages whereby the mean population represents the average population across 52 U.S. states and territories each year, while the standard deviation shows the variation in population sizes across them.
Visualize the relationship of median income vs poverty rate using ggplot.
# Filter the dataset for the year 2019 and create a scatterplot of median income vs poverty rate
us_data|>
filter(year == 2019) |>
ggplot(aes(x = median_income, y = prop_poverty)) +
geom_point(size = 3, alpha = 0.7) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Figure 3: Median Income vs. Poverty Rate (2019)",
x = "Median Income",
y = "Proportion in Poverty",
caption = "Data source: American Community Survey (ACS)") +
theme_bw()`geom_smooth()` using formula = 'y ~ x'
This scatter plot shows a negative relationship between median income and poverty rate across U.S. states in 2019. States with higher incomes tend to have lower poverty rates, as highlighted by the downward trend line.
Visualizing trend of the median_income vs prop_poverty for the state of Michigan over time.
us_data |>
filter(county_state == "Michigan") |>
ggplot(aes(x = year)) +
geom_line(aes(y = median_income, color = "Income")) +
geom_line(aes(y = prop_poverty * 100000, color = "Poverty (×100k)")) +
scale_y_continuous(
name = "Median Income",
sec.axis = sec_axis(~ . / 100000, name = "Poverty Rate")
) +
scale_color_viridis_d() +
labs(title = "Figure 4:Michigan: Median Income vs. Poverty Rate Over Time",
x = "Year",
caption = "Figure 4: Median Income vs Poverty Rate,
Data source: American Community Survey (ACS)")+
theme_dark() +
theme(legend.title = element_blank(),
legend.position = "bottom")The line graph illustrates Michigan’s socioeconomic evolution from 2008 to 2023, with rising median income signaling economic progress and fluctuating poverty rates highlighting persistent challenges. The inverse relationship between these metrics suggests that economic growth alone doesn’t eliminate poverty, which may contribute to social issues like gun violence.
2. Gun Violence Dataset.
The code snippet transforms the gun_violence dataset by converting the date column to a Date object and creating new columns (day, year, month) using lubridate functions. This enables temporal aggregation and analysis, supporting visualizations (e.g., calendar heatmap, yearly heatmap) and data merging with ACS data. It’s a critical preprocessing step that ensures the dataset is ready for the project’s statistical and visual analyses, directly contributing to insights about gun violence patterns and their socioeconomic drivers.
# Aggregating data.
gun_violence <- gun_violence |>
mutate(date = ymd(date), # Changing the data variable to a date type variable.
day = day(date),
year = year(date),
month = month(date, label = TRUE, abbr = FALSE)) # extracting the day, year and month.Creation of a table summarizing the total number of victims killed in gun violence incidents across cities in California during May 2019.
# Checking number of killed victims in the year 2019.
gun_violence |>
filter(state == "California", year == 2019, month == "May") |>
group_by(city)|>
summarize(total_victims_killed = sum(number_victims_killed))|>
arrange(desc(total_victims_killed)) |>
head(15) |>
gt() |>
gt_theme_nytimes() |>
tab_caption("Table 3: Top 15 cities by total number of victims killed in month of May 2019")| city | total_victims_killed |
|---|---|
| Stockton | 9 |
| Los Angeles | 7 |
| Oakland | 7 |
| Fresno | 5 |
| Sacramento | 4 |
| San Diego | 4 |
| Bakersfield | 3 |
| San Francisco | 3 |
| Cathedral City | 2 |
| Grass Valley | 2 |
| Long Beach | 2 |
| Oroville | 2 |
| Richmond | 2 |
| San Bernardino | 2 |
| West Hills | 2 |
Visualizing the number of gun violence victims killed per day in California for 2019 with CalendR.
# Calculating number of gun violence incidents for each date.
gun_violence_counts <- gun_violence |>
filter(state == "California", year == 2019) |>
mutate(date = make_date(year, month, day)) |>
group_by(date) |> # Groups incidents by date
summarize(total_victims_killed = sum(number_victims_killed, na.rm = TRUE))
# Creating a date sequence.
all_dates <- tibble(date = seq(as.Date("2019-01-01"), as.Date("2019-12-31"), by = "day")) # Generates a sequence of all dates in 2019 (365 days).
# Joining data to fill missing days with 0
gun_violence_counts <- all_dates |>
left_join(gun_violence_counts, by = "date") |>
mutate(total_victims_killed = replace_na(total_victims_killed, 0)) # Ensures every day has a value (0 for no victims).
# Creating heatmap calendar of number of victims
gun_violence_calendar <- calendR(year = 2019,
title = "Daily distribution of gun victims in California, 2019",
special.days = gun_violence_counts$total_victims_killed, # Maps the number of victims to each day’s color intensity.
gradient = TRUE,
monthnames = month.name, # Displays full month names
months.size = 11,
weeknames.size = 3.3,
legend.pos = "bottom",
legend.title = "Number of victims killed",
ncol = 3,
margin = 1) +
scale_fill_gradient(low = "white", high = "deeppink4",
guide = guide_colorbar(frame.colour = "black",
ticks.colour = "black",
title.position = "top")) + labs(caption = "Figure 5: Daily gun violence deaths in California, 2019.
Source: Gun Violence Archive")Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the calendR package.
Please report the issue to the authors.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
# Saving plot
size_mult <- 22
ggsave(plot = gun_violence_calendar,
filename = "calendar_plot.png",
units = "mm",
dpi = 120,
width = 7*size_mult,
height = 8.5*size_mult)
knitr::include_graphics("calendar_plot.png")The calendar heatmap reveals that gun violence deaths in California peak on summer weekends and holidays.
Aggregating gun violence incidents and cencus data by year and state creating a new dataset called yearly_violence.
# Aggregating by year and state.
yearly_violence <- gun_violence |>
filter(!is.na(state), !is.na(year)) |> # Checks for missing or inconsistent state or year values
rename(county_state = state) |> #renaming the state to county_state because the column represents the state.
group_by(year, county_state) |>
summarise(total_cases = n(), total_victims_killed = sum(number_victims_killed, na.rm = TRUE)) #Computes the number of incidents in each group`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
Visualizing the distribution of gun violence cases across states and years.
# Assuming yearly_violence has columns: year, county_state, total_cases
# Extract state from county_state and aggregate total_cases by state and year
yearly_violence_state <- yearly_violence |>
mutate(state = sub(".*,\\s*", "", county_state)) |> # Extract state (e.g., "WA" from "King County, WA")
group_by(state, year) |>
summarise(total_cases = sum(total_cases, na.rm = TRUE)) |>
ungroup()`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.
# Create a usmap heatmap for a single year (e.g., the most recent year in the data)
# You can change the year as needed
latest_year <- max(yearly_violence_state$year)
yearly_violence_single_year <- yearly_violence_state |>
filter(year == latest_year)
# Plot usmap heatmap
plot_usmap(
data = yearly_violence_single_year,
values = "total_cases",
color = "white" # White borders for states, matching geom_tile's color
) +
scale_fill_viridis_c(
option = "C",
name = "Total Cases"
) +
labs(
title = paste("Heatmap of Gun Violence Incidents by State,", latest_year),
caption = "Figure 6: Heatmap of Gun Violence for 2023
Source:Gun Violence Archive",
fill = "Total Cases"
) +
theme_minimal() +
theme(
legend.position = "bottom",
legend.key.width = unit(1, "cm")
)String Manipulation.
# Creating contains_robbery variables
gun_robbery <- gun_violence |>
filter(!is.na(incident_characteristics)) |>
mutate(contains_robbery = str_detect(incident_characteristics, regex("Armed robbery with injury", ignore_case = TRUE)))# Summarizing the count of "Armed robbery with injury"
gun_robbery |>
summarize(armed_robbery_count = sum(contains_robbery))# A tibble: 1 × 1
armed_robbery_count
<int>
1 20890
#Counting Occurrences of contains_robbery
gun_robbery |>
count(contains_robbery)# A tibble: 2 × 2
contains_robbery n
<lgl> <int>
1 FALSE 368858
2 TRUE 20890
# using janitor package to get percentages.
gun_robbery |>
tabyl(contains_robbery) contains_robbery n percent
FALSE 368858 0.94640126
TRUE 20890 0.05359874
Intepretation: About 5.36% of the filtered gun violence incidents are associated with armed robbery with injury, which is a notable but relatively small proportion.
Investigating which states have the highest counts or proportions ofarmed robbery incidents.
# By state and year
robbery_2019 <- gun_robbery |>
filter(year == 2019) |>
group_by(state) |>
summarise(robbery_count = sum(contains_robbery),
total_incidents = n(),
robbery_mean = mean(contains_robbery) * 100) |>
arrange(desc(robbery_count)) |>
head(16)
# Display the table
flextable(robbery_2019) |>
colformat_double(digits = 3) |>
set_caption("Table 4: States with the highest counts of armed robbery incidents") |>
autofit()state | robbery_count | total_incidents | robbery_mean |
|---|---|---|---|
Texas | 269 | 2,541 | 10.586 |
Florida | 135 | 2,041 | 6.614 |
Georgia | 111 | 1,209 | 9.181 |
North Carolina | 107 | 1,371 | 7.805 |
Ohio | 106 | 1,617 | 6.555 |
Pennsylvania | 97 | 2,015 | 4.814 |
Illinois | 93 | 2,829 | 3.287 |
Tennessee | 93 | 1,319 | 7.051 |
California | 82 | 2,447 | 3.351 |
Louisiana | 77 | 1,294 | 5.951 |
Missouri | 66 | 1,194 | 5.528 |
Maryland | 65 | 1,239 | 5.246 |
South Carolina | 61 | 1,043 | 5.849 |
Indiana | 56 | 955 | 5.864 |
Alabama | 52 | 978 | 5.317 |
Michigan | 46 | 1,083 | 4.247 |
Intepretation: Texas leads with the highest incidents among the listed states.Texas, Florida, Georgia, North Carolina, Tennessee, and Louisiana are all in the southern U.S., suggesting a regional concentration of armed robbery with injury incidents.
Visualizing the above results.
library(grid)
#load or download a gun image
# Replace with your local image path, e.g., "path/to/gun_image.png"
# Example using a public-domain image URL (replace with a suitable image if needed)
gun_image <- image_read("gun_image.jpg") |>
image_colorize(opacity = 50, color = "white")
# Save the image temporarily (or use local file directly)
temp_image <- tempfile(fileext = ".png")
image_write(gun_image, temp_image)
# Convert to raster for use in ggplot
gunimage_grob <- rasterGrob(gun_image, width = unit(1, "npc"), height = unit(1, "npc"))
robbery_2019 |>
ggplot(aes(x = reorder(state, robbery_count), y = robbery_count)) +
annotation_custom(gunimage_grob, xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf) +
geom_bar(stat = "identity", fill = "steelblue") +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
coord_flip() +
labs(title = "Figure 7 : Armed Robbery with Injury Incidents by State",
x = "State", y = "Incident Count") +
theme_minimal() +
theme(
panel.background = element_rect(fill = "transparent", color = NA),
plot.background = element_rect(fill = "transparent", color = NA)
)Investigating incident characteristic over the years for California.
# Clean incident_characteristics
gun_robbery_clean <- gun_robbery |>
filter(!is.na(incident_characteristics)) |>
mutate(incident_characteristics = str_replace_all(
incident_characteristics, "\\r\\n|\\n", ",")) |>
separate_rows(incident_characteristics, sep = ",") |>
mutate(incident_characteristics = str_trim(incident_characteristics)) |>
filter(incident_characteristics != "")
# Build data for animation
anim_data <- gun_robbery_clean |>
filter(state == "California") |>
count(year, incident_characteristics, sort = TRUE) |>
group_by(year) |>
slice_max(n, n = 10) |>
ungroup()
# Build plotly animation
anim_data |>
plot_ly(
x = ~n,
y = ~reorder(incident_characteristics, n),
frame = ~year,
type = "bar",
orientation = "h",
color = ~incident_characteristics,
colors = "Set3",
showlegend = FALSE
) |>
layout(
title = "Figure 8: Top 10 Incident Types in California by Year",
xaxis = list(title = "Count"),
yaxis = list(title = ""),
paper_bgcolor = "white",
plot_bgcolor = "white"
) |>
animation_opts(
frame = 800,
transition = 400,
easing = "cubic-in-out"
) |>
animation_button(
x = 0, xanchor = "left",
y = -0.1, yanchor = "bottom",
label = "▶ Play"
) |>
animation_slider(
currentvalue = list(prefix = "Year: ")
)Merging the two datasets.
Merging the two dataset by year and county_state with the function inner_join to create a new variable called’pop_gunviolence`.
economic_indic <- us_data |>
filter(year >= 2014, year <= 2023) |>
select(county_state, year, population, median_income, prop_poverty)pop_gunviolence <- yearly_violence |>
inner_join(economic_indic, by = c("year", "county_state"))skim(head(pop_gunviolence, 15))| Name | head(pop_gunviolence, 15) |
| Number of rows | 15 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 5 |
| ________________________ | |
| Group variables | year |
Variable type: character
| skim_variable | year | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|---|
| county_state | 2014 | 0 | 1 | 5 | 20 | 0 | 15 | 0 |
Variable type: numeric
| skim_variable | year | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| total_cases | 2014 | 0 | 1 | 681.87 | 748.17 | 20.00 | 214.00 | 343.00 | 847.00 | 2231.00 | ▇▂▁▁▂ |
| total_victims_killed | 2014 | 0 | 1 | 278.20 | 325.58 | 6.00 | 55.50 | 145.00 | 375.00 | 1147.00 | ▇▂▁▁▁ |
| population | 2014 | 0 | 1 | 7810374.13 | 10083654.60 | 658893.00 | 1527012.50 | 4849377.00 | 8414413.50 | 38802500.00 | ▇▂▁▁▁ |
| median_income | 2014 | 0 | 1 | 56767.87 | 10665.98 | 41262.00 | 48591.00 | 57444.00 | 65762.50 | 71648.00 | ▃▇▂▅▆ |
| prop_poverty | 2014 | 0 | 1 | 0.15 | 0.03 | 0.11 | 0.12 | 0.15 | 0.18 | 0.19 | ▆▂▅▃▇ |
knitr::kable(head(pop_gunviolence,10),
caption = "Table 5: Population and Gun Violence Data")| year | county_state | total_cases | total_victims_killed | population | median_income | prop_poverty |
|---|---|---|---|---|---|---|
| 2014 | Alabama | 725 | 264 | 4849377 | 42830 | 0.1925258 |
| 2014 | Alaska | 63 | 24 | 736732 | 71583 | 0.1123372 |
| 2014 | Arizona | 343 | 175 | 6731484 | 50068 | 0.1824119 |
| 2014 | Arkansas | 295 | 145 | 2966369 | 41262 | 0.1886608 |
| 2014 | California | 2144 | 1147 | 38802500 | 61933 | 0.1644525 |
| 2014 | Colorado | 233 | 113 | 5355866 | 61303 | 0.1204445 |
| 2014 | Connecticut | 225 | 67 | 3596677 | 70048 | 0.1075192 |
| 2014 | Delaware | 203 | 44 | 935614 | 59716 | 0.1247874 |
| 2014 | District of Columbia | 353 | 82 | 658893 | 71648 | 0.1772565 |
| 2014 | Florida | 1672 | 722 | 19893297 | 47463 | 0.1649502 |
Creating a new variable called gun_violence_ratewhich standardizes the number of gun violence cases total_cases by population size. The multiplication by 100,000 turns it into a rate that’s easier to interpret and compare across different regions or years.
# Calculate the rate of gun violence per 100,000 population
pop_gunviolence <- pop_gunviolence |>
mutate(gun_violence_rate = total_cases / population * 100000)Randomization of pop_gunviolence data.
# Randomization test for population
randomization_test <- function(data, n_randomizations = 1000) {
observed_correlation <- cor(data$population, data$gun_violence_rate) # Computes the observed correlation between population and gun_violence_rate
randomized_correlations <- numeric(n_randomizations)
# Randomize the population and calculate correlation each time.
for (i in 1:n_randomizations) {
randomized_data <- data |>
mutate(population = sample(population))
randomized_correlations[i] <- cor(randomized_data$population, randomized_data$gun_violence_rate)
}
# Calculate p_value.
p_value <- mean(abs(randomized_correlations) >= abs(observed_correlation))
return(list(observed_correlation = observed_correlation, p_value = p_value,permuted_distribution = randomized_correlations))
}
# Run the test
result_pop <- randomization_test(pop_gunviolence) The observed correlation is -0.05159267, this indicates that population size has little to no linear association with the gun violence rate in this dataset.The p-value is 0.259. This shows that there is no statistical significance evident to suggest that the observed correlation from what could occur by random chance.
Visualizing null distribution of permuted correlations.
p <- ggplot(data.frame(correlation = result_pop$permuted_distribution),
aes(x = correlation)) +
geom_histogram(bins = 30, fill = "lightblue", color = "black") +
geom_vline(xintercept = result_pop$observed_correlation,
color = "red", linetype = "dashed", linewidth = 1.2) +
labs(
title = paste0("Null Distribution | Observed r = ",
round(result_pop$observed_correlation, 3),
" | p = ", round(result_pop$p_value, 3)),
x = "Correlation between Population and Gun Violence Rate",
y = "Frequency",
caption = "Figure 9: Null distribution of Permutation Correlations,
Data Source: American Community Survey (ACS), Gun Violence Archive"
) +
theme_minimal()
pThe permutation test results show a weak negative correlation (-0.052) between population and gun violence rate, with a p-value of 0.259. The visualization displays the null distribution of correlations that would occur by random chance, with our observed correlation (red dashed line) falling within this distribution. Since the p-value exceeds the conventional threshold of 0.05. This indicates that population alone may not be a reliable predictor of gun violence rates.
Permutation test for prop_poverty and gun_violence_rate.
Permutation test to evaluate the relationship between prop_poverty and gun_violence_rate
# Perform permutations and randomization
set.seed(1994) # Setting seed for reproducibility
# Permutation test for prop_poverty
# Defining permutation test function.
perm_test <- function(data, n_permutations = 1000) {
observed_correlation <- cor(data$prop_poverty, data$gun_violence_rate)
permuted_correlations <- numeric(n_permutations)
# Running the permutations
for (i in 1:n_permutations) {
permuted_data <- data |>
mutate(prop_poverty = sample(prop_poverty))
permuted_correlations[i] <- cor(permuted_data$prop_poverty, permuted_data$gun_violence_rate)
}
p_value <- mean(abs(permuted_correlations) >= abs(observed_correlation))
return(list(observed_correlation = observed_correlation, p_value = p_value))
}
# Perform the permutation test
result <- perm_test(pop_gunviolence)
print(result)$observed_correlation
[1] 0.3165579
$p_value
[1] 0
This provides strong evidence that poverty rates and gun violence rates are meaningfully related in our dataset, suggesting that socioeconomic factors may play an important role in explaining variations in gun violence across different areas.
Relationship between Poverty_rate vs. Gun_violence_rate.
Creating a new variable `poverty_group by splitting area by median poverty rate.
pop_gunviolence <- pop_gunviolence |>
mutate(poverty_group = ifelse(prop_poverty > median(prop_poverty, na.rm = TRUE),
"High Poverty", "Low Poverty"))Creating a density plot to check for skewness and normality.
ggplot(pop_gunviolence, aes(x = gun_violence_rate, fill = poverty_group)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("High Poverty" = "#e63946", "Low Poverty" = "#457b9d")) +
facet_grid(poverty_group ~ .) labs(
title = "Figure 10: Density Plot of Gun Violence Rates by Poverty Group",
x = "Gun Violence Rate (per 100,000 population)",
y = "Density",
fill = "Poverty Group"
) +
theme_minimal() +
scale_fill_manual(values = c("High Poverty" = "#e63946", "Low Poverty" = "#457b9d"))NULL
Two-sample t-test
Next, we state the hypotheses and implement Welch’s two-sample t-test to test whether the average gun violence rate is different between high poverty and low poverty areas. \[ H_0: \mu_{\text{High}} = \mu_{\text{Low}} \] \[ H_a: \mu_{\text{High}} = \mu_{\text{Low}} \] where \(\mu_{\text{High}}\) is the average gun violence rate in high poverty areas., and \(\mu_{\text{Low}}\) is the average gun violence rate in low poverty areas.
# Reorder factor levels for t-test
pop_gunviolence <- pop_gunviolence |>
dplyr::mutate(poverty_group = fct_relevel(poverty_group, "High Poverty", "Low Poverty"))
t_test_result <- t.test(gun_violence_rate ~ poverty_group, data = pop_gunviolence)
# Display results of t-test
t_test_result |>
broom::tidy() |>
flextable() |>
colformat_double(digits = 3) |>
set_formatter(p.value = function(x) {format.pval(x, digits = 3)}) |>
set_caption("Table 6: Results of two-sample t-test comparing gun violence rates between high and low poverty areas.") |>
autofit() |>
fit_to_width(max_width = 7)estimate | estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|---|---|
6.962 | 15.879 | 8.917 | 6.633 | 1.53e-10 | 300.758 | 4.897 | 9.028 | Welch Two Sample t-test | two.sided |
We observed high positive skewness in the gun violence rate variable, particularly in the High Poverty group. Given the violation of the normality assumption required for the t-test, we proceeded with a non-parametric permutation test to assess the difference in means, which does not rely on distributional assumptions.
Calculating standard deviations and variances for each poverty group
pop_gunviolence |>
group_by(poverty_group) |>
summarise(
n = n(),
Mean = mean(gun_violence_rate, na.rm = TRUE),
SD = sd(gun_violence_rate, na.rm = TRUE),
Var = var(gun_violence_rate, na.rm = TRUE)) |>
flextable() |>
colformat_double(digits = 3) |>
set_caption("Table 7: Summary statistics of gun violence rate by poverty group") |>
autofit()poverty_group | n | Mean | SD | Var |
|---|---|---|---|---|
High Poverty | 225 | 15.879 | 14.520 | 210.822 |
Low Poverty | 234 | 8.917 | 6.209 | 38.552 |
The standard deviation of gun violence rates in the High Poverty group was 210.82, compared to 38.55 in the Low Poverty group. This indicates higher variability in one group. The difference in variance supports the use of Welch’s t-test, which does not assume equal variances between groups.
Assessing statistical significance difference
Performing a permutation test to assess whether there is a statistically significant difference in the gun violence rate between two groups defined by the poverty_group variable in the pop_gunviolence dataset.
# Number of permutations to do
n_permutations <- 10000
# Instantiating vector for test statistics
permutation_statistics <- numeric(n_permutations)
# Calculating t-test statistic for each permutation
for (p in 1:n_permutations) {
permuted_data <- pop_gunviolence |>
mutate(poverty_group = sample(poverty_group, replace = FALSE)) # Shuffle group labels
permutation_statistics[p] <- t.test(gun_violence_rate ~ poverty_group,
data = permuted_data,
alternative = "greater") |>
broom::tidy() |>
pull(statistic)
}
pp_value <- mean(permutation_statistics >= permutation_statistics[p], na.rm = TRUE)Interpreting we reject the null hypothesis since our p-value is 0.2785, suggesting that high-poverty areas have a significantly higher gun violence rate than low poverty areas.
Create a histogram displaying the null distribution obtained for the randomization test
# Organizing into a tibble
computed_statistics <- tibble(Value = permutation_statistics)
d <- computed_statistics |>
ggplot(aes(x = Value)) +
geom_histogram(aes(y= after_stat(density)), color = "white") +
geom_density(color = "maroon1", linewidth = 1) +
stat_function(fun = function(x) dt(x, df = t_test_result$parameter),color = "seagreen",linewidth = 1) +
labs(title = "Figure 11: Randomization null distribution") +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
geom_vline(xintercept = quantile(permutation_statistics, probs = 0.95), color = "red", linetype = "solid") +
geom_vline(xintercept = t_test_result$statistic, color= "blue",linetype = "dotted")
d`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
The distribution is approximately centered at zero, consistent with the null hypothesis of no group difference. The blue dotted line shows the observed t-statistic, while the red solid line marks the 95th percentile of the null distribution. If the observed statistic lies beyond this threshold, it provides evidence against the null hypothesis.
#Bootsrapping
# Subsetting to only Poverty groups classified as High Poverty
high_pdata <- pop_gunviolence |>
filter(poverty_group == "High Poverty")
# Sample size
nrow(high_pdata) # Returns the number of observations of HighPoverty[1] 225
# Sample median
median(high_pdata$gun_violence_rate)[1] 13.11814
Let’s implement the nonparametric bootstrap.
# Number of bootstrap samples
n_boots <- 10000
# Instantiating matrix for bootstrap samples
boots <- matrix(NA, nrow = nrow(high_pdata), ncol = n_boots)
# Sampling with replacement n_boots times
for(b in 1:n_boots) {
boots[, b] <- high_pdata |>
slice_sample(prop = 1, replace = TRUE) |>
dplyr::pull(gun_violence_rate)
}Using the generated bootstrap samples and the code below, create a bootstrap distribution of sample medians, and visualize this distribution using a histogram.
# Instantiating vector for bootstrap medians
boot_medians <- vector(length = n_boots)
# Calculating medians for bootstrap samples
for(b in 1:n_boots) {
boot_medians[b] <- median(boots[, b])
}Visualizing.
# Creating a tibble
tidy_boot_medians <- tibble(Median = boot_medians)
# Creating a histogram
tidy_boot_medians |>
ggplot(aes(x = Median)) +
geom_histogram(color = "black") +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(title = "Figure 12: Nonparametric Bootstrap Distribution of Median Gun Violence Incidents",
y = "Frequency") +
geom_vline(xintercept = quantile(boot_medians, probs = c(0.025, 0.975)),
color = "blue")`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
# Calculating an estimated standard error
sd(boot_medians)[1] 0.6189031
# Calculating a nonparametric confidence interval
nonparametric_ci <- quantile(boot_medians, probs = c(0.025, 0.975))
#Convert to data frame
np_ci <- data.frame(
Lower = nonparametric_ci[1],
Upper = nonparametric_ci[2]
)
# Create flextable
flextable(np_ci) |>
set_caption("Table 8: Nonparametic Confidence Interval") |>
autofit()Lower | Upper |
|---|---|
11.65563 | 14.18552 |
Interpretation: The 95% bootstrap confidence interval (11.65563, 14.18552) indicates that we are 95% confident that the true population median number of gun violence incidents in High poverty states lies between approximately 11.66 and 14.19 incidents.
Conclusion/Main Takeaways.
The analysis reveals a significant relationship between poverty and gun violence rates across U.S. states from 2014 to 2023, with high-poverty areas exhibiting significantly higher gun violence rates compared to low-poverty areas, as confirmed by a permutation test (p-value < 0.05). While population size showed a weak, non-significant correlation with gun violence rates (p = 0.259), the data highlighted regional patterns, such as higher armed robbery incidents in southern states like Texas and Florida. Visualizations, including heatmaps and density plots, underscored temporal and geographic variations, with gun violence peaking in California during summer weekends. The bootstrap analysis further indicated that the median gun violence rate in high-poverty states ranges between 11.66 and 14.19 incidents per 100,000 population, emphasizing the persistent socioeconomic drivers of gun violence and the need for targeted interventions in economically disadvantaged areas.
Contributions of Each Group Member
Contributions of Each Group Member Kate Wema Juma
- Data Dictionary Development: Authored the data dictionary for the ACS dataset. Summary Tables: Created and formatted Table 1 (Top 10 States by Average Median Income, 2014–2023) and Table 2 (Summary Statistics Over the Years).
- Data Cleaning and Manipulation: date/time processing with lubridate functions (ymd, year, month, day) for temporal analysis (e.g., Figure 5).
- Exploratory Data Analysis (EDA): Conducted missingness analysis for the ACS dataset using skimr and created Figure 1 (Lollipop Plot of Missingness for ACS Data).
- Visualizations: Created sophisticated visualizations, including Figure 5 (Calendar Heatmap of Gun Violence in California, 2019) using calendR, Figure 6 (US Map Heatmap of Gun Violence, 2023) using usmap, and Figure 8 (Animated Bar Chart of Incident Types in California) using gganimate.
- Statistical Analysis: Wrote and executed code for permutation tests (Figures 9 and 11) and bootstrap analysis (Figure 12, Table 8), including hypothesis formulation, null distribution visualization, and interpretation of results (e.g., p-value for poverty vs. gun violence rate).
- Report Writing: Drafted the introduction, exploratory data analysis, and conclusion sections.
- Code Organization: Added descriptive comments to R code chunks with required section headers (e.g., Data Dictionary, Exploratory Data Analysis).
- Extra Credit Contribution (Combined): Partnered to design and implement interactive visualizations using ggplotly for Figures 9 and 11 (null distributions of permutation tests). Contributed to brainstorming the animated bar chart (Figure 8) using gganimate. Contributed to dashboard.
Cynthia Mutua
- Data Dictionary Development: Authored the data dictionary for the Gun Violence dataset.
- Exploratory Data Analysis (EDA): Conducted missingness analysis for the Gun Violence dataset and created Figure 2 (Lollipop Plot of Missingness for Gun Violence Data).
- Data Cleaning and Manipulation: Implemented string manipulation using str_detect, str_replace_all, and str_trim to analyze incident_characteristics (e.g., Table 4, Figure 7).
- Visualizations: Created sophisticated visualizations, including Figure 5 (Calendar Heatmap of Gun Violence in California, 2019) using calendR, Figure 6 (US Map Heatmap of Gun Violence, 2023) using usmap, and Figure 8 (Animated Bar Chart of Incident Types in California) using gganimate.
- Statistical Analysis: Wrote and executed code for permutation tests (Figures 9 and 11) and bootstrap analysis (Figure 12, Table 8), including hypothesis formulation, null distribution visualization, and interpretation of results (e.g., p-value for poverty vs. gun violence rate).
- Report Writing: Drafted sections on data cleaning, permutation tests, and bootstrap analysis.
- Extra Credit Contribution (Combined): Partnered to develop interactive visualizations using ggplotly for Figures 9 and 11, ensuring seamless integration of interactive features. Co-designed the animated bar chart (Figure 8) using gganimate. Contributed to dashboard.
Combined Efforts
- Merging Datasets: Both members collaborated on merging the ACS and Gun Violence datasets using inner_join to create the pop_gunviolence dataset, enabling analyses like Figure 10 (Density Plot of Gun Violence Rates by Poverty Group) and Table 5.
- Extra Credit Focus: Worked together to implement ggplotly for interactive plots (Figures 9 and 11). Collaborated on the gganimate-based Figure 8, which animates changes in gun violence incident types in California.
- Project Review and Debugging: Both members reviewed the Quarto file for errors. We also cross-checked visualizations and tables.