Investigating the Impact of Poverty and Population on Gun Violence Rates.

Author

Cynthia Mutua & Kate Juma

Published

March 22, 2026

Introduction.

Gun violence remains a major public health issue in the United States. While various factors contribute to it, economic and demographic indicators such as poverty and population density are consistently suspected to be influential. This report aims to explore the relationship between poverty levels, population characteristics, and the prevalence of gun violence across U.S. states over time.

Our data sources include:

American Community Survey (ACS)- contains ACS estimates for U.S. states, including population, income, poverty rates, and demographic characteristics -,
Gun violence data -Provides gun violence incidents by state and date.

This analysis was conducted using R and Quarto, incorporating visualization, statistical testing, and reproducible data science practices.

Source

Data was obtained via the tidycensus R package and the United States Census website.

Data dictionary

American Community Survey (ACS) dataset.

Variable	Type	Description
geoid	character	Geographic region ID with the first 2 digits being the state Federal Information Processing Standard (FIPS) code and the last 3 digits the county FIPS code
county_state	character	Geographic region
year	double	Year
population	double	Population
median_income	double	Median income in dollars
median_monthly_rent_cost	double	Median monthly rent costs for renters in dollars
median_monthly_home_cost	double	Median monthly housing costs for homeowners in dollars
prop_female	double	Proportion of people who are female
prop_male	double	Proportion of people who are male
prop_white	double	Proportion of people who are white alone
prop_black	double	Proportion of people who are Black or African American alone
prop_native	double	Proportion of people who are American Indian and Alaska Native alone
prop_asian	double	Proportion of people who are Asian alone
prop_hawaiin_islander	double	Proportion of people who are Native Hawaiian and Other Pacific Islander alone
prop_other_race	double	Proportion of people who are some other race alone
prop_multi_racial	double	Proportion of people who are two or more races
prop_highschool	double	Proportion of people 25 and older whose highest education-level is high school
prop_GED	double	Proportion of people 25 and older whose highest education-level is a GED
prop_some_college	double	Proportion of people 25 and older whose highest education-level is some, but less than 1 year of college
prop_college_no_degree	double	Proportion of people 25 and older whose highest education-level is greater than 1 year of college but no degree
prop_associates	double	Proportion of people 25 and older whose highest education-level is an Associates degree
prop_bachelors	double	Proportion of people 25 and older whose highest education-level is a Bachelors degree
prop_masters	double	Proportion of people 25 and older whose highest education-level is a Masters degree
prop_professional	double	Proportion of people 25 and older whose highest education-level is a Professional degree
prop_doctoral	double	Proportion of people 25 and older whose highest education-level is a Doctoral degree
prop_poverty	double	Proportion of people 25 and older living in poverty, defined by the Census Bureau as having an income below the poverty threshold for their family size

Gun violence dataset.

Variable	Type	Description
geoid	character	Geographic region ID (FIPS code, matching the ACS geoid)
incident_id	double	Unique identifier for each gun violence incident
date	double	Date of the incident in ISO 8601 format (YYYY-MM-DDTHH:MM:SSZ)
state	character	Name of the U.S. state where the incident occurred
city	character	Name of the city where the incident occurred
county	character	Name of the county where the incident occurred (if available)
business_or_school	character	Indicates whether the incident occurred at a business or school(if specified)
address	character	Address or block location where the incident took place
latitude	double	Latitude coordinate of the incident location
longitude	double	Longitude coordinate of the incident location
number_victims_killed	double	Number of victims killed in the incident
number_victims_injured	double	Number of victims injured in the incident
number_suspects_killed	double	Number of suspects killed in the incident
number_suspects_injured	double	Number of suspects injured in the incident
number_suspects_arrested	double	Number of suspects arrested following the incident
incident_characteristics	character	Text description of the nature of the incident (may include multiple characteristics)

Loading packages.

library(dplyr)
library(tidyverse)
library(skimr)
library(flextable)
library(mice)
library(naniar)
library(gt)
library(plotly)
library(ggplot2)
library(lubridate)
library(gtExtras)
library(gganimate)
library(gifski)
library(calendR)
library(janitor)
library(moments)
library(usmap)
library(magick)
library(here)

Importing data into R.

American Community Survey (ACS) dataset

# Importing state census data from 2008 - 2023
us_data <- read_csv("census_data_state_2008-2023.csv")

Rows: 780 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): geoid, county_state
dbl (24): year, population, median_income, median_monthly_rent_cost, median_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Gun violence dataset

# Importing of gun violence data 
gun_violence <- read_csv("gun_violence_geo.csv")

Rows: 389850 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): geoid, state, city, county, business_or_school, address, incident_...
dbl  (8): incident_id, latitude, longitude, number_victims_killed, number_vi...
dttm (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exploring missingness

To check for missingness in the ACS data.

skim(us_data)

Data summary
Name	us_data
Number of rows	780
Number of columns	26
_______________________
Column type frequency:
character	2
numeric	24
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
geoid	0	1	2	2	0	52	0
county_state	0	1	4	20	0	52	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2015.20	4.61	2008.00	2011.00	2015.00	2019.00	2023.00	▇▆▆▃▆
population	1	6229692.05	7025083.48	532668.00	1786156.75	4257700.00	7053886.75	39557045.00	▇▂▁▁▁
median_income	1	58358.78	14075.73	18314.00	48362.75	56484.50	67147.25	108210.00	▁▇▇▂▁
median_monthly_rent_cost	1	945.44	269.03	419.00	749.50	884.00	1092.25	1992.00	▃▇▃▁▁
median_monthly_home_cost	1	1109.16	371.59	241.00	850.00	1020.00	1353.00	2561.00	▁▇▃▂▁
prop_female	1	0.51	0.01	0.47	0.50	0.51	0.51	0.53	▁▂▇▇▁
prop_male	1	0.49	0.01	0.47	0.49	0.49	0.50	0.53	▁▇▇▂▁
prop_white	1	0.75	0.15	0.22	0.67	0.77	0.85	0.96	▁▁▃▇▇
prop_black	1	0.11	0.11	0.00	0.03	0.07	0.15	0.53	▇▃▂▁▁
prop_native	1	0.02	0.03	0.00	0.00	0.00	0.01	0.16	▇▁▁▁▁
prop_asian	1	0.04	0.05	0.00	0.01	0.03	0.04	0.39	▇▁▁▁▁
prop_hawaiin_islander	1	0.00	0.01	0.00	0.00	0.00	0.00	0.11	▇▁▁▁▁
prop_other_race	1	0.04	0.04	0.00	0.01	0.02	0.05	0.32	▇▁▁▁▁
prop_multi_racial	1	0.05	0.05	0.01	0.02	0.03	0.05	0.39	▇▁▁▁▁
prop_highschool	1	0.24	0.04	0.10	0.22	0.24	0.26	0.35	▁▂▇▆▁
prop_GED	1	0.04	0.01	0.02	0.04	0.04	0.05	0.07	▃▇▆▃▁
prop_some_college	1	0.06	0.01	0.01	0.06	0.06	0.07	0.09	▁▁▅▇▂
prop_college_no_degree	1	0.15	0.02	0.08	0.13	0.14	0.16	0.22	▁▅▇▃▁
prop_associates	1	0.09	0.02	0.03	0.08	0.08	0.10	0.15	▁▃▇▂▁
prop_bachelors	1	0.19	0.03	0.10	0.17	0.19	0.21	0.29	▁▅▇▃▁
prop_masters	1	0.08	0.03	0.04	0.06	0.08	0.09	0.24	▇▆▁▁▁
prop_professional	1	0.02	0.01	0.01	0.02	0.02	0.02	0.10	▇▁▁▁▁
prop_doctoral	1	0.01	0.01	0.01	0.01	0.01	0.02	0.05	▇▃▁▁▁
prop_poverty	1	0.14	0.05	0.07	0.11	0.13	0.16	0.46	▇▃▁▁▁

Lollipop plot to explore ACS data for missingness.

gg_miss_var(us_data, show_pct = TRUE) +
labs(title= "Figure 1:Missingness for ACS data",
     caption = "Source: American Community Survey via tidycensus")

The dataset has no missing values.

To check for missingness in the second dataset.

skim(gun_violence)

Data summary
Name	gun_violence
Number of rows	389850
Number of columns	16
_______________________
Column type frequency:
character	7
numeric	8
POSIXct	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
geoid	46	1.00	5	5	2956
state	0	1.00	4	20	51
city	761	1.00	3	27	11548
county	389065	0.00	1	22	459
business_or_school	316795	0.19	1	85	47476
address	9665	0.98	3	61	325895
incident_characteristics	102	1.00	8	828	10939

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
incident_id	0	1	1468977.53	796698.22	92114.00	756778.50	1529846.00	2165292.50	2976478.00	▇▆▇▇▅
latitude	2	1	37.33	4.82	19.03	33.76	38.31	41.07	71.34	▁▇▅▁▁
longitude	2	1	-89.25	13.45	-171.74	-93.77	-86.77	-80.20	-67.04	▁▁▂▃▇
number_victims_killed	0	1	0.37	0.56	0.00	0.00	0.00	1.00	60.00	▇▁▁▁▁
number_victims_injured	0	1	0.78	1.06	0.00	0.00	1.00	1.00	439.00	▇▁▁▁▁
number_suspects_killed	0	1	0.06	0.24	0.00	0.00	0.00	0.00	4.00	▇▁▁▁▁
number_suspects_injured	0	1	0.05	0.23	0.00	0.00	0.00	0.00	5.00	▇▁▁▁▁
number_suspects_arrested	0	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2014-01-01	2023-12-31	2019-09-26	3652

Plotting a lollipop plot to explore missingness

gg_miss_var(gun_violence, show_pct = TRUE) +
  labs(title= "Figure 2:Missingness for Gun Violence data",
       caption = "Source: Gun Violence Archive")

While the majority of the dataset was complete, we did observe missing values in certain descriptive fields of the gun violence data, notably business_or_school, county, and address. This missingness does not affect the calculation of core metrics like gun violence rates or their relationship with poverty which are the main focus of our analysis.

Exploratory Data Analysis.

1. ACS dataset.

Lets explore the summary statistics

# summarizing mean of the median_income from the year 2014-2023 by county_state.
us_data |>
  filter(year >= 2014, year <= 2023) |>
  group_by(year, county_state) |>
  summarize(average_median_income =  round(mean(median_income, na.rm = TRUE))) |>
  arrange(desc(average_median_income)) |>
  ungroup() |>
  slice_head(n = 10) |>
  flextable() |>
  colformat_double(big.mark = "", digits = 0) |>
  theme_zebra() |>
  add_header_lines("Table 1: Top 10 states by Average median income, 2014–2023")

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Table 1: Top 10 states by Average median income, 2014–2023
year	county_state	average_median_income
2023	District of Columbia	108210
2022	District of Columbia	101027
2023	Massachusetts	99858
2023	New Jersey	99781
2023	Maryland	98678
2023	New Hampshire	96838
2022	New Jersey	96346
2023	California	95521
2023	Hawaii	95322
2022	Maryland	94991

Next, we can check the sample size of the population its mean and standard deviation per year.

# Summarizing sample size of the population its mean and standard deviation per year.
summary_stat <- us_data |>
  group_by(year) |> # Groups the data by year.
  summarize(pop_mean =  mean(population, na.rm = TRUE), # Calculates the mean population for each year.
            pop_size = n(), # Counts the number of entries in states per year.
            pop_sd = round(sd(population, na.rm = TRUE))) |> # Calculates and rounds the standard deviation of pouplation per year.
  ungroup() |>
  slice_head(n = 10)
  
# Create a flextable for the summary statistics
flextable(summary_stat) |>
   colformat_double(big.mark = "", digits = 0) |>
  add_header_lines("Table 2: Summary statistics over the years") |>
  theme_vanilla()

Table 2: Summary statistics over the years
year	pop_mean	pop_size	pop_sd
2008	5923342	52	6664198
2009	5980266	52	6719871
2010	6020612	52	6781449
2011	6063435	52	6847219
2012	6107329	52	6914594
2013	6148922	52	6975056
2014	6200105	52	7063774
2015	6247942	52	7140930
2016	6279593	52	7183936
2017	6328007	52	7257007

This table summarizes the average state-level population statistics from 2008 to 2017.These values reflect state-level averages whereby the mean population represents the average population across 52 U.S. states and territories each year, while the standard deviation shows the variation in population sizes across them.

Visualize the relationship of median income vs poverty rate using ggplot.

# Filter the dataset for the year 2019 and create a scatterplot of median income vs poverty rate
us_data|>
  filter(year == 2019) |>
  ggplot(aes(x = median_income, y = prop_poverty)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Figure 3: Median Income vs. Poverty Rate (2019)",
       x = "Median Income",
       y = "Proportion in Poverty",
       caption = "Data source: American Community Survey (ACS)") +
  theme_bw()

`geom_smooth()` using formula = 'y ~ x'

This scatter plot shows a negative relationship between median income and poverty rate across U.S. states in 2019. States with higher incomes tend to have lower poverty rates, as highlighted by the downward trend line.

Visualizing trend of the median_income vs prop_poverty for the state of Michigan over time.

us_data |>
  filter(county_state == "Michigan") |>
  ggplot(aes(x = year)) +
  geom_line(aes(y = median_income, color = "Income")) +
  geom_line(aes(y = prop_poverty * 100000, color = "Poverty (×100k)")) +
  scale_y_continuous(
    name = "Median Income",
    sec.axis = sec_axis(~ . / 100000, name = "Poverty Rate")
  ) +
  scale_color_viridis_d() +
  labs(title = "Figure 4:Michigan: Median Income vs. Poverty Rate Over Time",
       x = "Year",
  caption = "Figure 4: Median Income vs Poverty Rate,
  Data source: American Community Survey (ACS)")+
  theme_dark() +
  theme(legend.title = element_blank(),
        legend.position = "bottom")

The line graph illustrates Michigan’s socioeconomic evolution from 2008 to 2023, with rising median income signaling economic progress and fluctuating poverty rates highlighting persistent challenges. The inverse relationship between these metrics suggests that economic growth alone doesn’t eliminate poverty, which may contribute to social issues like gun violence.

2. Gun Violence Dataset.

The code snippet transforms the gun_violence dataset by converting the date column to a Date object and creating new columns (day, year, month) using lubridate functions. This enables temporal aggregation and analysis, supporting visualizations (e.g., calendar heatmap, yearly heatmap) and data merging with ACS data. It’s a critical preprocessing step that ensures the dataset is ready for the project’s statistical and visual analyses, directly contributing to insights about gun violence patterns and their socioeconomic drivers.

# Aggregating data.
gun_violence <- gun_violence |>
  mutate(date = ymd(date), # Changing the data variable to a date type variable.
         day = day(date),
         year = year(date), 
         month = month(date, label = TRUE, abbr = FALSE))  # extracting the day, year and month.

Creation of a table summarizing the total number of victims killed in gun violence incidents across cities in California during May 2019.

# Checking number of killed victims in the year 2019.
 gun_violence |>
  filter(state == "California", year == 2019, month == "May") |>
  group_by(city)|>
  summarize(total_victims_killed = sum(number_victims_killed))|>
  arrange(desc(total_victims_killed)) |>
  head(15) |>
  gt() |>
  gt_theme_nytimes() |>
  tab_caption("Table 3: Top 15 cities by total number of victims killed in month of May 2019")

Table 3: Top 15 cities by total number of victims killed in month of May 2019
city	total_victims_killed
Stockton	9
Los Angeles	7
Oakland	7
Fresno	5
Sacramento	4
San Diego	4
Bakersfield	3
San Francisco	3
Cathedral City	2
Grass Valley	2
Long Beach	2
Oroville	2
Richmond	2
San Bernardino	2
West Hills	2

Visualizing the number of gun violence victims killed per day in California for 2019 with CalendR.

# Calculating number of gun violence incidents for each date.
gun_violence_counts <- gun_violence |> 
  filter(state == "California", year == 2019) |>
  mutate(date = make_date(year, month, day)) |>
  group_by(date) |> # Groups incidents by date
  summarize(total_victims_killed = sum(number_victims_killed, na.rm = TRUE))

# Creating a date sequence.
all_dates <- tibble(date = seq(as.Date("2019-01-01"), as.Date("2019-12-31"), by = "day")) # Generates a sequence of all dates in 2019 (365 days).

# Joining data to fill missing days with 0
gun_violence_counts <- all_dates |>
  left_join(gun_violence_counts, by = "date") |>
  mutate(total_victims_killed = replace_na(total_victims_killed, 0)) # Ensures every day has a value (0 for no victims).
  


# Creating heatmap calendar of number of victims
gun_violence_calendar <- calendR(year = 2019,
                            title = "Daily distribution of gun victims in California, 2019",
        special.days = gun_violence_counts$total_victims_killed, # Maps the number of victims to each day’s color intensity.
        gradient = TRUE,
        monthnames = month.name, # Displays full month names
        months.size = 11,
        weeknames.size = 3.3,
        legend.pos = "bottom",
        legend.title = "Number of victims killed",
        ncol = 3,
        margin = 1) +
    scale_fill_gradient(low = "white", high = "deeppink4",
  guide = guide_colorbar(frame.colour = "black", 
                         ticks.colour = "black",
                         title.position = "top")) + labs(caption = "Figure 5: Daily gun violence deaths in California, 2019. 
                                                         Source: Gun Violence Archive")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the calendR package.
  Please report the issue to the authors.

Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

# Saving plot
size_mult <- 22
ggsave(plot = gun_violence_calendar,
       filename = "calendar_plot.png",
       units = "mm",
       dpi = 120,
       width = 7*size_mult,
       height = 8.5*size_mult)

knitr::include_graphics("calendar_plot.png")

The calendar heatmap reveals that gun violence deaths in California peak on summer weekends and holidays.

Aggregating gun violence incidents and cencus data by year and state creating a new dataset called yearly_violence.

# Aggregating by year and state.
yearly_violence <- gun_violence |>
  filter(!is.na(state), !is.na(year)) |> # Checks for missing or inconsistent state or year values
  rename(county_state = state) |> #renaming the state to county_state because the column represents the state.
  group_by(year, county_state) |>
  summarise(total_cases = n(), total_victims_killed = sum(number_victims_killed, na.rm = TRUE)) #Computes the number of incidents in each group

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Visualizing the distribution of gun violence cases across states and years.

# Assuming yearly_violence has columns: year, county_state, total_cases
# Extract state from county_state and aggregate total_cases by state and year
yearly_violence_state <- yearly_violence |>
  mutate(state = sub(".*,\\s*", "", county_state)) |> # Extract state (e.g., "WA" from "King County, WA")
  group_by(state, year) |>
  summarise(total_cases = sum(total_cases, na.rm = TRUE)) |>
  ungroup()

`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.

# Create a usmap heatmap for a single year (e.g., the most recent year in the data)
# You can change the year as needed
latest_year <- max(yearly_violence_state$year)
yearly_violence_single_year <- yearly_violence_state |>
  filter(year == latest_year)

# Plot usmap heatmap
plot_usmap(
  data = yearly_violence_single_year,
  values = "total_cases",
  color = "white" # White borders for states, matching geom_tile's color
) +
  scale_fill_viridis_c(
    option = "C",
    name = "Total Cases"
  ) +
  labs(
    title = paste("Heatmap of Gun Violence Incidents by State,", latest_year),
    caption = "Figure 6: Heatmap of Gun Violence for 2023
    Source:Gun Violence Archive",
    fill = "Total Cases"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    legend.key.width = unit(1, "cm")
  )

String Manipulation.

# Creating contains_robbery variables
gun_robbery <- gun_violence |>
  filter(!is.na(incident_characteristics)) |>
  mutate(contains_robbery = str_detect(incident_characteristics, regex("Armed robbery with injury", ignore_case = TRUE)))

# Summarizing the count of "Armed robbery with injury"
gun_robbery |>
summarize(armed_robbery_count = sum(contains_robbery))

# A tibble: 1 × 1
  armed_robbery_count
                <int>
1               20890

#Counting Occurrences of contains_robbery
gun_robbery |>
count(contains_robbery)

# A tibble: 2 × 2
  contains_robbery      n
  <lgl>             <int>
1 FALSE            368858
2 TRUE              20890

# using janitor package to get percentages.
gun_robbery |>
  tabyl(contains_robbery)

 contains_robbery      n    percent
            FALSE 368858 0.94640126
             TRUE  20890 0.05359874

Intepretation: About 5.36% of the filtered gun violence incidents are associated with armed robbery with injury, which is a notable but relatively small proportion.

Investigating which states have the highest counts or proportions ofarmed robbery incidents.

# By state and year
robbery_2019 <- gun_robbery |>
  filter(year == 2019) |>
  group_by(state) |>
  summarise(robbery_count = sum(contains_robbery),
            total_incidents = n(),
            robbery_mean = mean(contains_robbery) * 100) |>
  arrange(desc(robbery_count)) |>
  head(16)

# Display the table
flextable(robbery_2019) |>
  colformat_double(digits = 3) |>
  set_caption("Table 4: States with the highest counts of armed robbery incidents") |>
  autofit()

state	robbery_count	total_incidents	robbery_mean
Texas	269	2,541	10.586
Florida	135	2,041	6.614
Georgia	111	1,209	9.181
North Carolina	107	1,371	7.805
Ohio	106	1,617	6.555
Pennsylvania	97	2,015	4.814
Illinois	93	2,829	3.287
Tennessee	93	1,319	7.051
California	82	2,447	3.351
Louisiana	77	1,294	5.951
Missouri	66	1,194	5.528
Maryland	65	1,239	5.246
South Carolina	61	1,043	5.849
Indiana	56	955	5.864
Alabama	52	978	5.317
Michigan	46	1,083	4.247

Intepretation: Texas leads with the highest incidents among the listed states.Texas, Florida, Georgia, North Carolina, Tennessee, and Louisiana are all in the southern U.S., suggesting a regional concentration of armed robbery with injury incidents.

Visualizing the above results.

library(grid)
#load or download a gun image
# Replace with your local image path, e.g., "path/to/gun_image.png"
# Example using a public-domain image URL (replace with a suitable image if needed)
gun_image <- image_read("gun_image.jpg") |>
  image_colorize(opacity = 50, color = "white")

# Save the image temporarily (or use local file directly)
temp_image <- tempfile(fileext = ".png")
image_write(gun_image, temp_image)

# Convert to raster for use in ggplot
gunimage_grob <- rasterGrob(gun_image, width = unit(1, "npc"), height = unit(1, "npc"))

robbery_2019 |>
  ggplot(aes(x = reorder(state, robbery_count), y = robbery_count)) +
  annotation_custom(gunimage_grob, xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf) +
  geom_bar(stat = "identity", fill = "steelblue") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  coord_flip() +
  labs(title = "Figure 7 : Armed Robbery with Injury Incidents by State", 
       x = "State", y = "Incident Count") +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = "transparent", color = NA),
    plot.background = element_rect(fill = "transparent", color = NA)
  )

Investigating incident characteristic over the years for California.

# Clean incident_characteristics
gun_robbery_clean <- gun_robbery |>
  filter(!is.na(incident_characteristics)) |>
  mutate(incident_characteristics = str_replace_all(
    incident_characteristics, "\\r\\n|\\n", ",")) |>
  separate_rows(incident_characteristics, sep = ",") |>
  mutate(incident_characteristics = str_trim(incident_characteristics)) |>
  filter(incident_characteristics != "")

# Build data for animation
anim_data <- gun_robbery_clean |>
  filter(state == "California") |>
  count(year, incident_characteristics, sort = TRUE) |>
  group_by(year) |>
  slice_max(n, n = 10) |>
  ungroup()

# Build plotly animation
anim_data |>
  plot_ly(
    x = ~n,
    y = ~reorder(incident_characteristics, n),
    frame = ~year,
    type = "bar",
    orientation = "h",
    color = ~incident_characteristics,
    colors = "Set3",
    showlegend = FALSE
  ) |>
  layout(
    title = "Figure 8: Top 10 Incident Types in California by Year",
    xaxis = list(title = "Count"),
    yaxis = list(title = ""),
    paper_bgcolor = "white",
    plot_bgcolor = "white"
  ) |>
  animation_opts(
    frame = 800,
    transition = 400,
    easing = "cubic-in-out"
  ) |>
  animation_button(
    x = 0, xanchor = "left",
    y = -0.1, yanchor = "bottom",
    label = "▶ Play"
  ) |>
  animation_slider(
    currentvalue = list(prefix = "Year: ")
  )

Merging the two datasets.

Merging the two dataset by year and county_state with the function inner_join to create a new variable called’pop_gunviolence`.

economic_indic <- us_data |>
  filter(year >= 2014, year <= 2023) |>
  select(county_state, year, population, median_income, prop_poverty)

pop_gunviolence <- yearly_violence |>
  inner_join(economic_indic, by = c("year", "county_state"))

skim(head(pop_gunviolence, 15))

Data summary
Name	head(pop_gunviolence, 15)
Number of rows	15
Number of columns	7
_______________________
Column type frequency:
character	1
numeric	5
________________________
Group variables	year

Variable type: character

skim_variable	year	n_missing	complete_rate	min	max	empty	n_unique	whitespace
county_state	2014	0	1	5	20	0	15	0

Variable type: numeric

skim_variable	year	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total_cases	2014	1	681.87	748.17	20.00	214.00	343.00	847.00	2231.00	▇▂▁▁▂
total_victims_killed	2014	1	278.20	325.58	6.00	55.50	145.00	375.00	1147.00	▇▂▁▁▁
population	2014	1	7810374.13	10083654.60	658893.00	1527012.50	4849377.00	8414413.50	38802500.00	▇▂▁▁▁
median_income	2014	1	56767.87	10665.98	41262.00	48591.00	57444.00	65762.50	71648.00	▃▇▂▅▆
prop_poverty	2014	1	0.15	0.03	0.11	0.12	0.15	0.18	0.19	▆▂▅▃▇

knitr::kable(head(pop_gunviolence,10),
             caption = "Table 5: Population and Gun Violence Data")

Table 5: Population and Gun Violence Data
year	county_state	total_cases	total_victims_killed	population	median_income	prop_poverty
2014	Alabama	725	264	4849377	42830	0.1925258
2014	Alaska	63	24	736732	71583	0.1123372
2014	Arizona	343	175	6731484	50068	0.1824119
2014	Arkansas	295	145	2966369	41262	0.1886608
2014	California	2144	1147	38802500	61933	0.1644525
2014	Colorado	233	113	5355866	61303	0.1204445
2014	Connecticut	225	67	3596677	70048	0.1075192
2014	Delaware	203	44	935614	59716	0.1247874
2014	District of Columbia	353	82	658893	71648	0.1772565
2014	Florida	1672	722	19893297	47463	0.1649502

Creating a new variable called gun_violence_ratewhich standardizes the number of gun violence cases total_cases by population size. The multiplication by 100,000 turns it into a rate that’s easier to interpret and compare across different regions or years.

# Calculate the rate of gun violence per 100,000 population
pop_gunviolence <- pop_gunviolence |>
 mutate(gun_violence_rate = total_cases / population * 100000)

Randomization of pop_gunviolence data.

# Randomization test for population
randomization_test <- function(data, n_randomizations = 1000) {
 observed_correlation <- cor(data$population, data$gun_violence_rate) # Computes the observed correlation between population and gun_violence_rate
 randomized_correlations <- numeric(n_randomizations)
 
# Randomize the population and calculate correlation each time.
 for (i in 1:n_randomizations) {
 randomized_data <- data |>
 mutate(population = sample(population))
 randomized_correlations[i] <- cor(randomized_data$population, randomized_data$gun_violence_rate)
 }
 
# Calculate p_value.
 p_value <- mean(abs(randomized_correlations) >= abs(observed_correlation))
 return(list(observed_correlation = observed_correlation, p_value = p_value,permuted_distribution = randomized_correlations))
}

# Run the test
result_pop <- randomization_test(pop_gunviolence)

The observed correlation is -0.05159267, this indicates that population size has little to no linear association with the gun violence rate in this dataset.The p-value is 0.259. This shows that there is no statistical significance evident to suggest that the observed correlation from what could occur by random chance.

Visualizing null distribution of permuted correlations.

p <- ggplot(data.frame(correlation = result_pop$permuted_distribution), 
            aes(x = correlation)) +
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  geom_vline(xintercept = result_pop$observed_correlation, 
             color = "red", linetype = "dashed", linewidth = 1.2) +
  labs(
    title = paste0("Null Distribution | Observed r = ", 
                   round(result_pop$observed_correlation, 3),
                   " | p = ", round(result_pop$p_value, 3)),
    x = "Correlation between Population and Gun Violence Rate",
    y = "Frequency",
    caption = "Figure 9: Null distribution of Permutation Correlations,
    Data Source: American Community Survey (ACS), Gun Violence Archive"
  ) +
  theme_minimal()

p

The permutation test results show a weak negative correlation (-0.052) between population and gun violence rate, with a p-value of 0.259. The visualization displays the null distribution of correlations that would occur by random chance, with our observed correlation (red dashed line) falling within this distribution. Since the p-value exceeds the conventional threshold of 0.05. This indicates that population alone may not be a reliable predictor of gun violence rates.

Permutation test for `prop_poverty` and `gun_violence_rate`.

Permutation test to evaluate the relationship between prop_poverty and gun_violence_rate

# Perform permutations and randomization

set.seed(1994) # Setting seed for reproducibility

# Permutation test for prop_poverty
# Defining permutation test function.
perm_test <- function(data, n_permutations = 1000) {
  observed_correlation <- cor(data$prop_poverty, data$gun_violence_rate)
  permuted_correlations <- numeric(n_permutations)
 
 # Running the permutations
for (i in 1:n_permutations) {
  permuted_data <- data |>
    mutate(prop_poverty = sample(prop_poverty))
  permuted_correlations[i] <- cor(permuted_data$prop_poverty, permuted_data$gun_violence_rate)
 }
 
p_value <- mean(abs(permuted_correlations) >= abs(observed_correlation))
return(list(observed_correlation = observed_correlation, p_value = p_value))
}

# Perform the permutation test
result <- perm_test(pop_gunviolence)
print(result)

$observed_correlation
[1] 0.3165579

$p_value
[1] 0

This provides strong evidence that poverty rates and gun violence rates are meaningfully related in our dataset, suggesting that socioeconomic factors may play an important role in explaining variations in gun violence across different areas.

Relationship between Poverty_rate vs. Gun_violence_rate.

Creating a new variable `poverty_group by splitting area by median poverty rate.

pop_gunviolence <- pop_gunviolence |>
  mutate(poverty_group = ifelse(prop_poverty > median(prop_poverty, na.rm = TRUE),
                                "High Poverty", "Low Poverty"))

Creating a density plot to check for skewness and normality.

ggplot(pop_gunviolence, aes(x = gun_violence_rate, fill = poverty_group)) +
  geom_density(alpha = 0.5) +
   scale_fill_manual(values = c("High Poverty" = "#e63946", "Low Poverty" = "#457b9d")) +
  facet_grid(poverty_group ~ .)

  labs(
    title = "Figure 10: Density Plot of Gun Violence Rates by Poverty Group",
    x = "Gun Violence Rate (per 100,000 population)",
    y = "Density",
    fill = "Poverty Group"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("High Poverty" = "#e63946", "Low Poverty" = "#457b9d"))

NULL

Two-sample t-test

Next, we state the hypotheses and implement Welch’s two-sample t-test to test whether the average gun violence rate is different between high poverty and low poverty areas. \[ H_0: \mu_{\text{High}} = \mu_{\text{Low}} \] \[ H_a: \mu_{\text{High}} = \mu_{\text{Low}} \] where \(\mu_{\text{High}}\) is the average gun violence rate in high poverty areas., and \(\mu_{\text{Low}}\) is the average gun violence rate in low poverty areas.

# Reorder factor levels for t-test
pop_gunviolence <- pop_gunviolence |>
  dplyr::mutate(poverty_group = fct_relevel(poverty_group, "High Poverty", "Low Poverty"))


t_test_result <- t.test(gun_violence_rate ~ poverty_group, data = pop_gunviolence)

# Display results of t-test
t_test_result |> 
  broom::tidy() |> 
  flextable() |> 
  colformat_double(digits = 3) |> 
set_formatter(p.value = function(x) {format.pval(x, digits = 3)}) |> 
  set_caption("Table 6: Results of two-sample t-test comparing gun violence rates between high and low poverty areas.") |> 
  autofit() |> 
  fit_to_width(max_width = 7)

estimate	estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
6.962	15.879	8.917	6.633	1.53e-10	300.758	4.897	9.028	Welch Two Sample t-test	two.sided

We observed high positive skewness in the gun violence rate variable, particularly in the High Poverty group. Given the violation of the normality assumption required for the t-test, we proceeded with a non-parametric permutation test to assess the difference in means, which does not rely on distributional assumptions.

Calculating standard deviations and variances for each poverty group

pop_gunviolence |>
  group_by(poverty_group) |>
  summarise(
    n = n(),
    Mean = mean(gun_violence_rate, na.rm = TRUE),
    SD = sd(gun_violence_rate, na.rm = TRUE),
    Var = var(gun_violence_rate, na.rm = TRUE)) |>
    
flextable() |> 
  colformat_double(digits = 3) |> 
  set_caption("Table 7: Summary statistics of gun violence rate by poverty group") |>
  autofit()

poverty_group	n	Mean	SD	Var
High Poverty	225	15.879	14.520	210.822
Low Poverty	234	8.917	6.209	38.552

The standard deviation of gun violence rates in the High Poverty group was 210.82, compared to 38.55 in the Low Poverty group. This indicates higher variability in one group. The difference in variance supports the use of Welch’s t-test, which does not assume equal variances between groups.

Assessing statistical significance difference

Performing a permutation test to assess whether there is a statistically significant difference in the gun violence rate between two groups defined by the poverty_group variable in the pop_gunviolence dataset.

# Number of permutations to do
n_permutations <- 10000

# Instantiating vector for test statistics
permutation_statistics <- numeric(n_permutations)

# Calculating t-test statistic for each permutation
for (p in 1:n_permutations) {
  permuted_data <- pop_gunviolence |> 
    mutate(poverty_group = sample(poverty_group, replace = FALSE))  # Shuffle group labels
  
  permutation_statistics[p] <- t.test(gun_violence_rate ~ poverty_group, 
                                      data = permuted_data,
                                      alternative = "greater") |> 
    broom::tidy() |> 
    pull(statistic)
}

pp_value <- mean(permutation_statistics >= permutation_statistics[p], na.rm = TRUE)

Interpreting we reject the null hypothesis since our p-value is 0.2785, suggesting that high-poverty areas have a significantly higher gun violence rate than low poverty areas.

Create a histogram displaying the null distribution obtained for the randomization test

# Organizing into a tibble
computed_statistics <- tibble(Value = permutation_statistics)

d <- computed_statistics |>
  ggplot(aes(x = Value)) +
  geom_histogram(aes(y= after_stat(density)), color = "white") +
  geom_density(color = "maroon1", linewidth = 1) +
  stat_function(fun = function(x) dt(x, df = t_test_result$parameter),color = "seagreen",linewidth = 1) +
  labs(title = "Figure 11: Randomization null distribution") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  geom_vline(xintercept = quantile(permutation_statistics, probs = 0.95), color = "red", linetype = "solid") +
  geom_vline(xintercept = t_test_result$statistic, color= "blue",linetype = "dotted")

d

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

The distribution is approximately centered at zero, consistent with the null hypothesis of no group difference. The blue dotted line shows the observed t-statistic, while the red solid line marks the 95th percentile of the null distribution. If the observed statistic lies beyond this threshold, it provides evidence against the null hypothesis.

#Bootsrapping

# Subsetting to only Poverty groups classified as High Poverty
high_pdata <- pop_gunviolence |>
  filter(poverty_group == "High Poverty")

# Sample size
nrow(high_pdata) # Returns the number of observations of HighPoverty

[1] 225

# Sample median
median(high_pdata$gun_violence_rate)

[1] 13.11814

Let’s implement the nonparametric bootstrap.

# Number of bootstrap samples
n_boots <- 10000

# Instantiating matrix for bootstrap samples
boots <- matrix(NA, nrow = nrow(high_pdata), ncol = n_boots)

# Sampling with replacement n_boots times
for(b in 1:n_boots) {
  boots[, b] <- high_pdata |> 
    slice_sample(prop = 1, replace = TRUE) |> 
    dplyr::pull(gun_violence_rate)
}

Using the generated bootstrap samples and the code below, create a bootstrap distribution of sample medians, and visualize this distribution using a histogram.

# Instantiating vector for bootstrap medians
boot_medians <- vector(length = n_boots)

# Calculating medians for bootstrap samples
for(b in 1:n_boots) {
  boot_medians[b] <- median(boots[, b])
}

Visualizing.

# Creating a tibble
tidy_boot_medians <- tibble(Median = boot_medians)
  
# Creating a histogram
tidy_boot_medians |> 
  ggplot(aes(x = Median)) +
  geom_histogram(color = "black") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(title = "Figure 12: Nonparametric Bootstrap Distribution of Median Gun Violence Incidents",
       y = "Frequency") +
  geom_vline(xintercept = quantile(boot_medians, probs = c(0.025, 0.975)), 
             color = "blue")

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

# Calculating an estimated standard error
sd(boot_medians)

[1] 0.6189031

# Calculating a nonparametric confidence interval
nonparametric_ci <- quantile(boot_medians, probs = c(0.025, 0.975))

#Convert to data frame
np_ci <- data.frame(
  Lower = nonparametric_ci[1],
  Upper = nonparametric_ci[2]
)

# Create flextable
flextable(np_ci) |>
  set_caption("Table 8: Nonparametic Confidence Interval") |>
  autofit()

Lower	Upper
11.65563	14.18552

Interpretation: The 95% bootstrap confidence interval (11.65563, 14.18552) indicates that we are 95% confident that the true population median number of gun violence incidents in High poverty states lies between approximately 11.66 and 14.19 incidents.

Conclusion/Main Takeaways.

The analysis reveals a significant relationship between poverty and gun violence rates across U.S. states from 2014 to 2023, with high-poverty areas exhibiting significantly higher gun violence rates compared to low-poverty areas, as confirmed by a permutation test (p-value < 0.05). While population size showed a weak, non-significant correlation with gun violence rates (p = 0.259), the data highlighted regional patterns, such as higher armed robbery incidents in southern states like Texas and Florida. Visualizations, including heatmaps and density plots, underscored temporal and geographic variations, with gun violence peaking in California during summer weekends. The bootstrap analysis further indicated that the median gun violence rate in high-poverty states ranges between 11.66 and 14.19 incidents per 100,000 population, emphasizing the persistent socioeconomic drivers of gun violence and the need for targeted interventions in economically disadvantaged areas.

Contributions of Each Group Member

Contributions of Each Group Member Kate Wema Juma

Data Dictionary Development: Authored the data dictionary for the ACS dataset. Summary Tables: Created and formatted Table 1 (Top 10 States by Average Median Income, 2014–2023) and Table 2 (Summary Statistics Over the Years).
Data Cleaning and Manipulation: date/time processing with lubridate functions (ymd, year, month, day) for temporal analysis (e.g., Figure 5).
Exploratory Data Analysis (EDA): Conducted missingness analysis for the ACS dataset using skimr and created Figure 1 (Lollipop Plot of Missingness for ACS Data).
Visualizations: Created sophisticated visualizations, including Figure 5 (Calendar Heatmap of Gun Violence in California, 2019) using calendR, Figure 6 (US Map Heatmap of Gun Violence, 2023) using usmap, and Figure 8 (Animated Bar Chart of Incident Types in California) using gganimate.
Statistical Analysis: Wrote and executed code for permutation tests (Figures 9 and 11) and bootstrap analysis (Figure 12, Table 8), including hypothesis formulation, null distribution visualization, and interpretation of results (e.g., p-value for poverty vs. gun violence rate).
Report Writing: Drafted the introduction, exploratory data analysis, and conclusion sections.
Code Organization: Added descriptive comments to R code chunks with required section headers (e.g., Data Dictionary, Exploratory Data Analysis).
Extra Credit Contribution (Combined): Partnered to design and implement interactive visualizations using ggplotly for Figures 9 and 11 (null distributions of permutation tests). Contributed to brainstorming the animated bar chart (Figure 8) using gganimate. Contributed to dashboard.

Cynthia Mutua

Data Dictionary Development: Authored the data dictionary for the Gun Violence dataset.
Exploratory Data Analysis (EDA): Conducted missingness analysis for the Gun Violence dataset and created Figure 2 (Lollipop Plot of Missingness for Gun Violence Data).
Data Cleaning and Manipulation: Implemented string manipulation using str_detect, str_replace_all, and str_trim to analyze incident_characteristics (e.g., Table 4, Figure 7).
Visualizations: Created sophisticated visualizations, including Figure 5 (Calendar Heatmap of Gun Violence in California, 2019) using calendR, Figure 6 (US Map Heatmap of Gun Violence, 2023) using usmap, and Figure 8 (Animated Bar Chart of Incident Types in California) using gganimate.
Statistical Analysis: Wrote and executed code for permutation tests (Figures 9 and 11) and bootstrap analysis (Figure 12, Table 8), including hypothesis formulation, null distribution visualization, and interpretation of results (e.g., p-value for poverty vs. gun violence rate).
Report Writing: Drafted sections on data cleaning, permutation tests, and bootstrap analysis.
Extra Credit Contribution (Combined): Partnered to develop interactive visualizations using ggplotly for Figures 9 and 11, ensuring seamless integration of interactive features. Co-designed the animated bar chart (Figure 8) using gganimate. Contributed to dashboard.

Combined Efforts

Merging Datasets: Both members collaborated on merging the ACS and Gun Violence datasets using inner_join to create the pop_gunviolence dataset, enabling analyses like Figure 10 (Density Plot of Gun Violence Rates by Poverty Group) and Table 5.
Extra Credit Focus: Worked together to implement ggplotly for interactive plots (Figures 9 and 11). Collaborated on the gganimate-based Figure 8, which animates changes in gun violence incident types in California.
Project Review and Debugging: Both members reviewed the Quarto file for errors. We also cross-checked visualizations and tables.

Introduction.

Source

Data dictionary

American Community Survey (ACS) dataset.

Gun violence dataset.

Loading packages.

Importing data into R.

Exploring missingness

Lollipop plot to explore ACS data for missingness.

To check for missingness in the second dataset.

Plotting a lollipop plot to explore missingness

Exploratory Data Analysis.

1. ACS dataset.

Next, we can check the sample size of the population its mean and standard deviation per year.

Visualize the relationship of median income vs poverty rate using ggplot.

Visualizing trend of the median_income vs prop_poverty for the state of Michigan over time.

2. Gun Violence Dataset.

Visualizing the number of gun violence victims killed per day in California for 2019 with CalendR.

Aggregating gun violence incidents and cencus data by year and state creating a new dataset called yearly_violence.

Visualizing the distribution of gun violence cases across states and years.

String Manipulation.

Visualizing the above results.

Investigating incident characteristic over the years for California.

Merging the two datasets.

Randomization of pop_gunviolence data.

Visualizing null distribution of permuted correlations.

Permutation test for prop_poverty and gun_violence_rate.

Relationship between Poverty_rate vs. Gun_violence_rate.

Two-sample t-test

Calculating standard deviations and variances for each poverty group

Assessing statistical significance difference

Conclusion/Main Takeaways.

Contributions of Each Group Member

Permutation test for `prop_poverty` and `gun_violence_rate`.