Investigating the Impact of Poverty and Population on Gun Violence Rates.

Author

Cynthia Mutua & Kate Juma

Published

March 22, 2026

Introduction.

Gun violence remains a major public health issue in the United States. While various factors contribute to it, economic and demographic indicators such as poverty and population density are consistently suspected to be influential. This report aims to explore the relationship between poverty levels, population characteristics, and the prevalence of gun violence across U.S. states over time.

Our data sources include:

  1. American Community Survey (ACS)- contains ACS estimates for U.S. states, including population, income, poverty rates, and demographic characteristics -,

  2. Gun violence data -Provides gun violence incidents by state and date.

This analysis was conducted using R and Quarto, incorporating visualization, statistical testing, and reproducible data science practices.

Source

Data was obtained via the tidycensus R package and the United States Census website.

Data dictionary

American Community Survey (ACS) dataset.

Variable Type Description
geoid character Geographic region ID with the first 2 digits being the state Federal Information Processing Standard (FIPS) code and the last 3 digits the county FIPS code
county_state character Geographic region
year double Year
population double Population
median_income double Median income in dollars
median_monthly_rent_cost double Median monthly rent costs for renters in dollars
median_monthly_home_cost double Median monthly housing costs for homeowners in dollars
prop_female double Proportion of people who are female
prop_male double Proportion of people who are male
prop_white double Proportion of people who are white alone
prop_black double Proportion of people who are Black or African American alone
prop_native double Proportion of people who are American Indian and Alaska Native alone
prop_asian double Proportion of people who are Asian alone
prop_hawaiin_islander double Proportion of people who are Native Hawaiian and Other Pacific Islander alone
prop_other_race double Proportion of people who are some other race alone
prop_multi_racial double Proportion of people who are two or more races
prop_highschool double Proportion of people 25 and older whose highest education-level is high school
prop_GED double Proportion of people 25 and older whose highest education-level is a GED
prop_some_college double Proportion of people 25 and older whose highest education-level is some, but less than 1 year of college
prop_college_no_degree double Proportion of people 25 and older whose highest education-level is greater than 1 year of college but no degree
prop_associates double Proportion of people 25 and older whose highest education-level is an Associates degree
prop_bachelors double Proportion of people 25 and older whose highest education-level is a Bachelors degree
prop_masters double Proportion of people 25 and older whose highest education-level is a Masters degree
prop_professional double Proportion of people 25 and older whose highest education-level is a Professional degree
prop_doctoral double Proportion of people 25 and older whose highest education-level is a Doctoral degree
prop_poverty double Proportion of people 25 and older living in poverty, defined by the Census Bureau as having an income below the poverty threshold for their family size

Gun violence dataset.

Variable Type Description
geoid character Geographic region ID (FIPS code, matching the ACS geoid)
incident_id double Unique identifier for each gun violence incident
date double Date of the incident in ISO 8601 format (YYYY-MM-DDTHH:MM:SSZ)
state character Name of the U.S. state where the incident occurred
city character Name of the city where the incident occurred
county character Name of the county where the incident occurred (if available)
business_or_school character Indicates whether the incident occurred at a business or school(if specified)
address character Address or block location where the incident took place
latitude double Latitude coordinate of the incident location
longitude double Longitude coordinate of the incident location
number_victims_killed double Number of victims killed in the incident
number_victims_injured double Number of victims injured in the incident
number_suspects_killed double Number of suspects killed in the incident
number_suspects_injured double Number of suspects injured in the incident
number_suspects_arrested double Number of suspects arrested following the incident
incident_characteristics character Text description of the nature of the incident (may include multiple characteristics)

Loading packages.

library(dplyr)
library(tidyverse)
library(skimr)
library(flextable)
library(mice)
library(naniar)
library(gt)
library(plotly)
library(ggplot2)
library(lubridate)
library(gtExtras)
library(gganimate)
library(gifski)
library(calendR)
library(janitor)
library(moments)
library(usmap)
library(magick)
library(here)

Importing data into R.

American Community Survey (ACS) dataset

# Importing state census data from 2008 - 2023
us_data <- read_csv("census_data_state_2008-2023.csv")
Rows: 780 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): geoid, county_state
dbl (24): year, population, median_income, median_monthly_rent_cost, median_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Gun violence dataset

# Importing of gun violence data 
gun_violence <- read_csv("gun_violence_geo.csv")
Rows: 389850 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): geoid, state, city, county, business_or_school, address, incident_...
dbl  (8): incident_id, latitude, longitude, number_victims_killed, number_vi...
dttm (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exploring missingness

To check for missingness in the ACS data.

skim(us_data)
Data summary
Name us_data
Number of rows 780
Number of columns 26
_______________________
Column type frequency:
character 2
numeric 24
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
geoid 0 1 2 2 0 52 0
county_state 0 1 4 20 0 52 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2015.20 4.61 2008.00 2011.00 2015.00 2019.00 2023.00 ▇▆▆▃▆
population 0 1 6229692.05 7025083.48 532668.00 1786156.75 4257700.00 7053886.75 39557045.00 ▇▂▁▁▁
median_income 0 1 58358.78 14075.73 18314.00 48362.75 56484.50 67147.25 108210.00 ▁▇▇▂▁
median_monthly_rent_cost 0 1 945.44 269.03 419.00 749.50 884.00 1092.25 1992.00 ▃▇▃▁▁
median_monthly_home_cost 0 1 1109.16 371.59 241.00 850.00 1020.00 1353.00 2561.00 ▁▇▃▂▁
prop_female 0 1 0.51 0.01 0.47 0.50 0.51 0.51 0.53 ▁▂▇▇▁
prop_male 0 1 0.49 0.01 0.47 0.49 0.49 0.50 0.53 ▁▇▇▂▁
prop_white 0 1 0.75 0.15 0.22 0.67 0.77 0.85 0.96 ▁▁▃▇▇
prop_black 0 1 0.11 0.11 0.00 0.03 0.07 0.15 0.53 ▇▃▂▁▁
prop_native 0 1 0.02 0.03 0.00 0.00 0.00 0.01 0.16 ▇▁▁▁▁
prop_asian 0 1 0.04 0.05 0.00 0.01 0.03 0.04 0.39 ▇▁▁▁▁
prop_hawaiin_islander 0 1 0.00 0.01 0.00 0.00 0.00 0.00 0.11 ▇▁▁▁▁
prop_other_race 0 1 0.04 0.04 0.00 0.01 0.02 0.05 0.32 ▇▁▁▁▁
prop_multi_racial 0 1 0.05 0.05 0.01 0.02 0.03 0.05 0.39 ▇▁▁▁▁
prop_highschool 0 1 0.24 0.04 0.10 0.22 0.24 0.26 0.35 ▁▂▇▆▁
prop_GED 0 1 0.04 0.01 0.02 0.04 0.04 0.05 0.07 ▃▇▆▃▁
prop_some_college 0 1 0.06 0.01 0.01 0.06 0.06 0.07 0.09 ▁▁▅▇▂
prop_college_no_degree 0 1 0.15 0.02 0.08 0.13 0.14 0.16 0.22 ▁▅▇▃▁
prop_associates 0 1 0.09 0.02 0.03 0.08 0.08 0.10 0.15 ▁▃▇▂▁
prop_bachelors 0 1 0.19 0.03 0.10 0.17 0.19 0.21 0.29 ▁▅▇▃▁
prop_masters 0 1 0.08 0.03 0.04 0.06 0.08 0.09 0.24 ▇▆▁▁▁
prop_professional 0 1 0.02 0.01 0.01 0.02 0.02 0.02 0.10 ▇▁▁▁▁
prop_doctoral 0 1 0.01 0.01 0.01 0.01 0.01 0.02 0.05 ▇▃▁▁▁
prop_poverty 0 1 0.14 0.05 0.07 0.11 0.13 0.16 0.46 ▇▃▁▁▁

Lollipop plot to explore ACS data for missingness.

gg_miss_var(us_data, show_pct = TRUE) +
labs(title= "Figure 1:Missingness for ACS data",
     caption = "Source: American Community Survey via tidycensus")

The dataset has no missing values.

To check for missingness in the second dataset.

skim(gun_violence)
Data summary
Name gun_violence
Number of rows 389850
Number of columns 16
_______________________
Column type frequency:
character 7
numeric 8
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
geoid 46 1.00 5 5 0 2956 0
state 0 1.00 4 20 0 51 0
city 761 1.00 3 27 0 11548 0
county 389065 0.00 1 22 0 459 0
business_or_school 316795 0.19 1 85 0 47476 0
address 9665 0.98 3 61 0 325895 0
incident_characteristics 102 1.00 8 828 0 10939 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
incident_id 0 1 1468977.53 796698.22 92114.00 756778.50 1529846.00 2165292.50 2976478.00 ▇▆▇▇▅
latitude 2 1 37.33 4.82 19.03 33.76 38.31 41.07 71.34 ▁▇▅▁▁
longitude 2 1 -89.25 13.45 -171.74 -93.77 -86.77 -80.20 -67.04 ▁▁▂▃▇
number_victims_killed 0 1 0.37 0.56 0.00 0.00 0.00 1.00 60.00 ▇▁▁▁▁
number_victims_injured 0 1 0.78 1.06 0.00 0.00 1.00 1.00 439.00 ▇▁▁▁▁
number_suspects_killed 0 1 0.06 0.24 0.00 0.00 0.00 0.00 4.00 ▇▁▁▁▁
number_suspects_injured 0 1 0.05 0.23 0.00 0.00 0.00 0.00 5.00 ▇▁▁▁▁
number_suspects_arrested 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2014-01-01 2023-12-31 2019-09-26 3652

Plotting a lollipop plot to explore missingness

gg_miss_var(gun_violence, show_pct = TRUE) +
  labs(title= "Figure 2:Missingness for Gun Violence data",
       caption = "Source: Gun Violence Archive")

While the majority of the dataset was complete, we did observe missing values in certain descriptive fields of the gun violence data, notably business_or_school, county, and address. This missingness does not affect the calculation of core metrics like gun violence rates or their relationship with poverty which are the main focus of our analysis.

Exploratory Data Analysis.

1. ACS dataset.

Lets explore the summary statistics

# summarizing mean of the median_income from the year 2014-2023 by county_state.
us_data |>
  filter(year >= 2014, year <= 2023) |>
  group_by(year, county_state) |>
  summarize(average_median_income =  round(mean(median_income, na.rm = TRUE))) |>
  arrange(desc(average_median_income)) |>
  ungroup() |>
  slice_head(n = 10) |>
  flextable() |>
  colformat_double(big.mark = "", digits = 0) |>
  theme_zebra() |>
  add_header_lines("Table 1: Top 10 states by Average median income, 2014–2023")
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Table 1: Top 10 states by Average median income, 2014–2023

year

county_state

average_median_income

2023

District of Columbia

108210

2022

District of Columbia

101027

2023

Massachusetts

99858

2023

New Jersey

99781

2023

Maryland

98678

2023

New Hampshire

96838

2022

New Jersey

96346

2023

California

95521

2023

Hawaii

95322

2022

Maryland

94991

Next, we can check the sample size of the population its mean and standard deviation per year.
# Summarizing sample size of the population its mean and standard deviation per year.
summary_stat <- us_data |>
  group_by(year) |> # Groups the data by year.
  summarize(pop_mean =  mean(population, na.rm = TRUE), # Calculates the mean population for each year.
            pop_size = n(), # Counts the number of entries in states per year.
            pop_sd = round(sd(population, na.rm = TRUE))) |> # Calculates and rounds the standard deviation of pouplation per year.
  ungroup() |>
  slice_head(n = 10)
  
# Create a flextable for the summary statistics
flextable(summary_stat) |>
   colformat_double(big.mark = "", digits = 0) |>
  add_header_lines("Table 2: Summary statistics over the years") |>
  theme_vanilla()

Table 2: Summary statistics over the years

year

pop_mean

pop_size

pop_sd

2008

5923342

52

6664198

2009

5980266

52

6719871

2010

6020612

52

6781449

2011

6063435

52

6847219

2012

6107329

52

6914594

2013

6148922

52

6975056

2014

6200105

52

7063774

2015

6247942

52

7140930

2016

6279593

52

7183936

2017

6328007

52

7257007

This table summarizes the average state-level population statistics from 2008 to 2017.These values reflect state-level averages whereby the mean population represents the average population across 52 U.S. states and territories each year, while the standard deviation shows the variation in population sizes across them.

Visualize the relationship of median income vs poverty rate using ggplot.

# Filter the dataset for the year 2019 and create a scatterplot of median income vs poverty rate
us_data|>
  filter(year == 2019) |>
  ggplot(aes(x = median_income, y = prop_poverty)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Figure 3: Median Income vs. Poverty Rate (2019)",
       x = "Median Income",
       y = "Proportion in Poverty",
       caption = "Data source: American Community Survey (ACS)") +
  theme_bw()
`geom_smooth()` using formula = 'y ~ x'

This scatter plot shows a negative relationship between median income and poverty rate across U.S. states in 2019. States with higher incomes tend to have lower poverty rates, as highlighted by the downward trend line.

Visualizing trend of the median_income vs prop_poverty for the state of Michigan over time.

us_data |>
  filter(county_state == "Michigan") |>
  ggplot(aes(x = year)) +
  geom_line(aes(y = median_income, color = "Income")) +
  geom_line(aes(y = prop_poverty * 100000, color = "Poverty (×100k)")) +
  scale_y_continuous(
    name = "Median Income",
    sec.axis = sec_axis(~ . / 100000, name = "Poverty Rate")
  ) +
  scale_color_viridis_d() +
  labs(title = "Figure 4:Michigan: Median Income vs. Poverty Rate Over Time",
       x = "Year",
  caption = "Figure 4: Median Income vs Poverty Rate,
  Data source: American Community Survey (ACS)")+
  theme_dark() +
  theme(legend.title = element_blank(),
        legend.position = "bottom")

The line graph illustrates Michigan’s socioeconomic evolution from 2008 to 2023, with rising median income signaling economic progress and fluctuating poverty rates highlighting persistent challenges. The inverse relationship between these metrics suggests that economic growth alone doesn’t eliminate poverty, which may contribute to social issues like gun violence.

2. Gun Violence Dataset.

The code snippet transforms the gun_violence dataset by converting the date column to a Date object and creating new columns (day, year, month) using lubridate functions. This enables temporal aggregation and analysis, supporting visualizations (e.g., calendar heatmap, yearly heatmap) and data merging with ACS data. It’s a critical preprocessing step that ensures the dataset is ready for the project’s statistical and visual analyses, directly contributing to insights about gun violence patterns and their socioeconomic drivers.

# Aggregating data.
gun_violence <- gun_violence |>
  mutate(date = ymd(date), # Changing the data variable to a date type variable.
         day = day(date),
         year = year(date), 
         month = month(date, label = TRUE, abbr = FALSE))  # extracting the day, year and month.

Creation of a table summarizing the total number of victims killed in gun violence incidents across cities in California during May 2019.

# Checking number of killed victims in the year 2019.
 gun_violence |>
  filter(state == "California", year == 2019, month == "May") |>
  group_by(city)|>
  summarize(total_victims_killed = sum(number_victims_killed))|>
  arrange(desc(total_victims_killed)) |>
  head(15) |>
  gt() |>
  gt_theme_nytimes() |>
  tab_caption("Table 3: Top 15 cities by total number of victims killed in month of May 2019")
Table 3: Top 15 cities by total number of victims killed in month of May 2019
city total_victims_killed
Stockton 9
Los Angeles 7
Oakland 7
Fresno 5
Sacramento 4
San Diego 4
Bakersfield 3
San Francisco 3
Cathedral City 2
Grass Valley 2
Long Beach 2
Oroville 2
Richmond 2
San Bernardino 2
West Hills 2

Visualizing the number of gun violence victims killed per day in California for 2019 with CalendR.

# Calculating number of gun violence incidents for each date.
gun_violence_counts <- gun_violence |> 
  filter(state == "California", year == 2019) |>
  mutate(date = make_date(year, month, day)) |>
  group_by(date) |> # Groups incidents by date
  summarize(total_victims_killed = sum(number_victims_killed, na.rm = TRUE))

# Creating a date sequence.
all_dates <- tibble(date = seq(as.Date("2019-01-01"), as.Date("2019-12-31"), by = "day")) # Generates a sequence of all dates in 2019 (365 days).

# Joining data to fill missing days with 0
gun_violence_counts <- all_dates |>
  left_join(gun_violence_counts, by = "date") |>
  mutate(total_victims_killed = replace_na(total_victims_killed, 0)) # Ensures every day has a value (0 for no victims).
  


# Creating heatmap calendar of number of victims
gun_violence_calendar <- calendR(year = 2019,
                            title = "Daily distribution of gun victims in California, 2019",
        special.days = gun_violence_counts$total_victims_killed, # Maps the number of victims to each day’s color intensity.
        gradient = TRUE,
        monthnames = month.name, # Displays full month names
        months.size = 11,
        weeknames.size = 3.3,
        legend.pos = "bottom",
        legend.title = "Number of victims killed",
        ncol = 3,
        margin = 1) +
    scale_fill_gradient(low = "white", high = "deeppink4",
  guide = guide_colorbar(frame.colour = "black", 
                         ticks.colour = "black",
                         title.position = "top")) + labs(caption = "Figure 5: Daily gun violence deaths in California, 2019. 
                                                         Source: Gun Violence Archive")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the calendR package.
  Please report the issue to the authors.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
# Saving plot
size_mult <- 22
ggsave(plot = gun_violence_calendar,
       filename = "calendar_plot.png",
       units = "mm",
       dpi = 120,
       width = 7*size_mult,
       height = 8.5*size_mult)

knitr::include_graphics("calendar_plot.png")

The calendar heatmap reveals that gun violence deaths in California peak on summer weekends and holidays.

Aggregating gun violence incidents and cencus data by year and state creating a new dataset called yearly_violence.

# Aggregating by year and state.
yearly_violence <- gun_violence |>
  filter(!is.na(state), !is.na(year)) |> # Checks for missing or inconsistent state or year values
  rename(county_state = state) |> #renaming the state to county_state because the column represents the state.
  group_by(year, county_state) |>
  summarise(total_cases = n(), total_victims_killed = sum(number_victims_killed, na.rm = TRUE)) #Computes the number of incidents in each group
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
Visualizing the distribution of gun violence cases across states and years.
# Assuming yearly_violence has columns: year, county_state, total_cases
# Extract state from county_state and aggregate total_cases by state and year
yearly_violence_state <- yearly_violence |>
  mutate(state = sub(".*,\\s*", "", county_state)) |> # Extract state (e.g., "WA" from "King County, WA")
  group_by(state, year) |>
  summarise(total_cases = sum(total_cases, na.rm = TRUE)) |>
  ungroup()
`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.
# Create a usmap heatmap for a single year (e.g., the most recent year in the data)
# You can change the year as needed
latest_year <- max(yearly_violence_state$year)
yearly_violence_single_year <- yearly_violence_state |>
  filter(year == latest_year)

# Plot usmap heatmap
plot_usmap(
  data = yearly_violence_single_year,
  values = "total_cases",
  color = "white" # White borders for states, matching geom_tile's color
) +
  scale_fill_viridis_c(
    option = "C",
    name = "Total Cases"
  ) +
  labs(
    title = paste("Heatmap of Gun Violence Incidents by State,", latest_year),
    caption = "Figure 6: Heatmap of Gun Violence for 2023
    Source:Gun Violence Archive",
    fill = "Total Cases"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    legend.key.width = unit(1, "cm")
  )

String Manipulation.

# Creating contains_robbery variables
gun_robbery <- gun_violence |>
  filter(!is.na(incident_characteristics)) |>
  mutate(contains_robbery = str_detect(incident_characteristics, regex("Armed robbery with injury", ignore_case = TRUE)))
# Summarizing the count of "Armed robbery with injury"
gun_robbery |>
summarize(armed_robbery_count = sum(contains_robbery))
# A tibble: 1 × 1
  armed_robbery_count
                <int>
1               20890
#Counting Occurrences of contains_robbery
gun_robbery |>
count(contains_robbery)
# A tibble: 2 × 2
  contains_robbery      n
  <lgl>             <int>
1 FALSE            368858
2 TRUE              20890
# using janitor package to get percentages.
gun_robbery |>
  tabyl(contains_robbery)
 contains_robbery      n    percent
            FALSE 368858 0.94640126
             TRUE  20890 0.05359874

Intepretation: About 5.36% of the filtered gun violence incidents are associated with armed robbery with injury, which is a notable but relatively small proportion.

Investigating which states have the highest counts or proportions ofarmed robbery incidents.

# By state and year
robbery_2019 <- gun_robbery |>
  filter(year == 2019) |>
  group_by(state) |>
  summarise(robbery_count = sum(contains_robbery),
            total_incidents = n(),
            robbery_mean = mean(contains_robbery) * 100) |>
  arrange(desc(robbery_count)) |>
  head(16)

# Display the table
flextable(robbery_2019) |>
  colformat_double(digits = 3) |>
  set_caption("Table 4: States with the highest counts of armed robbery incidents") |>
  autofit()

state

robbery_count

total_incidents

robbery_mean

Texas

269

2,541

10.586

Florida

135

2,041

6.614

Georgia

111

1,209

9.181

North Carolina

107

1,371

7.805

Ohio

106

1,617

6.555

Pennsylvania

97

2,015

4.814

Illinois

93

2,829

3.287

Tennessee

93

1,319

7.051

California

82

2,447

3.351

Louisiana

77

1,294

5.951

Missouri

66

1,194

5.528

Maryland

65

1,239

5.246

South Carolina

61

1,043

5.849

Indiana

56

955

5.864

Alabama

52

978

5.317

Michigan

46

1,083

4.247

Intepretation: Texas leads with the highest incidents among the listed states.Texas, Florida, Georgia, North Carolina, Tennessee, and Louisiana are all in the southern U.S., suggesting a regional concentration of armed robbery with injury incidents.

Visualizing the above results.
library(grid)
#load or download a gun image
# Replace with your local image path, e.g., "path/to/gun_image.png"
# Example using a public-domain image URL (replace with a suitable image if needed)
gun_image <- image_read("gun_image.jpg") |>
  image_colorize(opacity = 50, color = "white")

# Save the image temporarily (or use local file directly)
temp_image <- tempfile(fileext = ".png")
image_write(gun_image, temp_image)

# Convert to raster for use in ggplot
gunimage_grob <- rasterGrob(gun_image, width = unit(1, "npc"), height = unit(1, "npc"))

robbery_2019 |>
  ggplot(aes(x = reorder(state, robbery_count), y = robbery_count)) +
  annotation_custom(gunimage_grob, xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf) +
  geom_bar(stat = "identity", fill = "steelblue") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  coord_flip() +
  labs(title = "Figure 7 : Armed Robbery with Injury Incidents by State", 
       x = "State", y = "Incident Count") +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = "transparent", color = NA),
    plot.background = element_rect(fill = "transparent", color = NA)
  )

Investigating incident characteristic over the years for California.

# Clean incident_characteristics
gun_robbery_clean <- gun_robbery |>
  filter(!is.na(incident_characteristics)) |>
  mutate(incident_characteristics = str_replace_all(
    incident_characteristics, "\\r\\n|\\n", ",")) |>
  separate_rows(incident_characteristics, sep = ",") |>
  mutate(incident_characteristics = str_trim(incident_characteristics)) |>
  filter(incident_characteristics != "")

# Build data for animation
anim_data <- gun_robbery_clean |>
  filter(state == "California") |>
  count(year, incident_characteristics, sort = TRUE) |>
  group_by(year) |>
  slice_max(n, n = 10) |>
  ungroup()

# Build plotly animation
anim_data |>
  plot_ly(
    x = ~n,
    y = ~reorder(incident_characteristics, n),
    frame = ~year,
    type = "bar",
    orientation = "h",
    color = ~incident_characteristics,
    colors = "Set3",
    showlegend = FALSE
  ) |>
  layout(
    title = "Figure 8: Top 10 Incident Types in California by Year",
    xaxis = list(title = "Count"),
    yaxis = list(title = ""),
    paper_bgcolor = "white",
    plot_bgcolor = "white"
  ) |>
  animation_opts(
    frame = 800,
    transition = 400,
    easing = "cubic-in-out"
  ) |>
  animation_button(
    x = 0, xanchor = "left",
    y = -0.1, yanchor = "bottom",
    label = "▶ Play"
  ) |>
  animation_slider(
    currentvalue = list(prefix = "Year: ")
  )

Merging the two datasets.

Merging the two dataset by year and county_state with the function inner_join to create a new variable called’pop_gunviolence`.

economic_indic <- us_data |>
  filter(year >= 2014, year <= 2023) |>
  select(county_state, year, population, median_income, prop_poverty)
pop_gunviolence <- yearly_violence |>
  inner_join(economic_indic, by = c("year", "county_state"))
skim(head(pop_gunviolence, 15))
Data summary
Name head(pop_gunviolence, 15)
Number of rows 15
Number of columns 7
_______________________
Column type frequency:
character 1
numeric 5
________________________
Group variables year

Variable type: character

skim_variable year n_missing complete_rate min max empty n_unique whitespace
county_state 2014 0 1 5 20 0 15 0

Variable type: numeric

skim_variable year n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_cases 2014 0 1 681.87 748.17 20.00 214.00 343.00 847.00 2231.00 ▇▂▁▁▂
total_victims_killed 2014 0 1 278.20 325.58 6.00 55.50 145.00 375.00 1147.00 ▇▂▁▁▁
population 2014 0 1 7810374.13 10083654.60 658893.00 1527012.50 4849377.00 8414413.50 38802500.00 ▇▂▁▁▁
median_income 2014 0 1 56767.87 10665.98 41262.00 48591.00 57444.00 65762.50 71648.00 ▃▇▂▅▆
prop_poverty 2014 0 1 0.15 0.03 0.11 0.12 0.15 0.18 0.19 ▆▂▅▃▇
knitr::kable(head(pop_gunviolence,10),
             caption = "Table 5: Population and Gun Violence Data")
Table 5: Population and Gun Violence Data
year county_state total_cases total_victims_killed population median_income prop_poverty
2014 Alabama 725 264 4849377 42830 0.1925258
2014 Alaska 63 24 736732 71583 0.1123372
2014 Arizona 343 175 6731484 50068 0.1824119
2014 Arkansas 295 145 2966369 41262 0.1886608
2014 California 2144 1147 38802500 61933 0.1644525
2014 Colorado 233 113 5355866 61303 0.1204445
2014 Connecticut 225 67 3596677 70048 0.1075192
2014 Delaware 203 44 935614 59716 0.1247874
2014 District of Columbia 353 82 658893 71648 0.1772565
2014 Florida 1672 722 19893297 47463 0.1649502

Creating a new variable called gun_violence_ratewhich standardizes the number of gun violence cases total_cases by population size. The multiplication by 100,000 turns it into a rate that’s easier to interpret and compare across different regions or years.

# Calculate the rate of gun violence per 100,000 population
pop_gunviolence <- pop_gunviolence |>
 mutate(gun_violence_rate = total_cases / population * 100000)

Randomization of pop_gunviolence data.

# Randomization test for population
randomization_test <- function(data, n_randomizations = 1000) {
 observed_correlation <- cor(data$population, data$gun_violence_rate) # Computes the observed correlation between population and gun_violence_rate
 randomized_correlations <- numeric(n_randomizations)
 
# Randomize the population and calculate correlation each time.
 for (i in 1:n_randomizations) {
 randomized_data <- data |>
 mutate(population = sample(population))
 randomized_correlations[i] <- cor(randomized_data$population, randomized_data$gun_violence_rate)
 }
 
# Calculate p_value.
 p_value <- mean(abs(randomized_correlations) >= abs(observed_correlation))
 return(list(observed_correlation = observed_correlation, p_value = p_value,permuted_distribution = randomized_correlations))
}

# Run the test
result_pop <- randomization_test(pop_gunviolence) 

The observed correlation is -0.05159267, this indicates that population size has little to no linear association with the gun violence rate in this dataset.The p-value is 0.259. This shows that there is no statistical significance evident to suggest that the observed correlation from what could occur by random chance.

Visualizing null distribution of permuted correlations.
p <- ggplot(data.frame(correlation = result_pop$permuted_distribution), 
            aes(x = correlation)) +
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  geom_vline(xintercept = result_pop$observed_correlation, 
             color = "red", linetype = "dashed", linewidth = 1.2) +
  labs(
    title = paste0("Null Distribution | Observed r = ", 
                   round(result_pop$observed_correlation, 3),
                   " | p = ", round(result_pop$p_value, 3)),
    x = "Correlation between Population and Gun Violence Rate",
    y = "Frequency",
    caption = "Figure 9: Null distribution of Permutation Correlations,
    Data Source: American Community Survey (ACS), Gun Violence Archive"
  ) +
  theme_minimal()

p

The permutation test results show a weak negative correlation (-0.052) between population and gun violence rate, with a p-value of 0.259. The visualization displays the null distribution of correlations that would occur by random chance, with our observed correlation (red dashed line) falling within this distribution. Since the p-value exceeds the conventional threshold of 0.05. This indicates that population alone may not be a reliable predictor of gun violence rates.

Permutation test for prop_poverty and gun_violence_rate.

Permutation test to evaluate the relationship between prop_poverty and gun_violence_rate

# Perform permutations and randomization

set.seed(1994) # Setting seed for reproducibility

# Permutation test for prop_poverty
# Defining permutation test function.
perm_test <- function(data, n_permutations = 1000) {
  observed_correlation <- cor(data$prop_poverty, data$gun_violence_rate)
  permuted_correlations <- numeric(n_permutations)
 
 # Running the permutations
for (i in 1:n_permutations) {
  permuted_data <- data |>
    mutate(prop_poverty = sample(prop_poverty))
  permuted_correlations[i] <- cor(permuted_data$prop_poverty, permuted_data$gun_violence_rate)
 }
 
p_value <- mean(abs(permuted_correlations) >= abs(observed_correlation))
return(list(observed_correlation = observed_correlation, p_value = p_value))
}

# Perform the permutation test
result <- perm_test(pop_gunviolence)
print(result)
$observed_correlation
[1] 0.3165579

$p_value
[1] 0

This provides strong evidence that poverty rates and gun violence rates are meaningfully related in our dataset, suggesting that socioeconomic factors may play an important role in explaining variations in gun violence across different areas.

Relationship between Poverty_rate vs. Gun_violence_rate.

Creating a new variable `poverty_group by splitting area by median poverty rate.

pop_gunviolence <- pop_gunviolence |>
  mutate(poverty_group = ifelse(prop_poverty > median(prop_poverty, na.rm = TRUE),
                                "High Poverty", "Low Poverty"))

Creating a density plot to check for skewness and normality.

ggplot(pop_gunviolence, aes(x = gun_violence_rate, fill = poverty_group)) +
  geom_density(alpha = 0.5) +
   scale_fill_manual(values = c("High Poverty" = "#e63946", "Low Poverty" = "#457b9d")) +
  facet_grid(poverty_group ~ .)

  labs(
    title = "Figure 10: Density Plot of Gun Violence Rates by Poverty Group",
    x = "Gun Violence Rate (per 100,000 population)",
    y = "Density",
    fill = "Poverty Group"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("High Poverty" = "#e63946", "Low Poverty" = "#457b9d"))
NULL

Two-sample t-test

Next, we state the hypotheses and implement Welch’s two-sample t-test to test whether the average gun violence rate is different between high poverty and low poverty areas. \[ H_0: \mu_{\text{High}} = \mu_{\text{Low}} \] \[ H_a: \mu_{\text{High}} = \mu_{\text{Low}} \] where \(\mu_{\text{High}}\) is the average gun violence rate in high poverty areas., and \(\mu_{\text{Low}}\) is the average gun violence rate in low poverty areas.

# Reorder factor levels for t-test
pop_gunviolence <- pop_gunviolence |>
  dplyr::mutate(poverty_group = fct_relevel(poverty_group, "High Poverty", "Low Poverty"))


t_test_result <- t.test(gun_violence_rate ~ poverty_group, data = pop_gunviolence)

# Display results of t-test
t_test_result |> 
  broom::tidy() |> 
  flextable() |> 
  colformat_double(digits = 3) |> 
set_formatter(p.value = function(x) {format.pval(x, digits = 3)}) |> 
  set_caption("Table 6: Results of two-sample t-test comparing gun violence rates between high and low poverty areas.") |> 
  autofit() |> 
  fit_to_width(max_width = 7)

estimate

estimate1

estimate2

statistic

p.value

parameter

conf.low

conf.high

method

alternative

6.962

15.879

8.917

6.633

1.53e-10

300.758

4.897

9.028

Welch Two Sample t-test

two.sided

We observed high positive skewness in the gun violence rate variable, particularly in the High Poverty group. Given the violation of the normality assumption required for the t-test, we proceeded with a non-parametric permutation test to assess the difference in means, which does not rely on distributional assumptions.

Calculating standard deviations and variances for each poverty group
pop_gunviolence |>
  group_by(poverty_group) |>
  summarise(
    n = n(),
    Mean = mean(gun_violence_rate, na.rm = TRUE),
    SD = sd(gun_violence_rate, na.rm = TRUE),
    Var = var(gun_violence_rate, na.rm = TRUE)) |>
    
flextable() |> 
  colformat_double(digits = 3) |> 
  set_caption("Table 7: Summary statistics of gun violence rate by poverty group") |>
  autofit()

poverty_group

n

Mean

SD

Var

High Poverty

225

15.879

14.520

210.822

Low Poverty

234

8.917

6.209

38.552

The standard deviation of gun violence rates in the High Poverty group was 210.82, compared to 38.55 in the Low Poverty group. This indicates higher variability in one group. The difference in variance supports the use of Welch’s t-test, which does not assume equal variances between groups.

Assessing statistical significance difference

Performing a permutation test to assess whether there is a statistically significant difference in the gun violence rate between two groups defined by the poverty_group variable in the pop_gunviolence dataset.

# Number of permutations to do
n_permutations <- 10000

# Instantiating vector for test statistics
permutation_statistics <- numeric(n_permutations)

# Calculating t-test statistic for each permutation
for (p in 1:n_permutations) {
  permuted_data <- pop_gunviolence |> 
    mutate(poverty_group = sample(poverty_group, replace = FALSE))  # Shuffle group labels
  
  permutation_statistics[p] <- t.test(gun_violence_rate ~ poverty_group, 
                                      data = permuted_data,
                                      alternative = "greater") |> 
    broom::tidy() |> 
    pull(statistic)
}

pp_value <- mean(permutation_statistics >= permutation_statistics[p], na.rm = TRUE)

Interpreting we reject the null hypothesis since our p-value is 0.2785, suggesting that high-poverty areas have a significantly higher gun violence rate than low poverty areas.

Create a histogram displaying the null distribution obtained for the randomization test

# Organizing into a tibble
computed_statistics <- tibble(Value = permutation_statistics)

d <- computed_statistics |>
  ggplot(aes(x = Value)) +
  geom_histogram(aes(y= after_stat(density)), color = "white") +
  geom_density(color = "maroon1", linewidth = 1) +
  stat_function(fun = function(x) dt(x, df = t_test_result$parameter),color = "seagreen",linewidth = 1) +
  labs(title = "Figure 11: Randomization null distribution") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  geom_vline(xintercept = quantile(permutation_statistics, probs = 0.95), color = "red", linetype = "solid") +
  geom_vline(xintercept = t_test_result$statistic, color= "blue",linetype = "dotted")

d
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

The distribution is approximately centered at zero, consistent with the null hypothesis of no group difference. The blue dotted line shows the observed t-statistic, while the red solid line marks the 95th percentile of the null distribution. If the observed statistic lies beyond this threshold, it provides evidence against the null hypothesis.

#Bootsrapping

# Subsetting to only Poverty groups classified as High Poverty
high_pdata <- pop_gunviolence |>
  filter(poverty_group == "High Poverty")

# Sample size
nrow(high_pdata) # Returns the number of observations of HighPoverty
[1] 225
# Sample median
median(high_pdata$gun_violence_rate)
[1] 13.11814

Let’s implement the nonparametric bootstrap.

# Number of bootstrap samples
n_boots <- 10000

# Instantiating matrix for bootstrap samples
boots <- matrix(NA, nrow = nrow(high_pdata), ncol = n_boots)

# Sampling with replacement n_boots times
for(b in 1:n_boots) {
  boots[, b] <- high_pdata |> 
    slice_sample(prop = 1, replace = TRUE) |> 
    dplyr::pull(gun_violence_rate)
}

Using the generated bootstrap samples and the code below, create a bootstrap distribution of sample medians, and visualize this distribution using a histogram.

# Instantiating vector for bootstrap medians
boot_medians <- vector(length = n_boots)

# Calculating medians for bootstrap samples
for(b in 1:n_boots) {
  boot_medians[b] <- median(boots[, b])
}

Visualizing.

# Creating a tibble
tidy_boot_medians <- tibble(Median = boot_medians)
  
# Creating a histogram
tidy_boot_medians |> 
  ggplot(aes(x = Median)) +
  geom_histogram(color = "black") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(title = "Figure 12: Nonparametric Bootstrap Distribution of Median Gun Violence Incidents",
       y = "Frequency") +
  geom_vline(xintercept = quantile(boot_medians, probs = c(0.025, 0.975)), 
             color = "blue")
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

# Calculating an estimated standard error
sd(boot_medians)
[1] 0.6189031
# Calculating a nonparametric confidence interval
nonparametric_ci <- quantile(boot_medians, probs = c(0.025, 0.975))

#Convert to data frame
np_ci <- data.frame(
  Lower = nonparametric_ci[1],
  Upper = nonparametric_ci[2]
)

# Create flextable
flextable(np_ci) |>
  set_caption("Table 8: Nonparametic Confidence Interval") |>
  autofit()

Lower

Upper

11.65563

14.18552

Interpretation: The 95% bootstrap confidence interval (11.65563, 14.18552) indicates that we are 95% confident that the true population median number of gun violence incidents in High poverty states lies between approximately 11.66 and 14.19 incidents.

Conclusion/Main Takeaways.

The analysis reveals a significant relationship between poverty and gun violence rates across U.S. states from 2014 to 2023, with high-poverty areas exhibiting significantly higher gun violence rates compared to low-poverty areas, as confirmed by a permutation test (p-value < 0.05). While population size showed a weak, non-significant correlation with gun violence rates (p = 0.259), the data highlighted regional patterns, such as higher armed robbery incidents in southern states like Texas and Florida. Visualizations, including heatmaps and density plots, underscored temporal and geographic variations, with gun violence peaking in California during summer weekends. The bootstrap analysis further indicated that the median gun violence rate in high-poverty states ranges between 11.66 and 14.19 incidents per 100,000 population, emphasizing the persistent socioeconomic drivers of gun violence and the need for targeted interventions in economically disadvantaged areas.

Contributions of Each Group Member

Contributions of Each Group Member Kate Wema Juma

  1. Data Dictionary Development: Authored the data dictionary for the ACS dataset. Summary Tables: Created and formatted Table 1 (Top 10 States by Average Median Income, 2014–2023) and Table 2 (Summary Statistics Over the Years).
  2. Data Cleaning and Manipulation: date/time processing with lubridate functions (ymd, year, month, day) for temporal analysis (e.g., Figure 5).
  3. Exploratory Data Analysis (EDA): Conducted missingness analysis for the ACS dataset using skimr and created Figure 1 (Lollipop Plot of Missingness for ACS Data).
  4. Visualizations: Created sophisticated visualizations, including Figure 5 (Calendar Heatmap of Gun Violence in California, 2019) using calendR, Figure 6 (US Map Heatmap of Gun Violence, 2023) using usmap, and Figure 8 (Animated Bar Chart of Incident Types in California) using gganimate.
  5. Statistical Analysis: Wrote and executed code for permutation tests (Figures 9 and 11) and bootstrap analysis (Figure 12, Table 8), including hypothesis formulation, null distribution visualization, and interpretation of results (e.g., p-value for poverty vs. gun violence rate).
  6. Report Writing: Drafted the introduction, exploratory data analysis, and conclusion sections.
  7. Code Organization: Added descriptive comments to R code chunks with required section headers (e.g., Data Dictionary, Exploratory Data Analysis).
  8. Extra Credit Contribution (Combined): Partnered to design and implement interactive visualizations using ggplotly for Figures 9 and 11 (null distributions of permutation tests). Contributed to brainstorming the animated bar chart (Figure 8) using gganimate. Contributed to dashboard.

Cynthia Mutua

  1. Data Dictionary Development: Authored the data dictionary for the Gun Violence dataset.
  2. Exploratory Data Analysis (EDA): Conducted missingness analysis for the Gun Violence dataset and created Figure 2 (Lollipop Plot of Missingness for Gun Violence Data).
  3. Data Cleaning and Manipulation: Implemented string manipulation using str_detect, str_replace_all, and str_trim to analyze incident_characteristics (e.g., Table 4, Figure 7).
  4. Visualizations: Created sophisticated visualizations, including Figure 5 (Calendar Heatmap of Gun Violence in California, 2019) using calendR, Figure 6 (US Map Heatmap of Gun Violence, 2023) using usmap, and Figure 8 (Animated Bar Chart of Incident Types in California) using gganimate.
  5. Statistical Analysis: Wrote and executed code for permutation tests (Figures 9 and 11) and bootstrap analysis (Figure 12, Table 8), including hypothesis formulation, null distribution visualization, and interpretation of results (e.g., p-value for poverty vs. gun violence rate).
  6. Report Writing: Drafted sections on data cleaning, permutation tests, and bootstrap analysis.
  7. Extra Credit Contribution (Combined): Partnered to develop interactive visualizations using ggplotly for Figures 9 and 11, ensuring seamless integration of interactive features. Co-designed the animated bar chart (Figure 8) using gganimate. Contributed to dashboard.

Combined Efforts

  1. Merging Datasets: Both members collaborated on merging the ACS and Gun Violence datasets using inner_join to create the pop_gunviolence dataset, enabling analyses like Figure 10 (Density Plot of Gun Violence Rates by Poverty Group) and Table 5.
  2. Extra Credit Focus: Worked together to implement ggplotly for interactive plots (Figures 9 and 11). Collaborated on the gganimate-based Figure 8, which animates changes in gun violence incident types in California.
  3. Project Review and Debugging: Both members reviewed the Quarto file for errors. We also cross-checked visualizations and tables.