Welcome to my CMSC320 Data Science tutorial! In this tutorial, we will use the data science pipeline to analyze a dataset (retrieving data, cleaning data, visual exploration/anaylsis, machine learning).
The data we will use is US police shootings. Given the lack of gun control and rise of systemic racism in the United States, incidences of shootings (mass or not) and police brutality have become more apparent. I want to analyze the data behind some of these shootings, investigating various factors that contribute to these deaths, and determining how much of a correlation there is between race and fatal shootings. Then, we will attempt to develop a model to predict race based on given data.
Below you will find all the packages this tutorial will use. We will use pandas to create dataframe/tables, matplot and folium to visualize the data with numpy to assist, and sklearn/statsmodels for machine learning and significance testing.
import pandas as pd
import matplotlib.pyplot as plt
import folium
import numpy as np
import statsmodels.api as sm
import scipy.stats as stats
from statsmodels.formula.api import ols
import sklearn
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.metrics import mean_squared_error
#from sklearn.model_selection import train_test_split
The data we will be looking at is from the Washington Post "Fatal Force" github: https://github.com/washingtonpost/data-police-shootings, which can also be found on Kaggle. This database contains records of every fatal shooting in the United States by a police officer in the line of duty since Jan. 1, 2015. As stated in their Github: "The Post is documenting only those shootings in which a police officer, in the line of duty, shoots and kills a civilian — the circumstances that most closely parallel the 2014 killing of Michael Brown in Ferguson, Mo., which began the protest movement culminating in Black Lives Matter and an increased focus on police accountability nationwide. The Post is not tracking deaths of people in police custody, fatal shootings by off-duty officers or non-shooting deaths."
After downloading the csv from their github, we can retrieve it as follows:
#read the csv through pandas
raw_data = pd.read_csv("fatal-police-shootings-data.csv", sep = ",")
#show the table
raw_data.head()
id | name | date | manner_of_death | armed | age | gender | race | city | state | signs_of_mental_illness | threat_level | flee | body_camera | longitude | latitude | is_geocoding_exact | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | Tim Elliot | 2015-01-02 | shot | gun | 53.0 | M | A | Shelton | WA | True | attack | Not fleeing | False | -123.122 | 47.247 | True |
1 | 4 | Lewis Lee Lembke | 2015-01-02 | shot | gun | 47.0 | M | W | Aloha | OR | False | attack | Not fleeing | False | -122.892 | 45.487 | True |
2 | 5 | John Paul Quintero | 2015-01-03 | shot and Tasered | unarmed | 23.0 | M | H | Wichita | KS | False | other | Not fleeing | False | -97.281 | 37.695 | True |
3 | 8 | Matthew Hoffman | 2015-01-04 | shot | toy weapon | 32.0 | M | W | San Francisco | CA | True | attack | Not fleeing | False | -122.422 | 37.763 | True |
4 | 9 | Michael Rodriguez | 2015-01-04 | shot | nail gun | 39.0 | M | H | Evans | CO | False | attack | Not fleeing | False | -104.692 | 40.384 | True |
Above you can see the raw csv data from the github. We can see that the columns include things such as the victim's name, date of death, age, race, state, etc. The next thing we will want to do is tidy/clean up the data to be useful for analysis. Pandas has a lot of functionality to allow us to manipulate and edit dataframes.
The main thing we want to investigate is the relationship between race and fatal shootings. The current race column has letters to symbolize the different groups, and it is noted in the GitHub that missing ('NA') values are for when the person's race was unknown. Let's replace these letters with the actual race labels, and replace all the missing values with the label "unknown."
raw_data = raw_data.replace({'A': 'Asian', 'W':'White','B':'Black', 'N':'Native', 'H':'Hispanic', 'O':'Other', np.nan : 'Unknown'})
raw_data.head()
id | name | date | manner_of_death | armed | age | gender | race | city | state | signs_of_mental_illness | threat_level | flee | body_camera | longitude | latitude | is_geocoding_exact | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | Tim Elliot | 2015-01-02 | shot | gun | 53.0 | M | Asian | Shelton | WA | True | attack | Not fleeing | False | -123.122 | 47.247 | True |
1 | 4 | Lewis Lee Lembke | 2015-01-02 | shot | gun | 47.0 | M | White | Aloha | OR | False | attack | Not fleeing | False | -122.892 | 45.487 | True |
2 | 5 | John Paul Quintero | 2015-01-03 | shot and Tasered | unarmed | 23.0 | M | Hispanic | Wichita | KS | False | other | Not fleeing | False | -97.281 | 37.695 | True |
3 | 8 | Matthew Hoffman | 2015-01-04 | shot | toy weapon | 32.0 | M | White | San Francisco | CA | True | attack | Not fleeing | False | -122.422 | 37.763 | True |
4 | 9 | Michael Rodriguez | 2015-01-04 | shot | nail gun | 39.0 | M | Hispanic | Evans | CO | False | attack | Not fleeing | False | -104.692 | 40.384 | True |
#sort by race and date
raw_data.sort_values(by = ["race", "date"])
id | name | date | manner_of_death | armed | age | gender | race | city | state | signs_of_mental_illness | threat_level | flee | body_camera | longitude | latitude | is_geocoding_exact | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | Tim Elliot | 2015-01-02 | shot | gun | 53.0 | M | Asian | Shelton | WA | True | attack | Not fleeing | False | -123.122 | 47.247 | True |
70 | 346 | Matautu Nuu | 2015-01-28 | shot and Tasered | hammer | 35.0 | M | Asian | Stockton | CA | True | attack | Not fleeing | False | -121.316 | 38.029 | True |
153 | 195 | Hung Trieu | 2015-03-01 | shot | gun | 35.0 | M | Asian | Houston | TX | False | attack | Not fleeing | False | -95.622 | 29.704 | True |
160 | 249 | Carl Lao | 2015-03-04 | shot | gun | 28.0 | M | Asian | Stockton | CA | False | attack | Not fleeing | False | -121.286 | 37.948 | True |
265 | 359 | Joseph Jeremy Weber | 2015-04-08 | shot | knife | 28.0 | M | Asian | Sunnyvale | CA | True | other | Not fleeing | False | -121.995 | 37.404 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6212 | 6778 | Peyton Ham | 2021-04-13 | shot | knife | 16.0 | M | White | Leonardtown | MD | False | attack | Not fleeing | False | -76.634 | 38.3 | True |
6217 | 6792 | Jeffrey Guy Sacks | 2021-04-15 | shot | knife | 26.0 | M | White | North Lauderdale | FL | True | other | Not fleeing | False | -80.222 | 26.207 | True |
6223 | 6783 | Bradley Michael Olsen | 2021-04-18 | shot | gun | 30.0 | M | White | Burnsville | MN | False | attack | Car | True | -93.266 | 44.737 | True |
6238 | 6801 | Benjamin Ridley | 2021-04-24 | shot | gun | 29.0 | M | White | Webbers Falls | OK | False | other | Unknown | False | -95.171 | 35.553 | True |
6240 | 6804 | Richard Solitro | 2021-04-24 | shot | undetermined | 34.0 | M | White | Los Angeles | CA | True | other | Not fleeing | True | -118.362 | 34.098 | True |
6241 rows × 17 columns
Great. Now it is easier to read the races, but we still need to make it easier to analyze each race. Rather than having each row be by name, we can make a new data frame where each row will be a race category. That way we can compare values/characteristics between each race as a group.
#dictionary to collect the values. The key value pairs will be Race: Num Killed for that Race.
race_dict = {}
#for each row in the raw data, increment the corresponding race value by 1
for index, row in raw_data.iterrows():
race = raw_data['race'][index]
if race in race_dict.keys():
race_dict[race] = race_dict[race] + 1
else:
race_dict[race] = 1
tidied_data = pd.DataFrame(list(race_dict.items()),columns = ['Race', 'Number Killed'])
tidied_data.head()
Race | Number Killed | |
---|---|---|
0 | Asian | 105 |
1 | White | 2886 |
2 | Hispanic | 1057 |
3 | Black | 1507 |
4 | Other | 47 |
^^ Like this. Now from here, we could add most of the other categories. For example the raw data's age column can be sorted by race and age group. There can be columns to divide the ages into below 18 (children), between 18 and 30 (young adult), between 30 and 60 (older adults), and over 60 (seniors). From the raw data's threat level column, we can convert that into a column that counts the number of individuals in each race that attacked or did not attack during their encounter. We can also count the number of individuals that had signs of mental illness, and the number of encounters where police wore their body camera.
But before we do that, what would be the flaw of just counting and comparing the quantities? Let's quickly graph what we have to see:
#graph bar plot of race and total fatal shootings for that group
tidied_data.plot.bar(x = "Race", y = "Number Killed", title = "Count of Fatal Shootings")
<AxesSubplot:title={'center':'Count of Fatal Shootings'}, xlabel='Race'>
So at first glance, we see that white people have had the most fatal encounters. So does that mean we can conclude that white people suffer the most fatal shootings and the idea of police brutality and systemic racism is a farce? What is misleading about this graph?
The main misleading component is that these quantities are not out of the same total. There are far more white people in this country than any other race, so naturally the quantity will be larger. Therefore, it is misleading to compare the data this way.
Instead, we should look at the number killed as a proportion. Since these quantities do not have the same total, let's divide the values by said total to get said proportion. What is that total? For that, we can look at demographics. For demographics, we can look at the 2019 data from the Census: https://www.census.gov/quickfacts/fact/table/US/PST045219 and see the following:
Percent White: 60.1%
Percent Black: 13.4%
Percent Asian: 5.9%
Percent Hispanic: 18.5%
Percent Native: 1.3%
Using these numbers, we can take the quantity and divide it by the demographic percentage to get a proportional count of fatal shootings for each race. The result we get is "per 1% of the population of some race, how many people suffer a fatal shooting?" This makes comparison more fair and accurate as it "equalizes" the impact/signifance of the value. We should apply this to all the quantity columns to get a proportion so that we can fairly compare the races. So rather than being "Number killed" or "number children," it will be "proportion killed/children."
With this set, let's complete the table.
#separate the raw data by race
groups = raw_data.groupby('race')
#this dictionary consists of the demographics listed above
demogs = {'White': 60.1, 'Black': 13.4, 'Asian': 5.9, 'Hispanic': 18.5, 'Native': 1.3}
#iterate over each race category - for each one, collect the data for the new columns
for race in demogs.keys():
df = groups.get_group(race)
male, female = 0, 0
below18, eighteen_thirty, thirty_sixty, sixty_greater = 0,0,0,0
num_shot, num_tasered = 0,0
mi = 0
body_cam = 0
attacked = 0
fleed = 0
for index, row in df.iterrows():
if df['gender'][index] == "M":
male += 1
else:
female += 1
age = df['age'][index]
if age != 'Unknown':
if age < 18:
below18 += 1
elif age >= 18 and age < 30:
eighteen_thirty += 1
elif age >= 30 and age < 60:
thirty_sixty +=1
else:
sixty_greater += 1
if df['manner_of_death'][index] == 'shot':
num_shot += 1
else:
num_tasered += 1
if df['signs_of_mental_illness'][index] == True:
mi += 1
if df['body_camera'][index] == False:
body_cam += 1
if df['threat_level'][index] != 'attack':
attacked += 1
if df['flee'][index] == 'Car' or df['flee'][index] == 'Foot':
fleed += 1
#add the data to the new columns
tidied_data.loc[tidied_data.Race == race, 'Prop males'] = male/demogs[race]
tidied_data.loc[tidied_data.Race == race, 'Prop females'] = female/demogs[race]
tidied_data.loc[tidied_data.Race == race, 'Prop Age <18'] = below18/demogs[race]
tidied_data.loc[tidied_data.Race == race, 'Prop Age >=18, <30'] = eighteen_thirty/demogs[race]
tidied_data.loc[tidied_data.Race == race, 'Prop Age >=30, <60'] = thirty_sixty/demogs[race]
tidied_data.loc[tidied_data.Race == race, 'Prop Age >=60'] = sixty_greater/demogs[race]
tidied_data.loc[tidied_data.Race == race, 'Prop Num shot'] = num_shot/demogs[race]
tidied_data.loc[tidied_data.Race == race, 'Prop Num shot and tasered'] = num_tasered/demogs[race]
tidied_data.loc[tidied_data.Race == race, 'Prop Num mental illness signs'] = mi/demogs[race]
tidied_data.loc[tidied_data.Race == race, 'Prop No Body Camera on Police'] = body_cam/demogs[race]
tidied_data.loc[tidied_data.Race == race, 'Prop Person not attacked'] = attacked/demogs[race]
tidied_data.loc[tidied_data.Race == race, 'Prop Fleed on Car/Foot'] = fleed/demogs[race]
tidied_data
Race | Number Killed | Prop males | Prop females | Prop Age <18 | Prop Age >=18, <30 | Prop Age >=30, <60 | Prop Age >=60 | Prop Num shot | Prop Num shot and tasered | Prop Num mental illness signs | Prop No Body Camera on Police | Prop Person not attacked | Prop Fleed on Car/Foot | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Asian | 105 | 16.949153 | 0.847458 | 0.677966 | 4.406780 | 11.186441 | 0.847458 | 16.440678 | 1.355932 | 4.576271 | 14.576271 | 8.305085 | 3.389831 |
1 | White | 2886 | 45.174709 | 2.845258 | 0.582363 | 10.782030 | 31.913478 | 4.009983 | 45.723794 | 2.296173 | 14.143095 | 42.961730 | 16.222962 | 12.529118 |
2 | Hispanic | 1057 | 55.405405 | 1.729730 | 1.513514 | 21.081081 | 31.783784 | 1.189189 | 54.054054 | 3.081081 | 10.108108 | 49.081081 | 23.567568 | 17.621622 |
3 | Black | 1507 | 108.656716 | 3.805970 | 3.134328 | 47.686567 | 55.820896 | 3.805970 | 106.194030 | 6.268657 | 17.462687 | 91.641791 | 37.164179 | 40.447761 |
4 | Other | 47 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | Unknown | 551 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | Native | 88 | 63.846154 | 3.846154 | 0.769231 | 28.461538 | 37.692308 | 0.000000 | 64.615385 | 3.076923 | 11.538462 | 56.153846 | 29.230769 | 20.000000 |
Looking good so far. We still need to convert the first column we had, and it would also be a good idea to just add the demographics to the table in case we need it in the future. Before we do that, we should drop the "Other" and "Unknown" race categories. We do not have accurate demographic percentages for these, so from here on we will not be analyzing this categories.
#drop unknown categories
tidied_data.drop(tidied_data.loc[tidied_data['Race']== 'Other'].index, inplace=True)
tidied_data.drop(tidied_data.loc[tidied_data['Race']== 'Unknown'].index, inplace=True)
#let's add the demographic percentages to the table in case we need it in the future
demographic = [5.9, 60.1, 18.5, 13.4, 1.3]
tidied_data['demo'] = demographic
#convert the first column we had (Number killed) to a proportion
tidied_data["Prop Killed"] = tidied_data['Number Killed']/tidied_data['demo']
tidied_data
Race | Number Killed | Prop males | Prop females | Prop Age <18 | Prop Age >=18, <30 | Prop Age >=30, <60 | Prop Age >=60 | Prop Num shot | Prop Num shot and tasered | Prop Num mental illness signs | Prop No Body Camera on Police | Prop Person not attacked | Prop Fleed on Car/Foot | demo | Prop Killed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Asian | 105 | 16.949153 | 0.847458 | 0.677966 | 4.406780 | 11.186441 | 0.847458 | 16.440678 | 1.355932 | 4.576271 | 14.576271 | 8.305085 | 3.389831 | 5.9 | 17.796610 |
1 | White | 2886 | 45.174709 | 2.845258 | 0.582363 | 10.782030 | 31.913478 | 4.009983 | 45.723794 | 2.296173 | 14.143095 | 42.961730 | 16.222962 | 12.529118 | 60.1 | 48.019967 |
2 | Hispanic | 1057 | 55.405405 | 1.729730 | 1.513514 | 21.081081 | 31.783784 | 1.189189 | 54.054054 | 3.081081 | 10.108108 | 49.081081 | 23.567568 | 17.621622 | 18.5 | 57.135135 |
3 | Black | 1507 | 108.656716 | 3.805970 | 3.134328 | 47.686567 | 55.820896 | 3.805970 | 106.194030 | 6.268657 | 17.462687 | 91.641791 | 37.164179 | 40.447761 | 13.4 | 112.462687 |
6 | Native | 88 | 63.846154 | 3.846154 | 0.769231 | 28.461538 | 37.692308 | 0.000000 | 64.615385 | 3.076923 | 11.538462 | 56.153846 | 29.230769 | 20.000000 | 1.3 | 67.692308 |
Great! Now let's tackle the armed category:
The armed category indicates the 'weapon' the victim was carrying during the fatal encounter. Let's look at the list:
set(raw_data['armed'])
{'Airsoft pistol', 'BB gun', 'BB gun and vehicle', 'Taser', 'Unknown', 'air conditioner', 'air pistol', 'ax', 'barstool', 'baseball bat', 'baseball bat and bottle', 'baseball bat and fireplace poker', 'baseball bat and knife', 'baton', 'bayonet', 'bean-bag gun', 'beer bottle', 'binoculars', 'blunt object', 'bottle', 'bow and arrow', 'box cutter', 'brick', 'car, knife and mace', 'carjack', 'chain', 'chain saw', 'chainsaw', 'chair', 'claimed to be armed', "contractor's level", 'cordless drill', 'crossbow', 'crowbar', 'fireworks', 'flagpole', 'flashlight', 'garden tool', 'glass shard', 'grenade', 'gun', 'gun and car', 'gun and knife', 'gun and machete', 'gun and sword', 'gun and vehicle', 'guns and explosives', 'hammer', 'hand torch', 'hatchet', 'hatchet and gun', 'ice pick', 'incendiary device', 'knife', 'knife and vehicle', 'lawn mower blade', 'machete', 'machete and gun', 'meat cleaver', 'metal hand tool', 'metal object', 'metal pipe', 'metal pole', 'metal rake', 'metal stick', 'microphone', 'motorcycle', 'nail gun', 'oar', 'pellet gun', 'pen', 'pepper spray', 'pick-axe', 'piece of wood', 'pipe', 'pitchfork', 'pole', 'pole and knife', 'railroad spikes', 'rock', 'samurai sword', 'scissors', 'screwdriver', 'sharp object', 'shovel', 'spear', 'stapler', 'straight edge razor', 'sword', 'tire iron', 'toy weapon', 'unarmed', 'undetermined', 'unknown weapon', 'vehicle', 'vehicle and gun', 'vehicle and machete', 'walking stick', 'wasp spray', 'wrench'}
Clearly not all these are worthy of being part of the 'armed' category. Some individuals were unarmed or carried harmless objects like a toy or an air conditioner. In the future, we will probably want to look at how many of these victims died despite not carrying anything threatening, so we want to make a columns that counts the number of either lethal or nonlethal weapons that were carried per race.
The code will be similar to what we did to make the other columns, except we need an extra step. We have to divide the list of values in the 'armed' category into Lethal and Nonlethal. This may be fairly subjective, but the lists can be seen below. For this tutorial, weapons were included in the Lethal category if they were anything related to a gun, sharp object, or explosive.
#make the nonlethal and lethal lists
Nonlethal = [ 'walking stick',
'wasp spray',
'wrench', 'vehicle','toy weapon',
'unarmed',
'unknown','stapler','shovel','scissors',
'screwdriver','rock','pole','piece of wood','metal hand tool',
'metal object',
'metal pipe',
'metal pole',
'metal rake',
'metal stick',
'motorcycle',
'nail gun',
'oar',
'pellet gun',
'pen',
'pepper spray',
'pipe','ice pick','hammer',
'hand torch','crowbar',
'fireworks',
'flagpole',
'flashlight',
'garden tool',
'glass shard', 'chair',
"contractor's level",
'cordless drill','carjack',
'chain','box cutter',
'brick',
'beer bottle',
'blunt object', 'baton','barstool',
'baseball bat',
'baseball bat and bottle',
'baseball bat and fireplace poker','air conditioner',
'air pistol','BB gun',
'BB gun and vehicle',
]
Lethal = ['vehicle and gun',
'vehicle and machete','straight edge razor',
'sword', 'spear','sharp object','samurai sword','pole and knife','pitchfork','pick-axe','incendiary device',
'knife',
'lawn mower blade',
'machete',
'machete and gun',
'meat cleaver', 'hatchet',
'hatchet and gun','grenade',
'gun',
'gun and car',
'gun and knife',
'gun and sword',
'gun and vehicle',
'guns and explosives','crossbow', 'chain saw',
'chainsaw','car, knife and mace','bow and arrow','bean-bag gun','bayonet','baseball bat and knife','ax','Taser',
'Airsoft pistol'
]
groups = raw_data.groupby('race')
#make a columns to count the number of nonlethal weapons carried
for race in demogs.keys():
df = groups.get_group(race)
non_lethal = 0
for index, row in df.iterrows():
if df['armed'][index] in Nonlethal:
non_lethal += 1
tidied_data.loc[tidied_data.Race == race, 'Prop Carried nonlethal'] = non_lethal/demogs[race]
tidied_data.head()
Race | Number Killed | Prop males | Prop females | Prop Age <18 | Prop Age >=18, <30 | Prop Age >=30, <60 | Prop Age >=60 | Prop Num shot | Prop Num shot and tasered | Prop Num mental illness signs | Prop No Body Camera on Police | Prop Person not attacked | Prop Fleed on Car/Foot | demo | Prop Killed | Prop Carried nonlethal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Asian | 105 | 16.949153 | 0.847458 | 0.677966 | 4.406780 | 11.186441 | 0.847458 | 16.440678 | 1.355932 | 4.576271 | 14.576271 | 8.305085 | 3.389831 | 5.9 | 17.796610 | 3.220339 |
1 | White | 2886 | 45.174709 | 2.845258 | 0.582363 | 10.782030 | 31.913478 | 4.009983 | 45.723794 | 2.296173 | 14.143095 | 42.961730 | 16.222962 | 12.529118 | 60.1 | 48.019967 | 7.554077 |
2 | Hispanic | 1057 | 55.405405 | 1.729730 | 1.513514 | 21.081081 | 31.783784 | 1.189189 | 54.054054 | 3.081081 | 10.108108 | 49.081081 | 23.567568 | 17.621622 | 18.5 | 57.135135 | 10.162162 |
3 | Black | 1507 | 108.656716 | 3.805970 | 3.134328 | 47.686567 | 55.820896 | 3.805970 | 106.194030 | 6.268657 | 17.462687 | 91.641791 | 37.164179 | 40.447761 | 13.4 | 112.462687 | 20.223881 |
6 | Native | 88 | 63.846154 | 3.846154 | 0.769231 | 28.461538 | 37.692308 | 0.000000 | 64.615385 | 3.076923 | 11.538462 | 56.153846 | 29.230769 | 20.000000 | 1.3 | 67.692308 | 11.538462 |
Great. We have tidied most of the data! We can start the visual analysis now, and if we need to manipulate the data any further we can do so as we encounter the need.
Let's start with the simplest question - what race has suffered the most fatal shootings?
tidied_data.plot.bar(x = "Race", y = "Prop Killed", xlabel = "Race", ylabel = "Proportion of Individuals per 1%", title = "Proportional Count of Fatal Shootings")
<AxesSubplot:title={'center':'Proportional Count of Fatal Shootings'}, xlabel='Race', ylabel='Proportion of Individuals per 1%'>
As we can see above, this graph tells a quite different story than the previous graph I showed with just the quantities. Black people are killed more per 1% of their population than any other race, more than twice than white people. This shows that even though, quantity wise, more white people may suffer fatal shootings, black people are disproportionally targeted. Hispanic and Native Americans also appear to be more disproportionally targetted than white people as well.
Let's visualize some of the other columns. We mainly want to investigate "unfair/unjust" shootings, ie incidents that may not have neccessarily had to end with the victim fatally shot. Obviously, there are details of the event that remain unknown and may tell a different story. But for example, we can visualize the number of people who carried a nonlethal weapon but still got shot, or we can visualize the number of victims that we children (< 18) per race. We can then see if there is a disparity across races (ex. is one race more disproportionally targeted for a category).
tidied_data.head()
Race | Number Killed | Prop males | Prop females | Prop Age <18 | Prop Age >=18, <30 | Prop Age >=30, <60 | Prop Age >=60 | Prop Num shot | Prop Num shot and tasered | Prop Num mental illness signs | Prop No Body Camera on Police | Prop Person not attacked | Prop Fleed on Car/Foot | demo | Prop Killed | Prop Carried nonlethal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Asian | 105 | 16.949153 | 0.847458 | 0.677966 | 4.406780 | 11.186441 | 0.847458 | 16.440678 | 1.355932 | 4.576271 | 14.576271 | 8.305085 | 3.389831 | 5.9 | 17.796610 | 3.220339 |
1 | White | 2886 | 45.174709 | 2.845258 | 0.582363 | 10.782030 | 31.913478 | 4.009983 | 45.723794 | 2.296173 | 14.143095 | 42.961730 | 16.222962 | 12.529118 | 60.1 | 48.019967 | 7.554077 |
2 | Hispanic | 1057 | 55.405405 | 1.729730 | 1.513514 | 21.081081 | 31.783784 | 1.189189 | 54.054054 | 3.081081 | 10.108108 | 49.081081 | 23.567568 | 17.621622 | 18.5 | 57.135135 | 10.162162 |
3 | Black | 1507 | 108.656716 | 3.805970 | 3.134328 | 47.686567 | 55.820896 | 3.805970 | 106.194030 | 6.268657 | 17.462687 | 91.641791 | 37.164179 | 40.447761 | 13.4 | 112.462687 | 20.223881 |
6 | Native | 88 | 63.846154 | 3.846154 | 0.769231 | 28.461538 | 37.692308 | 0.000000 | 64.615385 | 3.076923 | 11.538462 | 56.153846 | 29.230769 | 20.000000 | 1.3 | 67.692308 | 11.538462 |
#graph the other categories
tidied_data.plot.bar(x = "Race", y =
["Prop Carried nonlethal", "Prop Age <18", 'Prop Num mental illness signs',"Prop Num shot and tasered",
'Prop No Body Camera on Police', 'Prop Person not attacked'],
stacked = False, figsize = (20,10), title = "Visualization of Attributes of Shootings", xlabel = "Race", ylabel = "Proportion of Individuals per 1%").legend(loc = 'best')
<matplotlib.legend.Legend at 0x7faa7bab08b0>
There is a lot of information we can gather from this graph. We see that Hispanic, Black, and Native Americans disproprtionally were victims of fatal shootings despite carrying a nonlethal weapon compared to white individuals. Compared to other races, Black people have a higher proportion on individuals under 18 (children) that were victims of fatal shootings. The proportion of individuals who showed signs of mental illness were highest among white and black individuals, with Asian individuals having the lowest proportion. We also see that Hispanic, Native, and Black individuals had a higher proportion of individuals who were both shot and tasered (could be signs of excessive force) compared to white individuals. With the exception of Asian individuals, all other races have a high proportion of individuals that were fatally shot at a time where police did not wear a body camera (>40 per 1%), with Black people having the highest proportion. Black, Hispanic, and Native individuals also have a higher proportion of individuals that were shot but were not attacking compared to white individuals. Overall, these attributes are indicative of the prevalence of colorism and system racism in policing.
Next, lets add in time as a factor. We can analyze shootings over time for each race. For that, we are going to have to extract data from the raw data table in a different way, taking account the year.
We will make a dictionary of race: dictionary pairs. The inner dictionary will be date: number killed pairs for each race. We can then take the number killed and divide by the demographics to get the proportion.
#create the dictionary
date_dict = {}
#loop through the raw data and fill in the dictionary as described above
for index, row in raw_data.iterrows():
race = raw_data['race'][index]
date = raw_data['date'][index].split('-')[0]
if race in date_dict.keys():
datedict = date_dict[race]
if date in datedict.keys():
datedict[date] = datedict[date] + 1
else:
datedict[date] = 1
else:
date_dict[race] = {}
date_dict[race][date] = 1
date_list = []
for race in date_dict.keys():
#exclude unknown and other as they do not have demographics
if race != 'Unknown' and race != 'Other':
dates = date_dict[race]
for date in dates.keys():
date_list.append(tuple((race, date, dates[date], dates[date]/demogs[race])))
date_df = pd.DataFrame(date_list, columns =['Race', 'Date', 'Num Killed', 'Proportion'])
date_df['Date'] = date_df['Date'].apply(int)
date_df.head()
Race | Date | Num Killed | Proportion | |
---|---|---|---|---|
0 | Asian | 2015 | 15 | 2.542373 |
1 | Asian | 2016 | 15 | 2.542373 |
2 | Asian | 2017 | 16 | 2.711864 |
3 | Asian | 2018 | 22 | 3.728814 |
4 | Asian | 2019 | 20 | 3.389831 |
Now that we have this table, we can group by race and make a plot to show shootings over time. We will exclude the year 2021 as we expect those values to have a significant dip, since the year has not ended yet (so all the data has not been collected).
fig, ax = plt.subplots()
groups = date_df.groupby('Race')
for race in groups.groups.keys():
df = groups.get_group(race)
df.plot(x = 'Date', y = 'Proportion', ax = ax, xlim = ([2015, 2020]), legend = True, label = race,
title = "Shootings over time for each race(no 2020)", xlabel = "Race", ylabel = "Proportion of individuals per 1%" ).legend(loc = 'best')
From this graph, we can see the proportion of individuals killed remained relatively constant per each race except for Native Americans, who had a significant peak in 2017. Overall, I personally found the statistics for Native Americans to be higher than I expected in general. In terms of explaining the peak, perhaps there may have been some sort of event, law, etc that lead to this trend.
Finally, let's graph the total number of shootings over time to see overall how that number has changed.
total_dict = {}
for index, row in raw_data.iterrows():
date = raw_data['date'][index].split('-')[0]
if date != '2021':
if date in total_dict.keys():
total_dict[date] = total_dict[date] + 1
else:
total_dict[date] = 1
total_dict
df = pd.DataFrame.from_dict(total_dict, columns = ['total killed'], orient = 'index')
df.plot(title = "Total Fatal Shootings over Time", xlabel = "year", ylabel = "number shootings")
<AxesSubplot:title={'center':'Total Fatal Shootings over Time'}, xlabel='year', ylabel='number shootings'>
We can see from the graph above that the total fatal shootings each year since 2015 have hovered around 1000 per year, with 2016 having the lowest total of about 960. 2020 had the highest quantity of 1020 shootings in a year - it is surprising to see that this number increased desipte Covid-19 and lockdown measures occuring.
For the last part of our analysis, let's look at location and fatal shootings. We are curious to see if there is a certain area/state where these fatal shootings are concentrated - and later can take that a step further to see how these locations differ for each race.
To visualize this we can make a chloropleth map. A chloropleth map is a type of thematic map in which a set of pre-defined areas is colored or patterned based on a statistical variable that represents a characteristic of that area , in this case, the total number of shootings in a state. To make maps, we will be using the Folium library, which makes it fairly simple to visualize the data in the way we want to. However, before we do that we need to extract the data in preparation for the map. Earlier, our tidied table's rows were based on race. This time, we need a row for each state, with the column values corresponding to the total number of fatal shootings in that state. We will also make columns for each race so that we can make chlorpleth maps based on state and race as well.
#dictionary to collect the values.
state_dict = {}
#each state will have a dictionary mapping to race/total: # pairs
#first collect the total and counts for each race from the raw data, excluding unknown/other since again we are using proportions
for index, row in raw_data.iterrows():
state = raw_data['state'][index]
race = raw_data['race'][index]
if state in state_dict.keys():
state_dict[state]["total"] = state_dict[state]["total"] + 1
if race in demogs.keys():
if race in state_dict[state].keys():
state_dict[state][race] = state_dict[state][race] + 1
else:
state_dict[state][race] = 1
else:
state_dict[state] = {}
state_dict[state]["total"] = 1
if race in demogs.keys():
state_dict[state][race] = 1
#to account for missing values, if the state does not have a count for a certain race, input 0
for state in state_dict.keys():
if "White" not in state_dict[state].keys():
state_dict[state]["White"] = 0
if "Asian" not in state_dict[state].keys():
state_dict[state]["Asian"] = 0
if "Black" not in state_dict[state].keys():
state_dict[state]["Black"] = 0
if "Native" not in state_dict[state].keys():
state_dict[state]["Native"] = 0
if "Hispanic" not in state_dict[state].keys():
state_dict[state]["Hispanic"] = 0
#calculate the proportion for the races so that they are comparable
for race in state_dict[state].keys():
if race != "total":
state_dict[state][race] = state_dict[state][race]/demogs[race]
state_data = pd.DataFrame.from_dict(state_dict, orient = 'index')
state_data["State"] = state_data.index
state_data.head()
total | Asian | Hispanic | White | Native | Black | State | |
---|---|---|---|---|---|---|---|
WA | 175 | 1.525424 | 1.297297 | 1.331115 | 5.384615 | 1.865672 | WA |
OR | 100 | 0.000000 | 0.432432 | 1.214642 | 0.000000 | 0.522388 | OR |
KS | 56 | 0.000000 | 0.486486 | 0.632280 | 0.769231 | 0.522388 | KS |
CA | 927 | 5.932203 | 19.783784 | 4.259567 | 3.076923 | 11.417910 | CA |
CO | 219 | 0.847458 | 3.027027 | 1.830283 | 3.846154 | 1.417910 | CO |
Now that are table is ready, we can use Folium to make the chloropleth map. Let's start by visualizing the total number of fatal shootings in each state.
#get the data for the states geolocation
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
state_geo = f"{url}/us-states.json"
#make a map of the US
m = folium.Map(location=[48, -102], zoom_start=4)
#make the chloropleth, binding the state geolocation data to the state column of our table
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "total"],
key_on="feature.id",
fill_color="YlOrRd",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Total Shootings",
).add_to(m)
folium.LayerControl().add_to(m)
m
We can see from this map that overall, more fatal shootings have occured in California than any other state, followed by Texas and Florida. Let's see how this map changes as we break it down by race.
m = folium.Map(location=[48, -102], zoom_start=4)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Black"],
key_on="feature.id",
fill_color="YlOrRd",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Prop Black Shootings",
).add_to(m)
folium.LayerControl().add_to(m)
m
Looking at the proportion of Black fatal shootings, the top three states are still California, Texas, and Florida.
m = folium.Map(location=[48, -102], zoom_start=4)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "White"],
key_on="feature.id",
fill_color="YlOrRd",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Prop White Shootings",
).add_to(m)
folium.LayerControl().add_to(m)
m
The same top three states remain for analyzing the proportion of white fatal shootings.
m = folium.Map(location=[48, -102], zoom_start=4)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Asian"],
key_on="feature.id",
fill_color="YlOrRd",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Prop Asian Shootings",
).add_to(m)
folium.LayerControl().add_to(m)
m
The distribution looks a little different with Asian shootings. The top three states appear to be California, Texas, and Washington - however it is heavily concentrated in California. This may be due to the large Asian population in the state.
m = folium.Map(location=[48, -102], zoom_start=4)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Hispanic"],
key_on="feature.id",
fill_color="YlOrRd",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Prop Hispanic Shootings",
).add_to(m)
folium.LayerControl().add_to(m)
m
For Hispanic shootings, we see that the top three states are California, Texas, and New Mexico. However, it is also heavily concentrated in California.
m = folium.Map(location=[48, -102], zoom_start=4)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Native"],
key_on="feature.id",
fill_color="YlOrRd",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Prop Native Shootings",
).add_to(m)
folium.LayerControl().add_to(m)
m
The map for looking at Native American fatal shootings looks the most different. The top three states appear to be Oklahoma, Arizona, and Washington. Going back to the observation we made earlier about the trends of Native American shootings over time, events or policies that occured in this states during 2017 may have contributed to that peak.
For more information about Folium, their documentation is a great place to start! https://python-visualization.github.io/folium/quickstart.html#Choropleth-maps
To sum up our visual analysis, we created multiple bar plots to view different attributes of fatal shootings and compare the proportion of fatal shootings of different races. We observed that there was a clear bias towards people of color being victims of fatal shootings. We analyzed how trends changed over time and saw that Native Americans had the most varied pattern. Finally, we made chloropleth maps to get an idea of where most of these shootings occur in the United States and saw that most seem to occur in California.
Now that we have done some visual analysis, let's get into some hypothesis testing and machine learning.
Hypothesis testing is a way to "test" our data to see if we have meaningful, or statistically significant results. Our main focus is the correlation between race and fatal shootings, and we saw from our visual analysis that there is a difference between proportion of people killed and race, but is that difference significant? To determine this, we have to do a hypothesis test.
Given that one of our variables, Race, is categorical however, that limits the type of test we will use. We will be using the ANOVA test, ie Analysis of Variance. This compares the means among different groups to determine if one or more group has a statistically significant value. The null (baseline) hypothesis assumes that all means are equal; if Race did not have a factor on proportion killed, we would expect this hypothesis to be true. However, if race does contribute, then we expect to reject the null hypothesis, and our correlation is significant. You can read more about ANOVA testing here: https://www.reneshbedre.com/blog/anova.html
To begin the ANOVA test, we have to prep the data. This is simple, as the datatable we used to graph proportions over time has everything we need. We can drop the Num Killed column as we are looking at proportion.
date_df = date_df.drop(columns = "Num Killed")
date_df.head()
Race | Date | Proportion | |
---|---|---|---|
0 | Asian | 2015 | 2.542373 |
1 | Asian | 2016 | 2.542373 |
2 | Asian | 2017 | 2.711864 |
3 | Asian | 2018 | 3.728814 |
4 | Asian | 2019 | 3.389831 |
For the ANOVA test, we are going to take variances/means of each Race, so we want the race to be the column value. For that, we can pivot the data table like so:
anova_df = date_df.reset_index().groupby(['Date', 'Race'])['Proportion'].aggregate('first').unstack()
anova_df.head()
Race | Asian | Black | Hispanic | Native | White |
---|---|---|---|---|---|
Date | |||||
2015 | 2.542373 | 19.253731 | 9.297297 | 6.923077 | 8.369384 |
2016 | 2.542373 | 17.611940 | 8.702703 | 12.307692 | 7.770383 |
2017 | 2.711864 | 16.716418 | 9.783784 | 16.923077 | 7.653910 |
2018 | 3.728814 | 17.313433 | 9.027027 | 12.307692 | 7.670549 |
2019 | 3.389831 | 18.805970 | 9.081081 | 10.000000 | 7.054908 |
Now we can use the anova function from stats models to calculate the F and p value of the data. If the p value is < .05, that indicates that we would not expect this result if the null hypothesis were true, so we can reject the hypothesis and say the correlation is statistically significant.
# stats f_oneway functions takes the groups as input and returns ANOVA F and p value
fvalue, pvalue = stats.f_oneway(anova_df['Asian'], anova_df['Black'], anova_df['Hispanic'], anova_df['Native'], anova_df['White'])
print(fvalue, pvalue)
# Ordinary Least Squares (OLS) model
model = ols('Proportion ~ C(Race)', data=date_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table
13.560838253894635 2.002043061277953e-06
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
C(Race) | 677.489931 | 4.0 | 13.560838 | 0.000002 |
Residual | 374.694719 | 30.0 | NaN | NaN |
As we can see from above, the p-value is .000002, which is less than .05. This indicates that there is a significant difference among means and that there is an association between Race and proportion killed in a fatal shooting!
Finally, let's use our data to practice some machine learning. Given all the categorical values we have from our raw data, what if there was a way to build a model that can predict the race of an individual based on these categories? We can attempt this by using a Decision Tree. A decision tree uses an algorithm to take data and split it into nodes- the algorithm chooses a node/direction to go towards based on the model and ultimately ends up with a prediction/classifier.
#get the relevant categories from our raw data
ml_data = pd.DataFrame(raw_data, columns = ['age', 'gender', 'race', 'state','signs_of_mental_illness', 'threat_level', 'flee', 'body_camera'])
#for ages that are unknown, we can drop those rows
ml_data.drop(ml_data.loc[ml_data['age']=='Unknown'].index, inplace=True)
ml_data.head()
age | gender | race | state | signs_of_mental_illness | threat_level | flee | body_camera | |
---|---|---|---|---|---|---|---|---|
0 | 53.0 | M | Asian | WA | True | attack | Not fleeing | False |
1 | 47.0 | M | White | OR | False | attack | Not fleeing | False |
2 | 23.0 | M | Hispanic | KS | False | other | Not fleeing | False |
3 | 32.0 | M | White | CA | True | attack | Not fleeing | False |
4 | 39.0 | M | Hispanic | CO | False | attack | Not fleeing | False |
Since this values are categorical, we need to find a way to represent them numerically so that they can be inputted into the algorithm. We can do this with the sklearn Label Encoder:
#encode the categorical features with LabelEncoder
categorical = ['age', 'gender', 'race', 'state', 'signs_of_mental_illness', 'threat_level', 'flee', 'body_camera']
le = sklearn.preprocessing.LabelEncoder()
for column in categorical:
ml_data[column] = le.fit_transform(ml_data[column])
ml_data.head()
age | gender | race | state | signs_of_mental_illness | threat_level | flee | body_camera | |
---|---|---|---|---|---|---|---|---|
0 | 42 | 1 | 0 | 47 | 1 | 0 | 2 | 0 |
1 | 36 | 1 | 6 | 37 | 0 | 0 | 2 | 0 |
2 | 12 | 1 | 2 | 16 | 0 | 1 | 2 | 0 |
3 | 21 | 1 | 6 | 4 | 1 | 0 | 2 | 0 |
4 | 28 | 1 | 2 | 5 | 0 | 0 | 2 | 0 |
Now we can split the data into training and test data. The training data is used to program the model, and the test data is used to determine that the model we created is accurate.
y = ml_data['race']
X = ml_data.drop('race',axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
Now we can make the model with the training data:
# instantiate the model
tree = DecisionTreeClassifier()
# fit the model
tree.fit(X_train, y_train)
#predicting the target value from the model for the samples
y_test_tree = tree.predict(X_test)
y_train_tree = tree.predict(X_train)
#we can compute the accuracy of our model
acc_train_tree = tree.score(X_train, y_train)
acc_test_tree = tree.score(X_test, y_test)
#we can also compute the Root Mean Squared Error of the model (the error)
rmse_train_tree = np.sqrt(mean_squared_error(y_train, y_train_tree))
rmse_test_tree = np.sqrt(mean_squared_error(y_test, y_test_tree))
print("Decision Tree: Accuracy on training Data: {:.3f}".format(acc_train_tree))
print("Decision Tree: Accuracy on test Data: {:.3f}".format(acc_test_tree))
print('\nDecision Tree: The RMSE of the training set is:', rmse_train_tree)
print('Decision Tree: The RMSE of the testing set is:', rmse_test_tree)
Decision Tree: Accuracy on training Data: 0.906 Decision Tree: Accuracy on test Data: 0.410 Decision Tree: The RMSE of the training set is: 1.1529434172976283 Decision Tree: The RMSE of the testing set is: 2.956374617692101
As we can see, the accuracy of our model with the training data is 90%, but 41% with the test data. In the future, we can hone in and tweak this model to get a higher accuracy on test data.
This concludes our walk through of the data science pipeline! I hope you learned a lot. From this tutorial, we were able to gain a lot of insight into various factors of the victims of fatal shootings in the past 5 years. We visualized the many differences in values based on race, and were able to show that there is a statistically significant correlation between race and proportion of individuals in a fatal shooting. It is evident that a lot of reform needs to be done to reshape our policing and Justice system and combat systemic racism. For future steps, one could continute to hone in on the decision tree, or create other machine learning models that could be more helpful to the purpose of investigating this data.
Thank you!