CMSC320 Data Science Tutorial - Analyzing US Fatal Shootings

Katherine Kim, UID: 115928268, May 17, 2021

Welcome to my CMSC320 Data Science tutorial! In this tutorial, we will use the data science pipeline to analyze a dataset (retrieving data, cleaning data, visual exploration/anaylsis, machine learning).

The data we will use is US police shootings. Given the lack of gun control and rise of systemic racism in the United States, incidences of shootings (mass or not) and police brutality have become more apparent. I want to analyze the data behind some of these shootings, investigating various factors that contribute to these deaths, and determining how much of a correlation there is between race and fatal shootings. Then, we will attempt to develop a model to predict race based on given data.


Below you will find all the packages this tutorial will use. We will use pandas to create dataframe/tables, matplot and folium to visualize the data with numpy to assist, and sklearn/statsmodels for machine learning and significance testing.

Part 1 - Retrieving the Data

The data we will be looking at is from the Washington Post "Fatal Force" github: https://github.com/washingtonpost/data-police-shootings, which can also be found on Kaggle. This database contains records of every fatal shooting in the United States by a police officer in the line of duty since Jan. 1, 2015. As stated in their Github: "The Post is documenting only those shootings in which a police officer, in the line of duty, shoots and kills a civilian — the circumstances that most closely parallel the 2014 killing of Michael Brown in Ferguson, Mo., which began the protest movement culminating in Black Lives Matter and an increased focus on police accountability nationwide. The Post is not tracking deaths of people in police custody, fatal shootings by off-duty officers or non-shooting deaths."

After downloading the csv from their github, we can retrieve it as follows:

Above you can see the raw csv data from the github. We can see that the columns include things such as the victim's name, date of death, age, race, state, etc. The next thing we will want to do is tidy/clean up the data to be useful for analysis. Pandas has a lot of functionality to allow us to manipulate and edit dataframes.

Part 2 - Tidying Data

The main thing we want to investigate is the relationship between race and fatal shootings. The current race column has letters to symbolize the different groups, and it is noted in the GitHub that missing ('NA') values are for when the person's race was unknown. Let's replace these letters with the actual race labels, and replace all the missing values with the label "unknown."

Great. Now it is easier to read the races, but we still need to make it easier to analyze each race. Rather than having each row be by name, we can make a new data frame where each row will be a race category. That way we can compare values/characteristics between each race as a group.

^^ Like this. Now from here, we could add most of the other categories. For example the raw data's age column can be sorted by race and age group. There can be columns to divide the ages into below 18 (children), between 18 and 30 (young adult), between 30 and 60 (older adults), and over 60 (seniors). From the raw data's threat level column, we can convert that into a column that counts the number of individuals in each race that attacked or did not attack during their encounter. We can also count the number of individuals that had signs of mental illness, and the number of encounters where police wore their body camera.

But before we do that, what would be the flaw of just counting and comparing the quantities? Let's quickly graph what we have to see:

So at first glance, we see that white people have had the most fatal encounters. So does that mean we can conclude that white people suffer the most fatal shootings and the idea of police brutality and systemic racism is a farce? What is misleading about this graph?

The main misleading component is that these quantities are not out of the same total. There are far more white people in this country than any other race, so naturally the quantity will be larger. Therefore, it is misleading to compare the data this way.

Instead, we should look at the number killed as a proportion. Since these quantities do not have the same total, let's divide the values by said total to get said proportion. What is that total? For that, we can look at demographics. For demographics, we can look at the 2019 data from the Census: https://www.census.gov/quickfacts/fact/table/US/PST045219 and see the following:

Percent White: 60.1%
Percent Black: 13.4%
Percent Asian: 5.9%
Percent Hispanic: 18.5%
Percent Native: 1.3%

Using these numbers, we can take the quantity and divide it by the demographic percentage to get a proportional count of fatal shootings for each race. The result we get is "per 1% of the population of some race, how many people suffer a fatal shooting?" This makes comparison more fair and accurate as it "equalizes" the impact/signifance of the value. We should apply this to all the quantity columns to get a proportion so that we can fairly compare the races. So rather than being "Number killed" or "number children," it will be "proportion killed/children."

With this set, let's complete the table.

Looking good so far. We still need to convert the first column we had, and it would also be a good idea to just add the demographics to the table in case we need it in the future. Before we do that, we should drop the "Other" and "Unknown" race categories. We do not have accurate demographic percentages for these, so from here on we will not be analyzing this categories.

Great! Now let's tackle the armed category:

The armed category indicates the 'weapon' the victim was carrying during the fatal encounter. Let's look at the list:

Clearly not all these are worthy of being part of the 'armed' category. Some individuals were unarmed or carried harmless objects like a toy or an air conditioner. In the future, we will probably want to look at how many of these victims died despite not carrying anything threatening, so we want to make a columns that counts the number of either lethal or nonlethal weapons that were carried per race.

The code will be similar to what we did to make the other columns, except we need an extra step. We have to divide the list of values in the 'armed' category into Lethal and Nonlethal. This may be fairly subjective, but the lists can be seen below. For this tutorial, weapons were included in the Lethal category if they were anything related to a gun, sharp object, or explosive.

Great. We have tidied most of the data! We can start the visual analysis now, and if we need to manipulate the data any further we can do so as we encounter the need.


Part 3 - Visual Analysis

Let's start with the simplest question - what race has suffered the most fatal shootings?

As we can see above, this graph tells a quite different story than the previous graph I showed with just the quantities. Black people are killed more per 1% of their population than any other race, more than twice than white people. This shows that even though, quantity wise, more white people may suffer fatal shootings, black people are disproportionally targeted. Hispanic and Native Americans also appear to be more disproportionally targetted than white people as well.

Let's visualize some of the other columns. We mainly want to investigate "unfair/unjust" shootings, ie incidents that may not have neccessarily had to end with the victim fatally shot. Obviously, there are details of the event that remain unknown and may tell a different story. But for example, we can visualize the number of people who carried a nonlethal weapon but still got shot, or we can visualize the number of victims that we children (< 18) per race. We can then see if there is a disparity across races (ex. is one race more disproportionally targeted for a category).

There is a lot of information we can gather from this graph. We see that Hispanic, Black, and Native Americans disproprtionally were victims of fatal shootings despite carrying a nonlethal weapon compared to white individuals. Compared to other races, Black people have a higher proportion on individuals under 18 (children) that were victims of fatal shootings. The proportion of individuals who showed signs of mental illness were highest among white and black individuals, with Asian individuals having the lowest proportion. We also see that Hispanic, Native, and Black individuals had a higher proportion of individuals who were both shot and tasered (could be signs of excessive force) compared to white individuals. With the exception of Asian individuals, all other races have a high proportion of individuals that were fatally shot at a time where police did not wear a body camera (>40 per 1%), with Black people having the highest proportion. Black, Hispanic, and Native individuals also have a higher proportion of individuals that were shot but were not attacking compared to white individuals. Overall, these attributes are indicative of the prevalence of colorism and system racism in policing.

Next, lets add in time as a factor. We can analyze shootings over time for each race. For that, we are going to have to extract data from the raw data table in a different way, taking account the year.

We will make a dictionary of race: dictionary pairs. The inner dictionary will be date: number killed pairs for each race. We can then take the number killed and divide by the demographics to get the proportion.

Now that we have this table, we can group by race and make a plot to show shootings over time. We will exclude the year 2021 as we expect those values to have a significant dip, since the year has not ended yet (so all the data has not been collected).

From this graph, we can see the proportion of individuals killed remained relatively constant per each race except for Native Americans, who had a significant peak in 2017. Overall, I personally found the statistics for Native Americans to be higher than I expected in general. In terms of explaining the peak, perhaps there may have been some sort of event, law, etc that lead to this trend.

Finally, let's graph the total number of shootings over time to see overall how that number has changed.

We can see from the graph above that the total fatal shootings each year since 2015 have hovered around 1000 per year, with 2016 having the lowest total of about 960. 2020 had the highest quantity of 1020 shootings in a year - it is surprising to see that this number increased desipte Covid-19 and lockdown measures occuring.

For the last part of our analysis, let's look at location and fatal shootings. We are curious to see if there is a certain area/state where these fatal shootings are concentrated - and later can take that a step further to see how these locations differ for each race.

To visualize this we can make a chloropleth map. A chloropleth map is a type of thematic map in which a set of pre-defined areas is colored or patterned based on a statistical variable that represents a characteristic of that area , in this case, the total number of shootings in a state. To make maps, we will be using the Folium library, which makes it fairly simple to visualize the data in the way we want to. However, before we do that we need to extract the data in preparation for the map. Earlier, our tidied table's rows were based on race. This time, we need a row for each state, with the column values corresponding to the total number of fatal shootings in that state. We will also make columns for each race so that we can make chlorpleth maps based on state and race as well.

Now that are table is ready, we can use Folium to make the chloropleth map. Let's start by visualizing the total number of fatal shootings in each state.

We can see from this map that overall, more fatal shootings have occured in California than any other state, followed by Texas and Florida. Let's see how this map changes as we break it down by race.

Looking at the proportion of Black fatal shootings, the top three states are still California, Texas, and Florida.

The same top three states remain for analyzing the proportion of white fatal shootings.

The distribution looks a little different with Asian shootings. The top three states appear to be California, Texas, and Washington - however it is heavily concentrated in California. This may be due to the large Asian population in the state.

For Hispanic shootings, we see that the top three states are California, Texas, and New Mexico. However, it is also heavily concentrated in California.

The map for looking at Native American fatal shootings looks the most different. The top three states appear to be Oklahoma, Arizona, and Washington. Going back to the observation we made earlier about the trends of Native American shootings over time, events or policies that occured in this states during 2017 may have contributed to that peak.

For more information about Folium, their documentation is a great place to start! https://python-visualization.github.io/folium/quickstart.html#Choropleth-maps

To sum up our visual analysis, we created multiple bar plots to view different attributes of fatal shootings and compare the proportion of fatal shootings of different races. We observed that there was a clear bias towards people of color being victims of fatal shootings. We analyzed how trends changed over time and saw that Native Americans had the most varied pattern. Finally, we made chloropleth maps to get an idea of where most of these shootings occur in the United States and saw that most seem to occur in California.


Part 4: Hypothesis Testing and Machine Learning

Now that we have done some visual analysis, let's get into some hypothesis testing and machine learning.

Hypothesis testing is a way to "test" our data to see if we have meaningful, or statistically significant results. Our main focus is the correlation between race and fatal shootings, and we saw from our visual analysis that there is a difference between proportion of people killed and race, but is that difference significant? To determine this, we have to do a hypothesis test.

Given that one of our variables, Race, is categorical however, that limits the type of test we will use. We will be using the ANOVA test, ie Analysis of Variance. This compares the means among different groups to determine if one or more group has a statistically significant value. The null (baseline) hypothesis assumes that all means are equal; if Race did not have a factor on proportion killed, we would expect this hypothesis to be true. However, if race does contribute, then we expect to reject the null hypothesis, and our correlation is significant. You can read more about ANOVA testing here: https://www.reneshbedre.com/blog/anova.html

To begin the ANOVA test, we have to prep the data. This is simple, as the datatable we used to graph proportions over time has everything we need. We can drop the Num Killed column as we are looking at proportion.

For the ANOVA test, we are going to take variances/means of each Race, so we want the race to be the column value. For that, we can pivot the data table like so:

Now we can use the anova function from stats models to calculate the F and p value of the data. If the p value is < .05, that indicates that we would not expect this result if the null hypothesis were true, so we can reject the hypothesis and say the correlation is statistically significant.

As we can see from above, the p-value is .000002, which is less than .05. This indicates that there is a significant difference among means and that there is an association between Race and proportion killed in a fatal shooting!

Finally, let's use our data to practice some machine learning. Given all the categorical values we have from our raw data, what if there was a way to build a model that can predict the race of an individual based on these categories? We can attempt this by using a Decision Tree. A decision tree uses an algorithm to take data and split it into nodes- the algorithm chooses a node/direction to go towards based on the model and ultimately ends up with a prediction/classifier.

Since this values are categorical, we need to find a way to represent them numerically so that they can be inputted into the algorithm. We can do this with the sklearn Label Encoder:

Now we can split the data into training and test data. The training data is used to program the model, and the test data is used to determine that the model we created is accurate.

Now we can make the model with the training data:

As we can see, the accuracy of our model with the training data is 90%, but 41% with the test data. In the future, we can hone in and tweak this model to get a higher accuracy on test data.


Part 4: Communication/Wrap Up

This concludes our walk through of the data science pipeline! I hope you learned a lot. From this tutorial, we were able to gain a lot of insight into various factors of the victims of fatal shootings in the past 5 years. We visualized the many differences in values based on race, and were able to show that there is a statistically significant correlation between race and proportion of individuals in a fatal shooting. It is evident that a lot of reform needs to be done to reshape our policing and Justice system and combat systemic racism. For future steps, one could continute to hone in on the decision tree, or create other machine learning models that could be more helpful to the purpose of investigating this data.

Thank you!