Predicting arrest in Berkeley PD stops
Using the tidymodels framework and data that I accessed from the City of Berkeley Open Data portal, I built models for predicting whether police stops conducted by the Berkeley Police Department resulted in any arrests.
Welcome! 👋 If you need to skip around to different parts of the site, you can use the side navigation bar and the header tabs!
Content by header tab
💡 Project summary: Context and a summary without code
🧩 Data and methods: Information about original files, final variables and their ties to original files (without showing code), and the machine learning exercise (without showing code)
📊 Analysis code: Exploratory data analysis with code; classification modeling code and results
🧹 Data prep code: Code for preparing data from original files - including geospatial data transformation and data-quality checks
Context
A July 2025 citizen-scientist exercise like this is particularly timely (and admittedly coincidental: I wasn’t anticipating any policy updates on the topics of these files when I picked them up for an analysis in July 2025).
On July 22, 2025, the Berkeley Police Department and the Police Accountability Board jointly recommended that the City Council formally close the Fair & Impartial Policing item as a concluded policy initiative - in light of resulting collaborations and reforms (reforms e.g. based on Police Accountability Board recommendations, official documents; news coverage with the headline “Berkeley wraps up policy work tackling racial disparities in police stops”). This joint letter mentioned that data of police stops would continue to be published and that Berkeley PD’s annual reports would include data-driven analyses of traffic enforcement.
I learned about this development after finishing my analysis, and accessing recent documents related to what, at the time, seemed to be just archived data made this citizen-scientist exercise more interesting after-the-fact.
When requesting a qualified positive recommendation from the Public Safety Policy Committee, the July 2025 joint letter (mentioned above) referred to “the current traffic safety approach, which has … improved equity in stop outcomes across demographic groups.” Relevant to this quote, my project examined outcomes of police stops - both on and off the roads - during the period of 2020 Q4 to mid-2023 and built models for predicting whether stops resulted in any arrests (a specific police-stop outcome) with demographic variables among the model features. Analyses like mine could help to confirm whether equity in stop outcomes indeed has improved across demographic groups.
And, last year, in its April 24, 2024 report, the Police Accountability Board remarked that merely publishing the data does not endow the public with an understanding of local policing: “As noted in the BPD quarterly reports, a Transparency Hub was developed that provides raw data through an Open Data Portal. While this allows members of the public who have the time and ability to analyze the data, the BPD’s Data Analyst should provide more detailed analysis of these data with a focus on racial disparities, as we have done in this report.” For this project, I found the time to analyze one of the data sets, and my analysis did involve variables based on civilian race.
When I picked up files for this exercise, I was focused on data offerings via the the City of Berkeley Open Data portal. Berkeley PD’s transparency hub through ESRI/ArcGIS was not on my radar at the time - this site surfaced for me while I was poking about and looking for documents related to Berkeleyside’s reporting after I finished my analysis. Raw files available through the BPD transparency hub (including data for 2024 police stops) could provide valuable substrate for a related future analysis (and police-stop data from both data sources could be useful for other analyses, e.g. ones focusing less on arrest, ones focusing on whether police officers drew firearms, ones focusing on traffic violations).
Data preparation
The main police-stops file contained data from October 1, 2021 to July 12, 2023, and the rows represented civilians per each police stop. ~2% of stops were of bicyclist civilians, ~38% of stops were of pedestrians/off the roadways, and ~60% of stops involved civilians operating a vehicle; to focus on the difference between pedestrian/vehicular stops, I dropped police stops of bicyclists. About 5.6% of the remaining observed units shared police stops (a police stop can involve multiple civilians), which could necessitate some assumptions about observation unrelatedness; for a neater analysis, I aggregated data to the level of police stop.
The police-stops file contained columns with coordinates and geocode-able location descriptions; these enabled me to perform spatial joins to enrich the data set with information about police beat polygons and city council district boundaries from other files available via the City of Berkeley Open Data portal.
Exploratory data analysis
I performed some EDA using mostly tidyverse tools; the following were among the plots I generated.
Classification models
I built a main-effects logistic regression model and a regression model with interaction terms; both models performed very similarly when assessed on the training set using 10-fold cross-validation (averaged accuracy of 0.915 and 0.916 and mean ROC AUC of 0.959 and 0.960 respectively).
I valued the presence of the interaction terms so, while the second model was less parsimonious and didn’t improve model performance much, I proceeded to evaluate the model with interaction terms using the testing set; the model scored 0.916 for accuracy, 0.956 for ROC AUC, 0.747 for sensitivity, and 0.948 for specificity.
Selected findings from fitting the model with interaction terms to the training data
One advantage of sticking to logistic regression models was how easily the fitted coefficients/odds ratios lent themselves to interpretation. Here, I cover some of the odds ratios estimated from fitting the model to the training data.
Here, odds ratios are the ratios of odds (i.e. not of probabilities or of rates) of having any arrests in a stop between two groups that are determined by a model term - controlling for the rest of the model’s terms (ORs are model-specific). Take, for example, an odds ratio for whether any of a police stop’s civilians appeared fluent in English: the numerator refers to odds of arrests in stops where any civilians were perceived to be fluent in English, and the denominator is the odds of arrests in stops where none of the civilians were perceived to be fluent in English.
Significant ORs > 1 have greater numerators; adjusting for all other model terms, the odds of arrests were found to be greater when the model term applied.
Significant ORs < 1 have greater denominators; in this model, odds of having any arrests were less when the model term applied.
Demographics of stopped civilians and the odds of a police stop resulting in any arrests
Model term | Significant OR estimate |
---|---|
Whether a stop had only civilians who were perceived to have no disabilities (related to vision, hearing, intellect/development/dementia, mental health, or speech/language according to the raw file) | 3.95 |
Interaction term: Whether a stop was of civilians operating a vehicle ✖️ whether any of the civilians were perceived to be Black and had their race perceived prior to the stop | 3.07 |
Whether the stop had any Berkeley residents | N/A |
Whether the stop had any Oakland residents | N/A |
Whether the stop had only civilians of perceived age 17-24 | N/A |
Whether any civilians were recorded as a cis man | N/A |
Whether the stop had only civilians perceived to be white (excluding multiracial) | N/A |
Whether any civilians were perceived to be white (excluding multriacial) | N/A |
Whether any civilians were perceived to be Black | N/A |
The variable for whether any of a stop’s civilians were perceived to have a disability (related to vision, hearing, intellect/development/dementia, mental health, or speech/language according to the raw file) was a significant predictor of arrest in this model: for stops that had only civilians who were perceived to have no disabilities, the odds ratio was 3.95 (compared to stops where any civilians were perceived to have a disability).
Related to the first plots on this page, a significant OR was estimated for an interaction term for whether a stop was of civilians operating a vehicle ✖️ whether any of the civilians were perceived to be Black and had their race perceived prior to the stop. When the model with interaction terms was fitted on the training data, this odds ratio was 3.07 (odds in numerator was for police stops where all conditions applied).
Separate variables for whether the stop had any Berkeley residents or any Oakland residents did not have significant odds ratios when this model was fit on the training data.
Odds of arrest were not significantly different in this model when partitioning by any of the following demographic variables: whether the stop had only civilians of perceived age 17-24, whether any civilians were recorded as a cis man, whether the stop had only civilians perceived to be white (excluding multiracial), whether any civilians were perceived to be white (excluding multriacial), and whether any civilians were perceived to be Black (though other model terms related to race did have significant ORs).
Actions taken by police officers and the odds of a police stop resulting in any arrests
Model term | Significant OR estimate |
---|---|
Whether handcuffs/flex cuffs were used on any civilians | 38.63 |
Whether any stopped civilians were searched (searched of their person) | 2.85 |
Whether any civilians had their property searched | 1.59 |
Whether any civilians were detained in a patrol car | N/A |
- The variable for whether handcuffs/flex cuffs were used on any civilians had an odds ratio of 38.63 (p~0.00) when this model was fit on the training data (holding everything else in this model constant, the odds of arrest for stops where hand/flex cuffs were used on any civilians were 38.63 times the odds for other stops).
- The odds of a stop having any arrests were 2.85 times greater for stops where any stopped civilians were searched (searched of their person) and 1.59 greater for stops where any civilians had their property searched.
- A significant OR was not found for the model term referring to whether any civilians were detained in a patrol car.
Evidence identified and the odds of a police stop resulting in any arrests
Model term | Significant OR estimate |
---|---|
Whether evidence linked to any civilians included identified “drugs/narcotics” | 2.20 |
Whether evidence linked to any civilians included identified suspected stolen property respectively | 2.48 |
Interaction term: Whether alcohol was identified as evidence for any civilians ✖️ whether the stop was vehicular | 3.86 |
Whether any weapons were found | N/A |
Significant odds ratios of 2.20 and 2.48 were estimated for model terms representing whether evidence linked to any civilians included identified “drugs/narcotics” or suspected stolen property respectively.
One of the plots on this page concerned whether alcohol was identified as evidence for any civilians and whether the stop was pedestrian/vehicular. In this model, the interaction term for whether alcohol was identified as evidence for any civilians ✖️ whether the stop was vehicular had a significant OR of 3.86 (both conditions applied to stops that contributed to the OR numerator).
Whether any weapons were found was not a significant predictor for arrest in this model.
Characteristics of the police stop (time, place, pedestrian/vehicular mode) and the odds of a police stop resulting in any arrests
Model term | Significant OR estimate |
---|---|
Whether the stop was reported to last under 30 minutes | 0.65 |
Whether a police stop occurred during 7PM-4:59AM | 0.78 |
Interaction term: Whether a stop was of civilians operating a vehicle ✖️ whether any of the civilians were perceived to be Black and had their race perceived prior to the stop (repeated in demographics section) | 3.07 |
Interaction terms: Year (levels of 2021, 2022, and 2023 using 2020 as a reference) ✖️ whether the stop was of a pedestrian/off the roadways | N/A, N/A, N/A |
Whether the stop occurred in Berkeley (the original file had over 100 rows for each of neighboring cities of Albany, Emeryville, and Oakland) | N/A |
Whether the stop occurred in City Council District 4 (BPD station, Downtown Berkeley Bart for rail transit, highly commercial segments of Shattuck Ave and University Ave, homeless encampments during 2023-2025) | N/A |
Whether the stop occurred in any of 4 contiguous southwest police beat polygons of interest | N/A |
Holding everything else in this model constant, the odds of arrest for stops reported to last under 30 minutes were 0.65 times the odds for longer stops (OR=0.65).
A model term for whether a police stop occurred during 7PM-4:59AM had a significant odds ratio of 0.78 when the model was fit on training data, pointing to a possible protective effect for stops occurring during evening/night hours.
Stated earlier in the Demographics of stopped civilians and the odds of a police stop resulting in any arrests section: an interaction term for whether a stop was of civilians operating a vehicle ✖️ whether any of the civilians were perceived to be Black and had their race perceived prior to the stop had a significant OR of 3.07 (numerator pertained to police stops where all conditions applied).
Interaction terms for teasing out the relationship between year ✖️ whether the stop was pedestrian/vehicular did not have significant odds ratios when this model was fit on the training data; though the interaction term for the dummy variable for 2023 stops ✖️ whether the stop was of a pedestrian/off the roadways had a middle-of-the-pack rank for variable importance.
All three variables relating to the location of the stop - whether the stop occurred in Berkeley (the original file had over 100 rows for each of neighboring cities of Albany, Emeryville, and Oakland), occurred in City Council District 4 (BPD station, Downtown Berkeley Bart for rail transit, highly commercial segments of Shattuck Ave and University Ave, homeless encampments during 2023-2025), or occurred in any of 4 contiguous southwest police beat polygons of interest - were not significant predictors of arrest when this model was fit on the training data.
Supplementary materials
For anyone looking to code along in R, reproduce this analysis, or prepare their own follow-up analysis, I organized raw files and geocoding output used for this project on GitHub and Google Drive (same content on each site).