Data-Driven Analysis of Motor Vehicle Collisions

Nardos Solomon - Data Analyst @ UMASS-BOSTON

9 min readMar 9, 2024

1. Introduction

The increasing rate of motor vehicle collisions has prompted the need for a data-driven analysis to understand patterns, contributing factors, and severity levels of these incidents. This research aims to leverage machine learning models to provide insights into the factors influencing collisions and develop predictive models to assess the risk levels associated with different scenarios.

Research Objectives:

· Analyze and visualize patterns in motor vehicle collisions.

· Develop machine learning models to predict collision risk levels.

· Provide recommendations based on the analysis to improve road safety.

Data Source and Variable Explanation

The dataset used in this study is sourced from NYC Open Data, provided by the Police Department (NYPD). It includes comprehensive information on motor vehicle collisions, recorded and reported by the NYPD. Key variables include geographical information (such as ‘borough’), contributing factors (like ‘contributing_factor_vehicle_1’ to ‘contributing_factor_vehicle_5’), vehicle types (i.e., ‘vehicle_type_code_1’ to ‘vehicle_type_code_5’), and numbers about different types of injuries and fatalities.

pd.set_option('display.max_columns', None)  # Show all columns
df.head(15)

Note: Before I start working with data, I make sure to thoroughly understand it. This involves checking how big the dataset is, what type of information it contains, and looking for any missing pieces. I don’t just stop at finding missing values; I also try to figure out why they’re missing and what it means for each part of the data. This way, I can make smart decisions about how to handle those gaps and ensure the data is reliable when I process it.

df.shape
df.dtypes
print(df.isnull().sum())
# Display descriptive statistics of numerical columns
print("\nDescriptive statistics of numerical columns:")
print(df.describe())

2. Data Processing Explanation

The dataset underwent thorough preprocessing to ensure its suitability for analysis and modeling:

Standardization of Column Names:

Column names were standardized by removing spaces, special characters, and converting to lowercase.

# Clean up column names by removing spaces and converting to lowercase
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
print(df.columns)

Time-Related Features:

‘crash_date’ and ‘crash_time’ columns were converted to date-time format. Among many other reasons, this is done to explore and visualize the distribution of collisions over different months and hours.

# Explore time-related features: 'crash_date' and 'crash_time'
df['crash_date'] = pd.to_datetime(df['crash_date'])
df['crash_time'] = pd.to_datetime(df['crash_time'])

Then we can extract ‘month’ and ‘hour’ columns to analyze collisions over months and hours.

# Extract month and hour information
df['month'] = df['crash_date'].dt.month
df['hour'] = df['crash_time'].dt.hour

Total Injuries and Killed:

‘total_injuries’ and ‘total_killed’ columns were created by summing the relevant injury and fatality columns.

Risk Level Categorization:

A new column, ‘risk_level,’ was introduced by categorizing the combined injuries and killed into risk levels based on predefined thresholds.

low_threshold = 0
medium_threshold = 2
high_threshold = 3

df['risk_level'] = pd.cut(df['combined_injured_killed'], bins=[low_threshold,
medium_threshold, high_threshold,float('inf')],labels=['low risk', 'medium risk', 'high risk'], right=False)

One-Hot Encoding:

Categorical variables such as ‘borough’ and contributing factors were one-hot encoded to facilitate machine learning model input.

Handling of Missing Values:

Removal of Rows with All Null Values:

Rows with null values in all vehicle types and contributing factors were removed as they represented instances where there was insufficient information about the vehicles involved or the contributing factors, likely indicating less complex incidents with fewer than three, four, or five documented factors or vehicles.

Rows with missing values of borough were dropped

Imputation and Replacement:

Missing values in categorical columns, including vehicle types and contributing factors, were replaced with the label ‘Unknown’ to retain these instances in the analysis while acknowledging the absence of specific information.

Missing values in numerical columns like ‘number_of_persons_injured’ and ‘number_of_persons_killed’ were imputed with median values.

A step-by-step guide to doing a data analysis project

Don’t know where to begin with your data analysis project? Follow this step-by-step guide and start today!

nardosanalyst.medium.com

Exploratory Data Analysis (EDA):

The distribution of the top 10 contributing factors and the top 10 vehicle types was explored using count plots.

Distribution of Injuries and Fatalities (Log Scale)

The majority of collisions result in no injuries or fatalities, as indicated by the large counts for 0.0 in both figures. Collisions resulting in injuries are more common than those resulting in fatalities, which is expected. The distribution provides a sense of the severity of collisions, with most being minor (few or no injuries) and a smaller number involving more severe injuries or fatalities.

Collisions Over Months and Hours

Number of Accidents Over Time by Borough (2021–2022)

By grouping the data by year, month, and borough, the line plot provides a dynamic representation of how the frequency of accidents varied across different boroughs throughout the specified time frame.

As shown in the figure, Brooklyn consistently exhibited the highest number of collisions compared to other boroughs.

Collisions on Public Holidays vs Non-Public Holidays

Collisions on public holidays were examined, but the observed effect on collision rates appeared to be minimal.

3. Modeling

In this section, we explore three distinct models aimed at extracting valuable insights from the dataset: clustering to identify major vehicle types tied to collisions, linear regression to predict the number of accidents in an area over time, and decision Tree Classifier was employed to predict the risk level associated with accidents. Which aids in understanding the factors contributing to the severity of accidents.

3.1 Cluster Analysis Model:

· K-means clustering is a popular unsupervised machine learning algorithm used for grouping similar data points into clusters.

· It is chosen here for identifying major vehicle types tied to collisions as it can segregate data into distinct groups based on feature similarity.

To discern major vehicle types associated with collisions, K-means clustering was employed. The process involved:

Feature Selection:

The primary vehicle type (‘vehicle_type_code_1’) was selected as the key feature.

Determining Optimal Clusters (k):

The Elbow Method was utilized to identify the optimal number of clusters (k).

Applying K-means Clustering and Identifying Major Vehicle Types:

K-means clustering was executed with the chosen k to group similar vehicle types. Which is “3”. Then the major vehicle types in each cluster were determined by selecting the mode within each cluster.

Results

The clustering process revealed three distinct clusters, each associated with major vehicle types tied to collisions:

Clusters Outcome Analysis:

Cluster 0 analysis:

This cluster predominantly consists of collisions involving Station Wagons/Sport Utility Vehicles as the primary vehicle type.

Sedans are also notable within this cluster, suggesting a combination of these two vehicle types in collisions.

Cluster 1 Analysis:

This cluster is characterized by collisions where Sedans are the primary vehicle type involved.

It indicates a specific pattern where accidents are mainly between two sedans, possibly representing typical urban collisions.

Cluster 2 Analysis:

This cluster highlights collisions involving Taxis as the primary vehicle type.

Sedans also appear frequently in these collisions, indicating potential scenarios where taxis collide with sedans.

3.2 Linear Regression for Accidents Prediction

· Linear regression is a supervised machine learning algorithm used for predicting a continuous outcome.

· It is chosen here for predicting the number of accidents in an area over time as it models the relationship between input features (year, month, borough) and the target variable (accident count).

· It can provide insights into how changes in input features relate to changes in the target variable.

Grouping and Counting:

The dataset was grouped by year, month, and borough, and the number of accidents was counted.

Modeling Data Preparation:

A modeling dataset was created, featuring ‘Year,’ ‘Month,’ ‘borough,’ and ‘Accident Count.’

Training and Testing Sets:

The dataset was split into training and testing sets.

Linear Regression Model:

A linear regression model was built using the training data.

Model Evaluation and result

The model’s performance was assessed using metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).

The linear regression model demonstrated the following performance on the test set:

Interpretation and Acceptance of Results

Both models contribute valuable insights to our understanding of collisions. The clustering model identifies major vehicle types associated with collisions, aiding in targeted interventions and policy decisions. The linear regression model predicts the number of accidents over time, providing a tool for proactive planning and resource allocation.

The identified major vehicle types from clustering and the predictive capability of the linear regression model collectively contribute to a comprehensive understanding of collision patterns.

3.3 Decision Making Model

One modeling method chosen is the decision tree modeling. We chose this for classification of the “level of risk” on factors that are related to the vehicles (i.e. incident contribution factors, vehicle types, number of injuries, number of fatalities, and the area that it occurred in).

low_threshold = 0
medium_threshold = 2
high_threshold = 3

For the model’s risk classification, it tries to categorize the results into 3 levels of risk. We took a unique list of numbers of total injuries and/or fatalities and used that to classify the risk. We mainly used our own judgement for the grouping. “Low Risk” indicates that nobody was injured but the accident was still reported due to meeting the minimum damage cost required for a police officer to report (damage totaling up to over $1,000). “Medium Risk is classified as having up to or equal to 2 total injuries and/or fatalities. “High Risk” would be any total injuries and/or fatalities that occur above the threshold of 2. This model is aimed to predict the most likely risk category that a reported vehicle incident would be.

The results presented by the model were promising but there are a few issues that need to be addressed before we can accept it. The results currently show that the low-risk category was performing extremely well in predictions, accuracy, and reliability. While this is promising, the other two categories were not so fortunate. To address this issue, we would want to either limit the size of the data to be an even amount across all categories for training and testing, or up sample the minority variables in this case. This would provide us with a “truer” result set and allow us to further identify the “noise” in the dataset and process it further to maintain better accuracy. Due to time constraints related to this project and troubleshooting taking a long time, along with other obligations, we cannot address this issue prior to the presentation or report. Another potential avenue to address the current results situation would be to automatically categorize fatalities as high risk and maintain the current “buckets” for just total injuries. This would be a lot easier to regroup the large dataset into a slightly fairer set of clusters, but up sampling method would still be needed, as it isn’t that likely to have enough “medium/high-risk” categorized data to train and test with.

Conclusion:

In conclusion, this data-driven analysis of motor vehicle collisions employs machine learning models to understand patterns and predict risk levels, utilizing a comprehensive dataset from NYC Open Data. Thorough preprocessing, exploratory data analysis, and diverse modeling techniques contribute valuable insights. While the clustering and linear regression models offer valuable findings, recognizing the need for improvement in the decision tree model’s predictions is crucial. Continuous refinement and adaptation of methodologies are essential for a more effective understanding of collision patterns and improved road safety interventions. Overall, this study emphasizes the importance of ongoing efforts to enhance accuracy and insight in addressing motor vehicle collisions.