Have you ever been ticketed?
Case
How fast can I go before I end up getting a ticket in Montgomery County, MD?
To figure out how fast you can go in Maryland before you get ticketed, we will use a Random Forest. RF is a bootstrapped tree model using a regression. It is well known as one of the most efficient, accurate algorithms handling thousands of input variables. It estimates unbiased importance regardless of missing data and is an amazing method to compute interactions between clusters of data.
Also known as a Random Decision Forest, this model takes an N number of classification trees and bootstraps (sampling with replacement) the trees. Each correctly created tree is counted as a “vote”. The forest then chooses the classification with the most votes.
Let’s give it a try.
EDA and Preparation
Montgomery County has a constantly updating dataset of traffic violation information. download the info here.
This dataset usually takes a while depending on your hard-drive. Mine took about 20 minutes. Below, we take a glimpse()
of the data and see how there are a total of 1,244,404 rows and 35 variables.
library(dplyr)
stops.full <- read.csv("http://data.montgomerycountymd.gov/api/views/4mse-ku6q/rows.csv?accessType=DOWNLOAD",header=TRUE,as.is=TRUE)
glimpse(stops.full)
## Observations: 1,255,953
## Variables: 35
## $ Date.Of.Stop <chr> "09/24/2013", "08/29/2017", "12/01/2014", "08/29...
## $ Time.Of.Stop <chr> "17:11:00", "10:19:00", "12:52:00", "09:22:00", ...
## $ Agency <chr> "MCP", "MCP", "MCP", "MCP", "MCP", "MCP", "MCP",...
## $ SubAgency <chr> "3rd district, Silver Spring", "2nd district, Be...
## $ Description <chr> "DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGIS...
## $ Location <chr> "8804 FLOWER AVE", "WISCONSIN AVE@ ELM ST", "CHR...
## $ Latitude <dbl> NA, 38.98172, 39.16289, 39.05698, NA, NA, 39.093...
## $ Longitude <dbl> NA, -77.09276, -77.22909, -76.95463, NA, NA, -77...
## $ Accident <chr> "No", "No", "No", "No", "No", "No", "No", "No", ...
## $ Belts <chr> "No", "No", "No", "No", "No", "No", "No", "No", ...
## $ Personal.Injury <chr> "No", "No", "No", "No", "No", "No", "No", "No", ...
## ...
Next, we’ll have to filter the dataset to include all the data up to the last calendar year and the variables: Sub Agency, Accident, Seat-Belt, Personal Injury, Property Damage, Commercial License, Alcohol, Work zone, Color of car, Whether the driver contributed to an accident, Race, Gender, Auto year, Month, Hour, whether they are out of state, and if they got a ticket. The glimpse information is below (18 by 9773). Not all these variables might be classified as explanatory variables, but they are key indicators to predict whether or not a ticket was issued.
# subset to last year (2017)
last.year <- 2017
stops.full$AutoYear <- as.numeric(stops.full$Year)
stops.full$Year <- as.numeric(substr(stops.full$Date.Of.Stop,7,10))
stops.last <- subset(stops.full,Year==last.year)
# delete the really big data set ... don't need to tie up the memory
rm(stops.full)
# Make Month and Hour variables
stops.last$Month <- as.numeric(substr(stops.last$Date.Of.Stop,1,2))
stops.last$Hour <- as.numeric(substr(stops.last$Time.Of.Stop,1,2))
# clean up dataset
stops.last$AutoState <- stops.last$State
stops.last$Out.of.State <- (stops.last$AutoState!="MD")
stops.last$Color <- as.character(stops.last$Color)
stops.last$Color[stops.last$Color %in% c("CAMOUFLAGE","CHROME","COPPER","CREAM","MULTICOLOR","N/A","PINK")] <- "OTHER"
stops.last$Color <- factor(stops.last$Color)
# Take out N/A, blanks, U, or no Hazmat. More recent data.
stops.last <- subset(stops.last,Color != "N/A")
stops.last <- subset(stops.last,Color != "")
stops.last <- subset(stops.last,Gender != "U")
stops.last <- subset(stops.last,HAZMAT == "No")
stops.last <- subset(stops.last,AutoYear > 1990 & AutoYear < last.year+2)
# convert character variables to factors
stops.last$SubAgency <- factor(stops.last$SubAgency)
stops.last$Accident <- factor(stops.last$Accident)
stops.last$Belts <- factor(stops.last$Belts)
stops.last$Personal.Injury <- factor(stops.last$Personal.Injury)
stops.last$Property.Damage <- factor(stops.last$Property.Damage)
stops.last$Commercial.License <- factor(stops.last$Commercial.License)
stops.last$Commercial.Vehicle <- factor(stops.last$Commercial.Vehicle)
stops.last$Alcohol <- factor(stops.last$Alcohol)
stops.last$Work.Zone <- factor(stops.last$Work.Zone)
stops.last$Contributed.To.Accident <- factor(stops.last$Contributed.To.Accident)
stops.last$Race <- factor(stops.last$Race)
stops.last$Gender <- factor(stops.last$Gender)
stops.last$Out.of.State <- factor(stops.last$Out.of.State)
# Create dataset for Speeding
speed.last1 <- subset(stops.last,substr(Description,1,23)=="EXCEEDING MAXIMUM SPEED")
# difference between cited speed and posted speed limit
speed.last1$speed <- as.numeric(substr(speed.last1$Description,26,27))-as.numeric(substr(speed.last1$Description,45,46))
speed.last1 <- subset(speed.last1,!is.na(speed))
# example: EXCEEDING POSTED MAXIMUM SPEED LIMIT: 39 MPH IN A POSTED 30 MPH ZONE
speed.last2 <- subset(stops.last,substr(Description,1,30)=="EXCEEDING POSTED MAXIMUM SPEED")
# difference between cited speed and posted speed limit
speed.last2$speed <- as.numeric(substr(speed.last2$Description,39,40))-as.numeric(substr(speed.last2$Description,58,59))
speed.last2 <- subset(speed.last2,!is.na(speed))
# combine and subset to columns of interest
speed.last <- rbind(speed.last1,speed.last2)
speed.last <- speed.last[,c(4,9:12,14,16:18,24,28:30,36:38,40,41)]
glimpse(speed.last)
## Observations: 9,773
## Variables: 18
## $ SubAgency <fct> 4th district, Wheaton, 4th district, Wheaton, 3r...
## $ Accident <fct> No, No, No, No, No, No, No, No, No, No, No, No, ...
## $ Belts <fct> No, No, No, No, No, No, No, No, No, No, No, No, ...
## $ Personal.Injury <fct> No, No, No, No, No, No, No, No, No, No, No, No, ...
## $ Property.Damage <fct> No, No, No, No, No, No, No, No, No, No, No, No, ...
## $ Commercial.License <fct> No, No, No, No, No, No, No, No, No, No, No, No, ...
## $ Commercial.Vehicle <fct> No, No, No, No, No, No, No, No, No, No, No, No, ...
## $ Alcohol <fct> No, No, No, No, No, No, No, No, No, No, No, No, ...
## $ Work.Zone <fct> No, No, No, No, No, No, No, No, No, No, No, No, ...
## $ Color <fct> BLUE, RED, WHITE, WHITE, SILVER, BLUE, RED, BLUE...
## $ Contributed.To.Accident <fct> No, No, No, No, No, No, No, No, No, No, No, No, ...
## $ Race <fct> HISPANIC, HISPANIC, BLACK, BLACK, BLACK, WHITE, ...
## $ Gender <fct> M, M, F, M, M, M, M, M, M, M, M, F, M, M, M, M, ...
## $ AutoYear <dbl> 2012, 2002, 2000, 2012, 2005, 2016, 2009, 1995, ...
## $ Month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Hour <dbl> 10, 11, 11, 21, 2, 0, 20, 3, 4, 0, 23, 16, 21, 2...
## $ Out.of.State <fct> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, ...
## $ speed <dbl> 9, 9, 10, 21, 5, 11, 19, 50, 25, 17, 9, 9, 40, 2...
Here are the summary statistics and the histogram of the speed data. Speed in this set indicates the mph over the speed limit when stopped for a violation. The shape of this data is skewed to the right with a mean of about 15 mph over the speed limit and standard deviation of 7.36 mph. There are, of course, other considerations we have to make. If you’ve ever had a parking citation, you would know! Usually officers mark down the speed at the time they stop you, so the information we get in the end may not be a clear reflection of a traffic violation. We could probably go faster and get it marked down.
c(summary(speed.last$speed),"sd"=sd(speed.last$speed))
hist(speed.last$speed)
## Min. 1st Qu. Median Mean 3rd Qu. Max. sd
## 1.000000 9.000000 14.000000 15.027013 19.000000 61.000000 7.364514
Now we will do a cross-validation (a comparison of a test dataset with a training dataset) to measure our performance. This idea of training and test datasets are heavily used in machine-learning algorithms (Random Forests are a type of machine learning, really, regressions are a type of machine learning as well). Simply stated, training data pairs the input with an expected output to make sure that an obvious answer is returned correct and estimates the performance of the selected approach with a test dataset (how well your model has been trained depends on the size of your data, the value you would like to predict, input etc.). After we test it, you have a specific model for this situation! A little sacrifice of prediction for information due to a large dependence on data variables, but important nonetheless.
As a rule of thumb, the train dataset usually consists of about 70~80% of the complete dataset, and the train is comprised of the rest. The datasets are created by randomly sampling directly out of the dataset. For this step, make sure to set a random seed so you can make sure to get opposite values of train and test. After confirming that the train and test datasets are similar, we’re good to fit the model!
# Check length of dataset
length(speed.last$SubAgency)*.80 # 80 % = close to 8000 observations
# Create Train and Test
# Create a SRS without replacement of n.train=8000
set.seed(14) # Choose whatever seed!. Make sure to, so you get the same sample for both train and test.
train.rows <- sample(length(speed.last$SubAgency),8000)
speed.train <- speed.last[train.rows,]
speed.test <- speed.last[-train.rows,]
# confirm train and test are smimilar
rbind("Train" = summary(speed.train$speed), "Test" = summary(speed.test$speed)
## [1] 7818.4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Train 1 9 14 15.01062 19 61
## Test 4 9 15 15.10096 19 56
The Model
So, the package you need to download is the randomForest. This has a variety of help functions for classification and regression with rf. Some helpful functions include the MDSplot
and the na.roughfix
. If you are concerned about visualization and fixing missing values, you may want to look into those. Just note that random forests do a tremendous job imputing missing data without deleting any rows.
Now the RF model is quite simple. for the x value, we input the training dataset EXCLUDING the response variable (speed for this case) you are measuring. The y value will include the response variable (which is the same as column 18). The same is done for the xtest and ytext values for the functions (make sure you put the test dataset with x/ytest and x and y for train. Pretty obvious, but easy to mix up). Another specification to make is the replace = TRUE which makes sure to bootstrap your samples, so you can create as many trees as there are bootstrap possibilities (x^n possible, distinct outcomes).
The other fine tuning of parameters can get a little messy depending on how specific you want information to be. Since a lot of websites do not specify what parameters to use, I will do my best to make a general outline of what you can use to play around with. More information here.
ntree: The number of trees you want to create.
- The more trees you make, the better more in-depth your analysis will be, yet there is always a point where marginal benefit meets the marginal cost. Computationally, the expected variance will decrease as the sample size increases sd/sqrt(n) so collecting a large sample is obviously good, yet the more times the training value is being used to predict, there is a prediction error associated with each training sample. Hence, the larger the sample, the more errors we are stacking up. So, depending on the size of the dataset we should be careful. You can take a look at the OOB (out of bag) error rate for the optimum tree size and predictors sampled here. As a general rule of thumb, you can use about 50-130 trees, so you can have a relative accuracy and not let your CPU die because of too much data.
mtry: The number of predictors sampled for each tree.
- Each tree has a different number of predictor variables randomly used. In this model, we have 50 trees that use 5 bootstrapped predictors for each tree. The determination for how many to use can also be viewed through the OOB error rate. If you weren’t too meticulous, just use about a third of the number of explanatory variables. Keeping a relatively small set of explanatory variables is ideal as it allows the trees to make more interactions with specific variables.
nodesize: The minimum number of data you can have in each node.
- Since we are working with recursive partitions, our data tends to be hierarchical. We have branches of n size data that branch out with other data of various sizes. The smaller the number, the larger the trees. One way to figure out the optimum is by taking nodesizes from 10,20,30,40,50 and comparing the RMSE (Root Mean Square Error) of the data, which we will do afterwards. IF the RMSE of the train dataset is similar to the test, we can be assured that we have a good model. If they don’t seem to work even then, just keep increasing the nodesize. Larger datasets will usually require larger nodesizes.
library("randomForest")
names(speed.train) # to check which one is speed. Take out speed no. 18
out.speed <- randomForest(x=speed.train[,-18], y=speed.train$speed,
xtest=speed.test[,-18], ytest=speed.test$speed,
replace=TRUE, # Bootstrap SRS with replacement
keep.forest=TRUE, ntree=50, mtry=5, nodesize=30)
out.speed
## Call:
## randomForest(x = speed.train[, -18], y = speed.train$speed,
## xtest = speed.test[,-18], ytest = speed.test$speed,
## ntree = 50, mtry = 5, replace = TRUE,
## nodesize = 30, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 50
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 49.08494
## % Var explained: 9.09
## Test set MSE: 50.25
## % Var explained: 9.19
Looking out the output from out.speed, we can see that the Train MSE (variance, or mean of squared residuals) is 49.08 and the Test MSE is 50.25 assuring us that their root values are also similar. Of course, an RMSE of about 7 is rather large considering the speeds range from 1 all the way to 61. This means that 95% of the data will fall about +- 14 mph from the average speed of getting a ticket. We can see that the % variance explained (R^2) is small, which gives an indication of poorer prediction performance. For the sake of this example I will not tune the parameters further, but you get the gist (try increasing ntree and nodesize).
If there is one thing we need to be careful of, is overfitting. That is when the tree is finding a distinctive or peculiar feature of the observed data and is too close to be true. The only way to solve this is by tuning the parameters above. Usually, these problems can be resolved when more complex correlations can be observed between the data, so one way you may be able to resolve this problem is by increasing the number of trees and decreasing mtry and node size. This shouldn’t be too big of a concern with Random Forests, but this usually depends on what your dataset looks like.
Now let’s see how many times each explanatory variable is used in the tree. The more it is used, the more important it is, the less it is used, the less important. An important point to note is that importance is not the same as significance. Just because the variable was the number one predictor, does not mean those differences are statistically significant. This can be done by a simple test such as an ANOVA.
# Model Interpretability(how often were explanatory variables used in the forest?)
head(importance(out.speed))
varImpPlot(out.speed)
## IncNodePurity
## SubAgency 26307.18816
## Accident 0.00000
## Belts 5346.83283
## Personal.Injury 73.66466
## Property.Damage 245.90873
## Commercial.License 1310.37107
The importance()
function and the plot highlights a significant feature of the random forest. As we can see, accident and commercial vehicle measures were not used at all. Intuitively we know how accidents are a result, rather than an explanation for bad driving. We left it in to show that the rf is extremely intelligent: It seeks variables that have the best fit and ommit unnecessary variables (unless it overfits of course.. Then you’ll have the rf making ludacris connections).
Now that we have our model, let’s make a prediction! First, we will grab an observation in the dataset that is very similar to my situation and change it. This is NOT a reflection of myself, but just creating a hypothetical person (me). Afterwards, we can use the predict()
function to figure out:
Prediction
Let’s now pretend that you are a Male with the following characteristics. We will estimate how fast we can drive around Maryland before we get a ticket.
# make a prediction at a new observation
# note:
new.obs<-speed.last[25,]
new.obs$AutoYear<-2017
new.obs$Month<-8
new.obs$Hour<-18
head(glimpse(new.obs))
## Observations: 1
## Variables: 18
## $ SubAgency <fct> 2nd district, Bethesda
## $ Accident <fct> No
## $ Belts <fct> No
## $ Personal.Injury <fct> No
## $ Property.Damage <fct> No
## $ Commercial.License <fct> No
## $ Commercial.Vehicle <fct> No
## $ Alcohol <fct> No
## $ Work.Zone <fct> No
## $ Color <fct> BLUE
## $ Contributed.To.Accident <fct> No
## $ Race <fct> WHITE
## $ Gender <fct> M
## $ AutoYear <dbl> 2017
## $ Month <dbl> 8
## $ Hour <dbl> 18
## $ Out.of.State <fct> FALSE
## $ speed <dbl> 17
You don’t look like such a bad guy (what does this mean by a bad guy anyway?)! Just a normal dude going 17 over the speed limit with no seat belt!
predict(out.speed,newdata=new.obs)
hist(speed.last$speed)
abline(v=predict(out.speed,newdata=new.obs), col="red")
Wouldn’t that be neat? You can speed at above 15 miles an hour before I get caught in Montgomery County, MD! Of course, we do have to keep in mind how the RMSE is about 7 miles an hour, which will make the confidence interval rather large. Regardless, we have ourselves a rf prediction!
Conclusion
As you can see, Random Forest is quite the attractive model. While this example was using a quantatative predictor variable, we can also use categorical models as well. If we wanted, we can also answer the question of: If pulled over in MD, will you get a ticket or a warning? This will be a more logistic approach. If you want to see the code for that, send me a message and I will contact you directly.
We have learned that the RF model takes random subsets of all data at each node and finds a variable that optimizes at each split. This analysis technique can predict both quantatative and categorical data. Another plus is how missing values do not affect the analysis very much. All we need to do is to use different explanatory variables to predict an explanation for either type of data. Of course, there are assumptions to be made, 1. We cannot predict anything outside of the information we have. 2. The correlation between the explanatory and predictor are roughly linear. 3. homoscedasticity of errors.
In the future, maybe we should use more trees. it would most likely reduce the chance of stumbling across a tree that doesn’t perform well because of the relationship between the train and test data. Maybe We’ll give it a try when we have time. Maybe not?