There’s a 30% Chance that it’s Already Raining: Inside a CEE Capstone Project
By Rachel Galowich ’18 and Jill Dressler ’18
Your two favorite blogging seniors are back!
In our last post, Jill gave a great introduction into the background and significance of our project. With our next two posts, we hope to dive a little deeper into the computer science going on behind the scenes. Jill and I are working with the same data set (a refresher: MassDOT traffic camera images, and the labels applied to each image by Google Cloud Vision API), but focusing on different objectives. My goal is to be able to predict the probability of future events given historical data.
My first order of business was to sort through the thousands of labels Google produced. Most labels can be applied to any image, so they don’t provide new or significant information that can help MassDOT respond to an accident – “road”, “asphalt”, and “lane” are a few examples. There are a series of labels that appear to be random, and in some cases, are pretty comical, like “Arctic Monkeys” and “bodybuilding.” And finally, buried within the sea of all of these unhelpful labels, are the 43 that I am actually working with. They might indicate a reason to be concerned for driver safety (i.e. an event that might lead to an accident), the presence of an accident, or emergency response to an accident.
To test whether the events represented by image labels can be predicted, we can introduce a lag in the data, or “shift” the data back in time. Our data set includes images taken every three minutes, so shifting the data 20 times represents one hour of total lag. I have been using two tools to show the relationship between the data and its lag terms: covariance and partial-autocorrelation plots. Covariance plots reveal the labels that are strongly correlated with their lag terms, and partial-autocorrelation plots can show how the relationship changes with the number of lags. From this, I can learn what labels might be more easily predictable, and how far in the future they can be reasonably predicted.
The next step is to build a statistical model that can more concretely prove the possibility for prediction. I have been using logistic regression, which gives the conditional probability that an event will occur. In this case, the output is a binary variable, which takes the value of one if a label is applied to an image, and zero if it is not. The input is the series of lagged data for a label. I am still working towards refining this model, but am making steady guidance with Jeff’s guidance (and Jill’s support, of course). The eventual hope for this part of the project is to use the logistic regression model and another machine learning technique, hidden markov models, to infer the probability that one event, represented by a label (i.e. snowing outside → “snow”), might transition to another event (“snow” → “accident”).
Hope you enjoyed reading this post, and that Jill can live up to my blogging-prowess next week! Cheers until the semester finale!