Tools: R - partykit, caret, randomForest, pROC, tree, gbm, glmnet, tidyr, plus all the usual
While I’ve gotten comfortable with, and enjoy, using machine learning/data mining techniques for solving problems, this was my first experience coding up an entire model from scratch with multiple methods and a huge amount of data pre-processing. My awesome teammates, Yoni, Mark and I, easily put in 50+ hours to produce our report.
We spent ~ 60-70% of our time on data pre-processing. This included choosing the outcome variable we wanted to predict on, variable selection in general, data wrangling, and exploratory data analysis. We then ran five dm/ml models:
● Regularized Logistic Regression
● Random Forest
● Simple Decision Tree
● Bagging Classification Trees
We chose to use the bagging model for our final prediction as it performed the best based on the ROC curve, AUC, and the compromise between specificity, sensitivity, and PPV. We were happy with our work, and after feedback from our Professor know that with even more time we could improve our results.
Smart Street Sweeping in Pittsburgh
As a part of my final Masters thesis, I worked on a team evaluating street sweeping in the city of Pittsburgh, proposing policy recommendations and digitizing posted parking restrictions that will be used to create a street sweeping notification system.
Street sweeping is an important municipal function because it improves the cleanliness and appearance of the street, prevents catch basin clogging, and reduces nutrient pollutants from entering nearby waterways. Street sweeping in Pittsburgh requires residents to move their cars for a 5.5 hour time period twice a month (once on each side of the street) so that the sweeper can access the curb, where 80% of the debris resides. Last year, over 40,000 parking tickets were issued to residents who did not move their cars during this time period, resulting in negative interactions between city residents and government and, due to a municipal code, delayed sweeping operations.
We believed a large majority of violators are residents who simply forgot because of the infrequent nature of current sweeping and the fact that there was no place residents could go to find this information other than the metal signs that adorn Pittsburgh streets.
Our project goal was to decrease parking violations by 25% and make the issuing of the remaining tickets less of an impediment on sweeping operations. The City of Pittsburgh can achieve this by leveraging a database that the project team created which digitized all of the street sweeping parking restrictions to create an SMS text notification system for residents, resulting in a 10-15% annual reduction in parking violations.
We also recommended increasing the deterrent (fine) for street sweeping from $30 to $45, resulting in an additional 10-15% annual reduction in parking violations.
Finally, by decoupling the sweeping operations from parking enforcement by making a change to operations, the City can save an additional 1,000 labor hours saved per year.
Investor Ownership in Hazelwood
This project looked at the amount of investor activity occurring in the Hazelwood neighborhood of Pittsburgh, an area adjacent to the Almono site, the last large vacant parcel of land in the city of Pittsburgh that is slated to receive nearly $1 billion in future investment.
By using data from the Western PA Regional Data Center, I was able to show not only the volume of sales that have occurred, but also an analysis through space and time.
I found that investor activity has increased steadily in Hazelwood and has been concentrated around areas of recently-announced planned development, including an Uber autonomous vehicle test track and a school that was closed whose future was uncertain for a time. For the full analysis, view it here.
Food Desert Mapping
This project used ArcGIS Pro to analyze food buffers in Plano, TX.
This is a story map showing 1 mile buffers around retail grocery stores and wholesale clubs located in Plano, TX to indicate areas of the city that are defined as “food deserts” (i.e. not located within one mile of a grocery store.
All geographic and population data was provided from the U.S. Census Bureau and grocery store locations data is from the Reference USA database.
This analysis looks at the relationship between the amount of personal crime and average temperatures in the city of Pittsburgh. Studies have shown that higher temperatures can lead to higher numbers of violent crimes in an area. I was particularly curious about this relationship in light of the unseasonably warm weather that most of the East Coast enjoyed for the winter of 2015-2016. Using data gathered from the Western PA Regional Data Center and the National Climatic Data Center, I sought to first provide a visual illustration of the average daily temperatures plotted against the total number of personal crimes (assaults, public drunkenness, etc) and found that there is a statistically significant relationship between the two.
The personal sensing project was an opportunity to work with mobile phone data for the first time. Over a period of two weeks in January 2015, my movements and cell phone usage behavior were tracked to help provide data to a study being done by the Human Computer Interaction Institute at Carnegie Mellon. We were then given our own data and were able to create interesting visualizations. I used this data to plot on a map my hour-by-hour movements and used color to represent whether I was on foot, in a vehicle, or actively using my phone. You can view the full interactive map here.
Crime in Neighborhoods Surrounding CMU
For this project, I looked at crime in Pittsburgh during the month of January 2016 in neighborhoods surrounding Carnegie Mellon. All data were made available from the Western PA Regional Data Center, a regional data repository that contains information about various publicly-available datasets, including crime, 311, and property assessments. All red caution icons represent an offense that occurred (but no arrest made) and the police icons represent an arrest.
This public transit study sought to identify areas in Allegheny County that have high public transit ridership and identify the percentage difference in commute time from those who use public transit vs. those who use private vehicle ownership.
I worked in a team to analyze the impact that “shocks” (such as natural disasters or new technology implementation like cell phone apps) can have on the ability to use New York City 311 data for prediction of future events. For this analysis, we used Naïve Bayes, Bayes Nets, and decision tree models and found that, since the introduction of the ability to submit 311 reports through a cell phone app, there has been a significantly higher proportion of “visual” complaints (e.g. complaints that are likely to be reported while someone is outside of their home, such as a pothole) than “non-visual” complaints (e.g. heating or apartment maintenance requests).