In 2008, Washington D.C. became the first place in North America to launch a bike sharing program. Since then, bikesharing programs such as New York’s Citi Bike and San Francisco’s Bay Area Bike Share have popped up in major cities across the U.S. A popular new form of “public” transportation, bike share programs offer a way for residents and tourists to enjoy the benefits of bicycle commuting—exercise, avoiding traffic, low cost—without the hassle of owning, maintaining, and perhaps most importantly, parking a bicycle of their own.
Most bike shares have fleets of several hundred bicycles located at stations across their city. Members pay a monthly or yearly fee for free trips under 30 minutes and discounts on longer ones, while casual users (typically tourists) subsidize things by paying more per ride. There are now more than 30 bike sharing programs in North America.
One of the biggest problems that bike shares face is maintaining the number of bikes at every station. It’s critical to maintain a balance between ensuring there are enough bikes to meet demand, but not so many that the station fills up—not leaving any room for users to return bikes currently in use. Most programs employ dispatchers to move bikes from one station to another, but it’s often difficult to tell exactly where the bikes should go.
Seattle’s Pronto is one such program. With some 500 bikes across 50 stations throughout the city, Pronto often faces major issues ensuring there are enough—but not too many—bikes at each station. To help with the problem, Galvanize data scientist in residence Evan Sadler built a predictive model to predict bike traffic across the city. It predicts the number of available bikes at a given station by modeling the hourly incoming supply and outgoing demand. Essentially, it reveals the natural flow of bicycles.
“From clustering, I discovered two distinct ecosystems of bike stations—Seattle, and the University District—based on traffic flows from station to station,” Sadler said. “It turned out that having separate models for each lent itself to much better predictions.”
Sadler modeled hourly supply and hourly demand separately for each of the two ecosystems, summing the result to predict the change in current bike count, based on the current bike count data from the Pronto API. To do this, he used multiple random forest algorithms, each tuned for a specific task.
“Having groups of smaller random forests worked much better than having a single large random forest try to predict everything,” Sadler said. “This is probably due to the different ecosystems having vastly different signals and different types of noise.”
The model—which is actually two models (a random forest for each ecosystem), of which the branches of each are composed of additional random forests—draws from historical demand based on the current season, current hour, and current weekend. It also uses meta information about each station, such as elevation, size, and proximity to other stations. The model leverages this information to discover signals and patterns in ride usage, then predicts based on the signal it finds.
The information Sadler’s model provides is incredibly useful, but could lead to entirely different bike shortage/abundance problems if interpreted wrong. To that end, Sadler is currently working on a user-facing feature that will use machine learning to analyze the predicted bike counts and give suggestions on where and when bikes should be manually shuffled around. He plans to make the model and app available for anyone to use—Pronto’s bike-dispatchers, for example. Hopefully they’ll take him up on the offer, ensuring no Seattle rider finds themselves forced to hoof it again.
Want more data science tutorials and content? Subscribe to our data science newsletter.