How the Bill & Melinda Gates Foundation Saves Lives with Python

Billions, if not trillions of dollars are spent each year to help prevent death and disease, funding research and development of cures, preventative care, and other health initiatives. But the funding is not infinite, so organizations like the Bill and Melinda Gates Foundation, a leading contributor to global health funding, need to prioritize where to spend their money. The University of Washington’s Institute for Health Metrics and Evaluation helps answer this question, and it does it using data science.

One of the major projects at IHME is the Global Burden of Disease, a “systematic scientific effort to quantify the comparative magnitude of health loss due to diseases, injuries, and risk factors.” It tracks more than 300 diseases and 50 risk factors across 188 countries, modeling data from 1990 through present day and the future. In other words, the GBD paints a global picture of the ways people could be disabled or die.

For example, running the GBD for a developed country would give you a tree map that looks something like this.

Screen Shot 2015-09-24 at 5.46.17 PM

The red section represents communicable diseases, things like LRIs (lower respiratory infections such as pneumonia), diarrhea, HIV, tuberculosis, and malaria. The green represents injuries, primarily made up of road traffic incidents, as well as other forms of violence such as wars and disasters. Finally, the large blue section is non-communicable diseases: ischemic heart disease, stroke, cancer, and many others.

The GBD is then combined with the Disease Control Priorities, a collection of projects which assesses costs, effectiveness, and access to healthcare around the world. Working together, these two systems aim to advise policymakers and fundraising initiatives where to prioritize their limited resources. Essentially, it helps organizations like the Gates Foundation determine where it can get the most “bang for their buck”—that is, save the most lives.

IHME is working with a lot of data. A single run of the GBD produces more than 25 TB of data, and some newer projects exceed 1 petabyte. To handle all this, IHME’s scientific computing group uses a 20,000 core sun grid engine cluster, as well as a newer thousand-core cluster using Spark.

“We use Python because it’s free and open source,” said Kyle Foreman, IHME assistant director of scientific computing, during a talk at PyData Seattle 2015. “We’re a nonprofit and don’t want to spend a lot of money on software, but also we’re working with people around the world, and it’s great for collaborating.”

Python is scalable (critical for those massive datasets), allows for rapid prototyping, and easy to use. While IHME deals specifically with health metrics, many of its employees are epidemiologists and medical doctors, not statisticians or computer scientists. Resources such as dashboard integration and iPython notebooks are essential for helping these less-technical collaborators.

Doing the Most Good

When looking at ways to help a country’s health, one of the biggest focuses is on determining cause of death. After all, you want to keep people from dying, and the best way to do that is to figure out what’s killing them. To estimate causes of death, IHME has developed a statistical model known as CODEm—short for Cause of Death Ensemble model.

“There’s a huge amount of debate on what the best modeling strategy is,” Foreman said. “Especially since we have such a huge collaborator network, we’ve hit upon an ensemble strategy to combine all these different approaches, selectively evaluate and pick the best, then combine them into an ensemble that sort of pleases everyone.”

For example: to create a model of ischemic heart disease of 65-year-olds in Russia, you might start with linear regression, incorporate covariates such as different risk factors, neighboring countries, and neighboring age groups, smooth it over with residual information, then run it through gaussian progress regression (using PyMC) to fit your data and make better estimates. These techniques together form a model for a particular set of covariants—one of possibly thousands of other models that are then selectively cross-validated into an ensemble model for, in this case, ischemic heart disease.

Another project is focused on disease modeling: the aptly named DisMod. There are many different ways to measure diseases: you could count the number of people who have the disease, how many people contracted it this year, or how long a person usually has it, for example. DisMod takes the epidemiological data, along with a few covariates, for a particular disease and runs it through a Bayesian meta-regression using PyMC to generate an internally consistent dataset.

Here’s how it works: A generic disease model places a person into one of four categories: susceptible, with-condition, dead from the condition, and dead from a cause other than the condition. It then uses a series of differential equations to estimate the various probabilities (based on existing data, covariates, etc) that someone will transition from one category to another.

Looking to the Future

With these fantastic death and disease models built, IHME is able to look to the future, attempting to forecast what’s likely to happen to global health depending on various probabilities.

IHME’s forecasting generates possible scenarios for every single thing tracked in the Global Burden of Disease, tracking the correlations between different conditions and risk factors. For example: a simulation in which both blood pressure and heart disease go up would also look at whether or not sodium consumption increased. Similarly, it could track a country’s income against tobacco consumption, mortality rate, and population. The simulation would run these correlations tens of thousands of times, resulting in a gigantic mess of information to apply your Python-based statistical models to.

These large-scale simulations generate a ton of data (more than a petabyte), so they needed to be run in a modular, high-performance way. To do this, IHME uses a modular YAML schema for describing the directed acyclic graph. Each quantity (GDP, mortality rate, etc) is assigned to a node, while the edges between those nodes reference how those quantities influence each other (using equations via sympy).

“The idea behind this,” Foreman said, “is Bill Gates wants to know where to spend his money, now, in order to save the most lives in 2040.”

Want more data science tutorials and content? Subscribe to our data science newsletter.