Bad science can damage your reputation—as Pons and Fleischmann discovered—and costs companies real money, such as faulty forecasts leading to hundred million dollar capacity planning errors or incorrect analysis causing marketers to advertise in ineffective channels. Verification, Validation, and Uncertainty Quantification (VVUQ) is a framework from the Modeling and Simulation community which can be used to think about the correctness of scientific models. Using this framework will help you have more confidence in your results and avoid costly blunders.
There are two rules of VVUQ:
- Provide evidence of correctness for every model and prediction.
- In the absence of this evidence, assume that the results are false.
In The Logic of Scientific Discovery (1959), Karl Popper argued scientific methodology must focus on falsifiability: you cannot prove that a model is correct—only that you can disprove or have failed to disprove it. Consequently, VVUQ requires that you provide evidence of correctness appropriate for the type of decision you are dealing with. For example: you would provide a different level of evidence for a direct mail campaign than a nuclear reactor because the costs of being wrong are radically different.
This approach is inductive by nature: we furnish evidence to assert that a general condition—the correctness of the model—is likely to be true. This differs from the scientific method, which is deductive and proceeds from general principles, like conservation laws, to specific results, like the Hall effect.
VVUQ, as the name suggests, consists of three steps, according to Verification and Validation in Scientific Computing, by William L. Oberkampf and Christopher J. Roy.
- Verification: “solving the right equations” – i.e., check that your code correctly implements the model.
- Validation: “solving the right equations” – i.e., check that the model is faithful to reality.
- Uncertainty quantification: “[the] process of identifying, characterizing, and quantifying those factors in an analysis which could affect accuracy of computational results” — i.e., check that possible sources of error won’t confound your results
Let’s examine these steps in order.
Start with verification: your mission is to ensure that you’ve correctly implemented the model in code, whether that is SQL or a programming language like Python or C++.
Check Your SQL!
Checking your SQL (or whatever code you use to pull raw data) is crucial. If you assemble a garbage dataset, no amount of machine learning cleverness will make anything better. 70 percent of the people I have interviewed for data-related jobs assume that if their SQL passes explain plan, it must be correct. But explain plan doesn’t cover all your bases. Here are some other things you should verify:
• Check simple cases you can compute by hand
• Check join plan
• Check aggregate statistics
• Sanity check the answer—is it compatible with reality?
Now that you have confidence that you have assembled the right dataset, it is time to verify your code.
Unit Test Your Code
Some people believe that code is correct if it produces a number instead of crashing, or that it’s impossible to test scientific code. This is patently false. You must verify your code—even if it appears to run correctly. It may be hard to test scientific code, but you must make the effort. Numerical code, especially, can be tricky to get right.
There are several ways to test:
• Write unit tests
• Write specification tests that check if different modules interface correctly
• Test special cases
• Test cases with analytic solutions
• Test that you can correctly estimate the parameters from a synthetic data set
Unit tests are a great place to start, especially when combined with Test Driven Development. Whatever language you prefer should have a unit test framework, which will enable you to test your code as you build it. Identifying bugs as you go will let you fix them when they are easiest to correct, and you can automate these tests to ensure that defects don’t creep back in as your team continues to write more code.
For example, it’s easy to implement something as simple as moving averages incorrectly:
Are you off by one index? Is your moving average centered when it should be trailing? These problems are easy to miss if you don’t test. You may think that you don’t have time to test, but you don’t have time not to test!
Build A Pipeline
Similarly, building a good data pipeline facilitates testing and maintenance. Design your pipeline according to the Unix filters and pipes philosophy: each stage should do one thing and do it well, with known inputs and outputs, and be easy to combine with other stages. Then you can intervene between any stage—either to verify that the stage functions correctly or to insert new stages to handle special cases, such as out outliers, cleaning, aggregating to the right grain, and so on. And remember: an interface is a contract.
I also check cases where I know the answer, such as special cases with known (analytic) solutions or where I know the data generating process (DGP). When possible, I generate a Monte Carlo dataset with a known DGP. Then I can check that my estimation strategy can recover the DGP’s parameters. These simulations can also indicate how important finite sample bias and other statistical problems are for your data and methodology.
After you’ve ensured that your code correctly implements your model, validate how faithful your model is to reality. At this stage, you should worry about:
• What sources of uncertainty could affect your results?
• Do your assumptions hold? When do they fail?
• Does your model apply to the data/business problem?
• Where does your model break down? What are its limits? Where does it have power?
To answer these questions, you:
• Conduct an experiment
• Test the model’s assumptions
• Check for information leakage
• Perform specification testing for the model
Traditionally, practitioners run an experiment to check the fidelity of their models.
Popular methods include:
• A/B test
• Multi-armed bandit
• Bayesian A/B test
• Wald sequential analysis
No doubt you’re familiar with some of these approaches. Of these techniques, Wald sequential analysis is worth mentioning because it is not well-known. Use it to test whether you’ve collected enough data to conclude that you know the answer at some pre-specified confidence level or if you need to continue the experiment.
In addition to conducting experiments, you should compute statistical tests to provide evidence that the model’s assumptions are valid. When do they hold? When do they fail? Are the distributions of the features the same in the treatment and control groups? If you’re using linear regression, is endogeneity present, biasing your parameter estimates?
Often data science deals with human behavior; consequently, endogeneity is likely to be present because people will self-select based on their preferences. This selection bias will be present in many datasets you analyze. If you blindly throw all of the features into the black box of machine learning, turn the crank, and trust the results, your model will probably perform poorly on out of sample data because the label and features are co-determined by some unobserved factor. Beware of using endogenous features in your models!
Sometimes, we must work with data where we don’t fully understand all of the features. In such cases it is particularly important to check for information leakage, where some of the features are derived from the label. One way to detect leakage is to build simple models with each individual feature: if the feature predicts the label well, you probably have leakage. Similarly, make sure you perform cross validation across the appropriate margins of the data. In particular, for time series (and panel data) models, you must split the data temporally and not randomly. In general, beware of any model which performs too well out of the box.
When working with a classical statistical model, you can also perform specification testing to test whether you can restrict a more complex model to a simpler one. The type of model will determine which test—likelihood ratio, score, or Wald—is best. You can also check for missing variables and incorrect functional form.
In short: experiment, test, and question your assumptions.
The final step is uncertainty quantification. Take a step back and think about what factors could confound your results. There are two primary types of uncertainty: aleatoric and epistemic. Aleatory uncertainty arises from inherent randomness in the system, whereas epistemic uncertainty occurs from ignorance about the true data generating process. Epistemic uncertainty is particularly vicious because it includes confounding factors that we aren’t aware of, and hence are not captured by the model.
The current problems with the new San Francisco Bay Bridge are an expensive example of a project which ignored UQ. Most of the bridge design committee ignored the concerns of the engineering-oriented members, and dismissed their concerns about using an untried design for an application for which it was ill-suited. Now Caltrans is spending millions to determine how reliable the bridge is, and to develop a solution post-hoc for bridge parts which are failing prematurely.
Want more data science tutorials and content? Subscribe to our data science newsletter.