The ideal hypothesis:
- Has basis in a reasonable engineering, physical, or economic, etc. model.
- Is as simple as can be in terms of number of variables. I.e. Occam's Razor has been applied.
- Either has been vetted against a number of other hypotheses and selected as the most reasonable, or will be tested along with other reasonable hypotheses.
- Will be tested in the gold standard, the randomized controlled experiment.
- Is actionable.
Basis in a ModelAs discussed in my causality blog entry, the only way to assign causality is to develop a rational model about how things really work, not just from the output of some multivariate correlation done in R. The best hypotheses are rooted in causation, though it is of course possible to hypothesize anything conjecture at all, including from statistical correlations discovered during data exploration. Discovery from data as a source of hypothesis is better than pulling from thin air, but hypothesizing from a model is best of all. Hypothesizing from data is called induction and hypothesizing from a model is called deduction.
SimplicityThe fewer the variables, the stronger the hypothesis and the more robust it is, by which it is meant the more likely it will hold up to a variety of conditions. E.g., suppose we induce a hypothesis from data exploration that teenage girls that use Twitter like Justin Beiber. A stronger hypothesis (if it turns out to be true) would be to get rid of the Twitter condition, not only because it broadens the potential market for Justin Beiber products, but also because it is more resilient in varied circumstances, such as perhaps a time when (assuming some sort of unlikely calamity befalls Twitter) Twitter is no longer popular and something takes its place.
Vetted Against Competing HypothesesWhen forming a hypothesis, it is important to brainstorm as many different plausable hypotheses as possible, from a variety of sources:
- As with conventional brainstorming, ask fellow team members and associates for their creative hypotheses.
- Formulate as complete a model as possible, and from that model identify explanations. E.g., when modeling a consumer:
- What is the consumer's budget?
- What is the pay schedule of the consumer?
- Are there upcoming holidays that would either enhance purchases (in anticipation) or hinder them (due to store or bank closures)?
- What products complement the products the consumer already owns?
- What products would enhance the social standing of the consumer?
- Does the consumer carry credit cards that are accepted?
- Is the consumer a student?
- Identify leading hypotheses and test them. This is easier said than done. "Identifying" is a nice way of saying "hunch," because the alternative, "test," is very expensive if done by the gold standard, the controlled randomized experiment.
Controlled Randomized ExperimentControlled randomized experiments are the gold standard, but they are expensive and time consuming. It is much more convenient and quicker to find and test correlations in existing data sets, but such correlations are fraught with problems: population not random throughout independent variables of the new hypothesis, limited data for train vs. test that effectively lead to test data becoming training data, experimental conditions being different, etc.
But from a practical standpoint, "quasi-experiments" (experiments from an existing data set) are the general rule encountered in practice and "experiments" are, realistically, the exception. Compensating for the shortcomings of quasi-experiments will be the subject of a future article.
ActionableYou can have the most interesting, perhaps even insightful, hypothesis, but if there is no reasonable course of action to take once it is proven, it's a waste of time to prove it.
ConclusionGood hypothesis formation:
- Avoids wasting time testing bad hypotheses
- Saves time that can be redirected toward testing the best hypotheses, including testing hypothesis adjacent to the leading hypotheses to avoid spurious correlations
- Results in more resilient, more actionable insights.