What are some of the pitfalls of using statistics?

 •  Filed under Data science and visualization

In an older post on Quora, my short answer linked the output of data-based analysis with the necessity to make a recommendation or to take a decision:

A common problem is that people may use a statistical exercise to make a decision with a false sense of comfort that the decision is "correct" because it was "based on data".

... even if you have precise data and large samples, the impression that we have reached "the final answer" is too tempting, and almost never warranted. For example, suppose that you surveyed all economists in the world. Suppose that 95% of them said they "support policy X." What have we actually learned?

We do not know 1) how costly it would be to pick the wrong policy, 2) whether respondents relied on their priors and biases -- or how (un)certain they are -- 3) whether the majority of respondents even reported their opinions truthfully.

About 1.5 years later, I still think that:

  • It is good to collect data and look at what it says.
  • But it can be very hard to decide what the relevant data is.
  • It is too easy to forget that the data may be biased.
  • It is easy to downplay the flaws of the analysis, especially if you crunched the number yourself.
  • It can be hard to justify not using a chart or other type of output, because you had "done the work". Beware of the sunk cost fallacy.

So when we say that "it is good to collect data", does that imply that it is always better to have some data than to have no data? That's not so clear - I think it depends on how the data will "calibrate your confidence". If a table/chart/regression makes you forget about the remaining uncertainty, it might have been better not having the data in the first place.

It may sound obvious, but it seems it should be said: data analysis can produce hubris. A mindset that pushes you to always collect more information is better than most alternative approaches - but a decision does not automatically become correct just because it was based on a large dataset.

A few days ago, David Leonhardt wrote of this concern as well:

statistics — and its synonyms, like data — are both the most overrated and underrated kind of information. Overrated, especially now, because phrases like “big data” and “data journalism” are fashionable. Saying that a conclusion is based on big data seems to lend it a sheen of rigor that mere observation lacks.