Illustration: Angelica Alzona |
Normally, we teach you how to avoid misinterpreting statistics, but knowing how numbers are manipulated can help you spot when it happens. To that end, we're going to show you how to make data say whatever the hell you want to back up any wrong idea you have.
It's Evil Week at Lifehacker, which means we're looking into less-than-seemly methods for getting shit done. We like to think we're shedding light on these tactics as a way to help you do the opposite, but if you are, in fact, evil, you might find this week unironically helpful. That's up to you.
Gather Sample Data That Adds Bias to Your Findings
The first step to building statistics is determining what you want to analyse. Statisticians refer to this as the "population". Then you define a subset of that data to collect that, when analysed, should be representative of the population as a whole. The larger and more accurate the sample, the more precise your conclusions can be.
Of course, there are a few big ways to screw up this type of statistical sampling, either by accident or intentionally. If the sample data you gather is bad, you'll end up with false conclusions no matter what. There are a lot of ways you can mess up your data, but here are a few of the big ones:
- Self-Selection Bias: This type of bias occurs when the people or data you're studying voluntarily puts itself into a group that isn't representative of your whole population. For example, when we ask our readers questions like "What's your favourite texting app?" we only get responses from people who choose to read Lifehacker. The results of an informal poll like this likely won't be representative of the population at large because all our readers are smarter, funnier and more attractive than the average person.
- Convenience Sampling: This bias occurs when a study analyses whatever data it has available, instead of trying to find representative data. For example, a pay TV news network might poll its viewers about a political candidate. Without polling people who watch other networks (or don't watch TV at all), it's impossible to say that the results of the poll would represent reality.
- Non-Response Bias: This happens when some people in a chosen set don't respond to a statistical survey, causing the answers to shift. For example, if a survey on sexual activity asked, "Have you ever cheated on your spouse?" some people may not want to admit to infidelity, making it look like cheating is rarer than it is.
- Open-Access Polls: These type of polls allow anyone to submit answers and, in many cases, don't even verify that people only submit an answer once. While common, they're fundamentally biased because they don't attempt to control the input in any meaningful way. For example, online polls that just ask you to click your preferred option fall under this bias. While they can be fun and useful, they're not good at objectively proving a point.
Choose the Analysis That Supports Your Ideas
Since statistics use numbers, it's easy to assume that they're hard proof of the ideas they claim to support. In reality, the maths behind statistics is complex, and analysing it improperly can yield different or even entirely contradictory conclusions. If you wanted to twist a statistic to suit your needs, fudge the maths.
Anscombe's quartet shows four different charts that have nearly the exact same statistical summaries. |
To demonstrate the flaws in analysing data, statistician Francis Anscombe created Anscombe's quartet (diagrammed above). It consists of four graphs that, when viewed on a chart, show wildly different trends. The X1 chart shows a basic scatter plot with an upwards trend. X2 shows a curved trend that was going up, but is now going downward. X3 shows a smaller trend upwards, but with one outlier on the Y axis. X4 shows data that's perfectly flat on the X axis, save for one outlier that's super high on both axes.
Here's where it gets crazy. For all four of these charts, the following statements are true:
- The average x value is 9 for each dataset
- The average y value is 7.50 for each dataset
- The variance for x is 11 and the variance for y is 4.12
- The correlation between x and y is 0.816 for each dataset
Anscombe suggested that to avoid misleading people, you should always visualise your data before drawing conclusions and be aware of how outliers influence the analysis. It's hard to miss an outlier on a properly graphed chart, but they can have a massive yet invisible effect on text. Of course, if your goal is to mislead people, you can just skip this step.
Read more...
Source: Lifehacker Australia