Excerpt for Correlation Is Not Causation by , available in its entirety at Smashwords

Correlation Is Not Causation

Learn How to Avoid the 5 Traps That Even Pros Fall Into


Correlation Is Not Causation (2nd Edition)

By Lee Baker

Copyright 2018 Lee Baker

Smashwords Edition

Thank you for downloading this ebook. You are welcome to share it with your friends.

This book may be reproduced, copied and distributed for non-commercial purposes, provided the book remains in its complete original form.

If you enjoyed this book, please return to your favourite ebook retailer to discover other works by this author.

Thank you for your support.



The Nature of Correlation and Causation

Correlation and Causation – Definitions

Direct Causation

1. Wrong Direction Causation

2. The Third Cause Fallacy

3. Indirect Causation

4. Cyclic Causation

5. Coincidental Causation

Avoiding the 5 Traps of Correlation v Causation

The Dangers of Declaring Direct Causation

1. Slaying the Beast of Wrong Direction Causation

2. Slaying the Beast of The Third Cause Fallacy

3. Slaying the Beast of Indirect Causation

4. Slaying the Beast of Cyclic Causation

5. Slaying the Beast of Coincidental Causation


About the Author

Claim Your FREE eBook Now!

Leave a Review


As this book is hot off the press, it hasn’t had time to gather any reviews or testimonials yet. Not to worry – when it’s got some lovely reviews I’ll add them here.

So chop chop – if you get a move on and leave a juicy review you might end up getting a mention right here!

I’ll remind you at the end of the book to leave a review or testimonial.

For now, enjoy!

Claim Your FREE eBook Now!

Beginner’s Guide to Correlation Analysis is the sister book to Correlation Is Not Causation, and shows you why your correlation results are probably wrong!

Download your FREE copy right here:



We have a problem with storytelling.

It’s not our fault though – as human beings we are hard-wired from birth to look for patterns and explain why they happen. This problem doesn’t go away when we grow up though, it becomes worse the more intelligent we think we are. We convince ourselves that now we are older, wiser, smarter, that our conclusions are closer to the mark than when we were younger (the faster the wind blows the faster the windmill blades turn, not the other way around).

We get better at this when we learn about statistics and correlations though, don’t we?

How many times have you heard that ‘correlation does not imply causation’? Many times, I’m sure. So you know not to say things like ‘wow, this correlation between A and B has a p-value of 0.000001 so A must be causing B…’. I wish I had a pound for every time a very experienced, intelligent scientist with PhDs and titles and stuff has said this.

“Yeah, but just because you’ve got strong evidence of a correlation…” I say, “…it doesn’t mean that A causes B”.

“But look at that small p-value”, they say, “surely when the p-value is so small, then A must be causing B…”.

Well, no it doesn’t. Not at all.

You see what I mean about our story-telling problem? Even really smart people see a pattern and insist on putting an explanation to it, even when they don’t have enough information to reach such a conclusion. They can’t help it.

OK, so if correlation does not necessarily imply causation, there must be a reason for that, and there must be something that is causing what we observe. That is what this book is all about.

If we uncover a correlation between A and B, there are five alternatives to A being the direct cause of B:

  • Wrong Direction Causation

  • The Third Cause Fallacy

  • Indirect Causation

  • Cyclic Causation

  • Coincidental Causation

Once we understand each of these alternatives, we can formulate a plan to discover whether we have a direct causal link between A and B or whether there is some other explanation.

Back to Contents

The Nature of Correlation and Causation

We are constantly being reminded that “correlation doesn’t imply causation”, and yet so often in our daily lives we see a pair of events, by common sense link them together and declare that the first event must have caused the other.

At times we might whimsically moan that it always rains right after washing the car, or that the phone only seems to ring when you step into the bath. Does the act of washing the car cause it to rain? Did the phone ring because you got in the bath? Of course not!

This is the thing about being human. We seek explanation for the events that happen around us. If something defies logic, we try to find a reason why it might make sense. If something doesn’t add up, we make it up.

This is very familiar in the world of scientific exploration. A researcher, on analysing her data discovers an unusual correlation between a pair of variables. What she should do is be sceptical and try to find a reason why this correlation is not plausible, but she doesn’t. She tries to find a reason to support the correlation, a reason why one event caused the other, allowing her to declare a new discovery.

The bottom line is that she’s biased. She can’t help it. We all are – it’s human nature.

This is why we need to arm ourselves with the knowledge and the tools to analyse why correlation sometimes implies causation, but most of the time does not.

Correlation and Causation – Definitions

Let’s start by defining what each of correlation and causation are. Up first, correlation:

Did you notice that there is no mention of the word ‘causation’ in the definition? This is our first clue that correlation and causation may not necessarily be connected.

OK, so let’s move on to a definition of causation:

Here there is no mention of the word ‘correlation’.

On the other hand, the word ‘relationship’ is mentioned in both definitions. The difference is that in a causal connection between variables, the nature of the relationship is such that one causes the other. For this to happen, the cause must precede the effect – chronologically the cause must come first – and the effect happens some time later as a direct consequence. Therefore there must be an element of time involved.

For example, being stabbed in the chest causes severe blood loss. On the contrary, suffering severe blood loss does not cause you to get stabbed in the chest. One event is the cause (being stabbed) and the other (blood loss) is the effect.

When we talk about correlations though, statistical correlation tests do not involve the element of time. The relationship between variables is just that – a relationship – and nothing implies a passage of time. Nothing suggests that one event precedes another, and therefore there is nothing to suggest a cause-and-effect relationship. There may be one, but then again there may not.

This is not to say that correlations are useless. They might not be able to tell us about cause-and-effect, but they can tell us how observations are related to one another. In essence, we use correlations to find out about relationships, but this is just a precursor step to running some carefully constructed trials – trials that do involve time and can tell us which events are the cause and which are the effect.

Before we do set up lengthy and expensive trials though, we need to be sure that we’ve understood our correlations as well as we can. If we are careful with our analyses, we can identify the relationships that are our prime candidates for cause-and-effect trials, but if we are not, then we are likely to get the wrong answers.

There are five typical correlation-causation traps that even experienced professionals fall into from time to time. We’ll have a look at these in a moment, but first, let’s look at what we mean when we describe observations as having a direct causal relationship.

Direct Causation

When we analyse our data and find a correlation between A and B, our first thought, naturally, is that A causes B, like this:

This may be the case, but more often than not it won’t be. As I mentioned earlier, there are five reasons why this might not be the case, and it is important to assess each of them in turn. By diligent investigation, if we can eliminate all of the 5 alternatives then, and only then, can we be confident that A might cause B – but we can never be entirely sure, and we’ll find out why.

1. Wrong Direction Causation

The name says it all, really. Wrong Direction Causation, also known as Reverse Causation is where cause and effect seem to be reversed, where B causes A, like so:

Let’s have a look at a couple of examples to understand how this might come about.

You find that there is a correlation between height and weight of schoolchildren. Which would be the likely causal factor, height or weight?

It would be preposterous to suggest that the heavier a child is the taller they will grow (unless you’re conducting a study on malnourishment). This would be an example of Wrong Direction Causation.

More likely, the effect is the other way around, that the taller they are the heavier they would be – an example of Right Direction Causation (if such a thing existed).

On the other hand, a historical example of Wrong Direction Causality is that Europeans in the Middle Ages believed that lice were beneficial to your health. They had observed that healthy people were infested with lice, whereas few, if any, lice were to be found on sick people. They reasoned that people got sick because the lice left, ergo lice keep you healthy, like this:

Of course, the reverse is true. Lice are extremely sensitive to body temperature and left when their host began to develop a fever, even before the host noticed their symptoms. The observation was that the lice left, then the illness started; however, the illness was the cause that had the effect of the lice leaving. Only later did the fever symptoms appear.

2. The Third Cause Fallacy

Very common in statistics is where a correlation has been discovered between A and B, but in fact both A and B are consequences of a common cause X, but do not cause each other. This is known as the Third Cause Fallacy:

The variable X is known as a Confounding Variable (a variable that is present in the study, so can be discovered and corrected for) or a Lurking Variable (one that is not part of the study, so cannot be discovered, identified or corrected for).

A classic example of this is to be found in epidemiology.

Let’s say that in your analysis you find that there is a correlation between people who have lung cancer and people carrying matches.

The obvious inference is that matches cause lung cancer, but we all know that this explanation is probably not true. So we need to look deeper into the dataset to see what is responsible for this result, to find out which particular variable is likely to confound (cause surprise or confusion in) this result.

Digging deeper we find that smoking is significantly correlated with both lung cancer and carrying matches. The smoking variable now causes confusion in the relationship between matches and lung cancer, confounding our initial observations. The observation that there is a relationship between matches and lung cancer has become distorted because both are correlated with smoking. In this example smoking is called the confounding variable.

If you were to take your matches and lung cancer data and separate them into separate layers (data subsets or strata), one for ‘smoking’ and the other for ‘non-smoking’, the correlation between matches and lung cancer can be tested within the smoking population and within the non-smoking population separately.

You will find that there is no correlation between matches and lung cancer in each of the layers:

People carry matches because they smoke and smoking causes lung cancer; it is the third variable, smoking, that is the confounding variable and is the cause of both observations:

Of course, if you had not collected smoking data in your study, you could not have discovered it. Smoking would not be a confounding variable but instead would be a lurking variable, and would really mess with your conclusions!

3. Indirect Causation

Like the Third Cause Fallacy, Indirect Causation is also very common in statistics. A correlation has been found between A and B, but there actually exists an intermediate factor, X, such that A causes X which causes B, like this:

Let’s go back to an earlier example, where the lice left the host because the host had contracted a virus. Were the lice directly affected by the virus? Did the virus attack the lice, causing the lice to leave? In this case, the virus caused the body temperature of the host to rise. Lice, being acutely sensitive to temperature decided to leave and find an alternative host. Host temperature, then, was the intermediate factor.

4. Cyclic Causation

Very common in predator-prey relationships is cyclic causation, where A causes B and in turn B causes A, like so:

For example, Cheetah numbers in the African Savannah affect Gazelle numbers. When the numbers of Cheetah become large compared to Gazelle, Gazelle numbers (food supply) becomes low, the resultant food shortage decelerates the Cheetah population, giving the Gazelle population a chance to revive.

5. Coincidental Causation

With coincidental causation, there is no connection between A and B; the significant correlation that you found in your statistical analysis is a coincidence, like so:

In this case, even though your correlation p-value tells you that there is a significant correlation between A and B, it is not correct and the relationship has arisen by chance.

I guess we’ve all heard about the p-value, but let’s rewind a little and review what it is and how well we understand it.

When you want to assess whether there is a relationship (correlation or association) between a pair of variables, what you do is make an assumption that there is not a relationship between them (we call this the null hypothesis). Then you test this assumption.

All relationship tests give you a number between 0 and 1 that you can use to tell you how confident you are in the null hypothesis. This number is called the p-value, and is the probability that the null hypothesis is correct.

If the p-value is large, there is strong evidence that the null hypothesis is correct and you can be confident that there is not a relationship between the variables.

On the other hand, if the p-value is small, there is weak evidence to support the null hypothesis. You can then be confident in rejecting the null hypothesis and concluding that there is evidence that a relationship may exist.

Typically, a p-value of 0.05 is used as the cut-off value between ‘the null hypothesis is correct’ (anything larger than 0.05) and ‘not correct’ (smaller than 0.05).

If your test p-value is 0.15, this corresponds to a probability of 15% that the null hypothesis is correct, we do not reject the null hypothesis (since 0.15 is larger than 0.05) and we say that there is insufficient evidence to conclude that there is a relationship between the pair of variables.

This cut-off value has become synonymous with Ronald A Fisher (the guy who created the Fisher’s Exact Test, among other things), who said that:

In other words, if your result says that there is a statistically significant, independent (direct) relationship between a pair of variables, you have a 95% chance that this result is correct (if your analyses are correct and you’ve been diligent in seeking alternative causes for the relationship). What you can say for sure is that the result is correct in this dataset, but you can only be 95% confident that the result will be reproducible. Conversely, you are 5% confident that the result is incorrect and is not reproducible.

Another way of looking at it is to say that if you ran the experiment 100 times, you will get 100 variations of the result – each result will be slightly different, since your data is drawn from a similar but different population each time. On average, you will get the correct result 95 times and an incorrect result five times.

The odds may be in your favour, but there’s no getting away from the fact that sometimes, just sometimes, your significant correlation has occurred because of coincidence. Nothing more, nothing less.

Now here’s a little joke about correlation and causation I found online:


Looking over the virus-lice-host example from earlier, it’s interesting to note how our thinking evolved.

In the beginning, we thought there was a direct causal link between lice and health – if you had lice you were healthy, when the lice left you became unwell, therefore lice keep you healthy. Then we discovered that lice were leaving because the host had a virus, so we must have understood the relationship between lice and health the wrong way round – lice were leaving because the host was unhealthy. Then, when we discovered that lice were extremely sensitive to temperature we realised that they were leaving the host because the impending fever (that the host didn’t yet know about) was causing a rise in host temperature.

What we thought was a direct causal link became a reverse causation, became an indirect causation.

It’s really easy to believe your instincts and your preliminary results. The trouble is that we’re very easily deceived because we want to believe what we see. It’s only when we’re diligent enough to dig deeper that we discover that what we thought we knew may well have been wrong all along.

Test Yourself

Look at the following scenarios. Can you figure out why there is not a direct causal link, and which type of error is responsible?

  • Sleeping with your shoes on is strongly correlated with waking up with a headache. Therefore, sleeping with your shoes on causes headache.

  • There is a direct positive correlation between drinking alcohol and earnings. Therefore, the more you drink the more money you will earn.

  • As ice cream sales increase, the rate of drowning deaths increases sharply. Therefore, ice cream consumption causes drowning.

Learn More

If you want to learn more about correlations, causation and statistical correlation tests, visit the companion resources webpage. Here you’ll find recommendations for relevant books, video courses and other useful stuff – and I update it regularly so you always have the best learning material that I can find.

You can find the resources webpage here:

Back to Contents

Avoiding the 5 Traps of Correlation v Causation

As we’ve seen, there are five basic traps that we can fall into when making a link between correlation and causation, and you know the old saying – forewarned is forearmed. That’s not enough though. Knowing about the dangers won’t necessarily stop us falling into it.

It’s easy enough to make a causal link between height and weight in schoolchildren because the evidence is right there in front of our eyes. What we see could be deceiving us, but at least we can see what’s going on.

On the other hand, what about assessing the correlation between the expression of p53 and e-cadherin in breast cancer? We can’t actually see p53 and e-cadherin expression, so instead we use some very complicated methods to ‘stain’ tissue samples, then guestimate how much of the stain ‘sticks’ to the sample. It’s all rather imprecise and lots of things can go wrong, which is why it has to be repeated many times.

If we find a correlation, how confident are we that there is a causal link between them? Since we can’t see what’s going on inside cancer cells, we can’t use our ‘best judgement’ to suggest whether correlation equals causation.

What is needed, of course, is some more statistical analysis, and that’s where we’re going now.

If our statistical analyses are incomplete we’ll likely fall prey to one of the five errors of correlation v causation, but if we’re diligent enough we can slay all five beasts and come out on top.

Just so you know, we’re not going to go into depth about individual statistical tests in this Bite-Size book (otherwise it won’t be Bite-Size any more). We’re concerned here only with the general concepts of correlation, causation and of the general principles of learning how to avoid saying silly correlation-causation things that will make you look foolish in front of your boss.

If you want to learn more about the statistical mechanisms of correlations I suggest you visit the resources that accompany this book – you’ll find everything you need right there…

You can find the resources webpage here:

Ready? OK, let’s dive in…

The Dangers of Declaring Direct Causation

I hope you’ve figured out by now that quite a lot of the time what we think is a direct causal link between a pair of variables is nothing of the sort.

The question is how you assessed the correlation. If you did a univariate analysis between your pair of variables, such as a Pearson correlation, the result actually tells you a lot less than you think it does.

Let’s assume you have sufficient data to adequately test your hypothesis (this is often not the case), your study is well designed (it may not be) and your data is accurate (you’d be surprised at how often that it isn’t). If you get a non-significant p-value (larger than 0.05), you can be pretty sure (actually, 95% sure) that there is not a relationship between your variables, and in all likelihood one does not cause the other. That’s not to say that one does not influence the other indirectly, it may do, but there is not likely to be a direct causal link.

So far so good.

On the other hand, if you get a significant p-value (smaller than 0.05), the best you can say is that there may be a relationship between them. The relationship might be independent and it might not, and there could be a causal link, but equally there may not be. Is that vague enough for you?

Univariate statistical analysis is actually pretty poor at informing us whether a correlation between a pair of variables exists, and even worse at telling us about causal links.

What we need is to delve into the world of multivariate analysis to see if that can help us.

1. Slaying the Beast of Wrong Direction Causation

One feature of univariate statistics is that they don’t tell you the direction of the correlation. Remember the height and weight hypothesis of school kids? It’s pretty obvious to us that the taller you are the heavier you’re likely to be, but the opposite is not necessarily true. If one thing causes another, univariate analysis can help you find out if a correlation might exist, but it won’t tell you anything about which is the cause and which is the effect.

On the other hand, multivariate analysis can tell us the direction of the correlation.

In multivariate analysis, such as a Multiple Linear Regression (MLR), one or more variables are regressed against a single ‘target’ variable X, like this:

For Reverse Causation, the important thing is that the multivariate test is directional. It tells you which way the correlation runs.

If you want to better understand the relationship between A and X, and investigate the likelihood of whether A causes X or X causes A, first you need to find out if there is an independent correlation between them. For this you run the appropriate univariate tests. Then you use a multivariate test to assess A versus X with A as the target and then again with X as the target, like this:

It is important to note that at least one more predictor variable must also be used in these analyses, otherwise the multivariate analysis simply reduces to a univariate analysis.

If there remains a significant relationship when X is the target (but not when A is the target) then the correlation and hence the causation (if it exists) runs from A to X and not from X to A:

2. Slaying the Beast of The Third Cause Fallacy

The people carrying matches and lung cancer example that I worked through earlier gives a clue about one way to solve this problem – you do a stratified analysis. In this example, we found a significant relationship between people who carried matches and lung cancer. Looking deeper, we found that this relationship did not hold true in the population (stratum) of those that smoked, nor in the population that didn’t.

Stratification has the drawback though of diminishing amounts of data as the depth of your strata increase. Imagine performing these analyses in the strata:

  • smoking

  • men that smoke

  • men over 50 that smoke

  • left-handed men over 50 that smoke

  • left-handed men over 50 with diabetes that smoke

To these factors, we can also add in fitness level, family history, body mass index, and a whole host of environmental and genetic factors. By the time you get to the bottom of these strata there will be very few samples left and in the time it’s taken you to do the analyses you will likely have grown old, grey and very frustrated!

Rather than do a stratified analysis, it is quicker and easier to use the more powerful multivariate analyses. Multivariate analysis allows you to test for relationships while simultaneously assessing the impact of multiple variables on the outcome without having to limit the pool of data.

It tells you of the various risk factors and their relative contribution to outcome, and gets round the issue of confounding by adjusting for it.

If you used smoking status and carrying matches as predictor variables in a multivariate analysis with lung cancer status as the target, we would find that smoking would be significantly (p<0.05) and independently associated with lung cancer, while carrying matches would be non-significant (p>0.05), like this:

Remember though, you can only correct for confounding variables in this way – variables that are not part of the study (lurking) cannot be corrected for, so design your study carefully!

3. Slaying the Beast of Indirect Causation

With The Third Cause Fallacy, you can get a clue that something is not quite right by running a univariate analysis in your strata. However, the Indirect Causation problem is very difficult to identify by using univariate analysis alone.

If you have a correlation between A and B, it may appear that there may be a direct causal link. If we now discover that each of A and B are also correlated with another variable, X, it is a very complicated task to try to figure out by univariate analysis alone which (if any) of these correlations are direct and which are indirect.

For this, we again turn to multivariate analysis for the solution.

With multivariate analysis, the strategy for determining if any of these variables is an intermediate is quite straightforward. If B is the target variable, simply enter A and X as predictor variables. If X is an intermediate, the p-value for X will remain significant (meaning that X is independently related to B) and the p-value for A will become non-significant (A is not independently related to B).

In the earlier example of the host, the virus and the lice, in univariate analysis you would find a significant relationship between the presence of the virus and the temperature of the host, between the presence of the virus and the number of lice on the host, and between the temperature of the host and the number of lice present. Multivariate analysis will clearly show that there is a direct correlation between the host temperature and the numbers of lice present (significant p-value), and that the presence of the virus is not directly correlated to the number of lice present on the host (non-significant p-value).

4. Slaying the Beast of Cyclic Causation

Hmmm, this is a tough one.

Take the example of cheetahs and gazelles. At any given point in time, the ratio of cheetahs to gazelles will likely be different to any other point in time. This means that time plays a critical role in assessing if there is a cyclic cause to a particular relationship. Correlation tests do not take time into consideration, which makes them quite unsuitable to figuring out cyclicity (is that a real word or did I just make it up?).

Let’s plot the population sizes of cheetahs and gazelles over time on the same set of axes. We might end up with something that looks a little like this:

At first glance, it looks a little complicated, but let’s take it one step at a time.

The first thing to notice is that both populations are cyclical. That’ll give you your first clue as to whether there could be a cyclic causal connection in the relationship between cheetah and gazelle.

Next, notice that the peaks and troughs of the cheetah population size are offset with those of the gazelle.

What this means is that between certain time points the populations will be positively correlated, such as between B and C, where both populations are declining. This is because the cheetah population became too large, over-hunted the gazelle, causing a decrease in the gazelle population thereby decreasing their own food supply to the point where their own population began to dwindle. The gazelle population has not yet had a chance to recover.

On the other hand, there will be other periods where the cheetah and gazelle populations are negatively correlated. We can see this between points A and B, where the cheetah population is recovering from a lean period and is now beginning to over-hunt the gazelle. The cheetahs are on the rise, causing the gazelle population to fall. Similarly between points C and D the cheetahs, having over-hunted gazelle, are experiencing a decline in their numbers thereby giving the gazelle population an opportunity to recover.

If your study is a snapshot of a single period in time, you won’t detect this, but if you repeat your study across different periods, you should notice something odd about the relationship between cheetahs and gazelles.

This is a pretty simple example, but what if we were talking about a cyclic causation that is in equilibrium? Here, your pair of variables will be correlated with each other, but the nature of the correlation will not change over time. This is an incredibly difficult thing to spot, and there are whole branches of very complicated mathematics dedicated to explaining and quantifying it.

If you suspect you might have a cyclic causal connection between a pair of variables, go and speak to an experienced statistician or a specialist mathematician. Or both. Trust me – you need all the help you can get!

And that’s all I’m going to say on cyclic causality, otherwise this Bite-Size book will turn into a PhD thesis…

5. Slaying the Beast of Coincidental Causation

Remember when we said that, even though you had a correlation with a significant p-value, the result might still be incorrect and arose because of coincidence?

I gave you a clue how to deal with this earlier when I said that the result is correct in this dataset, but you can only be 95% confident that the result will be reproducible. I suspect you can guess where this is going – you need to repeat the experiment to see if your result stacks up!

If you do the experiment once, there is a 1 in 20 chance that the result happened by coincidence. On the other hand, if you repeat your experiment and confirm the result, there is now a 1 in 400 chance that the result happened by coincidence. I don’t know about you, but I’m starting to like those odds much better!

What would be the odds if you found the same result when you repeated the experiment again?

This is why repeatability is one of the central foundations of modern science – if you can’t repeat your result, forget it and move on…


When analysing your data you can never be absolutely sure that A causes B. Even if you’ve used multivariate analysis to figure out the direction of correlation, correct for confounding and intermediate variables, eliminated the possibility of cyclic causation, there’s nothing you can do about lurking variables or coincidental causation without a new hypothesis and new data.

If something has not yet been discovered, for example a new biomarker for cancer, you won’t have tested for it, detected it, measured it or used it in your analysis. Its lurking effects are present in your data but there’s no way of knowing until you collect data on it and include it in your analysis.

Similarly, if you repeat your analysis three times and have a one in 8000 chance of the result being due to coincidence, you still can’t declare A causes B, because there’s still that one in 8000 chance that it’s wrong.

This is why statistics can never prove causation – because it is based upon probabilities, not certainties. If you want to prove a cause-and-effect relationship, you need to set up a carefully thought through scientific study that is specifically designed to show that the cause must always precede the effect.

Despite all this, statistics is a much better tool than our intuition. As human beings we have a story telling problem in that we’re very good at coming up with explanations for things we can’t really explain. Statistics, on the other hand, is great for pointing us in the right direction. It is extremely good at telling us that something isn’t true, and although it’s not very good at telling us that something is true, it does at least tell us how confident we can be about the result.

Correlation might not necessarily imply causation, but it does give you a helluva hint…

Learn More

Just a quick reminder that if you want to learn more about correlations, causation and statistical correlation tests, visit the companion resources webpage – it is updated regularly so you always have the best learning material that I can find.

You can find the resources webpage here:

Back to Contents


Well, I hope you got something useful out of Correlation Is Not Causation.

We discovered that:

  • If there is a correlation between a pair of variables, there is a cause, but it’s not necessarily what you think it is

  • There are 5 reasons why your correlation might not be a direct causation

  • Correlation might not necessarily imply causation, but it does give you a helluva hint…

Next time you’re doing correlation and association analyses and you’re about to suggest a causal link, take a moment to consider that maybe, just maybe, your first instincts are leading you astray and you need to dig deeper to find out what’s going on with your data.

Remind yourself of the five reasons why your first instincts might be wrong. Rummage around in your data and stats. If something doesn’t look or feel right, keep digging until you find out what’s bothering you. There really is no substitute for getting your hands dirty!


About The Author

Lee Baker is an award-winning software creator that lives behind a keyboard in a darkened room. Illuminated only by the light from his monitor, he aspires to finding the light switch.

With decades of experience in science, statistics and artificial intelligence, he has a passion for telling stories with data. Despite explaining it a dozen times, his mother still doesn’t understand what he does for a living.

Insisting that data analysis is much simpler than we think it is, he authors friendly, easy-to-understand books that teach the fundamentals of data analysis and statistics.

His mission is to unleash your inner data ninja!

As the CEO of Chi-Squared Innovations, one day he’d like to retire to do something simpler, like crocodile wrestling.

Claim Your FREE eBook Now!

Beginner’s Guide to Correlation Analysis is the sister book to Correlation Is Not Causation, and shows you why your correlation results are probably wrong!

Download your FREE copy right here:


Leave a Review

Thank you for reading Correlation Is Not Causation.

I hope you enjoyed reading it as much as I enjoyed writing it. If you did, please take a moment to leave a review. The best reviews will be featured at the beginning of the book.

Thank you!

Lee Baker

Download this book for your ebook reader.
(Pages 1-29 show above.)