Genome wide association studies are unfit for causal inference

Genome wide association studies (GWAS) are observational studies of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. This type of study is very new and it shows how far computer science has come, enabling us to sequence the entire genome of hundreds of thousands of individuals if not over a million to be studied.

However, although these new studies are very interesting, one has to keep in mind that they are observational. In other words they are correlation studies, they enable us to find which genes variants correlate with having a certain phenotype. But as anyone who has taken an introductory class in statistics knows, correlation does not entail causality.

Despite all of this, to my surprise, Robert Plomin, an eminent behavioural geneticist, has made the claim that “Predictions from polygenic scores are an exception to the rule that correlations do not imply causation” in his book Blueprint. This is not true. What is probably happening is that Plomin is exaggerating his findings to acquire recognition. In this piece, I will show that GWAS are not causal, give examples of how they can be confounded and then proceed to provide some closing thoughts on this matter.

GWAS are confounded by population startification

Let’s imagine, for argument’s sake, that Swedish-Americans earn significantly more money than the average American. Let’s also assume that they do so for purely cultural reasons (protestant work ethic, avoidance of ostentatious spending like their Mediterranean counterparts, etc…). Since Swedish Americans represent a genetically distinct group, their distinctive gene variants will also tend to be associated with higher income, despite there not being any causal link. This would be an example of population stratification confounding GWAS.

Indeed, the definition of population stratification is the existence of a difference in allele frequencies between sub-populations in population as a result of non-random mating between individuals. I used the example of ethnicity above but it can also apply to situations of class endogamy or any genetically distinct group. The issue is that sub-populations that share genes will also tend to share a culture and an environment. This makes it hard to disentangle these factors when trying to assess the cause of an outcome of the group, be it social or health related.

Now that you have an idea of how GWAS can get confounded, let us look at a few concrete examples of GWAS that are very likely confounded.

Genes associated with ice cream flavor preference ?

The now famous DNA sequencing company 23 and me has conducted a GWAS study where they claim to have found genes associated with ice cream flavor preference. Although I don’t believe it’s impossible for genes to influence our sense of taste and our preferences, it seems impossible to me that it is our DNA that determines which artificial ice cream flavor we prefer. Throughout most of our evolutionary history none of these flavors were available, and certainly not artificial copies these flavors. What is probably happening here is that the study is picking up cultural groups that have a certain preference, or perhaps the study is simply not good and not reproducible.

A GWAS finds genes associated with walking pace

Another study found genes that explain roughly 9% of the variance in walking pace after controlling for body mass index. The individuals included range from 40 to 69 years old, one might think that this was the confounding variable but they claim they have also controlled for it along with other things. Nonetheless, even controlling for confounds in a regression (or in a GWAS) does not constitute a true causal method. When we lack an alternative and have a good idea of what are the possible confounds, such a method might be used to make a decision, albeit with a lower level of confidence than a RCT. But in this case, population stratification can happen in so many different ways that it is not warranted, in my opinion, to make causal claims from this data. It should also be noted that this study uses UK biobank data, which has been reported to have stratification problems.

Alleles correlated with which side of the face you use your phone on

To give a last reductio ad absurdum argument, an other GWAS found gene associations for using your cellphone on the left or right side of the face… As you have seen so far these studies can yield truly absurd results and should always be interpreted critically. Now, how can we prove that GWAS are actually confounded ?

Heritability reduced by in-family GWAS

A possible way to partly control for population stratification is by using in-family GWAS. Indeed, members of a family, although they can experience very different environments, will tend to share an ethnicity, a culture and a social class among other things. This paper shows that using such a methodology instead of classical GWAS studies decreases the heritability estimates significantly. This has been showed for height, IQ, educational attainment, smoking and more. What this study suggests is that most of the heritability estimates derived from previous GWAS not using the within sibling methodology are overestimated. One can conclude, that it is not only that GWAS can be confounded, it is that most are. Add to that the fact that in-family GWAS are not perfect and that even within a family environments can differ widely, so even the heritability estimates thus derived are probably too high.

Closing thoughts

GWAS are a brand new technology, which definitely has potential. If we can figure out which diseases do and do not have a genetic component and to what extent they do, it will enable medicine to start imagining new treatments accordingly. Nonetheless, one should keep in mind their stats 101 course and that correlation is not causation. New improved GWAS methods, such as the in-family method, will most likely keep on emerging, enabling us to control for more and more possible confounds. Be wary however, as even that does not constitute a robust causal method, but it will at least get us a bit closer to the answers we want. Perhaps one day we will develop true genetic causal methods, although at the moment I have no idea how this could be possible. Will future science prove me wrong ?

On the difference between statistically significant and statistically non-significant results

I’ve heard and read several statisticians use the following phrase or an equivalent, such as Andrew Gelman on his blog or Nassim Taleb on his youtube channel :

The difference between “significant” and “non-significant” is not itself statistically significant

But what does this mean exactly ? I did not fully understand it at first. I decided to research about it and think I now have a good grasp of the idea, so I am writing this short post in order to share my understanding of it, which might still be imperfect. Statistical significance is a tool that is very often misinterpreted and I think it is important to have an honest discussion about it to decide what this tool can and cannot achieve. Here is my explanation of this issue, hopefully you’ll find it succinct and clear :

When you run a hypothesis test, you’re trying to figure out whether you should reject the null hypothesis or not. If we take the example of the Z-test, you decide what the value of the parameter you’re interested in is for the null hypothesis. You then gather data from a sample of the population and estimate the same parameter for that sample. Assuming the parameter you are studying follows a Gaussian distribution, if the estimate of your parameter from the sample and the parameter value for your null hypothesis are separated by enough standard deviations, you reject the null hypothesis. The number of standard deviation separating the two values is often called the z-score, and the probability of observing a z-score greater than x is called the p-value. The value x varies and depends on the significance level chosen. Indeed, if the p-value is smaller than the significance level, the result is deemed statistically significant.

But here is the catch, if you take a population of samples of the same size from an initial population, the parameters of the population of samples will vary and follow a their own probability distribution. Their z-scores and p-values will also vary and follow their own probability distribution ! In fact, p-values of samples of the same size from the same population can vary a lot, and the difference between a significant result and a non-significant result is itself not necessarily statistically significant in the p-value distribution.

In fact, in Nassim Taleb’s paper on this subject, he generated the p-value distribution trough a Monte Carlo generator. He found that if the “true” p-value of the population is 0.12, 60% of the estimated p-values from the samples could be below the traditional significance threshold of 0.05.

This problem has serious consequences. Very often people will say “statistical significance does not imply practical significance”, but in fact, finding a statistically significant result in your sample does not even imply that it is truly “statistically significant” at the population level.

P-values and statistical significance are tools that are misunderstood by a lot of researchers and I think this information needs to be spread. The fact that, as we’ve shown, p-values can vary greatly makes p-hacking much easier than it would be otherwise, and this has terrible negative consequences on the scientific literature. One solution that has been advanced by several statisticians is to lower the significance threshold greatly, to 0.01 or 0.005. This might be a good start, but will it be sufficient ? Time will tell us, hopefully.

Post-publication modifications :

Corrected a few imprecisions.