Head of Data and AI at CentralNic Group PLC.
We live in an era where scientific knowledge is consolidated like never before. Regardless, skepticism in science and, especially, data abounds. In the data science domain, we’re seeing a lot of doubt surrounding the fast-paced development of AI-based tools and ever-more-powerful algorithms. While some of this concern is certainly valid, it is also spurring misguided but increasingly prominent notions such as the “AI is going to replace human workers” cliche. However, this presumption goes against the fact that AI is simply a tool built to enhance human skills and intelligence, not replace them, as is being evidenced in fields ranging from coding to medicine.
Science skepticism can be quite multifaceted. A book published by Haverford College argues that many reports appearing in the news and social media stating how, “according to scientists,” a change in diet or habits will lead to immediate weight reduction, IQ increase and other similar promises are almost always produced to entice as big of an audience as possible and are based on a loose and unverified correlation between two variables. What’s worse, it is not unusual for these to be connected with the clear intention of spreading misinformation.
Seeing as believers and skeptics alike will point to data to support their truths, it’s hugely important for data scientists to start upscaling their communication standards in order to build back trust in the scientific community. A large part of this is addressing the problem of “false causality” and how it can affect the perception of data and AI by the general public.
Cause and effect is not a rare concept to observe in practice. A city going through a massive snowstorm is going to have slower traffic even with a perfectly designed road system; drivers will naturally be more cautious and slow down.
However, if someone suggested that there is a cause-and-effect relationship between the divorce rate in Maine and the per-capita consumption of margarine in the U.S., this idea would seem absurd even if presented on a graph demonstrating a correlation of 99.26%. This and other incongruous correlations illustrate a fallacy known as false causality, and this happens to be an issue not just in data science but in all scientific fields.
Despite the understanding that “correlation does not imply causation,” it is still fairly easy to fall prey to this kind of situation if you consider all of the correlations that fall in the middle between an absurd and an obvious type. Take, for example, the rise of global temperatures and the decline of the number of pirates in the last 150 years. These demonstrate an inverse correlation, and it would be incorrect to claim a causal relationship between the two. In this specific case, industrialization is a third variable that impacted them both, but it’s not unusual to witness instances where this outside correlation is missing. This can lead to countless examples in nutrition, for instance, where a type of food is associated with a disease without taking other factors into consideration.
Much of the evidence that scientists gather is usually backed up by a correlation between variables, but this is for good reason. It is extremely useful and convenient to realize there’s an association and interaction between two variables before running an experiment. It’s not good practice, however, to make this correlation the foundation of your scientific argument. Good scientific practices will always try to confirm the correlation through tests and experiments regardless of how tempting it might be to see the correlation as obvious, and all other possible variables related to the two correlated ones must also be tested. This might prove to be a challenging task in data science when taking into account the large scale of datasets and variables involved; the machine can still do the heavy lifting, but it’s up to the scientist not to fall for any traps.
False Causality Pitfalls In Data Science
Research in data and AI is especially prone to false causality due to the simple fact that datasets are known to be largely observational, vast in their sample size and—more often than not—messy and unorganized. There is a massive amount of data for numerous variables, and processing them allows for numerous correlation possibilities. Add the phenomenon of patternicity from the human observer to the equation, and we have a recipe for false causality in multiple instances. AI and ML systems can identify countless correlations from a large enough dataset, but it is the scientist’s task to take advantage of the scientific method in order to draw the right conclusions from these patterns.
Data-driven organizations make virtually all decisions based on insight extracted from data. Allowing false causality to creep in can result in wrong decisions. In order to fight false causality, it is critical for data professionals to:
• Work hand in hand with domain experts to explain data-driven findings in real-world phenomena.
• Clearly state all of their assumptions when presenting their findings.
• Just like developers do “code review” (e.g., invite a second pair of eyes to review their work), data scientists should be doing “assumptions review” with their peers.
• Don’t rely on the Occam’s razor philosophy. Data science often tackles complex real-world issues, and unlike what Occam’s razor suggests, looking for the simplest possible explanations typically leads to false causality.
Ultimately, making effective use of communication and media tools to reinforce thoroughly verified correlations should help in the long run—not just for data and AI matters but for all positive things that scientific knowledge can aggregate to society.