A more suitable analysis (which we will get to know later) revealed that new customers in particular were more likely to recommend the product due to the initial euphoria. However, customer hotline contacts occurred mainly with new customers, as they had the most questions about setting up the devices. The service contacts thus turned out to be the second consequence of being a new customer. This led to a correlation between willingness to recommend and service satisfaction because it was a consequence of being a new customer. However, there was no significant causal relationship.
These and other examples suggest one thing: correlation analysis is not suitable for tracking down the causes. What methodology has statistics, or as we say today, data science, developed for this? These are, in particular, econometric models and structural equation models. Let’s take a look at an example of how useful these methods are in practice.
The company Mintel collects data on new product launches worldwide. In fact, thousands of employees “roam” through supermarkets worldwide to find new products, evaluate them subjectively and send them to the headquarters in London, where they are objectively evaluated and categorized. For this purpose, sales figures, distribution figures, prices and much more data are collected in a huge database.
From time to time, professionals ask bold questions like these: Why do only 5% of all new product launches survive the first two years? How can I predict whether my product will make it? How can I manage the whole thing at an early stage?
So the company’s data scientists set about unearthing the treasure trove of data. An econometric model was set up. It was tweaked and tinkered with. But the explanatory power was more than sobering:
That was the reason why I received this email one day asking if we didn’t have better methods here.
We had those. Once the work was done, we were able to predict with 81% accuracy whether each new product would be a winner or a loser.
How was that possible? Is classic modeling really that “bad”?
No, “bad” is the wrong word. Classic modeling is not practicable. It does not have the methodological properties that AI offers us today and that are needed to gain useful insights in a predictable way and in a limited amount of time.
Specifically, there are problems in the following three areas:
Even in classical research, it is still considered the gold standard to always proceed on the basis of hypotheses. Most people have learned this from their studies. The reason behind this is that (without using the methods we will get to know) only a good hypothesis prevents a spurious correlation from being declared true, i.e. causal.
The only practical problem is that there is usually a lack of good hypotheses. The greater the need for useful insights, the fewer hypotheses are available. The more solid hypotheses there are, the more likely marketers are to say to themselves: Well, what we know is enough for us to make decisions. There is always talk of the “last number after the decimal point that you don’t need.
Collecting hypotheses with the help of expert interviews is a lot of work and takes time. Nevertheless, there are still gaps, big gaps. The hypothesis-based approach leads to small models because there are only a few solid hypotheses. These small models then explain less. What is worse, however, is the associated invisibly higher risk of delivering incorrect results. We will come back to why this is the case.
In practice, as well as in science, people therefore “cheat” behind the scenes. You simply look at the data you have and see if you can come up with a hypothesis. The clean, classic process would be the other way around.
It is not uncommon for hypotheses to be “knitted” after the analysis has been carried out. The whole thing is then idealized as the high art of “storytelling”.
It was the same with Mintel. “All variables in” and then “let’s see”. Even respondents’ subjective statements about whether or not they would buy a product had no explanatory power for product success.
Did this disprove the hypothesis that a higher purchase intention leads to purchases?
Yes and no. If we assume that all the assumptions made in the model are correct, then yes. This leads us to the second point.
For example, many new products are more likely to be considered if they are perceived as unique. But it turns out that this uniqueness can be exaggerated. A “very unique” becomes “kind of weird”.
The standard methods of classical modeling assume that the more pronounced an explanatory variable is (e.g. the more unique), the greater the target variable (e.g. sales figures for the product). A fixed relationship is assumed. This is a linear relationship. Only the extent is determined by the parameters.
The second standard assumption is that of independence. This means that a price reduction of 1 euro has a certain absolute sales effect, e.g. 50%. – regardless of which brand the product is from. Even if this does not seem very realistic in this example, it is the core of all standard methods.
Sure. With econometric methods, it is possible to make them non-linear. It is also possible to map the dependencies between the causes in the model. The whole procedure has just one catch: it is hypothesis-based.
The data scientist needs to know what kind of non-linearity to build in. Do we have a saturation function here? A U-function? A growth function? An S-function?
He also needs to know what kind of dependency he should “build in”. Do we have an AND link here, i.e. sales only increase if the price falls AND the brand is strong? Or an OR link? Or an either/or link? Or something in between?
The MINTEL model had 200 variables. Even if you only have a hundred variables, the question arises: Who goes through them all to determine the non-linearity correctly? And who goes through all 100 times 100 (=10,000) combinations to consider how they are related?
This makes it plausible how impractical classical statistical modeling is. The aim should be to use methods that are only given hypotheses that can be supported. The methods should help you to learn what you don’t yet know and not just validate what you already know.
There are other challenges in business practice under which traditional methods break down.
Suppose you paint a 10 cm long line on a sheet of paper with a brush. Then paint a 10 x 10 cm area with it. How many more colors do you need? If you have a thick brush, perhaps 10 times more. Now let’s go from two-dimensional to three-dimensional. How much paint do we need to fill a 10x10x10cm box with paint?
The color is the data we need. The dimensions are the variables we have. The point is this: The more explanatory variables we have, the larger the possibility space becomes. This space contains our data. As the number of variables increases, we theoretically need exponentially more data. This phenomenon is called the “curse of dimensions“.
The only tool that classical methods use to overcome the curse of dimensions is hypotheses and assumptions: Hypotheses and assumptions. We have seen that this is not very practical.
In the course of the development of artificial intelligence, intelligent methods have been developed that tame the curse of dimensions without strict hypotheses and assumptions.
When AI algorithms today identify a cat in an image with 1000 x 1000 pixels, for example, they process 1 million (1000 x 1000) explanatory variables. The possibility space here is significantly larger than the sum of the elementary particles in the entire universe (10^81). Even the millions of cats that the algorithm has seen are a drop in the ocean.
In the same way, it is also possible to map high-dimensional challenges in corporate marketing.
Another limitation of classic modeling is differently scaled variables. There are binary variables such as gender or segment affiliation. And there are continuously scaled variables such as customer satisfaction or turnover. Classical statistics cannot mix these.
Data sets are divided into women and men, estimated separately and then compared. The sample is thus halved, as is the significance, and the gender analysis is purely correlative (instead of causal).
The requirements in business practice are different. But if you only have a hammer, every problem looks like a nail.
This was also the case in the Mintel project. If classical modeling is hypothesis-based and postulates linearity and independence of effect, it is intuitively plausible that the approach has its limits. This becomes all the clearer when we look at what a modern AI-based method has discovered:
The key finding of the model was that the success levers are interdependent. To sell a product, it has to be on the shelf. A good-looking product is of no use if it has a low level of distribution. A high degree of distribution is useless if the product is not so good that consumers want to buy it again. A good product is of no use if the price is not in an acceptable range. An acceptable price is of no use if the brand is not recognized on the shelf.
Generate 100 random numbers between 0 and 1 for 4 variables. For each of these 4 series of numbers, half of the cases are greater than 0.5. If you multiply two of the variables (series of numbers), only 25% of the numbers are greater than 0.5. If you also multiply another series of numbers, 12.5% of the numbers are greater than 0.5 and around 6% of the numbers are greater than 0.5 for the fourth series. This is pretty much the percentage of new products that survive two years.
This multiplication is logically equivalent to an AND link. Success only occurs if a new product has a high degree of distribution, an attractive overall appearance, a reasonable price, if it can be easily recognized and if it is so good that customers want to buy it again after the first purchase.
A new type of modeling was able to discover this correlation in the data. This was despite the fact that there were over 200 variables, including binary and metric variables and, above all, no one expected this result in advance.
It’s not that classic modeling methods are “bad”. Quite the opposite. Within the scope of their assumptions, the methods are extremely good and extremely accurate. Just like a Formula 1 car. It is extremely optimized and accelerates strongly with high top speed and the tires can be changed within seconds.
If you order such a car as a company car, you will find that you won’t get 100 meters. There is no trunk and you can’t get gasoline at the filling station. But above all, every single bump on a normal road will destroy the underbody.
A modern AI-based analysis system is more like an off-road vehicle. It may not be as fast as a Formula 1 car. However, it drives from A to B, no matter what the surface looks like, whether there is a stream in between or a hill to cross.
What is artificial intelligence and what is machine learning? The answer is quite simple: Machine Learning is written in Phyton and Artificial Intelligence in PowerPoint.
Artificial intelligence originally described all technical systems whose behavior gives the impression that they are controlled by human intelligence. For our purposes, a different definition is more appropriate. Because in many applications, AI systems are far more “intelligent” than humans are in this area. Especially in data analysis, this understanding of AI does not help us. Because what AI can do here is many times greater than the most ingenious human being.
What we want to do with AI is to gain insights from data and make predictions. In this context, we differentiate between statistical modeling and artificial intelligence:
Statistical modeling finds the parameters of a fixed, predefined formula
Artificial intelligence finds the formula itself and its parameters.
“Machine learning” is often used synonymously with AI, but for data scientists in particular, statistical modeling is also part of machine learning. The machine somehow learns by finding the parameters. This is why the majority of “AI” start-ups in Europe do not use AI at all, as a study revealed a few years ago.
What exactly does the term “formula” mean in this context? Every rational explanatory approach, such as a forecasting system, can be expressed as a mathematical function in which the result (the forecast) is derived mathematically from the explanatory variables (numbers that stand for certain characteristics of the causes).
The classic linear regression has this formula here:
Result = Variable_1 x Coefficient_1 + Variable_2 x Coefficient_2 +… + Variable_N x Coefficient_N + Constant
The formula consists of added terms and a constant number. The coefficients and the constant are calculated by the algorithm in such a way that the result of the sample data in the data set (estimated value) most closely matches the real result.
In addition to addition, there are other basic arithmetic operations such as multiplication. The basic arithmetic operations are possible basic building blocks with which you can build ANY functions. There are also other basic building blocks that can be used to build arbitrary functions. A neural network uses an S-shaped function as a basic building block and with its addition you can also build any other function (the mathematician Kolmogorov proved this 100 years ago).
The following image of the mountains has always helped me: The lines of longitude and latitude represent the explanatory variables. The height of a mountain at a certain point (= combination of longitude and latitude) is the desired result. There is now a mathematical function that represents any mountain (except for a small error).
AI can find this unknown function.
To stay with our image of the mountains. AI is like a forestry company that cuts trees in the mountains and notes the longitude, latitude and height on each tree. The sawmill can then use the data on the trees to estimate the shape of the mountain with AI.
Ok, there are a few gaps. Where no trees grow, for example. This is also the case in corporate practice. There have been no major crashes in the stock market data for the last ten years (2014 – 2024). This data cannot be used to anticipate new crashes.
If you use a forecasting system to select target customers in order to acquire them, one day this system may no longer work. If you don’t monitor the system, you will quickly go out of business.
It is important to be aware of these framework conditions. Otherwise, you run the risk of falling victim to black-swan phenomena.
But there are even more serious difficulties.
When I started studying in Berlin in 1993, I was always fascinated by this Reuters terminal in the middle of the Mensa-Fourier. The student stock exchange association had set it up there. The monitor showed current stock market prices in real time, which were delivered via satellite (remember: there was no internet back then!).
Then one day, Germany’s biggest daily newspaper ran the headline “Artificial intelligence predicts stock market”. I was hooked. Shortly afterwards, I joined the Börsenverein and read up on how the professionals make their investment decisions. That’s when I met Harun. He was also an electrical engineering student and had caught wind of the newspaper article.
For the next few years, we were to meet every week and, fortified by ready-made spaghetti, discuss the night together and program neural networks. Successes and setbacks followed in regular succession.
I still remember it well. We had built a system that not only learned the learning data with high precision, it also predicted the test data with good results. This was data from a more recent time horizon with which the neural network had not been trained. I ran the model for two weeks during my vacation.
But the performance on the live data was sobering. How could that be?
It turned out that our model was suffering from a phenomenon called “model drift”. Data scientists all over the world are familiar with it. And most of them still don’t have a solution for it. They simply retrain the model more frequently, which often only masks the problem.
If I want to predict the career success of managers and use shoe size to do so, this will work halfway at first. For well-known reasons, men climb the career ladder more often than women. And they have the bigger shoes. If shoe fashion changes and women wear long shoes, the model will falter. Why? Because shoe size is not the cause of career success.
Model drift occurs when the explanatory variables/data no longer explain the target variable in the same way over time. But how can this be? Why does the way the world works change so quickly?
The world does not change. Model drift occurs when acausal variables are used for forecasting.
The other day, my sons asked me if we wanted to go outside to play soccer. My fleeting glance saw a wet terrace and I said “no, it’s raining”. A wet terrace like that is sometimes information that coincides with the rain. After the rain, this information is not a good predictor and after the damp mopping either.
And so it was. The sky was beaming with sunshine. My brain had been caught up in a model drift.
Many banks implement forecasting systems that are supposed to predict the credit default risk of a loan applicant. There have already been many discrimination scandals in this area. What happened?
These AI systems used all available information about a customer and then tried to predict the probability of credit default in the past. The fact is that this information correlates strongly with each other. Higher earners live in different zip code areas and people with darker skin color have a lower income on average.
Machine learning – whether AI or not – traditionally has only one goal: to reproduce the target variable as accurately as possible. If two explanatory variables correlate strongly, then the algorithm “doesn’t care” which variable is used to reproduce the result. As a result, skin color has an explanatory contribution to loan default, even though this is (causally) unfounded. It is income and job security that are causal, not skin color.
The situation was similar with Amazon’s applicant scoring models, which categorically rejected all female applicants. The learning data set was not only male-dominated. The women it contained often had less professional experience. The algorithm only aimed to predict job success and the gender variable was useful to identify the “underperformers”. This feature was not merely “politically incorrect”. It was simply factually incorrect because, all other things being equal, women were just as successful.
The technical reason for the failure of classical AI and ML is that they are not designed to use only variables that have a causal influence.
The result is not just unfair models. The result is models that deliver suboptimal or even incorrect forecasts and findings in regular operation.
In recent years, the main criticism of AI systems has often been their black box properties. Work has been done on this. Methods called “Explainable AI” have been developed and freely available open source libraries have been created with SHARP. AI-based driver analysis has been developed that uses a random forest to tell the user which variables have which meaning.
But an acausal neural network (or an acausal random forest) cannot repair Explainable AI. Wrong remains wrong.
Expainable AI offers a dangerous illusion of transparency.
Distilling alcohol in the cellar at home used to be commonplace. In some parts of the world, this is still the case today. It regularly happened that too much methanol was produced during distillation. The result: at best, blindness and at worst, immediate death.
Methanol is produced by the splitting and fermentation of pectin, which is found in the cell walls of grain. If the mash is not properly filtered before distillation so that it contains hardly any cell walls, the spirits are rich in methanol.
Today, AI and machine learning are like distilling unfiltered schnapps. It often works, sometimes it goes really wrong.
What we need is a filtration system – also for AI.
Causal AI methods are one such “filter system”. They attack the core of the problem: acausal explanatory data.
It turns out that the challenge you face with Causal AI is a tough one. You can achieve a lot with “filter algorithms”. However, it turns out that good specialist knowledge of the real-world facts described by the data is also quite useful here.
For example, when we implemented the first causal direction detection algorithms in the NEUSREL software in 2012, the results were astonishing. The first use case was data from the American Satisfaction Index. In the structural equation model commonly used in marketing, satisfaction influences loyalty and not the other way around. Satisfaction is a short-term changing opinion about a brand. Loyalty, on the other hand, is an attitude that only changes in the long term.
But our algorithms revealed a clear -different- picture. Loyalty influences satisfaction. Not the other way around! Is there really nothing wrong with the algorithms?
Then it clicked for me: both were right! Just in his own way.
The data are responses from consumers in a survey. If someone is loyal but not satisfied, they tend to indicate a higher level of satisfaction than they actually feel due to their loyalty. This is a kind of psychological response bias. In this sense, current loyalty has a causal influence on the current level of satisfaction.
Things look different on a different time horizon. If we were to survey the same people again with a time lag, we would find that a low level of satisfaction – over a certain period of time – reduces loyalty.
Ergo. Causality always refers to a time horizon. Understanding and demonstrating this is also the task of those responsible for marketing.
If you wade barefoot through the cold November rain, you risk catching a cold. The cold does not make you ill. However, it does increase the likelihood of a viral infection breaking out because the immune system is weakened by the cold at this time.
However, it is also true that those who regularly expose themselves to the cold strengthen their resistance, promote the functional strength of the mitochondria and maintain a stronger immune system, so that they catch fewer colds in the long term. Wading barefoot leads to fewer colds in the long term, not more.
There are many such examples.
Those who wash their hair often have well-groomed, grease-free hair. If you don’t wash your hair, you’ll soon get the proof, because your hair quickly becomes greasy.
But if you never wash your hair, the hair roots will not regrease as quickly in the long term, as the hair already has a healthy greasy film. The hair will still look healthy and tidy.
My simplified formula is:
Causal AI = artificial intelligence + expertise + X
We will discuss what exactly constitutes this “X” in the next chapter. This will give us a better understanding of what a good filtration system for AI needs to look like.