February 16, 2024

Chapter 3

THE NEED What should this mouse trap be able to do?

Dr. Frank Buckler

Founder, Success Drivers

It takes three steps for a powerful Causal AI system. The construction of a holistic data set, the selection of a suitable machine learning algorithm, and the application of certain algorithmic and human assessment procedures. In the following, we will discuss why this is the case and what to look out for. It will be the most methodical chapter of this book, but I will try to keep it exciting, understandable and descriptive.

STEP #1 - Collect a holistic dataset

Swiss insurance companies often take a closer look than others. I love it when customers ask “is that possible?” questions. Then the inventor in me comes out and starts tinkering. It was the same here. 

We had just built a Causal AI system that explained what drives NPS and how to optimize it strategically. In the dashboard simulator, it was possible to set improvements in certain CX dimensions and then see how many NPS points the insurer would then improve.

But then came the question: “What good is one more NPS point anyway?” Melanie asked freely. It is these questions that are the starting point for any innovation.

In our case, we decided to request a dataset from the data team that would match the characteristics of the customers surveyed from the past and then also find out whether these customers had canceled in the following year.

An initial analysis made us wonder. The NPS rating correlated positively with customer churn. The more loyal the customers were, the more likely they were to churn? “Can’t be”, we said to ourselves and built an initial forecasting model that used the NPS rating and some information about the customer to predict whether the customer would churn. The impact of the NPS was now very weak but still positive.

I looked at the first model and saw that many of the available variables were not included in the model. In particular, the internal segmentation of customers had not yet been taken into account. 

Now you might ask yourself “what does this have to do with the influence of loyalty on churn”.  The answer became more than clear in this project.

The resulting model not only had better predictive quality, it also shows a clear negative influence of loyalty (NPS rating) on customer churn.

What had happened?

By integrating additional variables, we have taken so-called “confounder” effects into account. This confounder was called “customer segment”.

The insurance company had a higher-value customer segment that was obviously more selective and documented this with a lower rating. With the same loyalty, these people would tend to give a lower rating. The causal influence of segment affiliation on the loyalty measurement was negative.

At the same time, this segment was less inclined to migrate because they were generally better looked after. The causal influence of segment affiliation on emigration was positive.

If an external variable (segment affiliation) influences two variables at the same time, then these variables will correlate. In this case negatively, because the loyalty measurement is influenced negatively and churn positively – i.e. in opposite directions.

If this external variable – this so-called confounder – is not in the model, the model will show a causal relationship that does not actually exist.

In science and data science practice, a procedure known as “p-value hacking” is common. The “p-value” stands for the significance of a correlation. In order to obtain more significant relationships in statistical models, it is useful to remove more and more variables from the model. However, each removal not only increases the significance of the relationships, but also the probability that the model will produce causally incorrect results.

It has taken a few decades, but even the American Statistical Society has now clarified the limitations of significance tests as a key quality measure in a statement . However, I believe it will be another two or three generations before this becomes common practice.

Whenever we cannot fall back on robust theories (which is almost always the case in marketing), causal models are those that take many variables into account. They do this with the aim of reducing the risk of confounders by modeling their effects.

It’s a bit like the hare and the hedgehog. The hedgehog crosses the finish line first and you automatically conclude that he must be faster. But the confounder is the length of the path that the hedgehog cunningly shortened. 

Don’t be fooled by the hedgehog. Don’t just look at the data. You can’t measure cause-and-effect relationships and you can’t see them. You can only infer them indirectly. You have to be very careful and make sure that a hedgehog is not trying to trick you.

Marketing & business expertise helps with variable selection

At Success Drivers, we have been analyzing Microsoft’s B2B customer satisfaction survey data worldwide for many years. I remember well the day Angelika – a project manager with us – came to see me. She had built an AI driver analysis model that explained what drives satisfaction. In line with the guidelines, she had not only used the measured partial satisfaction, but also various other properties available in the data set in the model to make it more holistic.

She proudly said “Frank, we now have an explanation quality of 0.88”. That made me prick up my ears. “What are the main drivers?” I asked. “It’s IRO,” she said?

Neither she nor I knew what this mysterious IRO was supposed to be. That’s exactly what it shouldn’t be. We inquired at Microsoft and learned that particularly dissatisfied customers were flagged and then contacted. This variable was not a cause, but a result of the size we were trying to explain.

The AI model didn’t care. It does what it is told: Find the function with which we can predict the result from the input variables.

The entire model was worthless. IRO has a part of the target variable in itself. The explanatory contribution shown by the rest of the drivers can no longer be interpreted.

It is therefore part of the guideline that only driver variables that can be logically causal are used.

“Do you always know that?” I am often asked. Of course you don’t always know. But marketing research provides a framework in which the majority of variables in the marketing context can be classified.

There are variables that relate to results and the status in the marketing funnel. These are Purchase Intention, Consideration or the Brand Destinctives and Awareness. These variables have a logical sequence in themselves. Purchase Intention is a downstream stage of Consideration. 

The status of the marketing funnel is influenced by the perception of the products and the brand. Do consumers perceive the product as tasting good, healthy or trustworthy? These are product and brand-specific attitudes. And here, too, there are logical causal sequences. The brand influences product perception. Of course, there is also a reverse causal effect, but this takes place on a longer-term timeline. 

These perceptions and attitudes change with experiences with the product and the brand. This can be the consumption of the product, an advertising contact, a conversation with a friend or a TV report. We often call these things “touchpoints” in marketing.

How the touchpoints influence the attitude towards the product and the marketing funnel can vary depending on the target groups and the situation. It therefore makes sense to integrate characteristics of the person (e.g. demographics) and the situation (e.g. seasonality) as possible moderating variables.

There are hypotheses about what works and hypotheses about what does not work. According to Nassim Taleb, the latter are more likely to apply. It is easier for us humans to know, for example, that overall satisfaction does not influence service satisfaction than to know that service satisfaction is a decisive driver of overall satisfaction.

It is therefore not the aim of the above framework to define which variable influences which cause. Rather, the aim is to tell the model which causal relationships can be excluded with a high degree of probability from a logical point of view.

To speak in images: It is comparatively safe to say that a pope is a human being. However, identifying a person as a pope requires more intensive testing. We want to hand this check over to the AI if we cannot do it ourselves with certainty.

You don't know what you don't know

The aid organization “Kindernothilfe” wanted to revise its marketing strategy and as a first step set out to better understand how donors choose the aid organization. 

At Success Drivers, we usually solve this issue by designing a questionnaire that collects the data we need according to the above framework by interviewing the target customers. For example, we conducted a workshop with Kindernothilfe that filled the categories of the framework. As a rule, the process is actually quite quick, as many of these things can be found in old questionnaires and documents, which we just have to categorize.

Nevertheless, it is worth investing a lot of time in brainstorming. In this project, I realized this again by a happy coincidence: At the beginning of the questionnaire, we asked the respondent which aid organization he knew. From the list of those that the respondent knew, another one was selected in addition to Kindernothilfe. These two brands were then evaluated in order to understand what motivates donors to prefer one aid organization over the other.

This question was therefore only intended as a control question. When we were setting up the model, this information about which aid organizations a respondent knows was now in the data set. I noticed that some respondents knew a lot of aid organizations and some only knew very few. My intuition told me that this could be important information in the marketing funnel. So this variable was included in the model.

In fact, it turned out that this variable plays a central role. Donors who know many providers turned out to be much more selective. It is much more difficult to win them over because you have many points of comparison. 

This is another realization where you say to yourself “Yes, of course, it’s logical”. But it didn’t cross anyone’s lips in the workshop beforehand. 

These kinds of “it’s obvious” aha experiences have been with me ever since I started using Causal AI for companies and we’ll talk about a few more.

The wider we cast the net, the more data we collect on potentially influential facts, the better the causal model of reality becomes and the more amazing the “aha” moments become.

It was not only useful to realize that donors who only know a few providers are easier to recruit. It turned out that younger potential donors naturally know fewer providers. “Also clear” – but unfortunately only in retrospect. In addition, people who know few providers can be found in other places and at other touchpoints. In short: the entire marketing strategy turned upside down.

We overestimate the relevance of what we know and underestimate the relevance of what we don’t know. That is why it is so useful to think “out of the box”. It is precisely this process that will guide human experts for some time to come. At least a human will be able to lead it inspired by LLMs. The expertise required here has nothing to do with data science.

To put it metaphorically. Everyone knows this situation. You have a problem but can’t find a solution. No matter how hard you try. Then the idea comes to you in the shower. You can’t think of a word, no matter how hard you try. In a relaxed moment, it comes out of nowhere. If we focus too much, we focus on what we know, not what we don’t know. We lose the chance to find solutions that lie outside the current paradigm. 

Most scientific breakthroughs have not been made possible by a research plan, but by unplanned “coincidences”. Whether it’s Penecilin, Post-It’s, Air-Bags, microwave ovens or Teflon, great inventions are the result of lucky coincidences and a look “out of the box”.

If you want to usher in a new phase of growth for your company, it is useful not only to focus on what you know, but also to seek insights where your own knowledge is limited. This is exactly the idea of Causal AI – using knowledge to explore ignorance.

STEP #2 - Choose a suitable machine learning algorithm

When I started experimenting with neural networks in the nineties of the last century, I was full of anticipation. Harun and I collected stock market data. We “scraped” the cash prices that were transmitted via screen text on German television, because the Internet was not yet available for home use. We thought about the best way to pre-process the data so that we could finally feed it to a neural network.

The performance on test data (i.e. data sets that the system had not learned) was atrocious. Even the linear regression was better. Something was going wrong. What could it be? There were so many variables. Network architecture, learning methods, number of parameters, pre-initialization, better pre-processing, and so on. We tried out a lot. A lot. I learned that trying things out without questioning your paradigm can waste a lot of time. Fortunately, we were students, we had the time and it took us a year or two to realize that all the methods, even those so promisingly described in the textbooks, are of little use if they lack the so-called “regularization”.

What is that? Machine learning processes have one goal in common: they try to minimize the forecasting error. This goal is precisely the problem. The prediction error naturally only relates to the data that is available for learning. However, the aim must be to minimize the forecasting error based on situations that have not yet been seen (i.e. input data). 

Here you run into a dilemma. I could also use the test data for learning to make the model better. But in the live application, there will always be new unseen input data. You have to accept that the model with limited data must not only achieve a better fit, but also generalization capability. Regularization achieves this by following the philosophical principle of Occam’s razor: “When in doubt, the simpler model”.

Regularization methods try to make a model simpler while sacrificing as little predictive accuracy as possible on the learning data. 

The figure shows the learning data as crosses. The thin line is the model that is only aimed at minimizing the forecast error. The thick line is a regularized model.

These methods form the basis of almost all AI systems today.

As soon as we tried out regularization methods in our student programming circle, the system became usable. However, the results only became really good after a further step towards causality.

Dependencies between causes

Marketing is one of the most complicated fields you can choose. It is often ridiculed by technical professions. They think “they’re just babbling”. I have to admit, I thought the same thing at the beginning of my studies. We looked down on the business graduates who didn’t have to solve really complicated higher mathematics like we did. 

But as the semesters went on, I realized that in reality, science is comparatively simple. You can experiment relatively easily and get immediate feedback. That’s why we know so much in the natural sciences and so little in marketing. Marketing is like open-heart surgery, on millions of hearts at the same time. 

There are so many uncontrollable variables. That’s what Causal AI is all about: being able to capture and evaluate the complexity. What you then realize is that most of the variables correlate with each other. And that is a problem. Because it makes it more difficult to crystallize which variables are causal.

This was also the case with the Kindernothilfe aid organization. The older the potential donors are, the more often and the more they donate. This is well known and leads to senior citizens being at the heart of the marketing. Was this justified?

Many other variables also correlate. Various components of brand perception correlate strongly. Even wealth and income correlate.

It turns out that although classic machine learning and AI produce precise estimates for the learning data, the more variables are involved, the more they fall prey to multicollinearity. As an undesirable side effect, variables are used in the model that have no direct causal influence. This in turn leads to unstable forecasts and distorted attributions of causes.

In particular, there are two algorithmic methods that Causal AI systems use to measure causal effects more accurately.

 

1. Double Machine Learning (DML)

Let’s come back to the SONOS example. If we predict loyalty using the other data with an AI model, then the prediction contains all the information of the explanatory variables that the algorithm could use. What remains (the difference between the prediction and the real value) is called “noise” (i.e. an unexplained random component). If the explanatory variables (causes) do not fully explain the loyalty (effect), part of the “noise” is the intrinsic information contained in the loyalty variable. 

The same is the case when we explain the “service evaluation” with an AI model. Double Machine Learning now tries to explain the “intrinsic information” of the target variable with that of the other variables and works because it works with adjusted variables.

The method consists of two stages of machine learning. Hence the “double”. In stage one, machine learning models, such as a neural network, are trained for each variable that is influenced by others. This includes the target variable, such as loyalty. 

For the second step, the difference between the predicted value of the machine learning value and the actual value is calculated (this difference is called residuals). The second stage now calculates a machine learning model using the residuals, not the actual values.

It’s a bit like searching for tracks. You won’t see the snow fox, which only roams around at night. But its tracks in the snow can tell us where it has come from and where it is going.

2. Automated Relevance Detection (ARD)

There is another method for dealing with the interdependencies of the explanatory variables. The idea is to try to sort out explanatory variables during the iterative learning process of the AI instead of proceeding in two stages without losing precision.

The idea is therefore to integrate this goal into the objective function of artificial intelligence. What does objective function mean? Neural networks are optimized by setting up a function (=formula) that expresses what you want to achieve. In the simplest case, this is the “sum of the amounts of all differences between the real value and the predicted value for each case/data set”. This sum should be small. 

This formula is now dependent on the weights (=parameters) of the neural network. The learning algorithm, in turn, is a method that knows how to iteratively change the weights bit by bit so that the sum of this formula becomes smaller.

Automated Relevance Detection (ARD) has changed its target function so that the aim is not only to minimize the prediction error, but also to omit dependent variables. This is not about black or white, not about in or out, but about a trade-off. It is a process of weighing up. Weighing up how fit should be weighed up against the simplicity of the model. This is a learning process in itself, which is implemented by the procedure using the principles of Bayesian statistics.

It’s a bit like the tasks that hunters have. They have the task of maintaining a balance in the food chain. How many predators do you need? How much prey? How much pasture grass? It should be balanced, but what exactly the right balance is is a non-trivial question. The ARD algorithm investigates this question in an iterative search process.

Modern “Causal AI” has therefore typically implemented Automated Relevance Detection (ARD) and / or Double Machine Learning (DML).

Unexpected non-linearities

“Marketing works a little differently in the pharmaceutical sector,” Daniel explained to me, pointing to a chart. His company (then part of Solvay) produced prescription drugs. Marketing is mainly done through channels and campaigns to convince doctors to consider a drug. Of course, there are also advertisements in specialist journals. But the bulk of the budget is spent on equipping the sales force to keep the doctor well informed and make a good impression. The whole thing is then not called marketing, but “commercial excellence”. In essence, however, the questions are the same: which channels and campaigns are effective? What can I do to sell more? 

We helped Daniel to structure the problem and then collect data. For one pilot country, all sales representatives were asked to collect data for their sales territory and the past 24 months. In addition to the target figure “number of prescriptions”, we compiled the most important actions. These included the number of visits, participation in workshops, conference invitations, expenditure of brand reminders such as pens, product samples and much more. Over 20 sales territories, 480 data records were collected, which made it possible to evaluate 14 channels.

We applied our Causal AI methodology and once again the results amazed me: It was to be expected that the number of sales visits was a key driver. But the overall impact of product samples was zero. “Product samples are important” was still ringing in my ears.

I looked at the plots for non-linear relationships. It looked strange. The plots showed the result of a simulation: how many additional prescriptions could be expected if the number of product samples per doctor were increased or reduced to a certain value.

The plot showed an inverted U-function. There was a number of product samples where the effect was maximum. It took me a while, then it clicked. “It’s logical,” I said to myself. When sales reps hand out too many product samples, doctors get to the point where they don’t have enough patients to try them out. Instead, the product samples are then handed out instead of the prescription. The product samples then substitute prescriptions instead of promoting them.

The software had found something that we hadn’t thought of before. In retrospect, it was as clear as day. 

This example is intended to show one thing: Reality is often different than we think. It is also usually more complex than we think, because as humans we are used to thinking one-dimensionally and linearly. Because this is the case, we need Causal AI methods that detect unknown non-linearities without us specifying them in advance with hypotheses.

It’s a bit like a toddler playing the pegging game. It can be so frustrating when the cube doesn’t fit into the round hole. No amount of kicking or hammering will help. A causal AI like we need first looks to see what kind of hole we have and can then insert the right object.

Unexpected interactions

Do you remember the Mintel case study above? Here, the Causal AI discovered interaction effects that we didn’t have on our radar beforehand.

The phenomenon of “interactions” behaves in the same way as non-linearities. Unfortunately, many managers do not intuitively understand what exactly is meant by the term “interaction” in methodological terms. So here is a definition:

Interaction or moderation effect: We speak of an interaction or moderation when the extent or the way in which a causal variable acts depends on another causal variable. Then two (or more) variables “interact” with each other in their effect. The degree of distribution “only promotes sales” if the product looks attractive and it is only bought again if it tastes good. The effect of each component depends on the strength of the others.

Intermediation or mediation effect: In common parlance, the term interaction is often used when one causal variable (let’s say friendliness) influences another (let’s say service quality) and this in turn influences the result (let’s say loyalty). However, this must be distinguished from genuine interaction. That’s why we have a different term for it: (inter)mediation. In this example, the mediator is service quality.

To find interactions, I again need a machine learning approach that is flexible enough to find what is there. Many causal AI approaches, on the other hand, work on the basis of so-called “structural equations” or “causal graphs”. Here, the analyst determines which variable may affect which. Unconsciously, however, a fatal assumption is made: the assumption that the effects of the variables add up. Each correlation is considered separately and its effect should add up. This excludes unknown interactions.

STEP #3 - Simulate, Test, Repeat

In Step#1, I described how important it is to build a holistic data set and its expertise on the real-world topic that the data describes. In step two, I described what AI should be able to do to build causal models and obtain causal insights.

Now Step 3 is about how we should use these AI models to derive causal useful insights.

 

Illuminating the black box

The AI finds the formula hidden in the data. As such, it follows the flexible structural logic of a neural network and does not fit directly into the framework of human thinking. 

Human thinking consists of logical connections that fall back on categorizations (black/white). Continuous relationships can only be understood as “the more – the “. The requirement to make the findings of AI comprehensible to humans is the requirement to simplify the findings and transfer them into the structure of human language.

A neural network tells us neither how important input variables are nor how they are related. The weights of the neural networks have no fixed meaning and this is only formed in the context of all other weights. For example, the first hidden neuron is interchangeable with the second. The position plays no role. It is only the result of all neurons combined that has meaning. In this respect, an analysis of the weights is only of limited use.

What you can do is explore the properties of the unknown function by simulation.

Let’s stick with Daniel and his Pharma Commercial Excellence success model and run a few simulations together on the variable “number of product samples”.

Average Simulated Effect (ASE): Let’s simply increase the number of product samples by 1 for each data set we have (a sales territory in a given month) and then see what the number of prescriptions predicted by the neural network is. If the product samples work, the average number of prescriptions should be higher. In Daniel’s case, this was not the case. The so-called “average simulated effect” was close to zero. So was it the case that the variable had no effect? No. To understand this, let’s run these other simulations. 

Overall Explain Absolute Deviation (OEAD): To do this, we manipulate the product sample variable again. This time we replace the real data of this variable with a constant value. We take the average number of product samples. The output of the neural network now provides other values. The predicted values resulting from the real data are close to the actual number of prescriptions (low error). The forecast values obtained with the manipulated data are no longer as accurate. They provide a larger error. By measuring how much explanation of prescribing behavior we lose when we no longer have the information from the product samples, we can measure how important the variable is to us. In Daniel’s case, this value was quite large. So it was an important variable. However, there was no easily explainable (monotonic) correlation. But what does the correlation look like?

Non-linear plots: To do this, we simply have to look at the individual values of the OEAD simulation in a graph. The graph plots the number of product samples on the horizontal (X) axis. We create a point in the graph for each data set. On the vertical (Y) axis, we plot the CHANGE that occurs for this data set in the target variable (number of prescriptions) when we replace the value of the number of product samples with its mean value. What we now see in Daniel’s example is a U-shaped relationship. The points do not form a clear line but a cloud of points, but a connection can be recognized. The point cloud is not created by the estimation error of the neural network, because we subtract two forecast values of the neural network and thus the random component is subtracted out to zero. The point cloud is created by interactions with other variables (and model inaccuracies, which then appear as interactions).

Interaction plots: We can proceed in a similar way to visualize the interaction effects. We simply take the model manipulated in the OEAD above and set a second variable, which may interact, constant. In this case, we take the number of sales visits. We can now display the result in a 3D plot in a similar way. The horizontal dimensions are the number of product samples and the number of sales visits. The vertical dimension is again the CHANGE resulting from the replacement by the mean value. If the change triggered by the product samples is greater, if there are more sales visits, we have an interaction. This is then visually visible. This can then also be recorded in a key figure.

Figuratively speaking, these simulations work like the human eye. In reality, the human eye only sees a small section. It sees the section that it focuses on. It sees what is by moving the eye in a minimal way. Through the difference we understand what is and can filter out the irrelevant. If our muscles, including the eye muscles, were paralyzed, we would no longer see anything. So these simulations are our eyes, which enable us to look into a complex world. 

All of these simulations can be converted into key figures. With the help of bootstrapping, we can also calculate significance values for these. The procedure is very simple. N data records are randomly drawn from the sample of N data records. This means that some data records appear twice and others not at all. This creates a bootstrap data set. It represents a possible alternative data set that could be drawn in a similar way the next time. You now draw dozens or hundreds of these bootstrap data records. For each of these data sets, different key figures result. If the key figures are close to each other (or all greater than zero), the significance is high.

But be careful. “Significant” does not mean “important” and does not mean “relevant”. It only means “an effect is present”. Because many users (even in science) do not understand the real meaning of significance, significance value hacking is practiced in science and practice. The smaller (and therefore more unrealistic) a model becomes, the higher the significance value. In this way, variables that do not fit the picture are filtered out just to create the appearance of quality. 

In fact, “significance” is largely irrelevant for business practice. Even the American Association of Statistical Science recently confirmed my personal findings found in practice in a 6-point statement .

What we want to know in practice is whether a cause is “relevant”. The OEAD above measures such relevance and represents a kind of measured value for effect size. In Bayesian statistics, there is the concept of evidence. A relationship is evident if it is statistically relevant and the model makes sense (in accordance with the wisdom of your field). A small, overly simple model is not very meaningful and can therefore only produce limited evidence, even if we can demonstrate an effect size. In STEP#1 above we ensure meaningfulness, in STEP#2 we model the actual effects and measure their RELEVANCE in STEP #3.

To illustrate this figuratively, let’s look at weightlifters and bodybuilders. It is highly significant that body images are strong and powerful. However, this strength is not relevant. It would be relevant if you could lift particularly heavy weights with it, for example. In the picture you can see what the record holder in weightlifting looks like. You would actually think the bodybuilder could do much more, but it’s just for show.

Deriving total effects

We’ll stick with Daniel and his model to explain the prescription figures. Every sales visit to a doctor was accompanied by the handing out of product samples as well as brand reminders and new information sheets. There were times when no product samples were given and there were times when more were given. But the typical picture was that it was customary to hand out a few samples.

Methodologically, this habit is reflected in an indirect causal effect. This is because when the sales managers increased the frequency of their visits, they also increased the number of samples given out, in line with the usual and expected ritual. The effect also went the other way around. As a consequence, both variables correlated strongly.

In order to determine the total effect of sales visits, it is necessary to consider both the direct effect (all other things being equal / cetarus paribus) and the indirect effect. This indirect effect arises because more visits result in more samples being issued. This is because management asks only one question: “What happens if I change cause X?”

The total ASE is the direct ASE plus the indirect ASE. This indirect ASE is the ASE on the number of samples times the ASE of the number of samples on the number of prescriptions.

The total OEAD can be calculated in the same way.  Of course, there are many indirect effects in complex causal networks, some of which are nested or even circular. However, all of these can be calculated with software and combined into an overall effect.

The total OEAD tells me whether a variable is relevant. The total ASE tells me how large the (monotonic) effect of an incremental increase in a cause is on average.

Orchestra musicians and conductors know this better than almost anyone else. The room, the orchestra hall, is decisive for the sound. Only some of the sound waves reach the listener’s ear directly. There are many indirect reflections that make up the fullness of the sound. Sound measurements on the instruments are symbolic of the variables of the actions of the business, measurements on the walls of the hall are symbolic of intermediary variables and the coughing of the person sitting next to you is symbolic of situational variables.

Checking the causal direction

When we implemented the first algorithms for recognizing causal direction in the NEUSREL software in 2012, we encountered astonishing results. The first use case was data from the American Satisfaction Index. In the structural equation model commonly used in marketing, satisfaction influences loyalty and not the other way around. Satisfaction is a short-term changing opinion about a brand. Loyalty, on the other hand, is an attitude that only changes in the long term. This is what marketing science has discovered.

But our algorithms revealed a clear -different- picture. Loyalty influences satisfaction. Not the other way around! Is there really nothing wrong with the algorithms?

Then it clicked for me: both were right! Just in his own way.

The data are responses from consumers in a survey. If someone is loyal but not satisfied, they tend to indicate a higher level of satisfaction than they actually feel due to their loyalty. This is a kind of psychological response bias. In this sense, current loyalty has a causal influence on the current level of satisfaction.

Things look different on a different time horizon. If we were to survey the same people again with a time lag, we would find that a low level of satisfaction – over a certain period of time – reduces loyalty.

Ergo. Causality always refers to a time horizon. Understanding and demonstrating this is also the task of marketing managers. A purely data-driven view is blind here.

Another example? If you wade barefoot through the cold November rain, you risk catching a cold. The cold doesn’t make you ill. But it does weaken the immune system. In the case of a virus, the likelihood of an outbreak increases. 

However, it is also true that regular exposure to the cold strengthens the immune system and boosts the functioning of the mitochondria, meaning that you will catch fewer colds in the long term. Wading barefoot leads to fewer, not more, colds in the long term.

There are many of these examples. People who wash their hair often have well-groomed, grease-free hair.  If you don’t wash your hair, you’ll soon get the proof, because your hair quickly becomes greasy.

But if you never wash your hair, the hair roots will not regrease as quickly in the long term, as the hair already has a healthy greasy film. The hair will also look healthy and tidy (if you comb it). 

This shows that the topic of causal direction requires human supervision. First of all, we need to define for ourselves which impact horizon we are interested in. 

Furthermore, most causal directions in marketing result from pure common sense and specialist knowledge. For the remaining relationships, testing methods can be used to learn. I would like to discuss the two most related concepts here: The PC Algorithm and the Additive Noise Model.

The PC algorithm

The PC algorithm (named after Peter Spirtes and Clark Glymour, who contributed significantly to its development) is a method in machine learning that is used to determine the structure of causal networks from data. The algorithm attempts to discover causal relationships between variables by analyzing so-called “conditional independencies” in the data. 

The illustration shows how the algorithm extracts pairs of three and triangulates which causal directions result logically from this using the “conditional dependencies” (black file). If A and C as well as B and C are dependent, but A and B are not, then it follows that A and B act on C and not vice versa. Otherwise A and B would be interdependent.

The method focuses on linear relationships, but can be extended to non-linear ones. However, studies show that the rate of incorrect decisions increases rapidly with the size of the causal network. It is therefore only recommended for small models of less than 20 variables. The example above shows an extreme example. In reality, we deal less with black and white and more with gray. This makes it increasingly difficult to prove the causal directions cleanly with weak effect sizes.

Additive Noise Modeling 

Additive noise modeling is a way of finding out whether one thing (let’s call it A) causes another thing (B) or vice versa. It is based on the idea that if A causes B, then the change in B cannot be explained by A alone, but there is also a bit of “noise” (unforeseen influences or chance) that is added in. Importantly, this noise has nothing to do with A.

The figure below shows the same data on the left and right, except that the variable X is sometimes plotted horizontally and sometimes vertically. A simple linear regression is shown by the red line. The quality of the regression (coefficient of determination R2) is exactly the same in both cases. However, the deviation (spread) from the degree of regression is constant in the left-hand case and not in the right-hand case. If we assume independent noise, then X must be the cause of Y and not vice versa.

To decide whether A causes B or B causes A, we look at both possibilities and try to determine in which situation the noise is truly independent of its supposed cause. If it appears that the noise is only independent if we assume that A causes B (and not the other way around), then we would say that it is likely that A actually causes B.

The method is based on the assumption that the noise is truly independent and that the relationship between A and B is adequately captured by the model. If this is not the case, the method can lead to false conclusions. Modeling that is as close to reality as possible is therefore an important prerequisite for this test.

It’s a bit like trying to find out where up and down are, assuming that the gravitational force comes from the earth. If you turn your head, your hair should hang down. 

In summary, there are several methods for testing the causal direction. None of them offers a silver bullet. Therefore, a conscious approach and the integration of specialist knowledge is essential.

Check whether confounders are at work

A mobile communications company asked us to predict how vulnerable its customers are to churn. The company had already implemented a number of customer retention measures and also wanted to know how well they were working. In particular, the focus was on “cuddle calls”. Customers were called as a precautionary measure to ask them whether they were satisfied, simply to show appreciation and, if necessary, to hear whether someone was at risk of churning.

The figures were frightening: the households that had accepted a cuddle call had twice as high a cancellation rate in the following year as those that had not. The program was on the hit list. The assumption was that sluggish customers at risk of churning (so-called sleepers) were being activated with these calls.

Even our first churn modeling seemed to confirm this. The flag variable “cuddle call” had a positive effect on the churn probability.

We then enriched the data with sociographic data and expanded the modeling. The result: the cuddle calls now reduce the probability of termination! How could that be? 

We had integrated a confounder into the model. It turned out that most households cannot be reached by phone during the day and that this accessibility is a strong filter for making calls. Socially disadvantaged households in particular were reachable and these households had a significantly higher probability of termination.

The cuddle call correlates positively with the probability of termination because target groups with an affinity for termination were easier to reach by phone – not because the call was ineffective – quite the opposite.

It really is like a puppet show. As children, we see the puppets move. The robber hits the farmer and he falls over. Only apparently. In reality, there is a cause hovering above it that we cannot see: the puppeteer.

It is the same with the data. If we do not see the puppeteer (we have no data about him in the model), then we infer erroneous cause-and-effect relationships from correlations.

So how can we check whether confounders determine our model? We use two methods in the Neusrel software:

 

The Hausman Test

This test is similar to additive noise modeling and also to double machine learning. His thesis: If the residuals, the so-called noise of the target variable (i.e. the difference between the forecast and the real value), can be explained by the causes themselves, then they are not causes. So if the explanatory variables can be used to predict the residuals with the help of an AI model, then this is an indicator of confounders. 

After the Hausman test, the same modeling is carried out again – only with the residuals as the target variable. The same AI algorithms are used and the same simulation algorithms are used to calculate the effect size OEAD in particular.

If we then have clear indications as to whether a confounder is at work, we can do some soul-searching and now and again it becomes clear which data sources we may have forgotten to use.

However, if no other data can be obtained for practical reasons, the question arises as to whether there is any way of avoiding the falsification effects.

 

Confounder elimination

In 2011-13, I worked on this topic with Dr. Dominik Janzen (then Max Planck Institute, now Amazon Research) and together we came up with an idea.

If you plot the data of two (or more) explanatory variables in a two-dimensional plot, you will get a random distribution, however clear, such as a Gaussian distribution. This distribution may be elongated if the variables correlate with each other. Or it may make an arc if there is a non-linear association. 

However, if two (or more) distributions result, i.e. if the data are collected in clusters, then there must be a cause for this. This unknown cause IS the confounder. It is the influence of this confounder that pulls apart a previously uniform distribution. The information of the confounder is the vector between the clusters. This vector consists of the differences between the cluster centers.

Our procedure for “confounder elimination” now proceeds as follows

  • Finding clusters and calculating the vectors between the clusters
  • Projection of all data onto this/these vector(s) using vector multiplication
  • Using the result as an additional variable in the AI model
  • Recalculation of the AI model.

To speak in images again, confounders are like an Italian mom who distributes the spaghetti on the plates. By following the plates, we can understand that they have been filled by the mom and are not filling each other.

Repeat: Adjusting the model on the basis of new findings

In the first step, based on existing knowledge and wisdom, we collected and processed data and built a modeling approach. In the second step, we modeled each influenced variable with a causally appropriate AI method. In step three, we opened the black box and ran some tests. We will learn from the analysis of these results. 

We may find that we have made a mistake. Mistaken in some assumptions or mistaken in the processing of a variable. It is therefore usually necessary to refine, optimize and remodel.

The key figures for model fit, causal direction or confounders are not the only criteria for optimization. Ultimately, a person decides whether the model is meaningful and useful. This requires good specialist knowledge, wisdom and a solid foundation in causal machine learning.

What if you don’t have the confidence to do this? Of course you can get help from external experts, but there is another way: standardization.

Let’s assume you are building a marketing mix modeling. Once you have gone through the process cleanly, you could try to define all stages in a standard. If you prepare the same type of data in the same way, model and simulate it with the same AI methods, then the results will be interpretable in the same way. If this is the case, you can cast (or have cast) this interpretation and consulting service in software. The entire process can then be repeated with little effort for other business units, countries or at other times.

In my opinion, the development of standardized causal AI-based problem solutions is the future for 80% of marketing issues.  

The term standardization is often equated with “loss of individuality” and therefore lower quality. However, this view is very short-term and one-eyed. A good standard process is congealed wisdom. It also ensures quality because it prevents errors. Individualization is always possible, but the costs are significant.

Key Learnings

  1. Start with a blank sheet of paper and collect everything that could influence your target values. Then get the data.

  2. Model the data with an AI algorithm that ensures in the learning process that forecasts are only based on causal causes.

  3. Open the AI black box with suitable simulation algorithms

  4. Algorithmically check the causal directions and check for confounders

  5. Optimize your model and recalculate until it makes sense and is useful enough.

  6. Standardize everything in one process – from the data, the preparation, the method to the preparation of the results.

IMPRINT
Success Drivers GmbH
Johann-Heinrich-Platz 4
50935 Cologne
Germany

Success Drivers GmbH holds the copyright for any material, included translations, published at www.success-drivers.de