Predicting Customer Attrition with Machine Learning Tools

5 min readDec 1, 2020

Can a machine learning model effectively predict customer attrition?

Subscription based businesses generally rely on both the flow of new customers and the retention of existing customers for revenue generation. Acquiring new customers can be expensive — think marketing costs to reach new audiences and costly promotions for client conversion. Hence, retaining pre-existing customers is an important consideration for ensuring the health of a business.

If a business can predict which customers are susceptible to cancelation, it may take preventative countermeasures, such as targeted special offers, to retain such identified customers.

Machine learning models can be trained to predict — on the basis of customer attributes — whether a particular customer will cancel a subscription.

The dataset used to illustrate this is Kaggle’s Churn in Telecom’s dataset. Note that “customer churn” signifies a customer leaving a service. The dataset includes approximately 3000 customer profiles, with both qualitative and quantitative attributes. For example, subscription duration, whether the customer has an international plan, total charges, total customer service calls, and the churn status of the customer.

The model utilizes a logistic regression classifier; this analyzes customer attributes to predict the churn classification of a customer. To ensure the model’s performance is not inflated by making predictions on data it has already “seen” , the dataset was split into a training set and a test set. The test set was left unused until the very end to evaluate the performance of the model, while the training set was further subdivided in two. One part was used to train the model, while the other to experiment with differing combinations of conditions on the machine learning process. These high-level conditions are referred to as hyper-parameters.

Because of the nature of the dataset — a mix of both quantitative and qualitative data — the data had to be transformed before it could be used by the model. Qualitative data was converted to quantitative data, and scaling was performed on numerical data to ensure differences in units did not result in distortions.

Immediately below is code illustrating the steps described above.

Consideration was given to the fact that typically a much higher proportion of subscription customers retain than cancel a service within a time period. In the Telecom dataset, approximately 85% of customers maintained their subscription and the remaining 15% of customers cancelled. Therefore, a “dummy classifier” model which does not consider any customer attributes, but predicts each customer as non-cancelling, achieves 85% accuracy. However, such a classifier would not correctly identify any churning customers.

Furthermore, given the objectives of the business — maximizing profit — optimizing the model’s accuracy may not be appropriate. The negative consequences of losing a customer tend to outweigh the cost of giving unnecessary promotions to a non-cancelling customer.

Thus, this model is biased towards successfully identifying cancelling customers; this comes at the cost of decreased overall accuracy, because customers that would not have cancelled are also predicted as cancelling. In other words, the model is optimized for a measure, called F1, which places weight on the model’s recall and precision performance.

To help achieve this, model’s logistic classifier used a “balanced” operator that makes it seem as if the two predicted outcomes — cancel and not-cancel — have an equal number of occurrences. This way, greater weight is put into “capturing” those customers that actually cancelled.

Illustrated below is the effect of the balanced operator. The first histogram shows the model’s predicted cancellation probability, without the “balanced” operator. Blue represents true non-cancel and orange represents true cancel incidents. Notice a significant number of true cancelling incidents are given a cancelling probability of less than 50%. These represent false-negatives.

The second histogram, directly below, shows the model’s predicted cancelation probabilities using the “balanced” operator. A clear majority of true cancel instances now have a predicted cancellation probability of over 50%. There are also true non-cancel incidents that have a predicted cancellation probability of over 50%. This illustrates the decrease of false negatives and at a cost of an increase in false positives.

This shift in predicted probability is reflected in the results of the confusion matrix, directly below. Notice that a large majority of true-churn incidents are predicted as such, with a recall rate of 82.5%.

Using this model, classification accuracy of approximately 77% is achieved, lower than the baseline “dummy classifier”. However, more than 80% of cancelling customers were correctly predicted; given the objectives of the business, the model’s trade-offs are justified.

Immediately below is the code used to create and train the model .

However, one should not assume that in deployment the model will reliably make predictions with the same performance.

Firstly, the amount of data the model has trained and tested on — 3000 observations — is relatively small. There is a significant likelihood that the dataset does not adequately represent the overall population. In addition, because of the small data size and the number of times the model was refined and tested on the same data, model overfitting is possible.

Also, the model’s performance hit a wide range of values when being cross-validated. This lack of performance consistency should lead the user to be hesitant in over-asserting the model’s predictions.

Secondly, the dataset the model trained on lacks date information (i.e. year) whose value could affect the data’s relevance to the current market. Changes in the telecommunications market within the last 10–20 years may have resulted in changed consumer behavior. For instance, a market of predominantly landlines has transitioned to cellphone subscriptions. We cannot assume the data the model trained on reflects current consumer behaviors.

Finally, the model trained on data geographically specific to the United States; conditions within that market may be unique and not be perfectly generalizable. Hence, the model may witness significant deterioration in performance if used to predict customer churn in other markets.

Predicting Customer Attrition with Machine Learning Tools

Written by Rene Reid