Predicting Customer Churn with Logistic Regression in Python


In the world of business, churn refers to losing a customer who was previously engaged and could have been profitable. The loss of such customers can be costly, as acquiring new customers is often more expensive than retaining existing ones. To tackle this issue, we often want to predict which customers are likely to leave the company in the future. In this blog post, we'll explore a Python code example that addresses this churn prediction problem step by step.


Acquiring a new customer is always more expensive than retaining an existing one. Hence, not letting them churn is the key to a sustained revenue stream.

I want to predict which customer will leave the company or churn in the future.

This problem is a binary classification, the formula for this is: g(x_i) ≈  y_i

x_i could be feature vector describing customer details like ID etc.

y_i is the value we will target


So, in this case y_i∈{0,1}

0
 is negative churn (means the customer won't leave), and 1
 
is positive churn


Importing Necessary Libraries

Before diving into the problem, we need to import the necessary libraries to perform data manipulation, analysis, and machine learning.

 

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

 

Loading the Dataset

The first step is to load the dataset containing customer information. We're assuming the dataset is stored in a file named "Churn-prediction.csv."

 

path = 'Churn-prediction.csv'

df = pd.read_csv(path)

df.head(2)

         


 Data Preprocessing

Before building a predictive model, it's crucial to preprocess the data to ensure its quality and consistency. In this code, the following preprocessing steps are performed:

 

Uniformizing Column Names: The column names in the dataset might have different formats. To ensure consistency, we convert all column names to lowercase and replace spaces with underscores.


df.columns = df.columns.str.lower().str.replace(" ", "_")

 

Handling Categorical Columns: Categorical columns are converted to lowercase and spaces are replaced with underscores.


categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:

    df[c] = df[c].str.lower().str.replace(" ", "_")

 

Converting Total Charges to Numeric: The "totalcharges" column is intended to be numeric, but it might be stored as an object. We convert it to numeric, filling missing values with 0.


df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')

df.totalcharges = df.totalcharges.fillna(0)

 

Converting Churn Labels: The "churn" column contains values like "yes" and "no." We convert them to binary values (1 for "yes" and 0 for "no").


df.churn = (df.churn == 'yes').astype(int)

 

Data Splitting

Next, we split the dataset into training, validation, and test sets. This division allows us to train the model, tune hyperparameters using the validation set, and evaluate the final model's performance on the test set.

 

from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

df_full_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_full_train.reset_index(drop=True)

df_val = df_val.reset_index(drop=True)

df_test = df_test.reset_index(drop=True)

 

Exploratory Data Analysis (EDA)

Before diving into modeling, it's important to gain insights from the data through exploratory data analysis. This helps us understand the characteristics of the dataset and identify potential trends.

 

global_churn_rate = df_full_train.churn.mean()

df_group = df_full_train.groupby('gender').churn.agg(['mean', 'count'])

df_group['difference'] = df_group['mean'] - global_churn_rate

df_group['risk'] = df_group['mean'] / global_churn_rate

 

Feature Importance: Mutual Information Score

To understand the importance of different features in predicting churn, we can calculate the mutual information score. This score measures the dependence between two variables.

 

from sklearn.metrics import mutual_info_score

 mutual_info = df_full_train[categorical].apply(mutual_info_churn_score)

Feature Importance: Correlation


Correlation measures the linear relationship between variables. We can calculate the correlation between numerical features and the churn label.

 

numeric_correlation = df_full_train[numeric].corrwith(df_full_train.churn)

 

One-Hot Encoding

Categorical variables need to be transformed into a format that machine learning algorithms can process. One-hot encoding is used to convert categorical values into binary vectors.

 

from sklearn.feature_extraction import DictVectorizer 

dv = DictVectorizer(sparse=False)

X_train = dv.fit_transform(train_dicts)

X_val = dv.transform(val_dicts)

 

Logistic Regression Model

Logistic regression is a common algorithm for binary classification problems. It models the relationship between the features and the probability of a binary outcome.

 

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict_proba(X_val)[:, 1]

 

Model Interpretation

Interpreting a machine learning model helps us understand which features are contributing to its predictions.

 

feature_names = dv.get_feature_names()

feature_coefficients = dict(zip(feature_names, model.coef_[0]))

 

Using the Model

After training and interpreting the model, we can use it to make predictions on new data.

 

dicts_test = df_test[categorical + numeric].to_dict(orient='records')

X_test = dv.transform(dicts_test)

y_pred = model.predict_proba(X_test)[:, 1]

 

Conclusion

In this blog post, we walked through a Python code example for predicting customer churn using logistic regression. We covered data preprocessing, exploratory data analysis, feature importance analysis, model training, interpretation, and using the trained model for predictions. Predicting churn is essential for businesses to retain valuable customers and ensure a stable revenue stream.

0 تعليقات

إرسال تعليق

Post a Comment (0)

أحدث أقدم