Understanding Customer Churning using Spark for Big Data

The demand for audio streaming services has been increased, the offer for these services has increased too, and the competence is fierce. Customer loyalty to a specific app or service is crucial for this model business.

photo by https://unsplash.com/@blocks

Sparkify is a fictional popular digital media service created by Udacity, similar to Spotify or Pandora; many users use their services every day. It also has two modalities: using a free tier or a premium subscription model on which a user is free to upgrade, downgrade or cancel the service as they will like. It is important that the user likes the service to be loyal to Sparkify.

We track each event; if a user plays a song, visits a specific page, or gets an error, it generates a lot of data that will easily scale up to GB of data. Fortunately, for large amounts of data, we can use big data technologies like Spark, and we can analyze and predict customer churn to avoid that they leave.

What is customer churn?

Is when a user stops using our company’s product, in this case, the Sparkify service. It is a metric that helps if the business is growing or not, and it is important because normally it costs more to acquire new customers than it does to retain the existing ones. To reduce customer churn, Sparkify has decided to analyze the data and offer incentives to keep the customers.

Project definition

It is a realistic dataset with Spark to engineer relevant features for predicting churn. Using Spark MLlib to build machine learning models with large datasets. The full dataset is 12GB (s3n://udacity-dsnd/sparkify/sparkify_event_data.json), of which you can analyze a mini subset 128MB(s3n://udacity-dsnd/sparkify/mini_sparkify_event_data.json)

The code for this project is in the Github repository.

create a spark session:

# create a Spark session
spark = SparkSession \
.builder \
.master("local[*]") \
.config("spark.ui.port",3000) \
.appName("Sparkify") \

Load and Clean Dataset

for the mini subset of data, it is a small dataset for Machine learning; it contains 286,500 rows and the following columns:

list of columns of the dataset

there are missing userIds, normally the cookies are disabled, or we cannot track that specific user, then the data related to them has been removed. There are 225 unique users, which has the following distribution:

Customer count free/paid distribution by gender(F/M) and Churn (cancelled)

Exploratory Data Analysis

We have defined as churn the visit to Cancellation Confirmation page and the data have been evaluated according to it if the user has churn or not.

23.11% of users have canceled the service, and the user interactions about the visit pages are the following:

box plot of the number of pages visited in a session (1=Chrun 0 = not churn)

There is a small difference in behavior related to users who have churn and do not churn the service. Thumbs Up , Thumps down , and Roll Advert , however, there is still not a statistically significant difference to conclude. This evaluation of the box plot was performed to visualize the percentiles and mean to check distribution.

To avoid overfitting for the model, we have removed the pages related to the account's cancelation, which means a churn.

the behavior of the users related to the sessions (duration in hours and number of songs played)

hour_data_Churn =  spark.sql("""
dayOfWeek(from_unixtime( ts/1000.00)) dayofweek,
hour(from_unixtime( ts/1000.00)) hour,
count(*) count
FROM Sparkify_churn_view
group by
dayOfWeek(from_unixtime( ts/1000.00)) ,
dayofmonth(from_unixtime( ts/1000.00)) ,
hour(from_unixtime( ts/1000.00))

hour_data_scaled = scale_data_by_Churn(hour_data_Churn,'dayofweek','count')
hour_data_scaled.plot.bar(rot=0,title = 'event count(scaled) by day of week', figsize=(5,5))
behavior or users by session

Using userAgent information, we also have determined the composition of users by devices, operating system, and browser, getting more difference of data by the browser and operating system (the data has been scaled to evaluate it):

More users have churned the service using Firefox, checking it by the browser, and using Ubuntu as the operating system.

Feature Engineering

  1. Categorical Features: all the categorical has been labeled as 1/0 (dummy columns) so that they can be useful in the model
  • gender
  • pages (remove Cancellation Confirmation and Cancel)
  • Browser (extracted from userAgent)
  • OS (extracted from userAgent)
example of categorical feature labeled

2. Numerical Features: this data has been scaled from 0 to 1

  • mean of songs in a session
  • mean of session duration (in hours)
  • mean of events (registers) in a session
  • days of use (behavior)
example of numerical features scaled

All the data has been reduced by the user.

as final columns, we have:

|-- userId: string (nullable = true)
|-- Churn: integer (nullable = true)
|-- level: integer (nullable = true)
|-- gender: integer (nullable = true)
|-- lenght_avg_scaled: double (nullable = true)
|-- day_of_week_1_scaled: double (nullable = true)
|-- day_of_week_2_scaled: double (nullable = true)
|-- day_of_week_3_scaled: double (nullable = true)
|-- day_of_week_4_scaled: double (nullable = true)
|-- day_of_week_5_scaled: double (nullable = true)
|-- day_of_week_6_scaled: double (nullable = true)
|-- day_of_week_7_scaled: double (nullable = true)
|-- songs_by_session_scaled: double (nullable = true)
|-- session_duration_scaled: double (nullable = true)
|-- event_count_by_session_scaled: double (nullable = true)
|-- total_sessions_scaled: double (nullable = true)
|-- browser_chrome: integer (nullable = true)
|-- browser_firefox: integer (nullable = true)
|-- browser_ie: integer (nullable = true)
|-- browser_mobile_safari: integer (nullable = true)
|-- browser_safari: integer (nullable = true)
|-- OS_linux: integer (nullable = true)
|-- OS_mac_os_x: integer (nullable = true)
|-- OS_ubuntu: integer (nullable = true)
|-- OS_windows: integer (nullable = true)
|-- OS_ios: integer (nullable = true)
|-- page_about_scaled: double (nullable = true)
|-- page_add_friend_scaled: double (nullable = true)
|-- page_add_to_playlist_scaled: double (nullable = true)
|-- page_downgrade_scaled: double (nullable = true)
|-- page_error_scaled: double (nullable = true)
|-- page_help_scaled: double (nullable = true)
|-- page_home_scaled: double (nullable = true)
|-- page_logout_scaled: double (nullable = true)
|-- page_nextsong_scaled: double (nullable = true)
|-- page_roll_advert_scaled: double (nullable = true)
|-- page_save_settings_scaled: double (nullable = true)
|-- page_settings_scaled: double (nullable = true)
|-- page_submit_downgrade_scaled: double (nullable = true)
|-- page_submit_upgrade_scaled: double (nullable = true)
|-- page_thumbs_down_scaled: double (nullable = true)
|-- page_thumbs_up_scaled: double (nullable = true)
|-- page_upgrade_scaled: double (nullable = true)

Model building

The data has been split into training and testing datasets assigning 70% and 30%, respectively. The classifications used for this analysis are the following:

  • DecisionTreeClassifier
  • GBTClassifier
  • RandomForestClassifier
  • LinearSVC

on each of the models, the operations performed are the following:

model = classifier.fit(df_train) #train the model
pred_test = model.transform(df_test) #predict the data
#evaluate the metrics

Model evaluation

We will test the trained models’ performances and select the one that has the best performance; it is based on f1-score

obtained from https://en.wikipedia.org/wiki/F-score

The reason to use this evaluation is due to an imbalance in class distribution that is present in the dataset. There is a small portion of users that have churn, and the purpose of this analysis is to identify the users that can churn Sparkify’s service correctly.

result of the models


The model Machine learning and Spark allow processing large amounts of data; this is helpful for scalable analysis and can keep track of the users. Once we have identified the users with possible churn behavior, a good business strategy to reduce the churn will be AB/Test to evaluate and get more engagement from the users, and this will start with research about the current performance, then observe and formulate a hypothesis and define variations, for two groups (new incentive and a control group with the existing ones) and evaluate the new behavior.

F1 score was the metric to optimize, the better result for this was Linear Support Vector Classification(LinearSVC) with a score of 0.8061; despite the time of training, once it has been trained the prediction time will not be significant for new users. The model can be evaluated once a week to identify new users, and the tuning of the model and features to extract can be evaluated once a week.

For future work, evaluation on other windows of time, features extraction can be useful to improve the performance, checking that we are not getting overfitting in the model. With more data, it will be useful to evaluate: training, validation, and test sets.

I’m a technology enthusiast passionate about data science, software development, and artificial intelligence.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store