Official account: Special House
Author: Peter
Editor: Peter
Hello, I'm Peter~
Many readers have asked me: are there any good cases of data analysis and data mining? The answer is, of course, it's all on Kaggle.
It's just that you have to spend time studying and even playing games. Peter has no competition experience, but he often goes to Kaggle to learn the problem-solving ideas and methods of the big guys in the competition.
In order to record the good methods of the big guys and improve himself, Peter decided to open a column: Kaggle case sharing.
The case analysis will be updated from time to time later. The ideas come from the big guys on the Internet, especially the sharing of Top1. Peter is mainly responsible for sorting out ideas and learning technology.
Today, I decided to share a case about clustering, using the supermarket user segmentation data set. Please move to the official website address: supermarket
To facilitate your practice, the official account can be returned to the supermarket to receive this dataset.
The following is the Notebook source code ranking top 1. Welcome to learn from it~
Import library
# data processing import numpy as np import pandas as pd # KMeans clustering from sklearn.cluster import KMeans # Drawing library import matplotlib.pyplot as plt import seaborn as sns import plotly as py import plotly.express as px import plotly.graph_objects as go py.offline.init_notebook_mode(connected = True)
Data EDA
Import data
First we import the dataset:
We found that there are five attribute fields in the data, namely customer ID, gender, age, average income and consumption grade
Data exploration
1. Data shape
df.shape # result (200,5)
A total of 200 rows and 5 columns of data
2. Missing values
df.isnull().sum() # result CustomerID 0 Gender 0 Age 0 Annual Income (k$) 0 Spending Score (1-100) 0 dtype: int64
You can see that all fields are complete without missing values
3. Data type
df.dtypes # result CustomerID int64 Gender object Age int64 Annual Income (k$) int64 Spending Score (1-100) int64 dtype: object
In the field types, except that Gender gender is a string, others are int64 numeric types
4. Description statistics
Description statistical information is mainly used to view the values of relevant statistical parameters of numerical data, such as number, median, variance, maximum value, quartile, etc
For the convenience of subsequent data processing and display, two points are handled:
# 1. Set drawing style plt.style.use("fivethirtyeight") # 2. Take out the three key analysis fields cols = df.columns[2:].tolist() cols # result ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
3 attribute histograms
Check the histograms of 'Age', 'Annual Income (k $)' and 'Spending Score (1-100)' and observe the overall distribution:
# mapping plt.figure(1,figsize=(15,6)) # canvas size n = 0 for col in cols: n += 1 # Subgraph location plt.subplot(1,3,n) # Subgraph plt.subplots_adjust(hspace=0.5,wspace=0.5) # Adjust width and height sns.distplot(df[col],bins=20) # Draw histogram plt.title(f'Distplot of {col}') # title plt.show() # display graphics
Gender factor
Gender Statistics
Check the number of men and women in this dataset. Whether gender has an impact on the overall analysis will be considered in the follow-up.
Data distribution by gender
sns.pairplot(df.drop(["CustomerID"],axis=1), hue="Gender", # Grouping field aspect=1.5) plt.show()
Through the above bivariate distribution map, we observed that gender has little effect on the other three fields
Relationship between age and average income under different gender
plt.figure(1,figsize=(15,6)) # Drawing size for gender in ["Male", "Female"]: plt.scatter(x="Age", y="Annual Income (k$)", # Specify the fields for both analyses data=df[df["Gender"] == gender], # Data to be analyzed under a gender s=200,alpha=0.5,label=gender # Scatter size, transparency, label classification ) # Horizontal and vertical axis, title setting plt.xlabel("Age") plt.ylabel("Annual Income (k$)") plt.title("Age vs Annual Income w.r.t Gender") # display graphics plt.show()
Relationship between average income and consumption score under different gender
plt.figure(1,figsize=(15,6)) for gender in ["Male", "Female"]: # Refer to the above for explanation plt.scatter(x = 'Annual Income (k$)',y = 'Spending Score (1-100)', data=df[df["Gender"] == gender], s=200,alpha=0.5,label=gender) plt.xlabel('Annual Income (k$)') plt.ylabel('Spending Score (1-100)') plt.title("Annual Income vs Spending Score w.r.t Gender") plt.show()
Data distribution by gender
Observe the data distribution through violin diagram and cluster scatter diagram:
# Clustering scatter diagram: swarm plots # Violin picture: violinplot plt.figure(1,figsize=(15,7)) n = 0 for col in cols: n += 1 # Subgraph order plt.subplot(1,3,n) # Nth subgraph plt.subplots_adjust(hspace=0.5,wspace=0.5) # Adjust width and height # Draw two graphics under a col and display them in groups through Gender sns.violinplot(x=col,y="Gender",data=df,palette = "vlag") sns.swarmplot(x=col, y="Gender",data=df) # Axis and title settings plt.ylabel("Gender" if n == 1 else '') plt.title("Violinplots & Swarmplots" if n == 2 else '') plt.show()
The results are as follows:
- View the distribution of different fields under different genders
- Observe whether there are outliers, outliers, etc
Attribute correlation analysis
It mainly observes the regression between two attributes:
cols = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)'] # Correlation analysis of these three attributes
plt.figure(1,figsize=(15,6)) n = 0 for x in cols: for y in cols: n += 1 # Every cycle, n increases and the subgraph moves once plt.subplot(3,3,n) # 3 * 3 matrix, Nth figure plt.subplots_adjust(hspace=0.5, wspace=0.5) # Width and height parameters between subgraphs sns.regplot(x=x,y=y,data=df,color="#AE213D") # Drawing data and colors plt.ylabel(y.split()[0] + " " + y.split()[1] if len(y.split()) > 1 else y) plt.show()
Specific figures are:
The figure above shows two points:
- The main diagonal is the relationship between itself and itself, which is in direct proportion
- Other graphs are between attributes, including scattered distribution of data and relevant trend charts of simulation
Clustering between two attributes
The principle and process of clustering algorithm are not explained in detail here. It is based by default
K value selection
We determine the k value by drawing the ELBOW diagram of the data. Data broadcasting:
1. Parameter interpretation from the official website: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
2. Chinese interpretation reference: https://blog.csdn.net/qq_34104548/article/details/79336584
df1 = df[['Age' , 'Spending Score (1-100)']].iloc[:,:].values # Data to be fitted inertia = [] # An empty list to store the sum of distances to the centroid for k in range(1,11): # The default value of k is 1-10, and the empirical value is 5 or 10 algorithm = (KMeans(n_clusters=k, # k value init="k-means++", # Initial algorithm selection n_init=10, # Random run times max_iter=300, # Maximum number of iterations tol=0.0001, # Tolerance minimum error random_state=111, # Random seed algorithm="full")) # Select auto, full and elkan for the algorithm algorithm.fit(df1) # Fitting data inertia.append(algorithm.inertia_) # Sum of centroids
Draw the relationship between the change of K value and the sum of centroid distance:
plt.figure(1,figsize=(15,6)) plt.plot(np.arange(1,11), inertia, 'o') # The data is drawn twice, and the marks are different plt.plot(np.arange(1,11), inertia, '-', alpha=0.5) plt.xlabel("Choose of K") plt.ylabel("Interia") plt.show()
Finally, we find that k=4 is more appropriate. Therefore, k=4 is used for the real fitting process of data
Cluster modeling
algorithm = (KMeans(n_clusters=4, # k=4 init="k-means++", n_init=10, max_iter=300, tol=0.0001, random_state=111, algorithm="elkan")) algorithm.fit(df1) # Analog data
After the fit operation of the data, we get the label label and four centroids:
labels1 = algorithm.labels_ # Results of classification (4 categories) centroids1 = algorithm.cluster_centers_ # Position of final centroid print("labels1:", labels1) print("centroids1:", centroids1)
In order to show the classification effect of the original data, the case on the official website is the following operation, which I personally think is a little cumbersome:
Data consolidation:
Show classification effect:
plt.figure(1,figsize=(14,5)) plt.clf() Z = Z.reshape(xx.shape) plt.imshow(Z,interpolation="nearest", extent=(xx.min(),xx.max(),yy.min(),yy.max()), cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower') plt.scatter(x="Age", y='Spending Score (1-100)', data = df , c = labels1 , s = 200) plt.scatter(x = centroids1[:,0], y = centroids1[:,1], s = 300 , c = 'red', alpha = 0.5) plt.xlabel("Age") plt.ylabel("Spending Score(1-100)") plt.show()
If it was me, what would I do? Of course, Pandas+Plolty is used to solve the problem perfectly:
See the results of classification visualization:
px.scatter(df3,x="Age",y="Spending Score(1-100)",color="Labels",color_continuous_scale="rainbow")
The above process is clustering according to Age and Spending Score(1-100). Clustering of Annual Income (k $) and Spending Score(1-100) fields was also carried out on the official website based on the same method.
The effects are as follows, which are divided into five categories:
Clustering of three attributes
Cluster according to Age, Annual Income and Spending Score, and finally draw a three-dimensional graph.
K value selection
The methods are the same, except that three fields (two above) are selected
X3 = df[['Age' , 'Annual Income (k$)' ,'Spending Score (1-100)']].iloc[: , :].values # Select data for 3 fields inertia = [] for n in range(1 , 11): algorithm = (KMeans(n_clusters = n, init='k-means++', n_init = 10 , max_iter=300, tol=0.0001, random_state= 111 , algorithm='elkan') ) algorithm.fit(X3) # Fitting data inertia.append(algorithm.inertia_)
Draw elbow diagram to determine k:
plt.figure(1 , figsize = (15 ,6)) plt.plot(np.arange(1 , 11) , inertia , 'o') plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5) plt.xlabel('Number of Clusters') , plt.ylabel('Inertia') plt.show()
We finally choose k=6 to cluster
Modeling fitting
algorithm = (KMeans(n_clusters=6, # Determined k value init="k-means++", n_init=10, max_iter=300, tol=0.0001, random_state=111, algorithm="elkan")) algorithm.fit(df2) labels2 = algorithm.labels_ centroids2 = algorithm.cluster_centers_ print(labels2) print(centroids2)
Get labels and centroids:
labels2 = algorithm.labels_ centroids2 = algorithm.cluster_centers_
mapping
For 3D clustering, we finally choose plot to show:
df["labels2"] = labels2 trace = go.Scatter3d( x=df["Age"], y= df['Spending Score (1-100)'], z= df['Annual Income (k$)'], mode='markers', marker = dict( color=df["labels2"], size=20, line=dict(color=df["labels2"],width=12), opacity=0.8 ) ) data = [trace] layout = go.Layout( margin=dict(l=0,r=0,b=0,t=0), title="six Clusters", scene=dict( xaxis=dict(title="Age"), yaxis = dict(title = 'Spending Score'), zaxis = dict(title = 'Annual Income') ) ) fig = go.Figure(data=data,layout=layout) fig.show()
The following is the final clustering effect: