# Plot + pandas + sklearn: shoot the first shot of kaggle

Posted by buddhika2010 on Fri, 26 Nov 2021 15:59:17 +0100

Official account: Special House
Author: Peter
Editor: Peter

Hello, I'm Peter~

Many readers have asked me: are there any good cases of data analysis and data mining? The answer is, of course, it's all on Kaggle.

It's just that you have to spend time studying and even playing games. Peter has no competition experience, but he often goes to Kaggle to learn the problem-solving ideas and methods of the big guys in the competition.

In order to record the good methods of the big guys and improve himself, Peter decided to open a column: Kaggle case sharing.

The case analysis will be updated from time to time later. The ideas come from the big guys on the Internet, especially the sharing of Top1. Peter is mainly responsible for sorting out ideas and learning technology.

Today, I decided to share a case about clustering, using the supermarket user segmentation data set. Please move to the official website address: supermarket

To facilitate your practice, the official account can be returned to the supermarket to receive this dataset. The following is the Notebook source code ranking top 1. Welcome to learn from it~

## Import library

```# data processing
import numpy as np
import pandas as pd
# KMeans clustering
from sklearn.cluster import KMeans

# Drawing library
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.express as px
import plotly.graph_objects as go
py.offline.init_notebook_mode(connected = True)
```

## Data EDA

### Import data

First we import the dataset: We found that there are five attribute fields in the data, namely customer ID, gender, age, average income and consumption grade

### Data exploration

1. Data shape

```df.shape

# result
(200,5)
```

A total of 200 rows and 5 columns of data

2. Missing values

```df.isnull().sum()

# result
CustomerID                0
Gender                    0
Age                       0
Annual Income (k\$)        0
Spending Score (1-100)    0
dtype: int64
```

You can see that all fields are complete without missing values

3. Data type

```df.dtypes

# result
CustomerID                 int64
Gender                    object
Age                        int64
Annual Income (k\$)         int64
Spending Score (1-100)     int64
dtype: object
```

In the field types, except that Gender gender is a string, others are int64 numeric types

4. Description statistics

Description statistical information is mainly used to view the values of relevant statistical parameters of numerical data, such as number, median, variance, maximum value, quartile, etc For the convenience of subsequent data processing and display, two points are handled:

```# 1. Set drawing style
plt.style.use("fivethirtyeight")

# 2. Take out the three key analysis fields
cols = df.columns[2:].tolist()
cols
# result
['Age', 'Annual Income (k\$)', 'Spending Score (1-100)']
```

## 3 attribute histograms

Check the histograms of 'Age', 'Annual Income (k \$)' and 'Spending Score (1-100)' and observe the overall distribution:

```# mapping
plt.figure(1,figsize=(15,6))  # canvas size
n = 0

for col in cols:
n += 1 # Subgraph location
plt.subplot(1,3,n)  # Subgraph
sns.distplot(df[col],bins=20)  # Draw histogram
plt.title(f'Distplot of {col}')  # title
plt.show()  # display graphics
``` ## Gender factor

### Gender Statistics

Check the number of men and women in this dataset. Whether gender has an impact on the overall analysis will be considered in the follow-up. ### Data distribution by gender

```sns.pairplot(df.drop(["CustomerID"],axis=1),
hue="Gender",  # Grouping field
aspect=1.5)
plt.show()
``` Through the above bivariate distribution map, we observed that gender has little effect on the other three fields

### Relationship between age and average income under different gender

```plt.figure(1,figsize=(15,6))  # Drawing size

for gender in ["Male", "Female"]:
plt.scatter(x="Age", y="Annual Income (k\$)", # Specify the fields for both analyses
data=df[df["Gender"] == gender],  # Data to be analyzed under a gender
s=200,alpha=0.5,label=gender  # Scatter size, transparency, label classification
)

# Horizontal and vertical axis, title setting
plt.xlabel("Age")
plt.ylabel("Annual Income (k\$)")
plt.title("Age vs Annual Income w.r.t Gender")
# display graphics
plt.show()
``` ### Relationship between average income and consumption score under different gender

```plt.figure(1,figsize=(15,6))

for gender in ["Male", "Female"]:  # Refer to the above for explanation
plt.scatter(x = 'Annual Income (k\$)',y = 'Spending Score (1-100)',
data=df[df["Gender"] == gender],
s=200,alpha=0.5,label=gender)

plt.xlabel('Annual Income (k\$)')
plt.ylabel('Spending Score (1-100)')
plt.title("Annual Income vs Spending Score w.r.t Gender")
plt.show()
``` ### Data distribution by gender

Observe the data distribution through violin diagram and cluster scatter diagram:

```# Clustering scatter diagram: swarm plots
# Violin picture: violinplot

plt.figure(1,figsize=(15,7))
n = 0

for col in cols:
n += 1  # Subgraph order
plt.subplot(1,3,n)  # Nth subgraph
# Draw two graphics under a col and display them in groups through Gender
sns.violinplot(x=col,y="Gender",data=df,palette = "vlag")
sns.swarmplot(x=col, y="Gender",data=df)
# Axis and title settings
plt.ylabel("Gender" if n == 1 else '')
plt.title("Violinplots & Swarmplots" if n == 2 else '')

plt.show()
```

The results are as follows:

• View the distribution of different fields under different genders
• Observe whether there are outliers, outliers, etc ## Attribute correlation analysis

It mainly observes the regression between two attributes:

```cols = ['Age', 'Annual Income (k\$)', 'Spending Score (1-100)']  # Correlation analysis of these three attributes
```
```plt.figure(1,figsize=(15,6))
n = 0

for x in cols:
for y in cols:
n += 1  # Every cycle, n increases and the subgraph moves once
plt.subplot(3,3,n)  # 3 * 3 matrix, Nth figure
plt.subplots_adjust(hspace=0.5, wspace=0.5)  # Width and height parameters between subgraphs
sns.regplot(x=x,y=y,data=df,color="#AE213D")  # Drawing data and colors
plt.ylabel(y.split() + " " + y.split() if len(y.split()) > 1 else y)

plt.show()
```

Specific figures are: The figure above shows two points:

• The main diagonal is the relationship between itself and itself, which is in direct proportion
• Other graphs are between attributes, including scattered distribution of data and relevant trend charts of simulation

## Clustering between two attributes

The principle and process of clustering algorithm are not explained in detail here. It is based by default

### K value selection

We determine the k value by drawing the ELBOW diagram of the data. Data broadcasting:

1. Parameter interpretation from the official website: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

2. Chinese interpretation reference: https://blog.csdn.net/qq_34104548/article/details/79336584

```df1 = df[['Age' , 'Spending Score (1-100)']].iloc[:,:].values  # Data to be fitted
inertia = []   # An empty list to store the sum of distances to the centroid

for k in range(1,11):  # The default value of k is 1-10, and the empirical value is 5 or 10
algorithm = (KMeans(n_clusters=k,  # k value
init="k-means++",  # Initial algorithm selection
n_init=10,  # Random run times
max_iter=300,  # Maximum number of iterations
tol=0.0001,  # Tolerance minimum error
random_state=111,  # Random seed
algorithm="full"))  # Select auto, full and elkan for the algorithm
algorithm.fit(df1)  # Fitting data
inertia.append(algorithm.inertia_)  # Sum of centroids
``` Draw the relationship between the change of K value and the sum of centroid distance:

```plt.figure(1,figsize=(15,6))
plt.plot(np.arange(1,11), inertia, 'o')  # The data is drawn twice, and the marks are different
plt.plot(np.arange(1,11), inertia, '-', alpha=0.5)

plt.xlabel("Choose of K")
plt.ylabel("Interia")
plt.show()
``` Finally, we find that k=4 is more appropriate. Therefore, k=4 is used for the real fitting process of data

### Cluster modeling

```algorithm = (KMeans(n_clusters=4,  # k=4
init="k-means++",
n_init=10,
max_iter=300,
tol=0.0001,
random_state=111,
algorithm="elkan"))
algorithm.fit(df1)  # Analog data
```

After the fit operation of the data, we get the label label and four centroids:

```labels1 = algorithm.labels_  # Results of classification (4 categories)
centroids1 = algorithm.cluster_centers_  # Position of final centroid

print("labels1:", labels1)
print("centroids1:", centroids1)
``` In order to show the classification effect of the original data, the case on the official website is the following operation, which I personally think is a little cumbersome:  Data consolidation: Show classification effect:

```plt.figure(1,figsize=(14,5))
plt.clf()

Z = Z.reshape(xx.shape)

plt.imshow(Z,interpolation="nearest",
extent=(xx.min(),xx.max(),yy.min(),yy.max()),
cmap = plt.cm.Pastel2,
aspect = 'auto',
origin='lower')

plt.scatter(x="Age",
y='Spending Score (1-100)',
data = df ,
c = labels1 ,
s = 200)

plt.scatter(x = centroids1[:,0],
y =  centroids1[:,1],
s = 300 ,
c = 'red',
alpha = 0.5)

plt.xlabel("Age")
plt.ylabel("Spending Score(1-100)")

plt.show()
``` If it was me, what would I do? Of course, Pandas+Plolty is used to solve the problem perfectly:  See the results of classification visualization:

```px.scatter(df3,x="Age",y="Spending Score(1-100)",color="Labels",color_continuous_scale="rainbow")
``` The above process is clustering according to Age and Spending Score(1-100). Clustering of Annual Income (k \$) and Spending Score(1-100) fields was also carried out on the official website based on the same method.

The effects are as follows, which are divided into five categories: ## Clustering of three attributes

Cluster according to Age, Annual Income and Spending Score, and finally draw a three-dimensional graph.

### K value selection

The methods are the same, except that three fields (two above) are selected

```X3 = df[['Age' , 'Annual Income (k\$)' ,'Spending Score (1-100)']].iloc[: , :].values  # Select data for 3 fields
inertia = []
for n in range(1 , 11):
algorithm = (KMeans(n_clusters = n,
init='k-means++',
n_init = 10 ,
max_iter=300,
tol=0.0001,
random_state= 111  ,
algorithm='elkan') )
algorithm.fit(X3)   # Fitting data
inertia.append(algorithm.inertia_)
```

Draw elbow diagram to determine k:

```plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , inertia , 'o')
plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()
``` We finally choose k=6 to cluster

### Modeling fitting

```algorithm = (KMeans(n_clusters=6,  # Determined k value
init="k-means++",
n_init=10,
max_iter=300,
tol=0.0001,
random_state=111,
algorithm="elkan"))
algorithm.fit(df2)

labels2 = algorithm.labels_
centroids2 = algorithm.cluster_centers_

print(labels2)
print(centroids2)
```

Get labels and centroids:

```labels2 = algorithm.labels_
centroids2 = algorithm.cluster_centers_
```

### mapping

For 3D clustering, we finally choose plot to show:

```df["labels2"] = labels2

trace = go.Scatter3d(
x=df["Age"],
y= df['Spending Score (1-100)'],
z= df['Annual Income (k\$)'],
mode='markers',

marker = dict(
color=df["labels2"],
size=20,
line=dict(color=df["labels2"],width=12),
opacity=0.8
)
)

data = [trace]
layout = go.Layout(
margin=dict(l=0,r=0,b=0,t=0),
title="six Clusters",
scene=dict(
xaxis=dict(title="Age"),
yaxis = dict(title  = 'Spending Score'),
zaxis = dict(title  = 'Annual Income')
)
)

fig = go.Figure(data=data,layout=layout)

fig.show()
```

The following is the final clustering effect:  