1 source data
This time, we use the data of common words in teenagers' social networks to conduct market research
The method used is K-Means clustering method, and the principle is not introduced
2 data exploration and preprocessing
# Data preprocessing import pandas as pd teenager_sns = pd.read_csv('teenager_sns.csv') # View the last 20 rows of data teenager_sns.tail(20)
#1.1) observe the data and check whether there are missing values in the data? teenager_sns.info() #2.1) count the number of samples with missing values in gender. teenager_sns["gender"].value_counts(dropna = False) #2.2) count the number of samples with missing values of age, and give the overall description of age variables? print(f'age Number of missing values for variable: {teenager_sns["age"].isnull().sum()}') teenager_sns["age"].describe()
Set unreasonable age data (outliers) to NaN
import numpy as np def tag_nan(value): if (value >= 13) & (value < 20): return value else: return np.NaN # Teenagers are 13 ~ 18 years old, and those beyond the range are set as NaN # map mapping function teenager_sns["age"] = teenager_sns["age"].map(tag_nan) teenager_sns["age"].describe()
2.1 handling missing values of categorical variables through virtual coding
In this data, there are too many missing genders, so a new class of unknown genders is added, and the three values are onehot coded
#Replace the null value with "unkown" through the replace() function # teenager_sns["gender"] = teenager_sns["gender"].replace('NaN', 'unkown') # The replace function cannot be used. Use the fillna function instead teenager_sns["gender"] = teenager_sns["gender"].fillna('unkown') teenager_sns["gender"].value_counts() #Coding of virtual variables onehot coding gender_dummies = pd.get_dummies(teenager_sns["gender"],prefix="gender") #(column names that need to be converted to virtual variables, and column name prefixes of new columns) gender_dummies.head(5)
The gender variable is converted into three binary variables_ F,gender_M and gender_unkown, now merge them into the original data
#Use the concat() function of pandas to set teenager_sns and gender_dummies two data frames are spliced horizontally. teenager_sns = pd.concat([teenager_sns,gender_dummies],axis=1) teenager_sns.head(5)
2.2 use the filling method to deal with the missing values of numerical variables
For the missing value of age, we can use a special value to impute the missing value. The commonly used imputation values include given value, mean, median, etc
age_mean = teenager_sns["age"].mean() teenager_sns["age_avg_imputated"] = teenager_sns["age"].fillna(age_mean) teenager_sns.head(10)
2.3 data standardization
Because K-means clustering algorithm needs to calculate the distance of samples, we need to standardize the data before building the model. The commonly used standardization methods include min max standardization and z-score standardization. In this example, we directly adopt the z-score standardization method.
from sklearn import preprocessing filtered_columns = ["gradyear","friends","basketball", "football","soccer","softball","volleyball","swimming", "cheerleading","baseball","tennis","sports","cute","sex", "sexy","hot","kissed","dance","band","marching","music", "rock","god","church", "jesus","bible","hair","dress","blonde","mall","shopping","clothes", "hollister","abercrombie","die","death","drunk","drugs", "gender_M","gender_F","gender_unkown","age_avg_imputated"] teenager_sns_zscore = pd.DataFrame(preprocessing.scale(teenager_sns[filtered_columns].values.astype('float32'))) # teenager_sns[filtered_columns].values into an array # teenager_sns[filtered_columns].values.astype('float32 ') changes the element type # preprocessing.scale(teenager_sns[filtered_columns].values.astype('float32')) z-score standardization # pd.DataFrame(preprocessing.scale(teenager_sns[filtered_columns].values.astype('float32 ')) becomes DataFrame structure again teenager_sns_zscore.columns = filtered_columns # Name each attribute name teenager_sns_zscore.head(5)
3 model training
In order to segment our youth data, we use sklearn KMeans class in cluster package. N of KMeans_ The clusters parameter is the number of clusters. In this case, we will the number of market segments n_clusters is set to 5.
from sklearn.cluster import KMeans teenager_cluster_model = KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto', random_state=42, tol=0.0001, verbose=0)
4 Analysis of clustering results
teenager_cluster_model = teenager_cluster_model.fit(teenager_sns_zscore) teenager_clusters = pd.DataFrame({'class':teenager_cluster_model.predict(teenager_sns_zscore)}) teenager_clusters['class'].value_counts().sort_index()
In order to better understand the characteristics of the adolescent group represented by each class, we observe the cluster center of each class. The cluster center results are saved in the centers attribute of teenager_cluster_model.
centers = pd.DataFrame(teenager_cluster_model.cluster_centers_,columns = filtered_columns) centers.head()
Because the data has been standardized by Z-score method, we can directly analyze the meaning of each cluster center by observing the value of cluster center on each variable. If the value of a variable in the cluster center is greater than 0, it means that the value of the variable represented by the cluster is greater than the average level of the group.
Firstly, the data frame of the above clustering results is transposed, and then the variable values of each clustering center are sorted from large to small. Analyze the group represented by each cluster by observing the first 10 variables of each cluster: (only one class is analyzed here)
centers_t = centers.T centers_t.columns = ["cluster_0","cluster_1","cluster_2","cluster_3","cluster_4"] centers_t["cluster_0"].sort_values(ascending = False, inplace = False).head(10)