Data science guidance on computer

Posted by Jedi Legend on Tue, 14 Dec 2021 04:45:16 +0100

1 source data

This time, we use the data of common words in teenagers' social networks to conduct market research

The method used is K-Means clustering method, and the principle is not introduced

2 data exploration and preprocessing

# Data preprocessing
import pandas as pd
teenager_sns = pd.read_csv('teenager_sns.csv')
# View the last 20 rows of data
teenager_sns.tail(20)
#1.1) observe the data and check whether there are missing values in the data?
teenager_sns.info()

#2.1) count the number of samples with missing values in gender.
teenager_sns["gender"].value_counts(dropna = False)

#2.2) count the number of samples with missing values of age, and give the overall description of age variables?
print(f'age Number of missing values for variable: {teenager_sns["age"].isnull().sum()}')
teenager_sns["age"].describe()

Set unreasonable age data (outliers) to NaN

import numpy as np

def tag_nan(value):
    if (value >= 13) & (value < 20):
        return value
    else:
        return np.NaN
# Teenagers are 13 ~ 18 years old, and those beyond the range are set as NaN
# map mapping function
teenager_sns["age"]  = teenager_sns["age"].map(tag_nan)

teenager_sns["age"].describe()

2.1 handling missing values of categorical variables through virtual coding

In this data, there are too many missing genders, so a new class of unknown genders is added, and the three values are onehot coded

#Replace the null value with "unkown" through the replace() function
# teenager_sns["gender"] = teenager_sns["gender"].replace('NaN', 'unkown')
# The replace function cannot be used. Use the fillna function instead
teenager_sns["gender"] = teenager_sns["gender"].fillna('unkown')
teenager_sns["gender"].value_counts()

#Coding of virtual variables onehot coding
gender_dummies = pd.get_dummies(teenager_sns["gender"],prefix="gender") 
#(column names that need to be converted to virtual variables, and column name prefixes of new columns)
gender_dummies.head(5)

The gender variable is converted into three binary variables_ F,gender_M and gender_unkown, now merge them into the original data

#Use the concat() function of pandas to set teenager_sns and gender_dummies two data frames are spliced horizontally.
teenager_sns = pd.concat([teenager_sns,gender_dummies],axis=1)

teenager_sns.head(5)

2.2 use the filling method to deal with the missing values of numerical variables

For the missing value of age, we can use a special value to impute the missing value. The commonly used imputation values include given value, mean, median, etc

age_mean = teenager_sns["age"].mean()
teenager_sns["age_avg_imputated"] = teenager_sns["age"].fillna(age_mean)
teenager_sns.head(10)

2.3 data standardization

Because K-means clustering algorithm needs to calculate the distance of samples, we need to standardize the data before building the model. The commonly used standardization methods include min max standardization and z-score standardization. In this example, we directly adopt the z-score standardization method.

from sklearn import preprocessing
filtered_columns = ["gradyear","friends","basketball",
                     "football","soccer","softball","volleyball","swimming",
                     "cheerleading","baseball","tennis","sports","cute","sex",
                     "sexy","hot","kissed","dance","band","marching","music",
                     "rock","god","church",
                      "jesus","bible","hair","dress","blonde","mall","shopping","clothes",
                     "hollister","abercrombie","die","death","drunk","drugs",
                     "gender_M","gender_F","gender_unkown","age_avg_imputated"]

teenager_sns_zscore = pd.DataFrame(preprocessing.scale(teenager_sns[filtered_columns].values.astype('float32')))
# teenager_sns[filtered_columns].values into an array
# teenager_sns[filtered_columns].values.astype('float32 ') changes the element type
# preprocessing.scale(teenager_sns[filtered_columns].values.astype('float32')) z-score standardization
# pd.DataFrame(preprocessing.scale(teenager_sns[filtered_columns].values.astype('float32 ')) becomes DataFrame structure again
teenager_sns_zscore.columns = filtered_columns
# Name each attribute name
teenager_sns_zscore.head(5)

3 model training

In order to segment our youth data, we use sklearn KMeans class in cluster package. N of KMeans_ The clusters parameter is the number of clusters. In this case, we will the number of market segments n_clusters is set to 5.

from sklearn.cluster import KMeans
teenager_cluster_model = KMeans(algorithm='auto', copy_x=True, init='k-means++',
     max_iter=300, n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
     random_state=42, tol=0.0001, verbose=0) 

4 Analysis of clustering results

teenager_cluster_model = teenager_cluster_model.fit(teenager_sns_zscore)
teenager_clusters = pd.DataFrame({'class':teenager_cluster_model.predict(teenager_sns_zscore)})
teenager_clusters['class'].value_counts().sort_index()

In order to better understand the characteristics of the adolescent group represented by each class, we observe the cluster center of each class. The cluster center results are saved in the centers attribute of teenager_cluster_model.

centers = pd.DataFrame(teenager_cluster_model.cluster_centers_,columns = filtered_columns)
centers.head()

Because the data has been standardized by Z-score method, we can directly analyze the meaning of each cluster center by observing the value of cluster center on each variable. If the value of a variable in the cluster center is greater than 0, it means that the value of the variable represented by the cluster is greater than the average level of the group.

Firstly, the data frame of the above clustering results is transposed, and then the variable values of each clustering center are sorted from large to small. Analyze the group represented by each cluster by observing the first 10 variables of each cluster: (only one class is analyzed here)

centers_t = centers.T
centers_t.columns = ["cluster_0","cluster_1","cluster_2","cluster_3","cluster_4"]
centers_t["cluster_0"].sort_values(ascending = False, inplace = False).head(10)

Topics: Machine Learning Data Mining