The first two blogs retrieve information about python data analysis in the Label (tab) https://www.cnblogs.com/lyuzt/p/10636501.html ) and visual analysis of the acquired data ( https://www.cnblogs.com/lyuzt/p/10643941.html This time, we use sklearn to make a simple salary forecast for python data analysts with different academic qualifications and work experience.Now that you have an overview of the dataset in the previous two blogs, you can get directly to the topic.
1. Conversion of pay
Before that, import the module and read in the file, not only the training data file, but also a set of self-designed test data files.
import pandas as pd import numpy as np import matplotlib.pyplot as plt train_file = "analyst.csv" test_file = "test.csv" # Read file to get data train_data = pd.read_csv(train_file, encoding="gbk") train_data = train_data.drop('ID', axis=1) test_data = pd.read_csv(test_file, encoding="gbk") train_data.shape, test_data.shape
For better analysis, we need to preprocess the salary.Because of its scattered distribution, the number of many values is only 1.For the sake of not causing too much error, it can be divided into [less than 5k, 5k-10k, 10k-20k, 20k-30k, 30k-40k, 40K or more] according to its distribution. For the sake of more convenient analysis, the median of each salary range is taken and divided into the ranges we specify.
salarys = train_data['salary'].unique() # Get different values for salary for salary in salarys: # according to'-'Divide and remove'k',Convert values at each end to integers min_sa = int(salary.split('-')[0][:-1]) max_sa = int(salary.split('-')[1][:-1]) # median median_sa = (min_sa + max_sa) / 2 # Judges its value and divides it into specified ranges if median_sa < 5: train_data.replace(salary, '5k Following', inplace=True) elif median_sa >= 5 and median_sa < 10: train_data.replace(salary, '5k-10k', inplace=True) elif median_sa >= 10 and median_sa < 20: train_data.replace(salary, '10k-20k', inplace=True) elif median_sa >= 20 and median_sa < 30: train_data.replace(salary, '20k-30k', inplace=True) elif median_sa >= 30 and median_sa < 40: train_data.replace(salary, '30k-40k', inplace=True) else: train_data.replace(salary, '40k Above', inplace=True)
Once the process is complete, we can extract the "salary" separately as a label for the training set.
y_train = train_data.pop('salary').values
2. Converting variables
Converts a category variable into a numeric expression
Since variables are not numeric variables, the computer cannot recognize them during training, so they need to be converted.When we use numerical s to express categorical s, it is important to note that numbers have their own meaning, so using them indiscriminately can cause difficulties for subsequent model learning.So we can use One-Hot to represent categories.
The get_dummies method that comes with pandas can do One-Hot with one click.Here's how I understand One-Hot: for example, data ['academic qualifications'] has'junior college','undergraduate','master','unlimited'.However, data ['academic qualifications']=='undergraduate', he can be expressed as {'college': 0,'undergraduate': 1,'master': 0,'unlimited': 0} in a dictionary, expressed as [0, 1, 0, 0] in a vector.
Before that, it was a little easier to combine the test set with the training set.
data = pd.concat((train_data, test_data), axis=0) dummied_data = pd.get_dummies(data) dummied_data.head()
In order to better understand One-Hot, the results after processing are shown as follows:
Of course, there are other ways to do this, such as replacing different values with numbers.
The last time you did a visual analysis, you already knew there were no missing values in the dataset. To follow the process and ensure correctness, check again to see if there are any missing values.
dummied_data.isnull().sum().sort_values(ascending=False).head(10)
OK, good, no missing values.These values are simple and don't require much work, but first separate the training set from the test set.
X_train = dummied_data[:train_data.shape[0]].values
X_test = dummied_data[-test_data.shape[0]:].values
3. Selection of parameters
1. Decision Tree
from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import cross_val_score features_scores = [] max_features = [.1, .2, .3, .4, .5, .6, .7, .8, .9] for max_feature in max_features: clf = DecisionTreeClassifier(max_features=max_feature) features_score = cross_val_score(clf, X_train, y_train, cv=5) features_scores.append(np.mean(features_score)) plt.plot(max_features, features_scores)
This process mainly obtains the parameters to make the model better through cross-validation, which can be roughly understood as dividing the training set into several parts, then setting them as training set and test set respectively, and averaging the results from repeated cyclic training.Emmm... Feels like it's a bit general, or you can look it up on the Internet in more detail.
The parameters and values we get are shown in the figure:
Visible when max_features = 0.2 reaches the maximum, about 0.5418.
2. ensemble (integrated algorithm)
Integrated learning simply means predicting datasets with multiple classifiers to improve the generalization ability of the overall classifier.Here, sklearn's AdaBoostClassifier (adaptive boosting) is used to learn multiple classifiers by changing the weights of training samples, and to linearly combine them to improve generalization performance.
from sklearn.ensemble import AdaBoostClassifier n_scores = [] estimator_nums = [5, 10, 15, 20, 25, 30, 35, 40] for estimator_num in estimator_nums: clf = AdaBoostClassifier(n_estimators=estimator_num, base_estimator=dtc) n_score = cross_val_score(clf, X_train, y_train, cv=5) n_scores.append(np.mean(n_score)) plt.plot(estimator_nums, n_scores)
When estimators=20, the score is the highest, about 0.544, and although it is not much different from the score of a single decision tree, it is generally higher.
4. Modeling
Once you have selected the parameters, you are ready to build your model.
dtc = DecisionTreeClassifier(max_features=0.2) abc = AdaBoostClassifier(n_estimators=20)
# train
abc.fit(X_train, y_train)
dtc.fit(X_train, y_train)
# Forecast
y_dtc = dtc.predict(X_test)
y_abc = abc.predict(X_test)
test_data['salary(Single Decision Tree)'] = y_dtc
test_data['salary(boosting)'] = y_abc
As for results, it is impossible to predict perfectly, and the results of different models will vary, and it remains to be debated whether the predicted results conform to common sense, so just treat them as a small project with the specific code here: https://github.com/MaxLyu/Lagou_Analyze