This paper introduces the data mining of shared single vehicle in detail, including data analysis and model development. It consists of the following steps:
data:image/s3,"s3://crabby-images/0f9ab/0f9ab971fc29d0f54f70675c450a19ff4e24c868" alt=""
Shared vehicle data mining
Data set introduction
About sharing data sets
Bicycle sharing system is a new generation of traditional bicycle rental. The whole process from registered member, rental to return is automated. Through these systems, users can easily rent bicycles from one specific location and return them at another location. At present, there are more than 500 bike sharing projects around the world, which are composed of more than 500000 bicycles. Today, because of their important role in transportation, environment and health issues, people have great interest in these systems.
In addition to the interesting applications of bicycle sharing systems in the real world, many researchers are interested in the data generated by these systems. Unlike other transportation services (such as buses or subways), the duration, departure time and arrival location of shared bicycles are clearly recorded in the system. This function turns the bicycle sharing system into a virtual sensor network, which can be used to sense the mobility in the city. Therefore, by monitoring these data, it is expected that most important events in the city can be detected.
Today we use these data sets to mine the effective information contained in them. Next, from exploring data attributes, cleaning data, to model development, learn together and make common progress.
Note that this data set is a foreign shared single vehicle data set, not a domestic shared single vehicle data set. But it does not affect our learning of data mining related knowledge and technology.
Attribute information
Both hour.csv and day.csv have the following fields, but there is no hr field in day.csv
- instant: record index
- dteday: Date
- Season: season (1: spring, 2: summer, 3: autumn, 4: winter)
- yr: year (0:2011, 1:2012)
- mnth: month (1 to 12)
- hr: hours (0 to 23)
- Holiday: is it a holiday
- weekday: what day of the week
- workingday: working day. If the day is neither weekend nor holiday, it is 1; otherwise, it is 0.
- weathersit:
- 1: Sunny, cloudy, partly cloudy, cloudless
- 2: Mist + cloudy, mist + broken cloud, mist + a small amount of cloud, mist
- 3: Light snow, light rain + thunderstorm + scattered clouds, light rain + scattered clouds
- 4: Heavy rain + ice sheet + thunderstorm + fog, snow + fog
- temp: standardized temperature data, in degrees Celsius. These values are obtained by (t-t_min)/(t_max-t_min, t_min=-8, t_max=+39 (only in the hourly range)
- atemp: normal somatosensory temperature in degrees Celsius. These values are obtained by (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in the hourly range)
- hum: normalized humidity. These values are divided into 100 (maximum)
- windspeed: normalized wind speed data. These values are divided into 67 (maximum)
- casual: number of logged off users
- Registered: number of registered users
- cnt: total number of bicycles rented, including cancelled and registered bicycles
preparation in advance
Import module
import seaborn as sns import matplotlib.pyplot as plt from prettytable import PrettyTable import numpy as np import pandas as pd from sklearn.model_selection import RandomizedSearchCV from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_log_error from sklearn.linear_model import Lasso, ElasticNet, Ridge, SGDRegressor from sklearn.svm import SVR, NuSVR from sklearn.ensemble import BaggingRegressor, RandomForestRegressor from sklearn.neighbors import KNeighborsClassifier from sklearn.cluster import KMeans from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.linear_model import LinearRegression import random %matplotlib inline random.seed(100)
Define data acquisition function
class Dataloader(): ''' Bicycle shared dataset data loader. ''' def __init__(self, csv_path): ''' Initialize the bicycle shared dataset data loader. param: csv_path {str} -- Bicycle shared dataset CSV The path to the file. ''' self.csv_path = csv_path self.data = pd.read_csv(self.csv_path) # Shuffle self.data.sample(frac=1.0, replace=True, random_state=1) def getHeader(self): ''' Get shared bike CSV The column name of the file. return: [list of str] --CSV Column name of the file ''' return list(self.data.columns.values) def getData(self): ''' Partition training, verification and test sets return: pandas DataFrames -- Different data sets after division pandas DataFrames ''' # The data were divided into training, verification and test sets in the ratio of 60:20:20 split_train = int(60 / 100 * len(self.data)) split_val = int(80 / 100 * len(self.data)) train = self.data[: split_train] val = self.data[split_train: split_val] test = self.data[split_val: ] return train, val, test def getFullData(self): ''' In a DataFrames Get all data from. return: pandas DataFrames -- Complete shared dataset data ''' return self.data
Descriptive analysis
Partition training, verification and test data sets
dataloader = Dataloader('../data/bike/hour.csv') train, val, test = dataloader.getData() fullData = dataloader.getFullData() category_features = ['season', 'holiday', 'mnth', 'hr', 'weekday', 'workingday', 'weathersit'] number_features = ['temp', 'atemp', 'hum', 'windspeed'] features= category_features + number_features target = ['cnt']
features
['season', 'holiday', 'mnth', 'hr', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed']
Get column name of DataFrame:
print(list(fullData.columns))
['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']
Print the first five examples of the dataset to explore the data:
fullData.head(5)
data:image/s3,"s3://crabby-images/01903/01903d558e7f9b51844a67e14b21f1e786221e86" alt=""
Get data statistics for each column:
fullData[number_features].describe()
data:image/s3,"s3://crabby-images/757a1/757a1b36ef9012a2894d2eaff4f0161ce0539750" alt=""
for col in category_features: fullData[col] = fullData[col].astype('category') fullData[category_features].describe()
data:image/s3,"s3://crabby-images/82cdd/82cdda02d0fd8025ba98a83f51d232f5fac8531c" alt=""
Missing value analysis
See previous articles for missing value analysis: Missing value processing, can you really handle it?
Check NULL value in data:
print(fullData.isnull().any())
instant False dteday False season False yr False mnth False hr False holiday False weekday False workingday False weathersit False temp False atemp False hum False windspeed False casual False registered False cnt False dtype: bool
Outlier analysis
Box diagram
sns.set(font_scale=1.0) fig, axes = plt.subplots(nrows=3,ncols=2) fig.set_size_inches(15, 15) sns.boxplot(data=train,y="cnt",orient="v",ax=axes[0][0]) sns.boxplot(data=train,y="cnt",x="mnth",orient="v",ax=axes[0][1]) sns.boxplot(data=train,y="cnt",x="weathersit",orient="v",ax=axes[1][0]) sns.boxplot(data=train,y="cnt",x="workingday",orient="v",ax=axes[1][1]) sns.boxplot(data=train,y="cnt",x="hr",orient="v",ax=axes[2][0]) sns.boxplot(data=train,y="cnt",x="temp",orient="v",ax=axes[2][1]) axes[0][0].set(ylabel='Count',title="Box Plot On Count") axes[0][1].set(xlabel='Month', ylabel='Count',title="Box Plot On Count Across Months") axes[1][0].set(xlabel='Weather Situation', ylabel='Count',title="Box Plot On Count Across Weather Situations") axes[1][1].set(xlabel='Working Day', ylabel='Count',title="Box Plot On Count Across Working Day") axes[2][0].set(xlabel='Hour Of The Day', ylabel='Count',title="Box Plot On Count Across Hour Of The Day") axes[2][1].set(xlabel='Temperature', ylabel='Count',title="Box Plot On Count Across Temperature") for tick in axes[2][1].get_xticklabels(): tick.set_rotation(90)
data:image/s3,"s3://crabby-images/549a2/549a248e9c71ec29c075c49c791234cc3fef1d90" alt=""
Analysis: the box chart of weekdays and holidays shows that more bicycles are rented on normal weekdays than on weekends or holidays. The box chart per hour shows that the maximum is at 8 a.m. and 5 p.m., indicating that most users of bicycle rental services use bicycles to go to work or school. Another important factor seems to be temperature: higher temperature leads to an increase in the number of bicycles rented The lower temperature not only reduces the average lease quantity, but also shows more outliers in the data.
Remove outliers from data
sns.distplot(train[target[-1]]);
data:image/s3,"s3://crabby-images/e1aec/e1aece71fcf2f4aead6cf448cd68bb1fa7fe677f" alt=""
The distribution map of the count value shows that the count value does not conform to the normal distribution. We will use the median and interquartile interval (IQR) to identify and remove abnormal values in the data. (another method is to convert the target value to the normal distribution and use the mean and standard deviation.)
print("Samples in the train set with outliers: {}".format(len(train))) q1 = train.cnt.quantile(0.25) q3 = train.cnt.quantile(0.75) iqr = q3 - q1 lower_bound = q1 -(1.5 * iqr) upper_bound = q3 +(1.5 * iqr) train_preprocessed = train.loc[(train.cnt >= lower_bound) & (train.cnt <= upper_bound)] print("Training sample set without outliers: {}".format(len(train_preprocessed))) sns.distplot(train_preprocessed.cnt);
Sample in train set with outliers: 10427 Training sample set without outliers: 10151
data:image/s3,"s3://crabby-images/e26b3/e26b3d70bb5c13b79df84641c3bf9bade44b8a81" alt=""
correlation analysis
matrix = train[number_features + target].corr() heat = np.array(matrix) heat[np.tril_indices_from(heat)] = False fig,ax= plt.subplots() fig.set_size_inches(15,8) sns.set(font_scale=1.0) sns.heatmap(matrix, mask=heat,vmax=1.0, vmin=0.0, square=True,annot=True, cmap="Reds")
data:image/s3,"s3://crabby-images/80d9b/80d9b947bc23e56416adac960b98a827af384128" alt=""
Conclusion: in descriptive analysis, the following points are summarized:
- The variables "Casual" and "registered" contain direct information about the shared bicycle count, and if this information is used for prediction (data leakage), they are not considered in the feature set.
- The variables "temp" and "atemp" are highly correlated. In order to reduce the dimension of the prediction model, the feature "atemp" can be deleted.
- The variables "hr" and "temp" seem to be the characteristics that make a greater contribution to predicting the number of bicycle sharing.
features.remove('atemp')
Overview of evaluation indicators
Mean Squared Error (MSE)
MSE = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - y_i)^2}Root Mean Squared Logarithmic Error (RMSLE)
RMSLE = \sqrt{ \frac{1}{N} \sum_{i=1}^N (\log(x_i) - \log(y_i))^2 }R^2 Score
R^2=1-\frac{\sum_{i=1}^{n}e_i^2}{\sum_{i=1}^{n}(y_i-\bar{y})^2}Model selection
The problems presented are characterized by:
- Regression: the target variable is a continuous value.
- Small data set: sample size less than 100K.
- A few features should be important: the correlation matrix indicates that a few features contain information to predict the target variable.
These characteristics give ridge regression, support vector regression, integrated regression, random forest regression and other methods a good opportunity to show their skills. For regression models, see previous articles.
Multicollinearity and ridge regression in linear regression Machine learning | simple and powerful detailed explanation of linear regression Machine learning | deep understanding Lasso regression analysis Master the support vector machine in sklearn Integrated algorithm random forest regression model Ten thousand words long text, deduce the strongest summary of eight linear regression algorithms! Principle + code, 11 regression models are summarized Basic mathematical principle of regression algorithm in machine learning
We will evaluate the performance of these models as follows:
x_train = train_preprocessed[features].values y_train = train_preprocessed[target].values.ravel() # Sort validation set for plots val = val.sort_values(by=target) x_val = val[features].values y_val = val[target].values.ravel() x_test = test[features].values table = PrettyTable() table.field_names = ["Model", "Mean Squared Error", "R² score"] models = [ SGDRegressor(max_iter=1000, tol=1e-3), Lasso(alpha=0.1), ElasticNet(random_state=0), Ridge(alpha=.5), SVR(gamma='auto', kernel='linear'), SVR(gamma='auto', kernel='rbf'), BaggingRegressor(), BaggingRegressor(KNeighborsClassifier(), max_samples=0.5, max_features=0.5), NuSVR(gamma='auto'), RandomForestRegressor(random_state=0, n_estimators=300) ] for model in models: model.fit(x_train, y_train) y_res = model.predict(x_val) mse = mean_squared_error(y_val, y_res) score = model.score(x_val, y_val) table.add_row([type(model).__name__, format(mse, '.2f'), format(score, '.2f')]) print(table)
+-----------------------+--------------------+----------+ | Model | Mean Squared Error | R² score | +-----------------------+--------------------+----------+ | SGDRegressor | 43654.10 | 0.06 | | Lasso | 43103.36 | 0.07 | | ElasticNet | 54155.92 | -0.17 | | Ridge | 42963.88 | 0.07 | | SVR | 50794.91 | -0.09 | | SVR | 41659.68 | 0.10 | | BaggingRegressor | 18959.70 | 0.59 | | BaggingRegressor | 52475.34 | -0.13 | | NuSVR | 41517.67 | 0.11 | | RandomForestRegressor | 18949.93 | 0.59 | +-----------------------+--------------------+----------+
Random forest
Stochastic forest model
The theory and practice of stochastic forest model can refer to the following two articles: Integrated algorithm random forest classification model Integrated algorithm random forest regression model
# Table setup table = PrettyTable() table.field_names = ["Model", "Dataset", "MSE", "MAE", 'RMSLE', "R² score"] # model training model = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=4, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False) model.fit(x_train, y_train) def evaluate(x, y, dataset): pred = model.predict(x) mse = mean_squared_error(y, pred) mae = mean_absolute_error(y, pred) score = model.score(x, y) rmsle = np.sqrt(mean_squared_log_error(y, pred)) table.add_row([type(model).__name__, dataset, format(mse, '.2f'), format(mae, '.2f'), format(rmsle, '.2f'), format(score, '.2f')]) evaluate(x_train, y_train, 'training') evaluate(x_val, y_val, 'validation') print(table)
+-----------------------+------------+----------+-------+-------+----------+ | Model | Dataset | MSE | MAE | RMSLE | R² score | +-----------------------+------------+----------+-------+-------+----------+ | RandomForestRegressor | training | 298.21 | 10.93 | 0.21 | 0.98 | | RandomForestRegressor | validation | 18967.46 | 96.43 | 0.47 | 0.59 | +-----------------------+------------+----------+-------+-------+----------+
Characteristic importance
importances = model.feature_importances_ std = np.std([tree.feature_importances_ for tree in model.estimators_], axis=0) indices = np.argsort(importances)[: : -1]
# Print out feature sorting print("Feature ranking: ") for f in range(x_val.shape[1]): print("%d. feature %s (%f)" % (f + 1, features[indices[f]], importances[indices[f]]))
Feature ranking: 1. feature hr (0.633655) 2. feature temp (0.160043) 3. feature hum (0.049495) 4. feature workingday (0.046797) 5. feature weathersit (0.027105) 6. feature windspeed (0.026471) 7. feature weekday (0.020293) 8. feature mnth (0.019940) 9. feature season (0.012995) 10. feature holiday (0.003206)
# Draw the characteristic importance of random forest plt.figure(figsize=(14,5)) plt.title("Feature importances") plt.bar(range(x_val.shape[1]), importances[indices], color="cornflowerblue", yerr=std[indices], align="center") plt.xticks(range(x_val.shape[1]), [features[i] for i in indices]) plt.xlim([-1, x_val.shape[1]]) plt.show())
data:image/s3,"s3://crabby-images/da419/da419547b0ed17dfa99920363c85a43165d32c66" alt=""
Analysis: the result corresponds to the high correlation between the variable "hour" and the variable "temperature" in the characteristic correlation matrix and the bicycle sharing count.
Write at the end
Here are some ideas to further improve the performance of the data model:
- Distribution adjustment of target variables: some prediction models assume that the distribution of target variables is normal distribution, and conversion in data preprocessing can improve the performance of these methods.
- Implementation of random forest for large-scale data sets. For large-scale data sets (> 10 Mio. Samples), if you can't save all samples in working memory, or you will encounter serious memory problems, it will be very slow to use python to implement the random forest in sklearn. One solution can be a woody implementation, which includes a top tree for pre classification and a flat random forest implemented in C language at the leaves of the top tree.