Machine learning project: survival probability analysis of Titanic passengers

Posted by pmt2k on Thu, 23 Dec 2021 18:45:22 +0100

Machine learning project (I): survival probability analysis of Titanic passengers

Project background:

In 1912, the Titanic, then the world's largest passenger ship with the most luxurious internal facilities, with the reputation of "unsinkable", hit an iceberg and sank 3700 meters below the bottom of the Atlantic Ocean during her maiden voyage, killing more than 1500 people on board. In 1985, a joint search team of the United States and France found the wreckage of the Titanic.
In 1997, James Cameron directed the American film Titanic, which restored the whole thrilling process in the form of film for the first time.
Suppose that if there is such a thing as crossing, suddenly one day your soul crosses a passenger on the sailing Titanic, and you find that you are in a cabin on a certain floor, called Mr/Miss by your companions, and you have only modern memory. Suddenly there was a loud noise and the ship shook violently. People were panic stricken. What are the chances of you surviving at this moment? Next, I will tell you what characteristics will affect your survival probability through data analysis.

Purpose of the project:

Dataset:

train.csv ，genderclassmodel.csv ，gendermodel.csv ，test.csv

Meaning of each column name in the file:
Passengerld: unique number of a passenger
Survived: whether the passenger has been rescued. 0 means he has not been rescued, and 1 means he has been rescued successfully
Pclass: cabin class, 0 indicates class, and so on
Name: name
Sex: Gender
Age: age
SibSp: elderly; Parch: children
Embarked: boarding place S: indicates S Wharf, C: indicates C Wharf

Read data:

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Import data:

data = pd.read_csv('./train.csv')
data.head(10)

Check for missing values

data.isnull().sum()   #Check for missing values

View data size

data.describe()    #View data size

View the rescued proportion

f,ax=plt.subplots(1,2,figsize=(18,8))
data['Survived'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Survived')
ax[0].set_ylabel('')
sns.countplot('Survived',data=data,ax=ax[1])
ax[1].set_title('Survived')
plt.show()

Obviously, not many passengers survived the accident. Of the 891 passengers in the training set, only about 350 survived, and only 38.4% of the crew survived the air crash. We need to mine more information from the data to see which categories of passengers survived and which did not. We will try to use different characteristics of the dataset to check survival. For example, gender, age, boarding place, etc., but first we have to understand the characteristics in the data!

Analyze data characteristics

Data feature classification

Check the relationship between gender and rescued persons

data.groupby(['Sex','Survived'])['Survived'].count()
f,ax=plt.subplots(1,2,figsize=(18,8))  # Draw 1 * 2 sub images with a size of 18 * 8
data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Survived vs Sex')
sns.countplot('Sex',hue='Survived',data=data,ax=ax[1])
ax[1].set_title('Sex:Survived vs Dead')
plt.show()

Check the relationship between cabin level and rescue

pd.crosstab(data.Pclass,data.Survived,margins=True).style.background_gradient(cmap='summer_r')

We can see from the table that there were 216 people in the first class cabin, and 136 people were rescued, accounting for more than 50%; There were 184 people in the second class cabin, and 87 people were rescued, accounting for about 50%; There were 491 people in the third class cabin, and only 119 people were rescued, accounting for the lowest proportion. Although we say that life is neither high nor low, the reality is still cruel. The chance of the rich being saved is much higher than that of the poor.

Bar chart of cabin class, gender and survival rate

f,ax=plt.subplots(1,2,figsize=(18,8))   # Draw 1 * 2 sub images with a size of 18 * 8
data['Pclass'].value_counts().plot.bar(color=['#CD7F32','#FFDF00','#D3D3D3'],ax=ax[0])
ax[0].set_title('Number Of Passengers By Pclass')
ax[0].set_ylabel('Count')
sns.countplot('Pclass',hue='Survived',data=data,ax=ax[1])
ax[1].set_title('Pclass:Survived vs Dead')
plt.show()

People say money can't buy everything. However, we can clearly see that those with cabin class 1 are given high priority for rescue. Although the number of surviving passengers in pClass3 is much higher, the number of surviving passengers is still very low, about 25%.

Does this have anything to do with gender? Next, let's look at the impact of cabin class and gender on the results.

Check the impact of cabin class and gender on the results

pd.crosstab([data.Sex,data.Survived],data.Pclass,margins=True).style.background_gradient(cmap='summer_r')

It can be seen from the above table that in the first class cabin, the proportion of women rescued is very high. There are 94 women and 91 women were rescued.

Broken line chart of cabin class, gender and survival rate

sns.factorplot('Pclass','Survived',hue='Sex',data=data)
plt.show()

Influence of continuous value characteristics on results

print('Oldest Passenger was of:',data['Age'].max(),'Years')
print('Youngest Passenger was of:',data['Age'].min(),'Years')
print('Average Age on the ship:',data['Age'].mean(),'Years')
f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot("Pclass","Age", hue="Survived", data=data,split=True,ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("Sex","Age", hue="Survived", data=data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()

The survival rate of children under 10 years old increased with the number of cabin classes.
Survival is 20-50 years old, and the probability of being rescued is higher.
For men, survival decreases with age.

Data cleaning and preprocessing

As we saw earlier, the age feature has 177 null values. To replace these missing values, we can assign them the average age of the data set.
But the problem is that there are many people of different ages. The best way is to find a suitable age group!!
We can check the name characteristics. According to this feature, we can see that names have names like Mr. or Mrs. so that we can assign the average values of Mr. and Mrs. to their respective groups.

Missing value fill

Extract information

data['Initial']=0
for i in data:
    data['Initial']=data.Name.str.extract('([A-Za-z]+)\.')
pd.crosstab(data.Initial,data.Sex).T.style.background_gradient(cmap='summer_r') #Check the initial value of gender

Fill in missing values

data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
## Fill with the mean of each group
data.loc[(data.Age.isnull())&(data.Initial=='Mr'),'Age']=33
data.loc[(data.Age.isnull())&(data.Initial=='Mrs'),'Age']=36
data.loc[(data.Age.isnull())&(data.Initial=='Master'),'Age']=5
data.loc[(data.Age.isnull())&(data.Initial=='Miss'),'Age']=22
data.loc[(data.Age.isnull())&(data.Initial=='Other'),'Age']=46

Screening valuable features

Observe the characteristics of age groups

f,ax=plt.subplots(1,2,figsize=(20,10))
data[data['Survived']==0].Age.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('Survived= 0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
data[data['Survived']==1].Age.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('Survived= 1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()

Observation:

1) A large number of young children (under the age of 5) were rescued (women and children priority policy).
2) The oldest passenger saved (80 years old)
3) The highest number of deaths was in the 30-40 age group.

The effects of age on the results were compared

sns.factorplot('Pclass','Survived',col='Initial',data=data)
plt.show()

Effect of boarding location on results

pd.crosstab([data.Embarked,data.Pclass],[data.Sex,data.Survived],margins=True).style.background_gradient(cmap='summer_r')
sns.factorplot('Embarked','Survived',data=data)
fig=plt.gcf()
fig.set_size_inches(5,3)
plt.show()

The survival probability of port C is the highest, about 0.55, while the survival rate of port S is the lowest.

The survival rates of different cabin classes were compared

f,ax=plt.subplots(2,2,figsize=(20,15))
sns.countplot('Embarked',data=data,ax=ax[0,0])
ax[0,0].set_title('No. Of Passengers Boarded')
sns.countplot('Embarked',hue='Sex',data=data,ax=ax[0,1])
ax[0,1].set_title('Male-Female Split for Embarked')
sns.countplot('Embarked',hue='Survived',data=data,ax=ax[1,0])
ax[1,0].set_title('Embarked vs Survived')
sns.countplot('Embarked',hue='Pclass',data=data,ax=ax[1,1])
ax[1,1].set_title('Embarked vs Pclass')
plt.subplots_adjust(wspace=0.2,hspace=0.5)
plt.show()

Observation:

Most people's cabin class is 3.
C's passengers seemed lucky that some of them survived.
There are many rich people in port S. The chances of survival are still low.
Almost 95% of the passengers in port Q are poor.

Relationship between cabin class, survival rate, gender, boarding place and survival rate

sns.factorplot('Pclass','Survived',hue='Sex',col='Embarked',data=data)
plt.show()

Observation:

The probability of survival is almost 1 for women in pclass1 and pclass2.
The survival rate of male and female passengers in pclass3 is very low.
Port Q is unfortunate because it is full of third-class passengers.

There are also missing values in the port. I filled them with mode here, because S has the most boarders

data['Embarked'].fillna('S',inplace=True)
data.Embarked.isnull().any()

Relationship between the number of brothers and sisters and survival rate

pd.crosstab([data.SibSp],data.Survived).style.background_gradient(cmap='summer_r')
f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('SibSp','Survived',data=data,ax=ax[0])
ax[0].set_title('SibSp vs Survived')
sns.pointplot('SibSp','Survived',data=data,ax=ax[1])
ax[1].set_title('SibSp vs Survived')
plt.close(2)
plt.show()

pd.crosstab(data.SibSp,data.Pclass).style.background_gradient(cmap='summer_r')

Observation:
barplot and factorplot show that if the passenger is lonely and there are no brothers and sisters on the ship, he has a 35% survival rate. If the number of brothers and sisters increases, the survival rate of the figure roughly decreases. In other words, if there is a family on the ship, we will try our best to save them instead of saving ourselves first. But surprisingly, the survival rate of families with 5-8 members is 0%. The reason may be that they are in the cabin of pclass3.

Relationship between the number of parents and children and survival rate

pd.crosstab(data.Parch,data.Pclass).style.background_gradient(cmap='summer_r')

Compare the relationship between passengers with parents and survival

f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('Parch','Survived',data=data,ax=ax[0])
ax[0].set_title('Parch vs Survived')
sns.pointplot('Parch','Survived',data=data,ax=ax[1])
ax[1].set_title('Parch vs Survived')
plt.close(2)
plt.show()

Observation:
The results here are also very similar. Passengers with parents have a greater chance of survival. However, it decreases as the number increases.

People with 1-3 parents on board have a good chance of survival. Being alone has also proved fatal. When there are four parents on board, the chance of survival will be reduced.

The price of the ticket
Because the ticket is also continuous, we need to convert it to a numerical value.

print('Highest Fare was:',data['Fare'].max())
print('Lowest Fare was:',data['Fare'].min())
print('Average Fare was:',data['Fare'].mean())

Relationship between passenger cabin class and consumption

f,ax=plt.subplots(1,3,figsize=(20,8))
sns.distplot(data[data['Pclass']==1].Fare,ax=ax[0])
ax[0].set_title('Fares in Pclass 1')
sns.distplot(data[data['Pclass']==2].Fare,ax=ax[1])
ax[1].set_title('Fares in Pclass 2')
sns.distplot(data[data['Pclass']==3].Fare,ax=ax[2])
ax[2].set_title('Fares in Pclass 3')
plt.show()

Pclass: Yes, there is an obvious trend that the first class passengers give you a better chance of survival. For pclass3, the survival rate is very low. For women, the chance of survival from pclass1 is almost.

Age: children younger than 5-10 years old have a high survival rate. Passengers between the ages of 15 and 35 die a lot.

Port: there are also different cabin classes on board, and the mortality rate is also very high!

Family: if you have 1-2 siblings, spouses or parents, you have a greater chance of survival than if you travel alone or with a large family.

Correlation between features

#Related heat map
sns.heatmap(data.corr(),annot=True,cmap='RdYlGn',linewidths=0.2) #data.corr()-->correlation matrix
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()

Heat map of characteristic correlation

The first thing to note is that only numerical features are compared

Positive correlation: if the increase of feature a leads to the increase of feature b, they are positively correlated. A value of 1 indicates a fully positive correlation.

Negative correlation: if the increase of feature a leads to the decrease of feature b, it is negative correlation. A value of - 1 indicates a complete negative correlation.

Now let's say that the two features are highly or completely related, so an increase in one leads to an increase in the other. This means that both features contain highly similar information, and the information is little or no change. Such characteristics are of no value to us!

So do you think we should use them at the same time?. When making or training models, we should minimize redundancy because it reduces training time and many advantages.

Now, from the figure above, we can see that the features are not significantly correlated.

Feature Engineering and data cleaning

When we get a dataset with features, are all the features important? There may be many redundant features that should be eliminated. We can also obtain or add new features by observing or extracting information from other features.

Age characteristics

As I mentioned earlier, age is a continuous characteristic, and there is a problem of continuous variables in machine learning model.

If I say that sports are organized or arranged by sex, we can easily divide them into men and women.

What would you do if I said to group them by their age? If there are 30 people, there may be 30 age values.

We need to discretize continuous values to group.

OK, the maximum age of passengers is 80. So we will range from 0-80 to 5 cases. So 80 / 5 = 16.

data['Age_band']=0
data.loc[data['Age']<=16,'Age_band']=0
data.loc[(data['Age']>16)&(data['Age']<=32),'Age_band']=1
data.loc[(data['Age']>32)&(data['Age']<=48),'Age_band']=2
data.loc[(data['Age']>48)&(data['Age']<=64),'Age_band']=3
data.loc[data['Age']>64,'Age_band']=4
data.head(2)
data['Age_band'].value_counts().to_frame().style.background_gradient(cmap='summer')   # Check the age of the passengers

Relationship between age, cabin class and survival rate

sns.factorplot('Age_band','Survived',data=data,col='Pclass')
plt.show()

Family_size: total number of families

Just looking at brothers and sisters, the elderly and children is not very direct. We look directly at the number of the whole family

data['Family_Size']=0
data['Family_Size']=data['Parch']+data['SibSp']#family size
data['Alone']=0
data.loc[data.Family_Size==0,'Alone']=1#Alone
f,ax=plt.subplots(1,2,figsize=(18,6))
sns.pointplot('Family_Size','Survived',data=data,ax=ax[0])
ax[0].set_title('Family_Size vs Survived')
sns.pointplot('Alone','Survived',data=data,ax=ax[1])
ax[1].set_title('Alone vs Survived')
plt.close(2)
plt.show()

Relationship between passenger loneliness, cabin class, gender and survival rate

sns.factorplot('Alone','Survived',data=data,hue='Sex',col='Pclass')
plt.show()

Ticket price

data['Fare_Range']=pd.qcut(data['Fare'],4)
data.groupby(['Fare_Range'])['Survived'].mean().to_frame().style.background_gradient(cmap='summer_r')
data['Fare_cat']=0
data.loc[data['Fare']<=7.91,'Fare_cat']=0
data.loc[(data['Fare']>7.91)&(data['Fare']<=14.454),'Fare_cat']=1
data.loc[(data['Fare']>14.454)&(data['Fare']<=31),'Fare_cat']=2
data.loc[(data['Fare']>31)&(data['Fare']<=513),'Fare_cat']=3

Relationship between ticket price, gender and survival rate

sns.factorplot('Fare_cat','Survived',data=data,hue='Sex')
plt.show()

Converts a string type to a numeric data type

data['Sex'].replace(['male','female'],[0,1],inplace=True)
data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
data['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)

Remove unnecessary features

Name > we don't need the name attribute because it can't be converted to any classification value

Age - > we have age_band feature, so this is not required

Ticket number – > this is an arbitrary string and cannot be classified

Fare - > we have fare_cat feature, so it is not required

Warehouse No. - > meaningless

Passengerid -- > cannot be classified

Heat map after data type conversion

data.drop(['Name','Age','Ticket','Fare','Cabin','Fare_Range','PassengerId'],axis=1,inplace=True)
sns.heatmap(data.corr(),annot=True,cmap='RdYlGn',linewidths=0.2,annot_kws={'size':20})
fig=plt.gcf()
fig.set_size_inches(18,15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

Machine learning modeling

Import all ML packages

#importing all the required ML packages
from sklearn.linear_model import LogisticRegression  #logistic regression
from sklearn import svm   #support vector Machine
from sklearn.ensemble import RandomForestClassifier  #Random Forest
from sklearn.neighbors import KNeighborsClassifier  #KNN
from sklearn.naive_bayes import GaussianNB  #Naive bayes
from sklearn.tree import DecisionTreeClassifier  #Decision Tree
from sklearn.model_selection import train_test_split   #training and testing data split
from sklearn import metrics   #accuracy measure
from sklearn.metrics import confusion_matrix   #for confusion matrix
train,test=train_test_split(data,test_size=0.3,random_state=0,stratify=data['Survived'])
train_X=train[train.columns[1:]]
train_Y=train[train.columns[:1]]
test_X=test[test.columns[1:]]
test_Y=test[test.columns[:1]]
X=data[data.columns[1:]]
Y=data['Survived']

algorithm

Support vector machine

Radial support vector machine

model=svm.SVC(kernel='rbf',C=1,gamma=0.1)
model.fit(train_X,train_Y)
prediction1=model.predict(test_X)
print('Accuracy for rbf SVM is ',metrics.accuracy_score(prediction1,test_Y))

Program running results:

Accuracy for rbf SVM is  0.835820895522388

Linear support vector machine

model=svm.SVC(kernel='linear',C=0.1,gamma=0.1)
model.fit(train_X,train_Y)
prediction2=model.predict(test_X)
print('Accuracy for linear SVM is',metrics.accuracy_score(prediction2,test_Y))

Program running results:

Accuracy for linear SVM is 0.8171641791044776

logistic regression

model = LogisticRegression()
model.fit(train_X,train_Y)
prediction3=model.predict(test_X)
print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction3,test_Y))

Program running results:

The accuracy of the Logistic Regression is 0.8134328358208955

Decision tree

model=DecisionTreeClassifier()
model.fit(train_X,train_Y)
prediction4=model.predict(test_X)
print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction4,test_Y))

Program running results:

The accuracy of the Decision Tree is 0.8097014925373134

K-nearest neighbor algorithm

model=KNeighborsClassifier() 
model.fit(train_X,train_Y)
prediction5=model.predict(test_X)
print('The accuracy of the KNN is',metrics.accuracy_score(prediction5,test_Y))

Program running results:

The accuracy of the KNN is 0.832089552238806

Modify the accuracy of the check

a_index=list(range(1,11))
a=pd.Series()
x=[0,1,2,3,4,5,6,7,8,9,10]
for i in list(range(1,11)):
    model=KNeighborsClassifier(n_neighbors=i) 
    model.fit(train_X,train_Y)
    prediction=model.predict(test_X)
    a=a.append(pd.Series(metrics.accuracy_score(prediction,test_Y)))
plt.plot(a_index, a)
plt.xticks(x)
fig=plt.gcf()
fig.set_size_inches(12,6)
plt.show()
print('Accuracies for different values of n are:',a.values,'with the max value as ',a.values.max())

Program running results:

Naive Bayes

model=GaussianNB()
model.fit(train_X,train_Y)
prediction6=model.predict(test_X)
print('The accuracy of the NaiveBayes is',metrics.accuracy_score(prediction6,test_Y))

Program running results:

The accuracy of the NaiveBayes is 0.8134328358208955

Random forest algorithm

model=RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_Y)
prediction7=model.predict(test_X)
print('The accuracy of the Random Forests is',metrics.accuracy_score(prediction7,test_Y))

Program running results:

The accuracy of the Random Forests is 0.8059701492537313

The accuracy of the model is not the only factor that determines the effect of the classifier. Suppose that the classifier is trained on the training data and needs to be tested on the test set to be effective. Now the accuracy of this classifier is very high, but can we confirm that all new test sets are 90%? The answer is no, because we can't determine the results of the classifier on different data sources. When the training and test data change, the accuracy will also change, which may increase or decrease,
To overcome this, we obtain a generalized model using cross validation.

Cross validation of neural networks

A test set doesn't seem to be enough. Multiple rounds of averaging is a good strategy!

The working principle of cross validation is to first divide the data set into k-subsets.
Suppose we divide the data set into (k=5) parts, we reserve one part for testing, and train these four parts.
We continue this process by changing the test part in each iteration and training the algorithm in other parts. Then the average value of the measurement results is obtained to obtain the average accuracy of the algorithm.
This is called cross validation.

Verification process

from sklearn.model_selection import KFold #for K-fold cross validation
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction
kfold = KFold(n_splits=10) # k=10, split the data into 10 equal parts
xyz=[]
accuracy=[]
std=[]
classifiers=['Linear Svm','Radial Svm','Logistic Regression','KNN','Decision Tree','Naive Bayes','Random Forest']
models=[svm.SVC(kernel='linear'),svm.SVC(kernel='rbf'),LogisticRegression(),KNeighborsClassifier(n_neighbors=9),DecisionTreeClassifier(),GaussianNB(),RandomForestClassifier(n_estimators=100)]
for i in models:
    model = i
    cv_result = cross_val_score(model,X,Y, cv = kfold,scoring = "accuracy")
    cv_result=cv_result
    xyz.append(cv_result.mean())
    std.append(cv_result.std())
    accuracy.append(cv_result)
new_models_dataframe2=pd.DataFrame({'CV Mean':xyz,'Std':std},index=classifiers)       
new_models_dataframe2

Program running results:

plots box diagram

plt.subplots(figsize=(12,6))
box=pd.DataFrame(accuracy,index=[classifiers])
box.T.boxplot()

Program running results:

CV Mean bar chart

new_models_dataframe2['CV Mean'].plot.barh(width=0.8)
plt.title('Average CV Mean Accuracy')
fig=plt.gcf()
fig.set_size_inches(8,5)
plt.show()

Program running results:

Confusion matrix
It gives the number of correct and incorrect classifiers.

f,ax=plt.subplots(3,3,figsize=(12,10))
y_pred = cross_val_predict(svm.SVC(kernel='rbf'),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,0],annot=True,fmt='2.0f')
ax[0,0].set_title('Matrix for rbf-SVM')
y_pred = cross_val_predict(svm.SVC(kernel='linear'),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,1],annot=True,fmt='2.0f')
ax[0,1].set_title('Matrix for Linear-SVM')
y_pred = cross_val_predict(KNeighborsClassifier(n_neighbors=9),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,2],annot=True,fmt='2.0f')
ax[0,2].set_title('Matrix for KNN')
y_pred = cross_val_predict(RandomForestClassifier(n_estimators=100),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,0],annot=True,fmt='2.0f')
ax[1,0].set_title('Matrix for Random-Forests')
y_pred = cross_val_predict(LogisticRegression(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,1],annot=True,fmt='2.0f')
ax[1,1].set_title('Matrix for Logistic Regression')
y_pred = cross_val_predict(DecisionTreeClassifier(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,2],annot=True,fmt='2.0f')
ax[1,2].set_title('Matrix for Decision Tree')
y_pred = cross_val_predict(GaussianNB(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[2,0],annot=True,fmt='2.0f')
ax[2,0].set_title('Matrix for Naive Bayes')
plt.subplots_adjust(hspace=0.2,wspace=0.2)
plt.show()

Program running results:

Explain the confusion matrix: look at the first figure
1) The accuracy of prediction was 491 (death) + 247 (survival), and the average CV accuracy was (491 + 247) / 891 = 82.8%.
2) 58 and 95 are all our mistakes.

Super parameter setting

The machine learning model is like a black box. This black box has some default parameter values that we can adjust or change to get a better model. For example, C and y in the support vector machine model, which we call hyperparameters, may have a great impact on the results.

Random forest

from sklearn.model_selection import GridSearchCV
C=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
gamma=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
kernel=['rbf','linear']
hyper={'kernel':kernel,'C':C,'gamma':gamma}
gd=GridSearchCV(estimator=svm.SVC(),param_grid=hyper,verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)

Program running results:

Fitting 5 folds for each of 240 candidates, totalling 1200 fits
0.8282593685267716
SVC(C=0.4, gamma=0.3)

RBF support vector machine

n_estimators=range(100,1000,100)
hyper={'n_estimators':n_estimators}
gd=GridSearchCV(estimator=RandomForestClassifier(random_state=0),param_grid=hyper,verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)

Program running results:

Fitting 5 folds for each of 9 candidates, totalling 45 fits
0.819327098110602
RandomForestClassifier(n_estimators=300, random_state=0)

integrate

Integration is a good way to improve the accuracy and performance of the model. Simply put, it is the combination of various simple models that creates a powerful model.
1) Random forest type, parallel integration
2) Promotion type
3) Stack type

Voting classifier

This is the simplest way to combine the pre of many different simple machine learning models. It gives an average prediction result based on the prediction of each sub model.

from sklearn.ensemble import VotingClassifier
ensemble_lin_rbf=VotingClassifier(estimators=[('KNN',KNeighborsClassifier(n_neighbors=10)),
                                              ('RBF',svm.SVC(probability=True,kernel='rbf',C=0.5,gamma=0.1)),
                                              ('RFor',RandomForestClassifier(n_estimators=500,random_state=0)),
                                              ('LR',LogisticRegression(C=0.05)),
                                              ('DT',DecisionTreeClassifier(random_state=0)),
                                              ('NB',GaussianNB()),
                                              ('svm',svm.SVC(kernel='linear',probability=True))
                                             ], 
                       voting='soft').fit(train_X,train_Y)
print('The accuracy for ensembled model is:',ensemble_lin_rbf.score(test_X,test_Y))
cross=cross_val_score(ensemble_lin_rbf,X,Y, cv = 10,scoring = "accuracy")
print('The cross validated score is',cross.mean())

Program running results:

The accuracy for ensembled model is: 0.8246268656716418
The cross validated score is 0.8237952559300874

Bagged decision tree

from sklearn.ensemble import BaggingClassifier
model=BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=3),random_state=0,n_estimators=700)
model.fit(train_X,train_Y)
prediction=model.predict(test_X)
print('The accuracy for bagged KNN is:',metrics.accuracy_score(prediction,test_Y))
result=cross_val_score(model,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for bagged KNN is:',result.mean())

Program running results:

The accuracy for bagged KNN is: 0.835820895522388
The cross validated score for bagged KNN is: 0.8160424469413232
#Bagged DecisionTree
model=BaggingClassifier(base_estimator=DecisionTreeClassifier(),random_state=0,n_estimators=100)
model.fit(train_X,train_Y)
prediction=model.predict(test_X)
print('The accuracy for bagged Decision Tree is:',metrics.accuracy_score(prediction,test_Y))
result=cross_val_score(model,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for bagged Decision Tree is:',result.mean())

Program running results:

The accuracy for bagged Decision Tree is: 0.8208955223880597
The cross validated score for bagged Decision Tree is: 0.8171410736579275

Lifting is a gradually enhanced weak model: first, the complete data set is trained. Now the model will get some instances and some errors. Now, in the next iteration, learners will pay more attention to the instance of misprediction or give it more weight.

AdaBoost (adaptive enhancement)

In this case, weak learning or estimation is a decision tree. However, we can change the choice of any algorithm in the default base estimator.

from sklearn.ensemble import AdaBoostClassifier
ada=AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.1)
result=cross_val_score(ada,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for AdaBoost is:',result.mean())

Program running results:

The cross validated score for AdaBoost is: 0.8249188514357055

Maximum accuracy enhancement

from sklearn.ensemble import GradientBoostingClassifier
grad=GradientBoostingClassifier(n_estimators=500,random_state=0,learning_rate=0.1)
result=cross_val_score(grad,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for Gradient Boosting is:',result.mean())

Program running results:

The cross validated score for Gradient Boosting is: 0.8115230961298376

Use super parameters to increase accuracy

n_estimators=list(range(100,1100,100))
learn_rate=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
hyper={'n_estimators':n_estimators,'learning_rate':learn_rate}
gd=GridSearchCV(estimator=AdaBoostClassifier(),param_grid=hyper,verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)

Program running results:

Fitting 5 folds for each of 120 candidates, totalling 600 fits
0.8293892411022534
AdaBoostClassifier(learning_rate=0.1, n_estimators=100)

Confusion matrix

ada=AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.05)
result=cross_val_predict(ada,X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,result),cmap='winter',annot=True,fmt='2.0f')
plt.show()

Program running results:

The highest accuracy we can get from AdaBoost is 83.16%, n_estimators = 200 and learning_rate = 0.05

Characteristic importance

import xgboost as xg
f,ax=plt.subplots(2,2,figsize=(15,12))
model=RandomForestClassifier(n_estimators=500,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,0])
ax[0,0].set_title('Feature Importance in Random Forests')
model=AdaBoostClassifier(n_estimators=200,learning_rate=0.05,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,1],color='#ddff11')
ax[0,1].set_title('Feature Importance in AdaBoost')
model=GradientBoostingClassifier(n_estimators=500,learning_rate=0.1,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,0],cmap='RdYlGn_r')
ax[1,0].set_title('Feature Importance in Gradient Boosting')
model=xg.XGBClassifier(n_estimators=900,learning_rate=0.1)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,1],color='#FD0F00')
ax[1,1].set_title('Feature Importance in XgBoost')
plt.show()

Program running results:

Topics: Python Machine Learning

Programmer Think

Machine learning project: survival probability analysis of Titanic passengers

Machine learning project (I): survival probability analysis of Titanic passengers

Project background:

Purpose of the project:

Dataset:

Read data:

Import data:

Check for missing values

View data size

View the rescued proportion

Analyze data characteristics

Check the relationship between gender and rescued persons

Check the relationship between cabin level and rescue

Check the impact of cabin class and gender on the results

Influence of continuous value characteristics on results

Data cleaning and preprocessing

Missing value fill

Extract information

Fill in missing values

Screening valuable features

Correlation between features

Heat map of characteristic correlation

Feature Engineering and data cleaning

Age characteristics

Family_size: total number of families

Ticket price

Converts a string type to a numeric data type

Machine learning modeling

Import all ML packages

algorithm

Support vector machine

logistic regression

Decision tree

K-nearest neighbor algorithm

Naive Bayes

Random forest algorithm

Cross validation of neural networks

Super parameter setting

Random forest

RBF support vector machine

integrate

Voting classifier

Bagged decision tree

AdaBoost (adaptive enhancement)

Maximum accuracy enhancement

Characteristic importance

Hot Topics