# Titanic

Posted by gorgo666 on Mon, 03 Jan 2022 22:54:52 +0100

Pay attention to the entrance of warehouse 1 here

The captain asked the women and children to get on the ship first, but the women and children who got on the ship first were first class.

Draw a simple conclusion: 1 First class has a good chance of survival. 2. Women and children have a high probability of survival. 3. The first two factors are man-made

# I Data exploration

Let's import the data first

```import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

print(train_df.columns)
```
```PassengerId => passenger ID
Pclass => Passenger class(1/2/3 Class space)
Name => Passenger name
Sex => Gender
Age => Age
SibSp => male cousins/Number of sisters
Parch => Number of parents and children
Ticket => Ticket information
Fare => Ticket Price
Cabin => passenger cabin
Embarked => Boarding port
```

Data characteristics are divided into continuous value and discrete value

• Discrete value: gender (male, female) boarding place (S,Q,C) cabin class (1, 2, 3)

• Continuous value: age, ticket price

Then let's look at the survival of personnel

```# View survival ratio
fig,axes = plt.subplots(1,2,figsize=(18,8))
axes[0].set_title('Survived')
axes[0].set_ylabel(' ')

sns.countplot('Survived',data=train_df,ax=axes[1])
axes[1].set_title('Survived')
```

Here is only the data of the training set

Explore discrete features
Single feature
Only 38.4% of the personnel on the ship survived. Next, check the relationship between various attributes and retention!

```# Relationship between gender and survival
fig,axes = plt.subplots(1,2,figsize=(18,8))
train_df[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=axes[0])
axes[0].set_title('Survived vs Sex')

sns.countplot('Sex',hue='Survived',data=train_df,ax=axes[1])
```

It can be seen from the first picture that the proportion of women rescued is 75%, while that of men is less than 20%. From the right picture, it can be observed that there are more men on board than women, but the survival rate is very low. This is a very distinguishing feature and must be used

Next, let's look at the relationship between cabin class and survival

```# Relationship between cabin class and survival rate

# Relationship between cabin class and survival rate
fig,axes = plt.subplots(1,2,figsize=(18,8))
train_df['Pclass'].value_counts().plot.bar(ax=axes[0])
axes[0].set_title('Number Of Passengers By Pclass')
axes[0].set_ylabel('Count')

sns.countplot('Pclass',hue='Survived',data=train_df,ax=axes[1])
```

It can be seen from the figure on the left that the passengers with cabin registration of 3 are the lowest, but the corresponding survival rate in the figure on the right is very low. The survival rates of cabin registration of 1 and 2 are higher. We can analyze that the passengers in cabins 1 and 2 are high-level, rich and have a great chance of being rescued.

```# Relationship between landing port and survival rate
fig,axes = plt.subplots(1,2,figsize=(18,8))
train_df['Embarked'].value_counts().plot.bar(ax=axes[0])
axes[0].set_title('Number Of Passengers By Embarked')
axes[0].set_ylabel('Count')

sns.countplot('Embarked',hue='Survived',data=train_df,ax=axes[1])
```

S. There is no significant difference between Q and S. s has the largest number of boarding. C. There are obvious differences between boarding ports. The number of boarding is the least, and the rescued proportion is higher than 50%. Let's remember this feature for the time being.

Let's now look at boarding ports and gender and survival

```sns.barplot('Pclass','Survived',hue='Sex',data=train_df)
```

The survival rate of port 1 is relatively high, which is consistent with the previous analysis. The male survival rate of port 3 is the lowest, about 15%

Boarding port and cabin class and survival rate

```sns.factorplot('Pclass','Survived',hue='Sex',col='Embarked',data=train_df)
plt.show()
```

Explore continuous features

```# Age
print('Oldest Passenger was of:',train_df['Age'].max(),'Years')
print('Youngest Passenger was of:',train_df['Age'].min(),'Years')
print('Average Age on the ship:',train_df['Age'].mean(),'Years')
```
```Oldest Passenger was of: 80.0 Years
Youngest Passenger was of: 0.42 Years
Average Age on the ship: 29.69911764705882 Years
```
```# The effects of age, cabin class and gender on survival were visualized
fig,axes = plt.subplots(1,3,figsize=(18,8))
sns.violinplot('Pclass','Age',hue='Survived',data=train_df,split=True,ax=axes[0])
axes[0].set_title('Pclass and Age vs Survived')
axes[0].set_yticks(range(0,110,10))

sns.violinplot("Sex","Age", hue="Survived", data=train_df,split=True,ax=axes[1])
axes[1].set_title('Sex and Age vs Survived')
axes[1].set_yticks(range(0,110,10))

sns.violinplot("Embarked","Age", hue="Survived", data=train_df,split=True,ax=axes[2])
axes[2].set_title('Embarked and Age vs Survived')
axes[2].set_yticks(range(0,110,10))
```

Conclusion:

• The survival rate of children under 10 years old increased with the increase of the number of passenger s
• 20-50 year olds are more likely to be rescued
• Q port over 45 and under 5 years old are not likely to survive

Next, fill in the data

```# Average age by group
train_df.groupby('Initial')['Age'].mean()
```
```Initial
Master     4.574167
Miss      21.860000
Mr        32.739609
Mrs       35.981818
Other     45.888889
Name: Age, dtype: float64
```
```# Fill in the missing values using the mean of each group
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Mr'),'Age']=33
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Mrs'),'Age']=36
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Master'),'Age']=5
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Miss'),'Age']=22
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Other'),'Age']=46
```
```# Drawing visualization
fig,axes = plt.subplots(1,2,figsize=(20,10))
train_df[train_df['Survived']==0].Age.plot.hist(ax=axes[0],bins=20,edgecolor='black',color='red')
axes[0].set_title('Survived= 0')
x1 = list(range(0,85,5))
axes[0].set_xticks(x1)

train_df[train_df['Survived']==1].Age.plot.hist(ax=axes[1],color='green',bins=20,edgecolor='black')
axes[1].set_title('Survived= 1')
x2 = list(range(0,85,5))
axes[1].set_xticks(x2)
```

```sns.factorplot('Pclass','Survived',col='Initial',data=train_df)
```

```# Populate the embanked property with missing values
# Use the mode to fill in the missing value of port, and the mode is port S
train_df['Embarked'].fillna('S',inplace=True)
```
```# Number of brothers and sisters
```
```# The plot shows the relationship between the number of siblings and survival
fig,axes = plt.subplots(1,2,figsize=(20,8))
sns.barplot('SibSp','Survived',data=train_df,ax=axes[0])
axes[0].set_title('SibSp vs Survived')

sns.factorplot('SibSp','Survived',data=train_df,ax=axes[1])
axes[1].set_title('SibSp vs Survived')

plt.close(2)
```

```# Is there any relationship between brothers and sisters and cabin class
# Number of parents
# The drawing visualizes whether there is a relationship between parents and children and survival
f,ax = plt.subplots(1,2,figsize=(20,8))
sns.barplot('Parch','Survived',data=train_df,ax=ax[0])
ax[0].set_title('Parch vs Survived')
sns.factorplot('Parch','Survived',data=train_df,ax=ax[1])
ax[1].set_title('Parch vs Survived')
plt.close(2)
plt.show()
```

```# Ticket price
print('Highest Fare was:',train_df['Fare'].max())
print('Lowest Fare was:',train_df['Fare'].min())
print('Average Fare was:',train_df['Fare'].mean())
```
```Highest Fare was: 512.3292
Lowest Fare was: 0.0
Average Fare was: 32.204207968574636
```
```# Drawing visualization
f,ax = plt.subplots(1,3,figsize=(20,8))
sns.distplot(train_df[train_df['Pclass']==1].Fare,ax=ax[0])
ax[0].set_title('Fares in Pclass 1')
sns.distplot(train_df[train_df['Pclass']==2].Fare,ax=ax[1])
ax[1].set_title('Fares in Pclass 2')
sns.distplot(train_df[train_df['Pclass']==3].Fare,ax=ax[2])
ax[2].set_title('Fares in Pclass 3')
```

Summary:
• Gender: compared with men, women have a high chance of survival.
• Pclass: the survival rate of class I cabin is very high. Unfortunately, the survival rate of pclass3 is very low.
• Age: children younger than 5-10 years old have a high survival rate. Passengers between the ages of 15 and 35 die a lot.
• Port: there are also differences in the positions coming up, and the mortality is also great!
• Family: if you have 1-2 siblings, spouses or parents, you have a greater chance of survival than if you travel alone or with a large family.

# II. Data cleaning

```import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
```
```train_df = pd.read_csv(r'C:\Users\dahuo\Desktop\Dataset 2\train.csv')
print(train_df.columns)
```
```# Draw a thermal map to see the correlation between the previous features
sns.heatmap(train_df.corr(),annot=True,cmap='RdYlGn',linewidths=0.2)
```

It can be seen from the above figure that the characteristics are not significantly correlated.

Missing value fill
Filling age missing value

```train_df['Initial'] = train_df.Name.str.extract('([A-Za-z]+)\.')
# Average age by group
train_df.groupby('Initial')['Age'].mean()
```
```# Fill in the missing values using the mean of each group
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Mr'),'Age']=33
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Mrs'),'Age']=36
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Master'),'Age']=5
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Miss'),'Age']=22
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Other'),'Age']=46
```

Fill in missing values

```# Use mode S
train_df.Embarked.fillna('S',inplace=True)
train_df.Embarked.isnull().any()
```

Data cleaning

```# Discretization of age characteristics
# The age distribution of passengers was divided into 5 groups, and the age interval of one group was 16
train_df.loc[train_df['Age']<=16,'Age_band'] = 0;
train_df.loc[(train_df['Age']>16)&(train_df['Age']<=32),'Age_band'] = 1;
train_df.loc[(train_df['Age']>32)&(train_df['Age']<=48),'Age_band'] = 2;
train_df.loc[(train_df['Age']>48)&(train_df['Age']<=64),'Age_band'] = 3;
train_df.loc[train_df['Age']>64,'Age_band'] = 4;
```
```# Increase total household population
train_df['Family_num'] = train_df['Parch'] + train_df['SibSp']
# Whether to increase the characteristics of being alone
train_df['Alone'] = 0
train_df.loc[train_df['Family_num']==0,'Alone'] = 1;
```
```fig,ax = plt.subplots(1,2,figsize=(18,6))
sns.factorplot('Family_num','Survived',data=train_df,ax=ax[0])
ax[0].set_title('Family_Size vs Survived')

sns.factorplot('Alone','Survived',data=train_df,ax=ax[1])
ax[1].set_title('Alone vs Survived')

plt.close(2)
plt.close(3)
```

It can be seen from the above figure that the possibility of surviving a lonely life is only 30%, which is quite different from those with relatives. At the same time, it can be seen that when the number of relatives > = 4, the possibility of surviving begins to decline

```sns.factorplot('Alone','Survived',data=train_df,hue='Sex',col='Pclass')
```

```# Ticket discretization
train_df['Fare_range'] = pd.qcut(train_df['Fare'],4)
train_df.groupby(['Fare_range'])['Survived'].mean().to_frame()
```
```train_df.loc[train_df['Fare']<=7.91,'Fare_cat'] = 0
train_df.loc[(train_df['Fare']>7.91)&(train_df['Fare']<=14.454),'Fare_cat']=1
train_df.loc[(train_df['Fare']>14.454)&(train_df['Fare']<=31),'Fare_cat']=2
train_df.loc[(train_df['Fare']>31)&(train_df['Fare']<=513),'Fare_cat']=3
```
```sns.lineplot('Fare_cat','Survived',hue='Sex',data=train_df)
plt.show()
```

As can be seen from the above figure, the survival probability increases with the increase of ticket price

```# Convert discrete features to numbers
train_df['Sex'].replace(['male','female'],[0,1],inplace=True)
train_df['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
train_df['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)
```
```# Remove the characteristics of name, age, ticket number, ticket price, cabin number, etc
train_df.drop(['Name','Age','Ticket','Fare','Cabin','Fare_range','PassengerId'],axis=1,inplace=True)
train_df.info()
```
```sns.heatmap(train_df.corr(),annot=True,cmap='RdYlGn',linewidths=0.2)
fig = plt.gcf()
fig.set_size_inches(18,15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()
```

```# Save the processed data temporarily
train_df.to_csv(r'C:\Users\dahuo\Desktop\Dataset 2\processed_train.csv',encoding='utf-8',index=False)
```

# Three training model

```import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

from sklearn.model_selection import GridSearchCV
```

Partition dataset

```data = pd.read_csv(r'C:\Users\dahuo\Desktop\Dataset 2\processed_train.csv')
train_df,test_df = train_test_split(data,test_size=0.3,random_state=0,stratify=data['Survived'])

train_X = train_df[train_df.columns[1:]]
train_y = train_df[train_df.columns[:1]]

test_X = test_df[test_df.columns[1:]]
test_y = test_df[test_df.columns[:1]]

X = data[data.columns[1:]]
y = data['Survived']
```

SVM (Gaussian kernel)

```model = svm.SVC(kernel='rbf',C=1,gamma=0.1)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)
```

SVM (linear)

```model = svm.SVC(kernel='linear',C=0.1,gamma=0.1)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)
```

** LR**

```model = LogisticRegression()
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)
```

Decision tree

```model = DecisionTreeClassifier()
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)
```

KNN

```model = KNeighborsClassifier(n_neighbors=9)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)
```
```0.835820895522388
```
```predict_arr = pd.Series()
for i in range(1,11):
model = KNeighborsClassifier(n_neighbors=i)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
predict_arr = predict_arr.append(pd.Series(metrics.accuracy_score(prediction,test_y)))

# Drawing display
index = list(range(1,11))
plt.plot(index,predict_arr)
plt.xticks(list(range(0,11)))
print(predict_arr.values.max)
```

Naive Bayes

```model = GaussianNB()
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)
```
```0.8134328358208955
```

Random forest

```model = RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)
```

Cross validation

```kfold = KFold(n_splits=10,random_state=2019)
classifiers=['Linear Svm','Radial Svm','Logistic Regression','KNN','Decision Tree','Naive Bayes','Random Forest']
models = [svm.SVC(kernel='linear'),svm.SVC(kernel='rbf'),LogisticRegression(),KNeighborsClassifier(n_neighbors=9),DecisionTreeClassifier(),GaussianNB(),RandomForestClassifier(n_estimators=100)]

mean = []
std = []
auc = []
for model in models:
res = cross_val_score(model,X,y,cv=kfold,scoring='accuracy')
mean.append(res.mean())
std.append(res.std())
auc.append(res)

res_df = pd.DataFrame({'Mean':mean,'Std':std},index=classifiers)
```

Draw the box diagram corresponding to each algorithm

```plt.subplots(figsize=(12,6))
box = pd.DataFrame(auc,index=[classifiers])
box.T.boxplot()
```

Draw the confusion matrix and give the number of correct and incorrect classifications

```fig,ax = plt.subplots(3,3,figsize=(12,10))

y_pred = cross_val_predict(svm.SVC(kernel='rbf'),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[0,0],annot=True,fmt='2.0f')
ax[0,0].set_title('Matrix for rbf-SVM')

y_pred = cross_val_predict(svm.SVC(kernel='linear'),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[0,1],annot=True,fmt='2.0f')
ax[0,1].set_title('Matrix for Linear-SVM')

y_pred = cross_val_predict(KNeighborsClassifier(n_neighbors=9),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[0,2],annot=True,fmt='2.0f')
ax[0,2].set_title('Matrix for KNN')

y_pred = cross_val_predict(RandomForestClassifier(n_estimators=100),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[1,0],annot=True,fmt='2.0f')
ax[1,0].set_title('Matrix for Random-Forests')

y_pred = cross_val_predict(LogisticRegression(),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[1,1],annot=True,fmt='2.0f')
ax[1,1].set_title('Matrix for Logistic Regression')

y_pred = cross_val_predict(DecisionTreeClassifier(),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[1,2],annot=True,fmt='2.0f')
ax[1,2].set_title('Matrix for Decision Tree')

y_pred = cross_val_predict(GaussianNB(),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[2,0],annot=True,fmt='2.0f')
ax[2,0].set_title('Matrix for Naive Bayes')
plt.show()
```

Super parameter setting using grid search

```C = [0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
gamma = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
kernel = ['rbf','linear']

hyper = {'kernel':kernel,'C':C,'gamma':gamma}
gd = GridSearchCV(estimator=svm.SVC(),param_grid=hyper,verbose=True)
gd.fit(X,y)
print(gd.best_estimator_)
print(gd.best_score_)
C = 0.5
gamma = 0.1
n_estimators = range(100,1000,100)
hyper = {'n_estimators':n_estimators}

gd = GridSearchCV(estimator=RandomForestClassifier(random_state=0),param_grid=hyper,verbose=True)
gd.fit(X,y)
print(gd.best_score_)
print(gd.best_estimator_)
```
```                         Mean       Std
Linear Svm           0.790075  0.033587
Logistic Regression  0.809213  0.022927
KNN                  0.809176  0.026277
Decision Tree        0.817004  0.046226
Naive Bayes          0.803620  0.037507
Random Forest        0.815880  0.038200
Fitting 5 folds for each of 240 candidates, totalling 1200 fits
SVC(C=0.4, gamma=0.3)
0.8282593685267716
Fitting 5 folds for each of 9 candidates, totalling 45 fits
0.819327098110602
RandomForestClassifier(n_estimators=300, random_state=0)
```

# Four model fusion

```import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix

from sklearn.ensemble import VotingClassifier
import xgboost as xgb

train_df,test_df = train_test_split(data,test_size=0.3,random_state=0,stratify=data['Survived'])

train_X = train_df[train_df.columns[1:]]
train_y = train_df[train_df.columns[:1]]

test_X = test_df[test_df.columns[1:]]
test_y = test_df[test_df.columns[:1]]

X = data[data.columns[1:]]
y = data['Survived']

#Voting classifier
estimators = [
('linear_svm',svm.SVC(kernel='linear',probability=True)),
('rbf_svm',svm.SVC(kernel='rbf',C=0.5,gamma=0.1,probability=True)),
('lr',LogisticRegression(C=0.05)),
('knn',KNeighborsClassifier(n_neighbors=10)),
('rf',RandomForestClassifier(n_estimators=500,random_state=0)),
('dt',DecisionTreeClassifier(random_state=0)),
('nb',GaussianNB())
]
vc = VotingClassifier(estimators=estimators,voting='soft')
vc.fit(train_X,train_y)
print('The accuracy for ensembled model is:',vc.score(test_X,test_y))

# cross = cross_val_score(vc,X,y,cv=10,scoring='accuracy')
# mean=cross.mean()
# print(mean)

#xgboost
fig,ax = plt.subplots(2,2,figsize=(15,12))
model = xgb.XGBClassifier(n_estimators=900,learning_rate=0.1)
model.fit(X,y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,0])
ax[0,0].set_title('Feature Importance in XgBoost')
plt.grid()

model = RandomForestClassifier(n_estimators=500,random_state=0)
model.fit(X,y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,1])
ax[0,1].set_title('Feature Importance in Random Forest')
plt.grid()