[캐글 필사] EDA To Prediction (DieTanic)

EDA To Prediction(DieTanic)

Explore and run machine learning code with Kaggle Notebooks | Using data from Titanic - Machine Learning from Disaster

www.kaggle.com

캐글을 모르는 사람을 있어도 캐글 하면서 타이타닉을 모르는 사람은 없다는 말이 있을 정도로 유명한 타이타닉.

캐글에서 제공하는 튜토리얼 분석은 아니지만 DieTanic이라는 타이틀로 분석한 노트북을 필사해보고자 한다.

타이타닉호는 빙산과 충돌 후 승객과 승무원 2224명 중 1502명이 사망한 역사를 가지고 있다. 이 때문에 DieTanic이라는 이름을 붙였다고 한다.

이 노트북은 총 3개의 파트로 나눠져 있다.

Part1: Exploratory Data Analysis (EDA)

1) Analysis of the features.

2) Finding any relations or trnds considering multiple features.

Part2: Feature Engineering and Data Cleaning (전처리)

1) Adding any few features.

2) Removing redundant features

3) Converting features into suitable form for modeling

Part3: Predictive Modeling

1) Running Basic Algorithm

2) Cross Validation.

3) Ensembling.

4) Important Features Extraction

Part1: Exploratory Data Analysis(EDA)

Library Import

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

plt.style.use('fivethirtyeight')

matplotlib의 Style Sheet를 변경할 수 있다.

이외에도 여러 가지 스타일 시트들이 있는데 아래에 잘 정리된 곳이 있으니 참고하면 된다.

[Python] matplotlib stylesheet 종류 및 설정 변경

matplotlib은 파이썬 데이터 분석에서 빼놓을 수 없는 필수 시각화 라이브러리다. seaborn, bokeh 등과 같이 조금은 투박하다고 할 수 있는 matplotlib을 개선(?)한 인터페이스의 시각화 라이브러리가 있지

hong-yp-ml-records.tistory.com

warnings.filterwarnings('ignore')

Jupyter Notebook은 Warning이 발생했을 경우 아래와 같이 메시지를 띄워주는데 종종 이 Warning 메시지가 너무 길어 거슬릴 때가 있다. 이럴 때 ignore 옵션을 주면 이 메시지를 무시할 수 있다.

이미지 출처 :&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;nbsp;https://rfriend.tistory.com/346

%matplotlib inline

Jupyter Notebook을 실행한 브라우저에서 바로 그래프를 확인할 수 있도록 만들어주는 기능이다.

Data Loading

data=pd.read_csv('../input/train.csv')

data.head()

train 데이터는 다음과 같이 구성되어 있다.

PassengerId : 승객 번호
Survived : 생존 여부 ( 0 = 사망, 1 = 생존 ) - Target
Pclass : 티켓 클래스 ( 1 = 1등석, 2 = 2등석, 3 = 3등석 )
Name : 이름
Sex : 성별
Age : 나이
SibSp : 함께 탑승한 형제자매/배우자의 수
Parch : 함께 탑승한 부모/자녀들의 수
Ticket : 티켓 번호
Fare : 지불한 요금
Cabin : 수하물 번호
Embarked : 선착장 ( C = Cherbourg, Q = Queenstown, S = Southampton )

결측치 확인

data.isnull().sum()

isnull()을 통해 각 요소들이 결측값인지 확인할 수 있으며 sum으로 각 컬럼에 결측값 개수를 쉽게 확인할 수 있다.

여기서는 Age 177개, Cabin 687개, Embarked 2개의 결측값이 있다.

How Many Survived?

f, ax = plt.subplots(1, 2, figsize=(18, 8))
data['Survived'].value_counts().plot.pie(explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=True)
ax[0].set_title('Survived')
ax[0].set_ylabel('')
sns.countplot('Survived', data=data, ax=ax[1])
ax[1].set_title('Survived')
plt.show()

pie에 explode 속성을 넣으면 pie 그래프의 각 값들 사이에 공간을 줄 수 있다.
pie에 autopct 속성을 통해 표시할 값을 formatting 할 수 있다.

pie plot과 countplot을 사용하여 생존 여부를 시각화한 결과 생존자가 38.4%, 사망자가 61.6%로 사망자가 더 많음을 확인할 수 있다.

Types Of Features

이 노트북에서는 변수를 다음과 같이 나누었다.

Categorical Features (범주형 변수) : Sex(남, 여), Embarked(Cherbourg, Queenstown, Southampton)

Ordinal Features (순위 변수) : Pclass(1등석, 2등석, 3등석)

Continuous Features (연속 변수) : Age

변수 종류에 대해 자세히 알고 싶다면 아래 글을 참고하자.

[Data Science] 통계 - 변수의 종류 (질적변수, 양적변수)

데이터 분석에 들어가기 전에 분석할 데이터의 종류에 따라 분석 방법이 달라질 수 있다. 아래 표를 보면서 분석하고자 하는 데이터를 잘 파악한 후 적합한 분석법을 고려해보자. 데이터의 수량

foreverhappiness.tistory.com

Analysing The Features

Sex → Categorical Feature

성별이 Target에 어떤 영향을 미치는지 알아보자.

data.groupby(['Sex','Survived'])['Survived'].count()

성별에 따른 생존 여부를 count로 확인해보니 그냥 숫자로만 봐도 눈에 띄게 여성의 생존률이 더 높았다.

f, ax = plt.subplots(1, 2, figsize=(18, 8))
data[['Sex', 'Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Survived vs Sex')
sns.countplot('Sex', hue='Survived', data=data, ax=ax[1])
ax[1].set_title('Sex:Surived vs Dead')
plt.show()

1과 0으로 분류되는 Categorical Feature은 mean을 통해 확률처럼 나타낼 수 있다.

성별에 따른 생존 여부를 bar plot과 countplot으로 시각화한 결과 아래와 같이 여성의 생존률은 약 75%, 남성의 생존률은 약 18~19% 정도를 보였다.

이로 인해 성별은 Target에 밀접한 영향을 미친다는 것을 알 수 있다.

Pclass → Ordinal Feature

이번에는 Pclass가 Target에 어떤 영향을 주는지 알아보자.

pd.crosstab(data.Pclass, data.Survived, margins=True).style.background_gradient(cmap='summer_r')

crosstab의 margins를 True로 주게 되면 각 행과 열의 값들의 총합을 표시해준다.
matplotlib의 colormap(cmap)은 여기를 참고하자.

Pclass와 Target을 crosstab으로 시각화한 결과 1등석의 생존률이 가장 높았으며 등급이 낮아질수록 사망률이 높아지는 추세를 보였다.

f, ax = plt.subplots(1, 2, figsize=(18, 8))
data['Pclass'].value_counts().plot.bar(color=['#CD7F32', '#FFDF00', '#D3D3D3'], ax=ax[0])
ax[0].set_title('Number Of Passengers By Pclass')
ax[0].set_ylabel('Count')
sns.countplot('Pclass', hue='Survived', data=data, ax=ax[1])
ax[1].set_title('Pclass:Survived vs Dead')
plt.show()

각 Pclass별 인원 수와 사망자/생존자 수를 시각화해보니 확실히 등급이 높을수록 생존률이 높은 것을 확인할 수 있다.

이번에는 Sex와 Pclass를 합쳐 Target과 어떤 연관이 있는지 알아보자.

pd.crosstab([data.Sex, data.Survived], data.Pclass, margins=True).style.background_gradient(cmap='summer_r')

sns.factorplot('Pclass','Survived',hue='Sex',data=data)
plt.show()

factorplot의 hue 속성에는 보통 Categorical 변수가 들어가는데 Category의 개수만큼 line이 그려진다.

두 Column과 Target의 연관성을 crosstab과 factorplot으로 시각화한 결과이다.

앞선 결과에서도 나왔던 결론이지만 모든 Pclass에서 눈에 띄게 여성의 생존률이 높았고 1등석 여성의 경우에는 94명 중 3명만 사망하고 91명이 생존하는 높은 생존률을 보인다.

이로서 Pclass 역시 Target에 중요한 변수임을 알 수 있다.

Age → Continuous Feature

이번에는 나이가 Target에 어떤 영향을 미치는지 알아보자.

print('Oldest Passenger was of:', data['Age'].max(), 'Years')
print('Youngest Passenger was of:', data['Age'].min(), 'Years')
print('Average Passenger was of:', data['Age'].mean(), 'Years')

data set에서 가장 나이가 많은 사람은 80세, 가장 적은 사람은 0.42세, 평균은 대략 29.7세이다.

나이가 정수가 아닌 실수로 표현하는 것이 갸우뚱하지만 일단 받아들여 보자.

f, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.violinplot("Pclass", "Age", hue="Survived", data=data, split=True, ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0, 110, 10))
sns.violinplot("Sex", "Age", hue='Survived', data=data, split=True, ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0, 110, 10))
plt.plot()

violinplot에서 split 옵션을 True로 주게 되면 hue의 분류에 따라 그래프가 나뉘게 된다. 만약 이 옵션이 없다면 hue의 분류 값들이 모두 따로따로 plot 되므로 시각화 효과가 떨어지게 된다.

나이와 Pclass를 묶고 나이와 성별과도 묶어 Target과의 연관성을 violinplot으로 시각화하여 알아보았으며 아래와 같은 결론을 내릴 수 있다.

10세 이하의 어린이는 생존률이 높다.
1등석 20~50세 승객들은 생존률이 높으며 남성보다 여성이 더 높다.
남성의 생존률은 나이가 많을수록 감소한다.

앞서 결측값을 확인할 때 Age에서 177개의 결측값이 있음을 확인했다.

연속형 변수에서 결측값을 처리하는 여러 가지 방법이 있지만 보통은 평균값이나 최빈값을 많이 사용한다.

위에서 확인했듯이 평균값은 29.7세인데 실제 나이에서 상당히 벗어날 가능성이 있기 때문에 결측값을 모두 29.7세로 채울 수는 없다.

여기서 사용한 방법은 이름의 이니셜인 Mr. Mrs. 와 같은 부분을 통해 대략적인 연령대를 알아내는 것이다.

data['Initial'] = 0
for i in data:
    data['Initial'] = data.Name.str.extract('([A-Za-z]+)\.')

정규 표현식을 통해 이름의 이니셜 부분을 따올 수 있다.

pd.crosstab(data.Initial, data.Sex).style.background_gradient(cmap='summer_r')

Initial을 crosstab으로 시각화하면 다음과 같은 결과를 얻을 수 있다.

Miss, Mr, Mrs가 가장 많으며 나머지는 적은 수의 데이터밖에 없기 때문에 이것들을 처리해줄 필요가 있다.

data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],
                        ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'], inplace=True)

replace를 이용하여 적은 수의 Initial들은 제거해준다.

data.groupby('Initial')['Age'].mean()

정리된 Initial들의 평균 나이를 구한다.

data.loc[(data.Age.isnull())&(data.Initial=='Mr'), 'Age'] = 33
data.loc[(data.Age.isnull())&(data.Initial=='Mrs'), 'Age'] = 36
data.loc[(data.Age.isnull())&(data.Initial=='Master'), 'Age'] = 5
data.loc[(data.Age.isnull())&(data.Initial=='Miss'), 'Age'] = 22
data.loc[(data.Age.isnull())&(data.Initial=='Other'), 'Age'] = 46

이렇게 결측값들을 각 Initial의 대략적인 평균으로 대체해주면 Age의 결측값 문제를 해결할 수 있다.

data.Age.isnull().any()

결측값이 남아있는지 마지막으로 확인한다.

f, ax=plt.subplots(1, 2, figsize = (20, 10))
data[data['Survived']==0].Age.plot.hist(ax=ax[0], bins=20, edgecolor='black', color='red')
ax[0].set_title('Survived= 0')
x1=list(range(0, 85, 5))
ax[0].set_xticks(x1)
data[data['Survived']==1].Age.plot.hist(ax=ax[1], bins=20, edgecolor='black', color='green')
ax[1].set_title('Survived= 1')
x2=list(range(0, 85, 5))
ax[1].set_xticks(x2)
plt.show()

hist에 bins 옵션을 주면 막대그래프의 개수를 지정할 수 있다.

결측치가 처리된 Age와 Target인 Survived를 히스토그램으로 나타낸 결과 아래와 같이 해석할 수 있다.

5세 미만의 유아들은 대부분 구조되었다.
최고령 승객(80세)은 구조되었다.
최대 사망자는 30~40세이다.

sns.factorplot('Pclass', 'Survived', col='Initial', data=data)
plt.show()

각 Initial별로 Pclass에 따른 생존률을 factorplot으로 시각화한 결과이다.

여기서도 알 수 있다시피 어린이와 여성의 생존률은 높다.

Embarked → Categorical Value

pd.crosstab([data.Embarked, data.Pclass], [data.Sex, data.Survived], margins=True).style.background_gradient(cmap='summer_r')

이번에는 Embarked까지 포함하여 crosstab을 그려보았다.

S선착장에서 탑승한 승객이 가장 많으며 Q선착장에서 탑승한 승객이 가장 적은 것을 볼 수 있다.

sns.factorplot('Embarked', 'Survived', data=data)
fig = plt.gcf()
fig.set_size_inches(5, 3)
plt.show()

각 Embarked별로 생존률을 factorplot으로 나타내 보니 C항구 승객들의 생존률이 약 55%로 가장 높았고 S항구 승객들의 생존률은 약 33%로 가장 낮았다.

f,ax=plt.subplots(2, 2, figsize=(20, 15))
sns.countplot('Embarked', data=data, ax=ax[0, 0])
ax[0, 0].set_title('No. Of Passengers Boarded')
sns.countplot('Embarked', hue='Sex', data=data, ax=ax[0, 1])
ax[0, 1].set_title('Male-Female Split for Embarked')
sns.countplot('Embarked', hue='Survived', data=data, ax=ax[1, 0])
ax[1, 0].set_title('Embarked vs Survived')
sns.countplot('Embarked', hue='Pclass', data=data, ax=ax[1, 1])
ax[1, 1].set_title('Embarked vs Pclass')
plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

countplot으로 각 선착장별 여러 가지 시각화를 해본 결과 아래와 같은 해석을 할 수 있었다.

S항구의 승객 수가 가장 많고 그중에서도 3등석의 승객들이 가장 많다.
C항구 승객들의 생존률이 높은데 1등석 승객들이 많기 때문에 그럴 수도 있다.
S항구의 1, 2등석 승객들의 수가 가장 많음에도 생존률이 낮은 이유는 3등석 승객들이 매우 많기 때문이다.
Q항구의 승객들 중 95%는 3등석 승객들이다.

sns.factorplot('Pclass', 'Survived', hue='Sex', col='Embarked', data=data)
plt.show()

선착장별로 factorplot을 그려본 결과 다음과 같이 해석할 수 있다.

1등석과 2등석 승객들은 선착장에 관계없이 90% 이상의 높은 생존률을 보여준다.
S선착장 3등석 승객들은 성별에 관계없이 낮은 생존률을 보여준다.
P선착장은 대부분 3등석 승객들이었지만 성별에 따른 생존률 차이는 크게 나타난다.

data['Embarked'].fillna('S', inplace=True)

Embarked에는 결측값이 2개밖에 없었기 때문에 최빈값인 'S'로 채워주었다.

data.Embarked.isnull().any()

결측값이 잘 처리되었는지 확인한다.

SibSp→Discrete Value

이번에는 함께 탑승한 형제자매/배우자의 수가 생존률에 어떤 영향을 미치는지 알아보자.

pd.crosstab([data.SibSp], data.Survived).style.background_gradient(cmap='summer_r')

crosstab을 통해 먼저 확인해보자.

같이 탑승한 형제자매/배우자가 없거나 1명인 경우가 제일 많았으며 5명, 8명인 경우는 모두 사망하였다.

sns.barplot('SibSp', 'Survived', data=data).set_title('SibSp vs Survived')

f = sns.factorplot('SibSp', 'Survived', data=data)
f.fig.suptitle('SubSp vs Survived')

SibSp에 따른 생존률을 barplot과 factorplot으로 확인해본 결과 승객이 형제자매나 배우자 없이 혼자 탑승한 경우 34.5%의 생존률을 보였고 형제자매가 늘어날수록 생존률은 감소하는 형태를 보인다.

pd.crosstab(data.SibSp, data.Pclass).style.background_gradient(cmap='summer_r')

SibSp가 5, 8인 경우 생존률이 0%로 나온 이유는 어쩌면 Pclass 때문일지도 모른다.

Parch

Parch는 함께 탑승한 부모/자녀들의 수이다.

pd.crosstab(data.Parch, data.Pclass).style.background_gradient(cmap='summer_r')

Parch를 Pclass별로 생존률을 확인해보니 Parch가 5, 6일 경우 모두 3등석임을 확인할 수 있다.

sns.barplot('Parch', 'Survived', data=data).set_title('Parch vs Survived')

f = sns.factorplot('Parch', 'Survived', data=data)
f.fig.suptitle('Parch vs Survived')

Parch에 따른 생존률을 barplot과 factorplot으로 확인해본 결과 함께 탑승한 부모/자녀들의 수가 4명 이상이면 0%에 가까운 생존률을 보였고 혼자 탑승한 승객보다는 부모/자녀들과 함께 탑승한 승객들의 생존률이 더 높다.

Fare→Continuous Feature

print('Highes Fare was:', data['Fare'].max())
print('Lowest Fare was:', data['Fare'].min())
print('Average Fare was:', data['Fare'].mean())

승객들이 지불한 요금의 최댓값, 최솟값, 평균값을 출력해보았다.

무료로 탑승한 승객도 있다니..

f, ax = plt.subplots(1, 3, figsize=(20, 8))
sns.distplot(data[data['Pclass']==1].Fare, ax=ax[0])
ax[0].set_title('Fares in Pclass 1')
sns.distplot(data[data['Pclass']==2].Fare, ax=ax[1])
ax[1].set_title('Fares in Pclass 2')
sns.distplot(data[data['Pclass']==3].Fare, ax=ax[2])
ax[2].set_title('Fares in Pclass 3')
plt.show()

Pclass별 요금을 distplot으로 시각화해봤더니 당연한 결과겠지만 1등석의 요금이 가장 높게 나왔으며 1등석의 분포도가 큰 것으로 보인다.

요약

Sex : 여성의 생존률이 남성의 생존률보다 높다.

Pclass : 1등석 승객이 2등석 승객보다 생존률이 높고, 2등석 승객이 3등석 승객보다 생존률이 높다.

Age : 10세 미만의 아이들은 생존률이 높으며 15~35세 승객들의 생존률이 낮다.

Embarked : 1등석 승객들의 수가 C 선착장에서 탑승한 승객들보다 S선착장에서 탑승한 승객들이 더 많았지만 C 선착장에서 탑승한 승객들의 생존률이 훨씬 더 높다. 그리고 Q 선착장에서 탑승한 승객들의 대부분은 3등석 승객들이다.

SibSp, Parch : 1~2명의 형제자매/배우자 또는 1~3명의 부모가 함께 탑승한 승객들이 혼자 탑승한 승객들보다 생존률이 높으며 대가족의 생존률은 거의 0%에 가깝다.

변수들 간의 상관분석

sns.heatmap(data.corr(), annot=True, cmap='RdYlGn', linewidths=0.2)

fig = plt.gcf()
fig.set_size_inches(10, 8)
plt.show()

heatmap의 annot 속성을 True로 주면 각 셀들의 값을 표시할 수 있다.
heatmap의 linewidths 속성을 주면 각 셀들 간의 간격을 줄 수 있다.
plt.gcf 함수로 현재 figure 객체를 가져올 수 있다.

DataFrame.corr()을 통해 상관 분석을 할 수 있는데 이때 default로 피어슨 상관 계수 분석을 하게 된다.

피어슨 상관 분석을 하게 되면 두 변수의 선형성을 보고 +1에 가까우면 양의 선형 관계에 있고 -1에 가까우면 음의 선형 관계에 있다고 할 수 있다.

여기서 Feature A의 값이 증가함에 따라 Feature B의 값도 증가하면 양의 선형 관계에 있고, 감소하면 음의 선형 관계에 있다 라고 한다.

따라서 절댓값이 1에 가까울수록 의미 있는 것이라고 할 수 있는데 여기서 Target인 Survived와 절댓값이 가장 큰 변수는 Pclass로 -0.34이지만 이 정도 수치로는 관련성이 깊다라고 하기 힘들다.

그러므로 추가적인 전처리 과정을 통해 보다 더 의미 있는 변수들을 추출해보고자 한다.

Part2: Feature Engineering and Data Cleaning

Feature Engineering이란 뭘까?

우리는 주어진 DataSet에서 모든 Feature를 고려할 필요가 없다. 중요하지 않아서 제거하는 Feature도 있을 것이고 반대로 새로운 Feature를 만들어 낼 수도 있을 것이다.

Age_band

DieTanic에서는 Age_band라는 새로운 Feature를 생성하였다. Continuous Feature인 Age를 Categorical Feature로 만드는 것인데 binning(구간화)를 사용하였다.

data['Age_band'] = 0
data.loc[data['Age'] <= 16, 'Age_band'] = 0
data.loc[(data['Age'] > 16) & (data['Age'] <= 32), 'Age_band'] = 1
data.loc[(data['Age'] > 32) & (data['Age'] <= 48), 'Age_band'] = 2
data.loc[(data['Age'] > 48) & (data['Age'] <= 64), 'Age_band'] = 3
data.loc[data['Age'] > 64, 'Age_band'] = 4

data.head(2)

가장 나이가 많은 사람이 80세이기 때문에 5개의 구간으로 나눌 수 있다.

data['Age_band'].value_counts().to_frame().style.background_gradient(cmap='summer')

1구간 (17세 이상 32세 미만) 사람들이 가장 많았고, 4구간 (65세 이상) 사람들이 가장 적었다.

sns.factorplot('Age_band', 'Survived', data=data, col='Pclass')
plt.show()

Age_band와 Pclass에 따른 생존률을 확인해보니 Pclass와 관계없이 Age_band가 높을수록 생존률이 감소하는 경향을 보였다.

Family_Size and Alone

data['Family_Size']=0
data['Family_Size']=data['Parch']+data['SibSp']
data['Alone']=0
data.loc[data.Family_Size==0,'Alone']=1

sns.factorplot('Family_Size','Survived',data=data)
plt.show()

sns.factorplot('Alone','Survived',data=data)
plt.show()

Parch와 SibSp를 합쳐 승객이 혼자 탑승했는지 가족들과 함께 탑승했는지 파악할 수 있다.

Family_Size가 0이라는 것은 혼자 탑승했다는 의미이고 이때 생존률은 0.3 정도이다.

그리고 Family_Size가 3일 때 가장 높은 생존률을 보였으며 7명 이상의 대가족은 생존률이 0이라는 점도 주의 깊게 볼 만하다.

sns.factorplot('Alone', 'Survived', data=data, hue='Sex', col='Pclass')
plt.show()

혼자 있으나마나 여성은 생존률이 1에 가깝다는 것도 눈에 띠지만 Pclass3에서 여성이 혼자일 때 더 생존률이 높다는 것도 재미있는 결과다.

Fare_Range

data['Fare_Range'] = pd.qcut(data['Fare'], 4)
data.groupby(['Fare_Range'])['Survived'].mean().to_frame().style.background_gradient(cmap='summer_r')

앞에서 Age는 일정 범위를 균등하게 나누어 binning(구간화)을 하였지만 Fare은 qcut을 사용하여 구간을 나누었다.

범주를 qcut으로 나누게 되면 각 구간에 존재하는 데이터의 개수가 동일하게 된다.

지금까지 EDA 해본 결과로써는 당연한 얘기지만 탑승 요금이 비쌀수록 생존률이 높아진다는 결과를 확인할 수 있다.

data['Fare_cat']=0
data.loc[data['Fare']<=7.91,'Fare_cat']=0
data.loc[(data['Fare']>7.91)&(data['Fare']<=14.454),'Fare_cat']=1
data.loc[(data['Fare']>14.454)&(data['Fare']<=31),'Fare_cat']=2
data.loc[(data['Fare']>31)&(data['Fare']<=513),'Fare_cat']=3

sns.factorplot('Fare_cat','Survived',data=data,hue='Sex')
plt.show()

각 Fare 구간을 라벨링 해준 후 hue를 성별로 두고 factorplot으로 시각화해보면 위와 같이 남녀와 상관없이 Fare_cat과 생존률이 양의 선형 관계를 보이고 있음을 알 수 있다.

Converting String Values into Numeric

data['Sex'].replace(['male', 'female'], [0, 1], inplace=True)
data['Embarked'].replace(['S', 'C', 'Q'], [0, 1, 2], inplace=True)
data['Initial'].replace(['Mr', 'Mrs', 'Miss', 'Master', 'Other'], [0, 1, 2, 3, 4], inplace=True)

문자열을 그대로 learning을 시킬 수 없기 때문에 라벨 인코딩을 해준다.

Dropping UnNeeded Features

data.drop(['Name', 'Age', 'Ticket', 'Fare', 'Cabin', 'Fare_Range', 'PassengerId'], axis=1, inplace=True)

불필요한 Feature들을 삭제해주었다.

Name -> Categorical Feature로 변환할 수 없고 이니셜을 추출하여 Age의 결측값을 채우는 등 이미 역할을 다했다.

Age -> Categorical Feature인 Age_band로 대체할 수 있다.

Ticket -> 개인마다 티켓 번호는 고유한 값이기 때문에 분석이 무의미하다.

Fare -> Fare_cat으로 대체할 수 있다.

Fare_Range -> Fare_cat으로 대체할 수 있다.

Cabin -> 결측값이 너무 많고 이를 대체할 수 있는 방법이 없다.

PassengerId -> 개마다 고유한 값이기 때문에 분석이 무의미하다.

sns.heatmap(data.corr(), annot=True, cmap='RdYlGn', linewidths=0.2, annot_kws={'size': 20})
fig=plt.gcf()
fig.set_size_inches(18, 15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

annot_kws를 통해 Axes Text에 여러 가지 파라미터들을 전달할 수 있다.

전처리 과정을 거친 후에 피어슨 상관분석을 해보니 Target 대비 절댓값이 큰 Feature는 성별, Initial, Fare_cat이 있는데 이 정도 수치만으로는 아직 Target과 연관성이 깊다고 얘기하기 어렵다. 게다가 라벨 인코딩을 거친 Feature들이 상관분석 결과가 좋게 나오니 이는 좋은 결과라고 보기는 어렵다.

여기서 주의 깊게 볼 만한 것은 SibSp, Parch와 Family_Size는 높은 양의 선형 관계를 보이고 Alone과 Family_Size는 높은 음의 선형 관계를 보인다는 것이다.

너무나도 당연한 결과라 할 말이 없다.

Part3: Predictive Modeling

#importing all the required ML packages
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn import svm #support vector Machine
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.naive_bayes import GaussianNB #Naive bayes
from sklearn.tree import DecisionTreeClassifier #Decision Tree
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix

test data set에 대한 생존 여부를 예측하기 위해 다음과 같은 6가지 기본 모델링을 적용하고자 한다.

Logistic Regression
SVM (Support Vector Machine - Linear and radial)
Random Forest
KNN (K-Nearest Neighbours)
Naive Bayes
Decision Tree

train, test = train_test_split(data, test_size=0.3, random_state=0, stratify=data['Survived'])
train_X = train[train.columns[1:]]
train_Y = train[train.columns[:1]]
test_X = test[test.columns[1:]]
test_Y = test[test.columns[:1]]
X = data[data.columns[1:]]
Y = data['Survived']

train_test_split에 stratify 옵션을 넣게 되면 Data의 비율을 유치한 채 데이터가 분할된다.
타이타닉 Data set의 경우 EDA의 제일 첫 부분에서 확인할 수 있듯이 생존자와 사망자의 비율이 대략 6대 4이다. train과 test에 split 될 때도 이 비율을 유지한 채로 분할된다는 뜻이다.

Radial Support Vector Machines(rbf-SVM)

model = svm.SVC(kernel='rbf', C=1, gamma=0.1)
model.fit(train_X, train_Y)
prediction1=model.predict(test_X)
print('Accuracy for rbf SVM is', metrics.accuracy_score(prediction1, test_Y))

Linear Support Vector Machine(Linear-SVM)

model = svm.SVC(kernel='linear', C=0.1, gamma=0.1)
model.fit(train_X, train_Y)
prediction2=model.predict(test_X)
print('Accuracy for linear SVM is', metrics.accuracy_score(prediction2, test_Y))

Logistic Regression

model = LogisticRegression()
model.fit(train_X, train_Y)
prediction3=model.predict(test_X)
print('Accuracy for Logistic Regression is', metrics.accuracy_score(prediction3, test_Y))

Decision Tree

model=DecisionTreeClassifier()
model.fit(train_X, train_Y)
prediction4=model.predict(test_X)
print('Accuracy for Decision Tree is', metrics.accuracy_score(prediction4, test_Y))

K-Nearest Neighbours(KNN)

model = KNeighborsClassifier()
model.fit(train_X, train_Y)
prediction5=model.predict(test_X)
print('Accuracy for KNN is', metrics.accuracy_score(prediction5, test_Y))

Basic 모델들을 사용했을 때 나오는 결과로 봤을 때 rbf-SVM의 결과가 가장 높게 나왔다.

a_index = list(range(1, 11))
a = pd.Series()
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

for i in list(range(1, 11)):
    model=KNeighborsClassifier(n_neighbors=i)
    model.fit(train_X, train_Y)
    prediction = model.predict(test_X)
    a = a.append(pd.Series(metrics.accuracy_score(prediction, test_Y)))

plt.plot(a_index, a)
plt.xticks(x)
fig=plt.gcf()
fig.set_size_inches(12, 6)
plt.show()
print('Accuracy for different values of n are:', a.values, 'with the max value as', a.values.max())

KNN에서 n_neighbors를 하이퍼 파라미터로 조정할 수 있다. 여기서는 n_neighbors가 9일 때 가장 정확도가 높게 나왔다.

Gaussian Naive Bayes

model = GaussianNB()
model.fit(train_X, train_Y)
prediction6 = model.predict(test_X)
print('Accuracy of the NaiveBayes is', metrics.accuracy_score(prediction6, test_Y))

Random Forest

model = RandomForestClassifier(n_estimators=100)
model.fit(train_X, train_Y)
prediction7 = model.predict(test_X)
print('Accuracy of the Random Forest is', metrics.accuracy_score(prediction7, test_Y))

Cross Validation

지금까지 봤던 정확도들은 테스트 데이터들에 대한 정확도이다. 우리는 이것이 유효한 정확도인지 교차 검증(Cross Validation)을 해볼 필요가 있다.

Cross Validation을 하는 과정은 다음과 같다.

1) K-Fold 교차 검증을 하기 위해 K개의 서브 데이터셋을 만든다.

2) 만약 5개의 서브 데이터셋을 만들었다고 가정하면 1개는 Testing을 위해 사용하고 나머지 4개는 Training을 위해 사용한다.

3) Test Data Set을 변경하면서 동일한 작업을 반복하고 평균 정확도와 평균 오류를 구한다.

여기까지가 K-Fold 교차 검증의 단계이다.

4) 어떤 Data Set은 underfit, 어떤 Data Set은 overfit 될 수 있다. 그렇기 때문에 교차 검증(Cross Validation)을 통해 모델을 일반화할 수 있는 것이다.

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

kfold = KFold(n_splits=10, shuffle=True, random_state=22)
xyz=[]
accuracy = []
std = []

classifiers = ['Linear Svm', 'Radial Svm', 'Logistic Regression', 'KNN', 'Decision Tree', 'Naive Bayes', 'Random Forest']
models=[svm.SVC(kernel='linear'), svm.SVC(kernel='rbf'), LogisticRegression(), KNeighborsClassifier(n_neighbors=9), DecisionTreeClassifier(), GaussianNB(), RandomForestClassifier(n_estimators=100)]

for i in models:
    model = i
    cv_result = cross_val_score(model, X, Y, cv = kfold, scoring='accuracy')
    xyz.append(cv_result.mean())
    std.append(cv_result.std())
    accuracy.append(cv_result)

new_models_dataframe = pd.DataFrame({'CV Mean': xyz, 'Std': std}, index=classifiers)
new_models_dataframe

Radial Svm에서 CV 평균이 가장 높고 표준편차는 KNN에서 가장 낮게 나왔다.

plt.subplots(figsize=(12, 6))
box = pd.DataFrame(accuracy, index=[classifiers])
box.T.boxplot()

CV accuracy를 boxplot으로 나타내 보면 평균과 분산을 한눈에 시각적으로 확인할 수 있다.

new_models_dataframe['CV Mean'].plot.barh(width=0.8)
plt.title('Average CV Mean Accuracy')
fig=plt.gcf()
fig.set_size_inches(8,5)
plt.show()

CV 정확도는 종종 데이터 불균형으로 인해 오분류 결과를 가져올 수 있다. 따라서 Confusion Matrix를 통해 잘 예측했는지 클래스별로 확인해볼 수 있다.

Confusion Matrix

f, ax = plt.subplots(3, 3, figsize=(12, 10))

y_pred = cross_val_predict(svm.SVC(kernel='rbf'), X, Y, cv=10)
sns.heatmap(confusion_matrix(Y, y_pred), ax=ax[0, 0], annot=True, fmt='2.0f')
ax[0, 0].set_title('Matrix for rbf-SVM')

y_pred = cross_val_predict(svm.SVC(kernel='linear'), X, Y, cv=10)
sns.heatmap(confusion_matrix(Y, y_pred), ax=ax[0, 1], annot=True, fmt='2.0f')
ax[0, 1].set_title('Matrix for Linear-SVM')

y_pred = cross_val_predict(KNeighborsClassifier(n_neighbors=9), X, Y, cv=10)
sns.heatmap(confusion_matrix(Y, y_pred), ax=ax[0, 2], annot=True, fmt='2.0f')
ax[0, 2].set_title('Matrix for KNN')

y_pred = cross_val_predict(RandomForestClassifier(n_estimators=100), X, Y, cv=10)
sns.heatmap(confusion_matrix(Y, y_pred), ax=ax[1, 0], annot=True, fmt='2.0f')
ax[1, 0].set_title('Matrix for Random Forest')

y_pred = cross_val_predict(LogisticRegression(), X, Y, cv=10)
sns.heatmap(confusion_matrix(Y, y_pred), ax=ax[1, 1], annot=True, fmt='2.0f')
ax[1, 1].set_title('Matrix for Logistic Regression')

y_pred = cross_val_predict(DecisionTreeClassifier(), X, Y, cv=10)
sns.heatmap(confusion_matrix(Y, y_pred), ax=ax[1, 2], annot=True, fmt='2.0f')
ax[1, 2].set_title('Matrix for Decision Tree')

y_pred = cross_val_predict(GaussianNB(), X, Y, cv=10)
sns.heatmap(confusion_matrix(Y, y_pred), ax=ax[2, 0], annot=True, fmt='2.0f')
ax[2, 0].set_title('Matrix for Naive Bayes')

plt.subplots_adjust(hspace=0.2, wspace=0.2)
plt.show()

각 모델들 별로 cv를 10으로 두었을 때 모델이 잘 예측했는지 Cross Validation을 통해 좀 더 명확하게 확인할 수 있다.

행이 기존 값, 열이 예측 값이 되는데 첫 번째 그래프를 보면 사망한 사람을 잘 예측한 경우가 491개, 생존한 사람을 잘 예측한 경우가 247개로 총 738개이며 사망한 사람을 생존했다고 예측한 경우가 58개, 생존한 사람을 사망했다고 잘못 예측한 경우가 95개로 총 153개이다.

rbf-SVM은 생존한 사람을 가장 잘 예측하고 NaiveBayes는 사망한 사람을 가장 잘 예측하는 모델이라고 볼 수 있다.

Hyper-Parameters Tuning

SVM모델에서 C와 gamma라는 파라미터를 줬는데 이 값들을 조정해서 더 나은 분류 모델을 얻을 수 있으며 이를 하이파 파라미터 튜닝이라고 한다.

SVM

from sklearn.model_selection import GridSearchCV

C = [0.05, 0.1, 0.2, 0.3, 0.25, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
gamma = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
kernel = ['rbf', 'linear']

hyper = {'kernel': kernel, 'C': C, 'gamma': gamma}

gd = GridSearchCV(estimator=svm.SVC(), param_grid=hyper, verbose=True)
gd.fit(X, Y)
print(gd.best_score_)
print(gd.best_estimator_)

GridSearchCV를 통해 하이퍼 파라미터 튜닝을 할 수 있으며 estimator에 튜닝하고자 하는 모델, param_grid에 튜닝하고자 하는 하이퍼 파라미터를 넣어주면 된다.
verbose 옵션으로 파라미터 튜닝 로그를 출력해줄 수 있다. (default=0, 출력 안 함)

SVM을 하이퍼 파라미터 튜닝을 해준 결과 best C는 0.4, best gamma는 0.3으로 나온다. best kernel도 나와야 하는데.. 왜 안 나오는지 모르겠지만 일반적으로 linear보다 rbf 커널이 더 효율적이기 때문에 rbf라고 판단하겠다.

Random Forest

n_estimators = range(100, 1000, 100)
hyper={'n_estimators': n_estimators}

gd = GridSearchCV(estimator=RandomForestClassifier(), param_grid=hyper, verbose=True)
gd.fit(X, Y)

print(gd.best_score_)
print(gd.best_estimator_)

Random Forest 같은 경우 n_estimator가 500일 때 best_score가 81.48%로 가장 높게 나왔다.

따라서 SVM과 Random Forest를 통틀어 rbf-SVM 모델에서 C가 0.4, gamma가 0.3일 때가 가장 효율적인 모델이라고 할 수 있다.

Ensembling

앙상블은 모델의 성능을 높이는 아주 획기적인 방법으로 여러 가지 모델들을 결합하여 하나의 강력한 모델을 만드는 방식이다.

앙상블의 종류에는 다음 3가지가 있다.

1) Voting Classifier

2) Bagging

3) Boosting

Voting Classifier

from sklearn.ensemble import VotingClassifier

ensemble_lin_rbf = VotingClassifier(estimators=[('KNN', KNeighborsClassifier(n_neighbors=10)),
                                               ('RBF', svm.SVC(probability=True, kernel='rbf', C=0.4, gamma=0.3)),
                                               ('RFor', RandomForestClassifier(n_estimators=500, random_state=0)),
                                               ('LR', LogisticRegression(C=0.05)),
                                               ('DT', DecisionTreeClassifier(random_state=0)),
                                               ('NB', GaussianNB()),
                                               ('svm', svm.SVC(kernel='linear', probability=True))
                                               ],
                                    voting='soft').fit(train_X, train_Y)

print('The accuracy for ensembled model is:', ensemble_lin_rbf.score(test_X, test_Y))
cross=cross_val_score(ensemble_lin_rbf, X, Y, cv=10, scoring='accuracy')
print('The cross validated score is', cross.mean())

첫 번째로 Voting Classifier를 사용하였는데 여기서는 Soft Voting Classifier를 사용하였다.

Soft Voring Classifier는 각각의 서브셋을 estimator에 넣어 평균이 가장 높은 서브셋으로 예측한 결과를 제공한다.

Bagging

Bagging은 일반적인 앙상블 방식으로 분산이 높은 모델에 효과적이고 과적합을 방지하는 효과도 있다.

Voting Classifier과는 다르게 동일한 모델을 사용한다는 것에 차이가 있다.

Bagged KNN

KNN으로 Bagging을 해보았다.

from sklearn.ensemble import BaggingClassifier

model=BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=3), random_state=0, n_estimators=700)
model.fit(train_X, train_Y)
prediction=model.predict(test_X)

print('The accuracy for bagged KNN is:', metrics.accuracy_score(prediction, test_Y))
result=cross_val_score(model, X, Y, cv=10, scoring='accuracy')
print('The cross validated score for bagged KNN is: ',result.mean())

Bagged DecisionTree

DecisionTree 모델로 bagging을 해보았다.

model=BaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=0, n_estimators=100)
model.fit(train_X, train_Y)
prediction=model.predict(test_X)

print('The accuracy for bagged Decision Tree is:', metrics.accuracy_score(prediction, test_Y))
result=cross_val_score(model, X, Y, cv=10, scoring='accuracy')
print('The cross validated score for bagged Decision Tree is:', result.mean())

Boosting

Boosting은 Sequential 한 학습 방법으로 약하거나 잘못된 모델들을 결합하여 한 단계씩 성능을 향상시키는 방법이다.

AdaBoost(Adaptive Boosting)

Adaptive Boosting을 적용시킨 결과이다.

from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(n_estimators=200, random_state=0, learning_rate=0.1)
result = cross_val_score(ada, X, Y, cv=10, scoring='accuracy')
print('The cross validated score for AdaBoost is:', result.mean())

Stochastic Gradient Boosting

Stochastic Gradient Boosting을 적용시킨 결과이다.

from sklearn.ensemble import GradientBoostingClassifier
grad=GradientBoostingClassifier(n_estimators=500, random_state=0, learning_rate=0.1)
result=cross_val_score(grad, X, Y, cv=10, scoring='accuracy')
print('The cross validated score for Gradient Boosting is:', result.mean())

XGBoost

XGBoost를 적용시킨 결과이다.

import xgboost as xg
xgboost=xg.XGBClassifier(n_estimators=900, learning_rate=0.1)
result=cross_val_score(xgboost, X, Y, cv=10, scoring='accuracy')
print('The cross validated score for XGBoost is:', result.mean())

Hyper-Parameter Tuning

n_estimators=list(range(100, 1100, 100))
learning_rate=[0.05, 0.1, 0.2, 0.3, 0.25, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
hyper={'n_estimators': n_estimators, 'learning_rate': learning_rate}

gd=GridSearchCV(estimator=AdaBoostClassifier(), param_grid=hyper, verbose=True)
gd.fit(X, Y)

print(gd.best_score_)
print(gd.best_estimator_)

Boosting 결과로 가장 높게 나온 AdaBoost로 Hyper-Parameter Tuning을 해본 결과 learning_rate=1, n_estimators=100일 때 82.93%의 결과를 얻을 수 있었다.

Confusion Matrix for the Best Model

ada=AdaBoostClassifier(n_estimators=100, random_state=0, learning_rate=0.1)
result=cross_val_predict(ada, X, Y, cv=10)
sns.heatmap(confusion_matrix(Y, result), cmap='winter', annot=True, fmt='2.0f')
plt.show()

Best Model로 Confusion Matrix를 그려본 결과다. rbf-SVM에서 아주 약간 개선되었다.

Feature Importance

f, ax = plt.subplots(2, 2, figsize=(15, 12))
model = RandomForestClassifier(n_estimators=500, random_state=0)
model.fit(X, Y)
pd.Series(model.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8, ax=ax[0,0])
ax[0,0].set_title('Feature Importance in Random Forests')

model = AdaBoostClassifier(n_estimators=200, learning_rate=0.05, random_state=0)
model.fit(X, Y)
pd.Series(model.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8, ax=ax[0,1], color='#ddff11')
ax[0,1].set_title('Feature Importance in AdaBoost')

model = GradientBoostingClassifier(n_estimators=500, learning_rate=0.1, random_state=0)
model.fit(X, Y)
pd.Series(model.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8, ax=ax[1,0], cmap='RdYlGn_r')
ax[1,0].set_title('Feature Importance in Gradient Boosting')

model=xg.XGBClassifier(n_estimators=900, learning_rate=0.1)
model.fit(X, Y)
pd.Series(model.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8, ax=ax[1,1], color='#FD0F00')
ax[1,1].set_title('Feature Importance in XgBoost')

plt.show()

모델별로 Feature Importance를 확인해본 결과 아래와 같은 결과를 얻을 수 있다.

Initial, Fare_cat, Pclass, Family_Size등과 같은 Feature들이 중요하며 이 중 Initial은 모든 모델에서 가장 높은 Importance를 보였다.
Sex는 Random Forest 모델에서 높은 importance를 보였다.

저작자표시 비영리 변경금지 (새창열림)

'Data Science > Kaggle' 카테고리의 다른 글

[캐글 필사] A Complete Introduction and Walkthrough (0)	2022.03.10