博文

Kaggle之路：Titanic

已有 3174 次阅读 2018-12-28 08:43 |系统分类:科研笔记

Overview

Titanic可谓是Kaggler的必经之路。我们以其为例，走一个完整的机器学习分析流程。

Step 1: 问题分析

关于Titanic的相关描述可参考官网，这是一个二分类的基本问题。

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Step 2: 数据整理

首先加载必要的库。数据这块儿用pandas，模型用scikit-learn和xgboost，计算的库numpy和scipy，作图用matplotlib和seaborn等。

#-*- coding: UTF-8 -*- #!/usr/bin/env python# system parametersimport sysprint("Python version: {}". format(sys.version))# functions for data processing and analysisimport pandas as pdprint("pandas version: {}". format(pd.__version__))# machine learning algorithmsimport sklearnfrom sklearn.preprocessing import OneHotEncoder, LabelEncoderfrom sklearn.preprocessing import StandardScaler, RobustScalerfrom sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process, neural_networkfrom sklearn import feature_selection, model_selection, metricsimport xgboostfrom xgboost import XGBClassifierprint("scikit-learn version: {}". format(sklearn.__version__))print("xgboost version: {}". format(xgboost.__version__))# scientific computingimport numpy as npimport scipy as spprint("NumPy version: {}". format(np.__version__))print("SciPy version: {}". format(sp.__version__)) # data visualizationfrom pandas.tools.plotting import scatter_matriximport matplotlib as mplimport matplotlib.pyplot as pltimport matplotlib.pylab as pylabimport seaborn as snsimport IPythonfrom IPython import displayprint("matplotlib version: {}". format(mpl.__version__))print("seaborn version: {}". format(sns.__version__))print("IPython version: {}". format(IPython.__version__))# Configure Visualization Defaults# %matplotlib inline = show plots in Jupyter Notebook browser%matplotlib inlinempl.style.use("ggplot")sns.set_style("white")pylab.rcParams["figure.figsize"] = 12,8# misc librariesimport randomimport time# ignore warningsimport warningswarnings.filterwarnings("ignore")print('-'*25)# Input data files are available in the "datasets/titanic/" directory.# Listing all filesfrom subprocess import check_outputdata_dir = "datasets/titanic/"print(check_output(["ls", data_dir]).decode("utf8"))

接下来我们看看数据的基本情况，可以使用info、describe等函数。

train = pd.read_csv(data_dir+"train.csv")test = pd.read_csv(data_dir+"test.csv")train.info()#test.info()#train.describe(include = "all")#data_train.sample(10)

个人习惯将train和test合并做数据整理和特征工程，另外PassengerId仅仅用于数据标识，Ticket意义不太明确，我们将这两项去掉。

# saving passenger id in advance in order to submit later.passengerId = test.PassengerIdtrain_len = len(train)dataset =  pd.concat([train, test], axis=0).reset_index(drop=True)#dataset.info()#dataset.head()drop_column = ["PassengerId", "Ticket"]dataset.drop(drop_column, axis=1, inplace = True)dataset.info()

我们统计一下原数据中的缺失值。

print("Data columns with null values:\n", dataset.isnull().sum())

处理缺失值是数据预处理重要的一环，往往要综合考虑，多次尝试。常见的方法除均值、中值、众值外，还包括关联性考察、重新编码、去除缺失值过多的特征项、利用机器学习赋值等。

这里我们先看看Embarked特征项的两个缺失值，考察下Embarked特征项的整体情况。

dataset[dataset.Embarked.isnull()]pd.DataFrame(dataset.Embarked.value_counts(dropna=False))

可以看到，出现频率最高的为S，所以一种处理方式就是将缺失值设为众数S。不过，仔细看下Embarked和Pclass/Fare之间的关系会发现，Pclass=1，Fare=80时，Embarked最有可能是C，所以这里我们也可将缺失值设为C。

h = sns.boxplot(x="Embarked", y="Fare", hue="Pclass", data=dataset)

后续数据中，我们暂时用类似处理Embarked的方法填补Fare，用中值填补Age，用N填补Cabin。

dataset.Embarked.fillna("C", inplace=True)missing_value = dataset[(dataset.Pclass == 3) & (dataset.Embarked == "S") & (dataset.Sex == "male")].Fare.mean()dataset.Fare.fillna(missing_value, inplace=True)dataset.Age.fillna(dataset.Age.median(), inplace=True)dataset.Cabin.fillna("N", inplace=True)dataset.info()#dataset.describe(include = "all")

接下来，我们简单看看train中各特征值之间的关系，包括与目标值Survived之间的关系等。

data_plot = dataset[:train_len]#histogram comparison of sex, class, and age by survivalh = sns.FacetGrid(data_plot, row = "Sex", col = "Pclass", hue = "Survived")h.map(plt.hist, "Age", alpha = .75)h.add_legend()#pair plots of entire datasetpp = sns.pairplot(data_plot, hue = "Survived", palette = "deep", size=1.2, diag_kind = "kde", diag_kws=dict(shade=True), plot_kws=dict(s=10) )pp.set(xticklabels=[])#correlation heatmap of datasetdef correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = colormap,
        square=True, 
        cbar_kws={"shrink":.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor="white",
        annot_kws={"fontsize":12 }
    )
    
    plt.title("Pearson Correlation of Features", y=1.05, size=15)correlation_heatmap(data_plot)

Step 3: 特征工程

特征工程可谓是个见仁见智见能力的事情。

我们逐个特征项来做吧，首先是Age。

g = sns.kdeplot(data_plot.Age[data_plot.Survived == 0], color="Red", shade = True)g = sns.kdeplot(data_plot.Age[data_plot.Survived == 1], ax =g, color="Blue", shade= True)g.set_xlabel("Age")g.set_ylabel("Frequency")g = g.legend(["Not Survived","Survived"])

可以看到整体差异不大，但儿童的生存率出现峰值，是人性的光辉吧。我们不妨抽提儿童作为新的特征，整体还可以按年龄分组。

dataset["AgeBin"] = pd.cut(dataset.Age.astype(int), 5)dataset["IsChild"] = [1 if i<16 else 0 for i in dataset.Age]#dataset.info()

Cabin项描述了客舱编号，其开头字母可能具有信息，我们不妨抽提。

#dataset.Cabin.head()dataset.Cabin = [i[0] for i in dataset.Cabin]g = sns.factorplot(y="Survived", x="Cabin", data=dataset, kind="bar")g = g.set_ylabels("Survival Probability")#dataset = pd.get_dummies(dataset, columns = ["Cabin"], prefix="Cabin")#dataset.sample(10)

对于Fare，我们查看下分布，并使用不同于Age的分组方式。

g = sns.kdeplot(data_plot.Fare[data_plot.Survived == 0], color="Red", shade = True)g = sns.kdeplot(data_plot.Fare[data_plot.Survived == 1], ax =g, color="Blue", shade= True)g.set_xlabel("Fare")g.set_ylabel("Frequency")g = g.legend(["Not Survived","Survived"])dataset["FareBin"] = pd.qcut(dataset.Fare, 4)

对于Name，我们参考论坛的一些帖子，抽提Title作为特征项。

dataset["Title"] = dataset.Name.str.split(", ", expand=True)[1].str.split(".", expand=True)[0]title_names = (dataset.Title.value_counts() < 5)dataset.Title = dataset.Title.apply(lambda x: "Misc" if title_names.loc[x] == True else x)#dataset.Title.value_counts()

其它的特征项，我们也相应处理。

dataset["FamilySize"] = dataset.SibSp + dataset.Parch + 1dataset["IsAlone"] = 1dataset["IsAlone"].loc[dataset.FamilySize > 1] = 0dataset["IsMother"] = np.where((dataset.Age > 18) & 
                               (dataset.Title != 'Miss') & 
                               (dataset.Parch > 0) & 
                               (dataset.Sex == 'female'), 
                               1, 0)#dataset.info()

为方便处理，我们需要进行一些编码、转换等工作。

label = LabelEncoder()dataset["Cabin_Code"] = label.fit_transform(dataset.Cabin)dataset["Sex_Code"] = label.fit_transform(dataset.Sex)dataset["Embarked_Code"] = label.fit_transform(dataset.Embarked)dataset["Title_Code"] = label.fit_transform(dataset.Title)dataset["AgeBin_Code"] = label.fit_transform(dataset.AgeBin)dataset["FareBin_Code"] = label.fit_transform(dataset.FareBin)drop_column = ["Cabin", "Sex", "Embarked", "Title", "AgeBin", "FareBin", "Name"]dataset.drop(drop_column, axis=1, inplace = True)dataset.info()

Step 4: 模型建立

先将train和test分开，并设好cv。

train_x = dataset[:train_len]test_x = dataset[train_len:]train_y = pd.DataFrame(train_x.Survived)train_x.drop("Survived", axis=1, inplace = True)test_x.drop("Survived", axis=1, inplace = True)ss = StandardScaler()train_x = ss.fit_transform(train_x)test_x = ss.transform(test_x)cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0)#train_x.info()

我们一次性尝试多个模型的效果。

#Machine Learning Algorithm (MLA) Selection and InitializationMLA = [
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    #Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),
    
    #GLM
    linear_model.LogisticRegressionCV(),
    linear_model.SGDClassifier(),
    #linear_model.Perceptron(),
    
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    #naive_bayes.MultinomialNB(),
    #naive_bayes.ComplementNB(),
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    #neighbors.RadiusNeighborsClassifier(radius=50),
    
    #NN
    neural_network.MLPClassifier(),
    
    #SVM
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),
    #svm.LinearSVC(),
    
    #Trees 
    #tree.DecisionTreeClassifier(),
    #tree.ExtraTreeClassifier(),
    
    #Discriminant Analysis
    #discriminant_analysis.LinearDiscriminantAnalysis(),
    discriminant_analysis.QuadraticDiscriminantAnalysis(),

    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
    XGBClassifier()    
    ]

下面评价一下各个模型的优劣。

MLA_columns = ["MLA Name","MLA Parameters","MLA Train Accuracy Mean","MLA Test Accuracy Mean","MLA Test Accuracy 3*STD","MLA Time"]MLA_compare = pd.DataFrame(columns = MLA_columns)MLA_predict = train_y.copy()#index through MLA and save performance to tablerow_index = 0for alg in MLA:
    #set name and parameters
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, "MLA Name"] = MLA_name
    MLA_compare.loc[row_index, "MLA Parameters"] = str(alg.get_params())
    
    cv_results = model_selection.cross_validate(alg, train_x, train_y, cv  = cv_split)

    MLA_compare.loc[row_index, "MLA Time"] = cv_results['fit_time'].mean()
    MLA_compare.loc[row_index, "MLA Train Accuracy Mean"] = cv_results["train_score"].mean()
    MLA_compare.loc[row_index, "MLA Test Accuracy Mean"] = cv_results["test_score"].mean()   
    #if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically capture 99.7% of the subsets
    MLA_compare.loc[row_index, "MLA Test Accuracy 3*STD"] = cv_results["test_score"].std()*3   #let's know the worst that can happen!
    
    #save MLA predictions
    alg.fit(train_x, train_y)
    MLA_predict[MLA_name] = alg.predict(train_x)
    
    row_index+=1
    MLA_compare.sort_values(by = ["MLA Test Accuracy Mean"], ascending = False, inplace = True)MLA_compare

我们还可以看看各模型预测结果间的相关性。

correlation_heatmap(MLA_predict)

Step 5: 模型优化

常见的优化方法主要有调参、特征选择等。这里我们尝试使用GridSearchCV进行调参。

vote_est = [
    ('ada', ensemble.AdaBoostClassifier()),
    ('bc',  ensemble.BaggingClassifier()),
    ('etc', ensemble.ExtraTreesClassifier()),
    ('gbc', ensemble.GradientBoostingClassifier()),
    ('rfc', ensemble.RandomForestClassifier()),
    ('gpc', gaussian_process.GaussianProcessClassifier()),
    ('lr',  linear_model.LogisticRegressionCV()),
    ('sgd', linear_model.SGDClassifier()),
    #('lrp', linear_model.Perceptron()),
    ('bnb', naive_bayes.BernoulliNB()),
    ('gnb', naive_bayes.GaussianNB()),
    ('knn', neighbors.KNeighborsClassifier()),
    ('mlp', neural_network.MLPClassifier()),
    ('svc', svm.SVC(probability=True)),
    ('nu',  svm.NuSVC(probability=True)),
    #('lsv', svm.LinearSVC()),
    ('qd', discriminant_analysis.QuadraticDiscriminantAnalysis()),
    ('xgb', XGBClassifier())]grid_seed = [5362]grid_param = [
            # AdaBoostClassifier
            [{
            'n_estimators': [50, 100, 200],
            'learning_rate': [.05, .1, .2],
            'random_state': grid_seed
            }],
       
            # BaggingClassifier
            [{
            'n_estimators': [200, 300, 400],
            'max_samples': [.2, .3, .5],
            'random_state': grid_seed
            }],
            
            # ExtraTreesClassifier
            [{
            'n_estimators': [50, 100, 300],
            'criterion': ['gini', 'entropy'],
            'max_depth': [6, 8, None],
            'random_state': grid_seed
            }],

            # GradientBoostingClassifier
            [{
            'learning_rate': [.1, .3], 
            'n_estimators': [100, 300, 500], 
            'max_depth': [2, 4, 6],
            'random_state': grid_seed
            }],

            # RandomForestClassifier
            [{
            'n_estimators': [100, 300, 500],
            'criterion': ['gini', 'entropy'],
            'max_depth': [6, 8, None],
            'oob_score': [True, False],
            'random_state': grid_seed
            }],
    
            # GaussianProcessClassifier
            [{    
            'max_iter_predict': [50, 100, 300],
            'random_state': grid_seed
            }],
    
            # LogisticRegressionCV
            [{
            'fit_intercept': [True, False],
            'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
            'random_state': grid_seed
            }],
    
            # SGDClassifier
            [{
            'fit_intercept': [True, False],
            'loss': ['hinge', 'log', 'modified_huber', 'squared_hinge'],
            'random_state': grid_seed
            }],
    
            # Perceptron
            #[{
            #'fit_intercept': [True, False],
            #'random_state': grid_seed
            #}],
            
            # BernoulliNB
            [{
            'alpha': [.5, .75, 1.0],
            }],
    
            # GaussianNB
            [{}],
    
            # KNeighborsClassifier
            [{
            'n_neighbors': [5,6,7,8],
            'weights': ['uniform', 'distance'],
            'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
            'metric': ['minkowski'],
            'leaf_size': [16, 22, 26, 30]
            }],
    
            # MLPClassifier
            [{
            'activation': ['identity', 'logistic', 'tanh', 'relu'],
            'solver': ['lbfgs', 'sgd', 'adam'],
            'learning_rate': ['constant', 'invscaling', 'adaptive'],
            'random_state': grid_seed
            }],
            
            # SVC
            [{
            'C': [3,4,5,6],
            'gamma': [.1, .25, .5],
            'decision_function_shape': ['ovo', 'ovr'],
            'probability': [True],
            'random_state': grid_seed
            }],
    
            # NuSVC
            [{
            'gamma': [.1, .25, .5],
            'decision_function_shape': ['ovo', 'ovr'],
            'probability': [True],
            'random_state': grid_seed
            }],
    
            # LinearSVC
            #[{
            #'C': [1, 2],
            #'random_state': grid_seed
            #}],
    
            # QuadraticDiscriminantAnalysis
            [{}],
    
            #XGBClassifier
            [{
            'learning_rate': [.02, .03, .05], #default: .3
            'max_depth': [8, 10, 12], #default 2
            'n_estimators': [300, 400], 
            'colsample_bytree': [0.6, 0.8],
            'colsample_bylevel': [0.6, 0.8],
            'seed': grid_seed  
            }]   
        ]start_total = time.perf_counter() #https://docs.python.org/3/library/time.html#time.perf_counterfor clf, param in zip (vote_est, grid_param):
    start = time.perf_counter()        
    best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = "roc_auc")
    best_search.fit(train_x, train_y)
    run = time.perf_counter() - start

    best_param = best_search.best_params_
    print("The best parameter for {} is {} with a runtime of {:.2f} seconds.".format(clf[1].__class__.__name__, best_param, run))
    clf[1].set_params(**best_param) run_total = time.perf_counter() - start_totalprint("Total optimization time was {:.2f} minutes.".format(run_total/60))print("-"*10)

面对一众模型，我们在提交最终预测结果的时候还可以选择使用ensemble，这里以voting为例。

ensembleVote = ensemble.VotingClassifier(estimators = vote_est , voting = "hard")ensembleVote.fit(train_x, train_y)prediction = ensembleVote.predict(test_x).astype(int)submission = pd.DataFrame(data={
    "PassengerId": passengerId,
    "Survived": prediction})#submission.info()submission.to_csv(data_dir+"submission.csv", index=False)

提交到Kaggle，最终得分0.78947。

原文链接https://wenlongshen.github.io/2018/12/21/Walks-Titanic/

转载本文请联系原作者获取授权，同时请注明本文来自沈文龙科学网博客。
链接地址：https://wap.sciencenet.cn/blog-543513-1153930.html

上一篇：Feature Importance
下一篇：搭建基于Docker的Tensorflow+Cuda环境

收藏 IP: 223.71.139.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

沈文龙

扫一扫，分享此博文

shenwl的个人博客分享 http://blog.sciencenet.cn/u/shenwl

博文

Kaggle之路：Titanic

Overview

Step 1: 问题分析

Step 2: 数据整理

Step 3: 特征工程

Step 4: 模型建立

Step 5: 模型优化

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

沈文龙

全部作者的其他最新博文

全部精选博文导读

相关博文

shenwl的个人博客分享 http://blog.sciencenet.cn/u/shenwl

博文

Kaggle之路：Titanic

Overview

Step 1: 问题分析

Step 2: 数据整理

Step 3: 特征工程

Step 4: 模型建立

Step 5: 模型优化

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

沈文龙

全部作者的其他最新博文

全部精选博文导读

相关博文

该博文允许注册用户评论请点击登录评论 (0 个评论)