The Dataset was acquired from https://www.kaggle.com/c/titanic

For data preprocessing, I firstly defined three transformers:

  • DataFrameSelector: Select features to handle.
  • CombinedAttributesAdder: Add a categorical feature Age_cat which divided all passengers into three catagories according to their ages.
  • ImputeMostFrequent: Since the SimpleImputer( ) method was only suitable for numerical variables, I wrote an transformer to impute string missing values with the mode value. Here I was inspired by https://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn.

Then I wrote pipelines separately for different features

  • For numerical features, I applied DataFrameSelector, SimpleImputer and StandardScaler
  • For categorical features, I applied DataFrameSelector, ImputeMostFrequent and OneHotEncoder
  • For the new created feature Age_cat, since itself was a category but was derived from a numerical feature, I wrote an individual pipeline to impute the missing values and encode the categories.

Finally, we can build a full pipeline through FeatureUnion. Here is the code:

 # Read data
import pandas as pd
import numpy as np
import os
titanic_train = pd.read_csv('Dataset/Titanic/train.csv')
titanic_test = pd.read_csv('Dataset/Titanic/test.csv')
submission = pd.read_csv('Dataset/Titanic/gender_submission.csv') # Divide attributes and labels
titanic_labels = titanic_train['Survived'].copy()
titanic = titanic_train.drop(['Survived'],axis=1) # Feature Selection
from sklearn.base import BaseEstimator, TransformerMixin class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self,attribute_name):
self.attribute_name = attribute_name
def fit(self, X):
return self
def transform (self, X, y=None):
if 'Pclass' in self.attribute_name:
X['Pclass'] = X['Pclass'].astype(str)
return X[self.attribute_name] # Feature Creation
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
Age_cat = pd.cut(X['Age'],[0,18,60,100],labels=['child', 'adult', 'old'])
Age_cat=np.array(Age_cat)
return pd.DataFrame(Age_cat,columns=['Age_Cat']) # Impute Categorical variables
class ImputeMostFrequent(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.fill = pd.Series([X[c].value_counts().index[0] for c in X],index=X.columns)
return self
def transform(self, X, y=None):
return X.fillna(self.fill) #Pipeline
from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import FeatureUnion num_pipeline = Pipeline([
('selector',DataFrameSelector(['Age','SibSp','Parch','Fare'])),
('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),
]) cat_pipeline = Pipeline([
('selector',DataFrameSelector(['Pclass','Sex','Embarked'])),
('imputer',ImputeMostFrequent()),
('encoder', OneHotEncoder()),
]) new_pipeline = Pipeline([
('selector',DataFrameSelector(['Age'])),
#('imputer', SimpleImputer(strategy="median")),
('attr_adder',CombinedAttributesAdder()),
('imputer',ImputeMostFrequent()),
('encoder', OneHotEncoder()),
]) full_pipeline = FeatureUnion([
("num", num_pipeline),
("cat", cat_pipeline),
("new", new_pipeline),
]) titanic_prepared = full_pipeline.fit_transform(titanic)

Another thing I want to mention is that the output of a pipeline should be a 2D array rather a 1D array. So if you wanna choose only one feature, don't forget to transform the 1D array by reshape() method. Otherwise, you will receive an error like

ValueError: Expected 2D array, got 1D array instead

Specifically, apply reshape(-1,1) for column and reshape(1,-1). More about the issue can be found at https://stackoverflow.com/questions/51150153/valueerror-expected-2d-array-got-1d-array-instead.


												

最新文章

  1. .NET跨平台之旅:corehost 是如何加载 coreclr 的
  2. excel工具类
  3. 关于C#中的 static
  4. KnockoutJS 3.X API 第四章 表单绑定(9) value绑定
  5. delegate 集成在类中,还是单独写在.h文件中?
  6. LeetCode:Container With Most Water,Trapping Rain Water
  7. 使用UIL(Universal-Image-Loader)异步加载图片
  8. [div+css]竖排菜单
  9. Spark运行环境的安装
  10. DMS平台从.NET 1.1升级到.NET 4.0的升级步骤
  11. navBar
  12. 三、spark入门:文本中发现5个最常用的word,排除常用停用词
  13. 【读书笔记】C++Primer---第二章
  14. EF架构~让mysql支持DbFunctions扩展函数
  15. JAVA的垃圾回收机制(GC)
  16. git 学习小记之记住https方式推送密码
  17. Java转换Json日期/Date(1487053489965+0800)/格式以及js时间格式 Tue Feb 14 2017 14:06:32 GMT+0800
  18. $NOIp$前的日常
  19. C++ 内连接与外连接 (转)
  20. LCD相关基础知识

热门文章

  1. FreeMarker的<#if></#if>标签
  2. Android开发——常见的内存泄漏以及解决方案(一)
  3. Python虚拟机类机制之descriptor(三)
  4. Robotium测试架构规划及测试用例组织
  5. 使用shell脚本生成数据库markdown文档
  6. hnust 心电图
  7. 深入学习之mysql(二)表的操作
  8. [译]如何检查python中的值是否为nan?
  9. 微信小程序--微信小程序tabBar不显示:缺少文件,错误信息:error:iconPath=
  10. mybitis中对象字段与表中字段名称不匹配(复制)