- 实际应用pandas过程中,经常会用到category数据类型,通常以string的形式显示,包括颜色(红,绿,蓝),尺寸的大小(大,中,小),还有地理信息等(国家,省份),这些数据的处理经常会有各种各样的问题,pandas以及scikit-learn两个包可以将category数据转化为合适的数值型格式,这篇主要介绍通过这两个包处理category类型的数据转化为数值类型,也就是encoding的过程。
- 数据来源UCI Machine Learning Repository,这个数据集中包含了很多的category类型的数据,可以从链接汇总查看数据的代表的含义。
- 下面开始导入需要用到的包
import numpy as np
import pandas as pd
# 规定一下数据列的各个名称,
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
"num_doors", "body_style", "drive_wheels", "engine_location",
"wheel_base", "length", "width", "height", "curb_weight",
"engine_type", "num_cylinders", "engine_size", "fuel_system",
"bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
"city_mpg", "highway_mpg", "price"]
# 从pandas导入csv文件,将?标记为NaN缺失值
df=pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",header=None,names=headers,na_values="?")
df.head()
|
symboling |
normalized_losses |
make |
fuel_type |
aspiration |
num_doors |
body_style |
drive_wheels |
engine_location |
wheel_base |
... |
engine_size |
fuel_system |
bore |
stroke |
compression_ratio |
horsepower |
peak_rpm |
city_mpg |
highway_mpg |
price |
0 |
3 |
NaN |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
88.6 |
... |
130 |
mpfi |
3.47 |
2.68 |
9.0 |
111.0 |
5000.0 |
21 |
27 |
13495.0 |
1 |
3 |
NaN |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
88.6 |
... |
130 |
mpfi |
3.47 |
2.68 |
9.0 |
111.0 |
5000.0 |
21 |
27 |
16500.0 |
2 |
1 |
NaN |
alfa-romero |
gas |
std |
two |
hatchback |
rwd |
front |
94.5 |
... |
152 |
mpfi |
2.68 |
3.47 |
9.0 |
154.0 |
5000.0 |
19 |
26 |
16500.0 |
3 |
2 |
164.0 |
audi |
gas |
std |
four |
sedan |
fwd |
front |
99.8 |
... |
109 |
mpfi |
3.19 |
3.40 |
10.0 |
102.0 |
5500.0 |
24 |
30 |
13950.0 |
4 |
2 |
164.0 |
audi |
gas |
std |
four |
sedan |
4wd |
front |
99.4 |
... |
136 |
mpfi |
3.19 |
3.40 |
8.0 |
115.0 |
5500.0 |
18 |
22 |
17450.0 |
5 rows × 26 columns
df.dtypes
symboling int64
normalized_losses float64
make object
fuel_type object
aspiration object
num_doors object
body_style object
drive_wheels object
engine_location object
wheel_base float64
length float64
width float64
height float64
curb_weight int64
engine_type object
num_cylinders object
engine_size int64
fuel_system object
bore float64
stroke float64
compression_ratio float64
horsepower float64
peak_rpm float64
city_mpg int64
highway_mpg int64
price float64
dtype: object
# 如果只关注category 类型的数据,其实根本没有必要拿到这些全部数据,只需要将object类型的数据取出,然后进行后续分析即可
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()
|
make |
fuel_type |
aspiration |
num_doors |
body_style |
drive_wheels |
engine_location |
engine_type |
num_cylinders |
fuel_system |
0 |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
dohc |
four |
mpfi |
1 |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
dohc |
four |
mpfi |
2 |
alfa-romero |
gas |
std |
two |
hatchback |
rwd |
front |
ohcv |
six |
mpfi |
3 |
audi |
gas |
std |
four |
sedan |
fwd |
front |
ohc |
four |
mpfi |
4 |
audi |
gas |
std |
four |
sedan |
4wd |
front |
ohc |
five |
mpfi |
# 在进行下一步处理的之前,需要将数据进行缺失值的处理,对列进行处理axis=1
obj_df[obj_df.isnull().any(axis=1)]
|
make |
fuel_type |
aspiration |
num_doors |
body_style |
drive_wheels |
engine_location |
engine_type |
num_cylinders |
fuel_system |
27 |
dodge |
gas |
turbo |
NaN |
sedan |
fwd |
front |
ohc |
four |
mpfi |
63 |
mazda |
diesel |
std |
NaN |
sedan |
fwd |
front |
ohc |
four |
idi |
# 处理缺失值的方式有很多种,根据项目的不同或者填补缺失值或者去掉该样本。本文中的数据缺失用该列的众数来补充。
obj_df.num_doors.value_counts()
four 114
two 89
Name: num_doors, dtype: int64
obj_df=obj_df.fillna({"num_doors":"four"})
在处理完缺失值之后,有以下几种方式进行category数据转化encoding
- Find and Replace
- label encoding
- One Hot encoding
- Custom Binary encoding
- sklearn
- advanced Approaches
# pandas里面的replace文档非常丰富,笔者在使用该功能时候,深感其参数众多,深感提供的功能也非常的强大
# 本文中使用replace的功能,创建map的字典,针对需要数据清理的列进行清理更加方便,例如:
cleanup_nums= {
"num_doors":{"four":4,"two":2},
"num_cylinders":{
"four":4,"six":6,"five":5,"eight":8,"two":2,"twelve":12,"three":3
}
}
obj_df.replace(cleanup_nums,inplace=True)
obj_df.head()
|
make |
fuel_type |
aspiration |
num_doors |
body_style |
drive_wheels |
engine_location |
engine_type |
num_cylinders |
fuel_system |
0 |
alfa-romero |
gas |
std |
2 |
convertible |
rwd |
front |
dohc |
4 |
mpfi |
1 |
alfa-romero |
gas |
std |
2 |
convertible |
rwd |
front |
dohc |
4 |
mpfi |
2 |
alfa-romero |
gas |
std |
2 |
hatchback |
rwd |
front |
ohcv |
6 |
mpfi |
3 |
audi |
gas |
std |
4 |
sedan |
fwd |
front |
ohc |
4 |
mpfi |
4 |
audi |
gas |
std |
4 |
sedan |
4wd |
front |
ohc |
5 |
mpfi |
label encoding 是将一组无规则的,没有大小比较的数据转化为数字
- 比如body_style 字段中含有多个数据值,可以使用该方法将其转化
- convertible > 0
- hardtop > 1
- hatchback > 2
- sedan > 3
- wagon > 4
这种方式就像是密码编码一样,这,个比喻很有意思,就像之前看电影,记得一句台词,他们俩亲密的像做贼一样
# 通过pandas里面的 category数据类型,可以很方便的或者该编码
obj_df["body_style"]=obj_df["body_style"].astype("category")
obj_df.dtypes
make object
fuel_type object
aspiration object
num_doors int64
body_style category
drive_wheels object
engine_location object
engine_type object
num_cylinders int64
fuel_system object
dtype: object
# 我们可以通过赋值新的列,保存其对应的code
# 通过这种方法可以舒服的数据,便于以后的数据分析以及整理
obj_df["body_style_code"] = obj_df["body_style"].cat.codes
obj_df.head()
|
make |
fuel_type |
aspiration |
num_doors |
body_style |
drive_wheels |
engine_location |
engine_type |
num_cylinders |
fuel_system |
body_style_code |
0 |
alfa-romero |
gas |
std |
2 |
convertible |
rwd |
front |
dohc |
4 |
mpfi |
0 |
1 |
alfa-romero |
gas |
std |
2 |
convertible |
rwd |
front |
dohc |
4 |
mpfi |
0 |
2 |
alfa-romero |
gas |
std |
2 |
hatchback |
rwd |
front |
ohcv |
6 |
mpfi |
2 |
3 |
audi |
gas |
std |
4 |
sedan |
fwd |
front |
ohc |
4 |
mpfi |
3 |
4 |
audi |
gas |
std |
4 |
sedan |
4wd |
front |
ohc |
5 |
mpfi |
3 |
one hot encoding
- label encoding 因为将wagon转化为4,而convertible变成了0,这里面是不是会有大大小的比较,可能会造成误解,然后利用one hot encoding这种方式
是将特征转化为0或者1,这样会增加数据的列的数量,同时也减少了label encoding造成的衡量数据大小的误解。
- pandas中提供了get_dummies 方法可以将需要转化的列的值转化为0,1,两种编码
# 新生成DataFrame包含了新生成的三列数据,
# drive_wheels_4wd
# drive_wheels_fwd
# drive_wheels_rwd
pd.get_dummies(obj_df,columns=["drive_wheels"]).head()
|
make |
fuel_type |
aspiration |
num_doors |
body_style |
engine_location |
engine_type |
num_cylinders |
fuel_system |
body_style_code |
drive_wheels_4wd |
drive_wheels_fwd |
drive_wheels_rwd |
0 |
alfa-romero |
gas |
std |
2 |
convertible |
front |
dohc |
4 |
mpfi |
0 |
0 |
0 |
1 |
1 |
alfa-romero |
gas |
std |
2 |
convertible |
front |
dohc |
4 |
mpfi |
0 |
0 |
0 |
1 |
2 |
alfa-romero |
gas |
std |
2 |
hatchback |
front |
ohcv |
6 |
mpfi |
2 |
0 |
0 |
1 |
3 |
audi |
gas |
std |
4 |
sedan |
front |
ohc |
4 |
mpfi |
3 |
0 |
1 |
0 |
4 |
audi |
gas |
std |
4 |
sedan |
front |
ohc |
5 |
mpfi |
3 |
1 |
0 |
0 |
# 该方法之所以强大,是因为可以同时处理多个category的列,同时选择prefix前缀分别对应好
# 产生的新的DataFrame所有数据都包含
pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head()
|
make |
fuel_type |
aspiration |
num_doors |
engine_location |
engine_type |
num_cylinders |
fuel_system |
body_style_code |
body_convertible |
body_hardtop |
body_hatchback |
body_sedan |
body_wagon |
drive_4wd |
drive_fwd |
drive_rwd |
0 |
alfa-romero |
gas |
std |
2 |
front |
dohc |
4 |
mpfi |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
alfa-romero |
gas |
std |
2 |
front |
dohc |
4 |
mpfi |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
2 |
alfa-romero |
gas |
std |
2 |
front |
ohcv |
6 |
mpfi |
2 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
3 |
audi |
gas |
std |
4 |
front |
ohc |
4 |
mpfi |
3 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
0 |
4 |
audi |
gas |
std |
4 |
front |
ohc |
5 |
mpfi |
3 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
0 |
自定义0,1 encoding
- 有的时候回根据业务需要,可能会结合label encoding以及not hot 两种方式进行二值化。
obj_df["engine_type"].value_counts()
ohc 148
ohcf 15
ohcv 13
dohc 12
l 12
rotor 4
dohcv 1
Name: engine_type, dtype: int64
# 有的时候为了区分出 engine_type是否是och技术的,可以使用二值化,将该列进行处理
# 这也突出了领域知识是如何以最有效的方式解决问题
obj_df["engine_type_code"] = np.where(obj_df["engine_type"].str.contains("ohc"),1,0)
obj_df[["make","engine_type","engine_type_code"]].head()
|
make |
engine_type |
engine_type_code |
0 |
alfa-romero |
dohc |
1 |
1 |
alfa-romero |
dohc |
1 |
2 |
alfa-romero |
ohcv |
1 |
3 |
audi |
ohc |
1 |
4 |
audi |
ohc |
1 |
scikit-learn中的数据转化
- sklearn.processing模块提供了很多方便的数据转化以及缺失值处理方式(Imputer),可以直接从该模块导入LabelEncoder,LabelBinarizer,0,1归一化(最大最小标准化),Normalizer正则化(L1,L2)一般用的不多,标准化(最大最小标准化max_mix),非线性转换,生成多项式特征(PolynomialFeatures),将每个特征缩放在同样的范围或分布情况下
- sklearn processing 模块官网文档链接
- category_encoders包官方文档
至此,数据预处理以及category转化大致讲完了。
最新文章
- 如何通过JavaScript构建Asp.net服务端控件
- 延迟对象$q和供应商配置config
- socket.io,理解socket.io
- Effective C++ 之 Item 4:确定对象被使用前已先被初始化
- 【Nginx】配置Nginx的负载均衡
- 去除DEDECMS后台预览文章URL地址多余的数字信息
- bash脚本编程之二 字符串测试及for循环
- unreal3对象属性自动从配置文件中加载的机制
- Java插件开发-取插件下的某个文件
- maven安装和环境变量配置
- 【转】install intel wireless 3165 driver for ubuntu 14.04.3
- 2015浙江财经大学ACM有奖周赛(一) 题解报告
- 【翻译】理解Joomla!模板
- 解决 HomeBrew 下载缓慢的问题
- Java 内存回收机制——GC机制
- [模板]Link-Cut-Tree动态树
- db2 reorg(转)
- C# 3.0 / C# 3.5 扩展方法
- 常用的windows注册表大全
- Web Service-WSDL详解
热门文章
- varchar和nvarchar的区别 数据来证明
- 源码分析二(ArrayList与LinkedList的区别)
- 变分推断(Variational Inference)
- Visual Assist X 10.8.2042的Crack破解补丁. 2014.06.25 (General release.)
- 10 -- 深入使用Spring -- 5...2 在Spring中使用Quartz
- 8 -- 深入使用Spring -- 3...1 Resource实现类ClassPathResource
- android 网络检测
- win10进入到安全模式的三种方法
- C++ template —— 模板中的名称(三)
- make: Warning: File `Makefile' has modification time 17 s in the future