和上文一样,先导入后面会频繁使用到的模块:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt plt.rc('font', family='Arial Unicode MS')
plt.rc('axes', unicode_minus='False') pd.__version__
Out[1]:
'1.1.3'
 

注意:我这里是Mac系统,用matplotlib画图时设置字体为Arial Unicode MS支持中文显示,如果是deepin系统可以设置字体为WenQuanYi Micro Hei,即:

In [2]:
# import numpy as np
# import pandas as pd
# import matplotlib.pyplot as plt # plt.rc('font', family='WenQuanYi Micro Hei')
# plt.rc('axes', unicode_minus='False')
 

如果其他系统画图时中文乱码,可以用以下几行代码查看系统字体,然后自行寻找支持中文的字体:

In [3]:
# from matplotlib.font_manager import FontManager
# fonts = set([x.name for x in FontManager().ttflist])
# print(fonts)
 

话不多说,继续pandas的学习。

 
 

数据合并

 

在实际的业务处理中,往往需要将多个数据集、文档合并后再进行分析。

 

concat

 
Signature:
pd.concat(
objs: Union[Iterable[~FrameOrSeries], Mapping[Union[Hashable, NoneType], ~FrameOrSeries]],
axis=0,
join='outer',
ignore_index: bool = False,
keys=None,
levels=None,
names=None,
verify_integrity: bool = False,
sort: bool = False,
copy: bool = True,
) -> Union[ForwardRef('DataFrame'), ForwardRef('Series')] Docstring:
Concatenate pandas objects along a particular axis with optional set logic
along the other axes. Can also add a layer of hierarchical indexing on the concatenation axis,
which may be useful if the labels are the same (or overlapping) on
the passed axis number.
 

数据准备:

In [4]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df1
Out[4]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
In [5]:
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df2
Out[5]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
In [6]:
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
df3
Out[6]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
 

基本连接:

 

In [7]:
# 将三个有相同列的表合并到一起
pd.concat([df1, df2, df3])
Out[7]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
 

可以再给每个表给一个一级索引,形成多层索引:

 

In [8]:
pd.concat([df1, df2, df3], keys=['x', 'y', 'z'])
Out[8]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

    A B C D
x 0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
y 4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
z 8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
 

也等同于下面这种方式:

In [9]:
pd.concat({'x': df1, 'y': df2, 'z': df3})
Out[9]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

    A B C D
x 0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
y 4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
z 8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
 

合并时不保留原索引,启用新的自然索引:

 

In [10]:
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
'D': ['D2', 'D3', 'D6', 'D7'],
'F': ['F2', 'F3', 'F6', 'F7']},
index=[2, 3, 6, 7])
df4
Out[10]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  B D F
2 B2 D2 F2
3 B3 D3 F3
6 B6 D6 F6
7 B7 D7 F7
In [11]:
pd.concat([df1, df4], ignore_index=True, sort=False)
Out[11]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
4 NaN B2 NaN D2 F2
5 NaN B3 NaN D3 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7
 

有没有类似于数据库中的outer join呢?

 

In [12]:
pd.concat([df1, df4], axis=1, sort=False)
Out[12]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D B D F
0 A0 B0 C0 D0 NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
6 NaN NaN NaN NaN B6 D6 F6
7 NaN NaN NaN NaN B7 D7 F7
 

很自然联想到inner join

 

In [13]:
pd.concat([df1, df4], axis=1, join='inner')
Out[13]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D B D F
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
 
join : {'inner', 'outer'}, default 'outer'
How to handle indexes on other axis (or axes).
 

这里并没有看到left join或者right join,那么如何达到left join的效果呢?

 

In [14]:
pd.concat([df1, df4.reindex(df1.index)], axis=1)
Out[14]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D B D F
0 A0 B0 C0 D0 NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
 

与序列合并:

 

In [15]:
s1 = pd.Series(['X0', 'X1', 'X2', 'X3'], name='X')
pd.concat([df1, s1], axis=1)
Out[15]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D X
0 A0 B0 C0 D0 X0
1 A1 B1 C1 D1 X1
2 A2 B2 C2 D2 X2
3 A3 B3 C3 D3 X3
 

当然,也是可以使用df.assign()来定义一个新列。

 

如果序列没名称,会自动给自然索引名称,如下:

In [16]:
s2 = pd.Series(['_A', '_B', '_C', '_D'])
s3 = pd.Series(['_a', '_b', '_c', '_d'])
pd.concat([df1, s2, s3], axis=1)
Out[16]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D 0 1
0 A0 B0 C0 D0 _A _a
1 A1 B1 C1 D1 _B _b
2 A2 B2 C2 D2 _C _c
3 A3 B3 C3 D3 _D _d
 

ignore_index=True会取消原有列名:

In [17]:
pd.concat([df1, s1], axis=1, ignore_index=True)
Out[17]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  0 1 2 3 4
0 A0 B0 C0 D0 X0
1 A1 B1 C1 D1 X1
2 A2 B2 C2 D2 X2
3 A3 B3 C3 D3 X3
 

同理,多个Series也可以合并:

In [18]:
s3 = pd.Series(['李寻欢', '令狐冲', '张无忌', '花无缺'])
s4 = pd.Series(['多情剑客无情剑', '笑傲江湖', '倚天屠龙记', '绝代双骄'])
s5 = pd.Series(['小李飞刀', '独孤九剑', '九阳神功', '移花接玉'])
pd.concat([s3, s4, s5], axis=1)
Out[18]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  0 1 2
0 李寻欢 多情剑客无情剑 小李飞刀
1 令狐冲 笑傲江湖 独孤九剑
2 张无忌 倚天屠龙记 九阳神功
3 花无缺 绝代双骄 移花接玉
 

也可以指定keys使用新的列名:

In [19]:
pd.concat([s3, s4, s5], axis=1, keys=['name', 'book', 'skill'])
Out[19]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  name book skill
0 李寻欢 多情剑客无情剑 小李飞刀
1 令狐冲 笑傲江湖 独孤九剑
2 张无忌 倚天屠龙记 九阳神功
3 花无缺 绝代双骄 移花接玉
 

merge

 
Signature:
pd.merge(
left,
right,
how: str = 'inner',
on=None,
left_on=None,
right_on=None,
left_index: bool = False,
right_index: bool = False,
sort: bool = False,
suffixes=('_x', '_y'),
copy: bool = True,
indicator: bool = False,
validate=None,
) -> 'DataFrame' Docstring:
Merge DataFrame or named Series objects with a database-style join. The join is done on columns or indexes. If joining columns on
columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes
on indexes or indexes on a column or columns, the index will be passed on. Parameters
----------
left : DataFrame
right : DataFrame or named Series
Object to merge with.
how : {'left', 'right', 'outer', 'inner'}, default 'inner'
Type of merge to be performed. * left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
* right: use only keys from right frame, similar to a SQL right outer join;
preserve key order.
* outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
* inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys.
on : label or list
Column or index level names to join on. These must be found in both
DataFrames. If `on` is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.
left_on : label or list, or array-like
Column or index level names to join on in the left DataFrame. Can also
be an array or list of arrays of the length of the left DataFrame.
These arrays are treated as if they are columns.
right_on : label or list, or array-like
Column or index level names to join on in the right DataFrame. Can also
be an array or list of arrays of the length of the right DataFrame.
These arrays are treated as if they are columns.
left_index : bool, default False
Use the index from the left DataFrame as the join key(s). If it is a
MultiIndex, the number of keys in the other DataFrame (either the index
or a number of columns) must match the number of levels.
right_index : bool, default False
Use the index from the right DataFrame as the join key. Same caveats as
left_index.
sort : bool, default False
Sort the join keys lexicographically in the result DataFrame. If False,
the order of the join keys depends on the join type (how keyword).
suffixes : list-like, default is ("_x", "_y")
A length-2 sequence where each element is optionally a string
indicating the suffix to add to overlapping column names in
`left` and `right` respectively. Pass a value of `None` instead
of a string to indicate that the column name from `left` or
`right` should be left as-is, with no suffix. At least one of the
values must not be None.
copy : bool, default True
If False, avoid copy if possible.
indicator : bool or str, default False
If True, adds a column to the output DataFrame called "_merge" with
information on the source of each row. The column can be given a different
name by providing a string argument. The column will have a Categorical
type with the value of "left_only" for observations whose merge key only
appears in the left DataFrame, "right_only" for observations
whose merge key only appears in the right DataFrame, and "both"
if the observation's merge key is found in both DataFrames. validate : str, optional
If specified, checks if merge is of specified type. * "one_to_one" or "1:1": check if merge keys are unique in both
left and right datasets.
* "one_to_many" or "1:m": check if merge keys are unique in left
dataset.
* "many_to_one" or "m:1": check if merge keys are unique in right
dataset.
* "many_to_many" or "m:m": allowed, but does not result in checks.
 

这里的merge与关系型数据库中的join非常类似。下面根据实例看看如何使用。

 
  • on:根据某个字段进行连接,必须存在于两个DateFrame中(若未同时存在,则需要分别使用left_onright_on来设置)
 

In [20]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']}) pd.merge(left, right, on='key')
Out[20]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
 

也可以有多个连接键:

 

In [21]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']}) pd.merge(left, right, on=['key1', 'key2'])
Out[21]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
 
  • how: 可以指定数据用哪种方式进行合并,没有的内容会为NaN,默认值inner
 

上面没有指定how,默认就是inner,下面分别看看left, right, outer的效果。

 

左外连接:

 

In [22]:
pd.merge(left, right, how='left', on=['key1', 'key2'])
Out[22]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
 

右外连接:

 

In [23]:
pd.merge(left, right, how='right', on=['key1', 'key2'])
Out[23]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
 

全外连接:

 

In [24]:
pd.merge(left, right, how='outer', on=['key1', 'key2'])
Out[24]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
 

如果设置indicatorTrue, 则会增加名为_merge的一列,显示这列是如何而来,其中left_only表示只在左表中, right_only表示只在右表中, both表示两个表中都有:

In [25]:
pd.merge(left, right, how='outer', on=['key1', 'key2'], indicator=True)
Out[25]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 key2 A B C D _merge
0 K0 K0 A0 B0 C0 D0 both
1 K0 K1 A1 B1 NaN NaN left_only
2 K1 K0 A2 B2 C1 D1 both
3 K1 K0 A2 B2 C2 D2 both
4 K2 K1 A3 B3 NaN NaN left_only
5 K2 K0 NaN NaN C3 D3 right_only
 

如果左、右两边连接的字段名称不同时,可以分别设置left_onright_on

In [26]:
left = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key2': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']}) pd.merge(left, right, left_on='key1', right_on='key2')
Out[26]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 A B key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2
3 K3 A3 B3 K3 C3 D3
 

非关联字段名称相同时,会怎样?

In [27]:
left = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key2': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'B': [30, 50, 70, 90]}) pd.merge(left, right, left_on='key1', right_on='key2')
Out[27]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 A B_x key2 C B_y
0 K0 A0 B0 K0 C0 30
1 K1 A1 B1 K1 C1 50
2 K2 A2 B2 K2 C2 70
3 K3 A3 B3 K3 C3 90
 

默认suffixes=('_x', '_y'),也可以自行修改:

In [28]:
pd.merge(left, right, left_on='key1', right_on='key2',
suffixes=('_left', '_right'))
Out[28]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  key1 A B_left key2 C B_right
0 K0 A0 B0 K0 C0 30
1 K1 A1 B1 K1 C1 50
2 K2 A2 B2 K2 C2 70
3 K3 A3 B3 K3 C3 90
 

append

 

append可以追加数据,并返回一个新对象,也是一种简单常用的数据合并方式。

 
Signature:
df.append(other, ignore_index=False, verify_integrity=False, sort=False) -> 'DataFrame' Docstring:
Append rows of `other` to the end of caller, returning a new object.
 

参数解释:

  • other: 要追加的其他DataFrameSeries
  • ignore_index: 如果为True则重新进行自然索引
  • verify_integrity: 如果为True则遇到重复索引内容时报错
  • sort: 是否进行排序
 

追加同结构的数据:

 

In [29]:
df1.append(df2)
Out[29]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
 

追加不同结构的数据,没有的列会增加,没有对应内容的会为NAN

 

In [30]:
df1.append(df4, sort=False)
Out[30]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
2 NaN B2 NaN D2 F2
3 NaN B3 NaN D3 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7
 

追加多个DataFrame:

In [31]:
df1.append([df2, df3])
Out[31]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
 

忽略原索引:

 

In [32]:
df1.append(df4, ignore_index=True, sort=False)
Out[32]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
4 NaN B2 NaN D2 F2
5 NaN B3 NaN D3 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7
 

追加Series

 

In [33]:
s2 = pd.Series(['X0', 'X1', 'X2', 'X3'],
index=['A', 'B', 'C', 'D']) df1.append(s2, ignore_index=True)
Out[33]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 X0 X1 X2 X3
 

追加字典列表:

 

In [34]:
d = [{'A': 1, 'B': 2, 'C': 3, 'X': 4},
{'A': 5, 'B': 6, 'C': 7, 'Y': 8}] df1.append(d, ignore_index=True, sort=False)
Out[34]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B C D X Y
0 A0 B0 C0 D0 NaN NaN
1 A1 B1 C1 D1 NaN NaN
2 A2 B2 C2 D2 NaN NaN
3 A3 B3 C3 D3 NaN NaN
4 1 2 3 NaN 4.0 NaN
5 5 6 7 NaN NaN 8.0
 

来个实战案例。在使用Excel的时候,常常会在数据最后,增加一行汇总数据,比如求和,求平均值等。现在用Pandas如何实现呢?

In [35]:
df = pd.DataFrame(np.random.randint(1, 10, size=(3, 4)),
columns=['a', 'b', 'c', 'd'])
df
Out[35]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  a b c d
0 8 5 6 1
1 6 2 8 5
2 7 2 2 3
In [36]:
df.append(pd.Series(df.sum(), name='total'))
Out[36]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  a b c d
0 8 5 6 1
1 6 2 8 5
2 7 2 2 3
total 21 9 16 9
 

数据清洗

 

数据清洗是指发现并纠正数据集中可识别的错误的一个过程,包括检查数据一致性,处理无效值、缺失值等。数据清洗是为了最大限度地提高数据集的准确性。

 

缺失值

 

由于数据来源的复杂性、不确定性,数据中难免会存在字段值不全、缺失等情况,下面先介绍如何找出这些缺失的值。

 

以这里的电影数据举例,其中数据集来自github,为了方便测试,下载压缩后上传到博客园,原始数据链接:

https://github.com/LearnDataSci/articles/blob/master/Python%20Pandas%20Tutorial%20A%20Complete%20Introduction%20for%20Beginners/IMDB-Movie-Data.csv

In [37]:
movies = pd.read_csv('https://files.cnblogs.com/files/blogs/478024/IMDB-Movie-Data.csv.zip')
movies.tail(2)
Out[37]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
998 999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0
999 1000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0
 

上面可以看到Revenue (Millions)列有个NaN,这里的NaN是一个缺值标识。

 

NaN (not a number) is the standard missing data marker used in pandas.

 

判断数据集中是否有缺失值:

In [38]:
np.any(movies.isna())
Out[38]:
True
 

或者判断数据集中是否不含缺失值:

In [39]:
np.all(movies.notna())
Out[39]:
False
 

统计 每列有多少个缺失值:

In [40]:
movies.isna().sum(axis=0)
Out[40]:
Rank                    0
Title 0
Genre 0
Description 0
Director 0
Actors 0
Year 0
Runtime (Minutes) 0
Rating 0
Votes 0
Revenue (Millions) 128
Metascore 64
dtype: int64
 

统计 每行有多少个缺失值:

In [41]:
movies.isna().sum(axis=1)
Out[41]:
0      0
1 0
2 0
3 0
4 0
..
995 1
996 0
997 0
998 1
999 0
Length: 1000, dtype: int64
 

统计 一共有多少个缺失值:

In [42]:
movies.isna().sum().sum()
Out[42]:
192
 

筛选出有缺失值的列:

In [43]:
movies.isnull().any(axis=0)
Out[43]:
Rank                  False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) True
Metascore True
dtype: bool
 

统计有缺失值的列的个数:

In [44]:
movies.isnull().any(axis=0).sum()
Out[44]:
2
 

筛选出有缺失值的行:

In [45]:
movies.isnull().any(axis=1)
Out[45]:
0      False
1 False
2 False
3 False
4 False
...
995 True
996 False
997 False
998 True
999 False
Length: 1000, dtype: bool
 

统计有缺失值的行的个数:

In [46]:
movies.isnull().any(axis=1).sum()
Out[46]:
162
 

查看Metascore列缺失的数据:

In [47]:
movies[movies['Metascore'].isna()][['Rank', 'Title', 'Votes', 'Metascore']]
Out[47]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Rank Title Votes Metascore
25 26 Paris pieds nus 222 NaN
26 27 Bahubali: The Beginning 76193 NaN
27 28 Dead Awake 523 NaN
39 40 5- 25- 77 241 NaN
42 43 Don't Fuck in the Woods 496 NaN
... ... ... ... ...
967 968 The Walk 92378 NaN
969 970 The Lone Ranger 190855 NaN
971 972 Disturbia 193491 NaN
989 990 Selma 67637 NaN
992 993 Take Me Home Tonight 45419 NaN

64 rows × 4 columns

In [48]:
movies.shape
Out[48]:
(1000, 12)
In [49]:
movies.count()
Out[49]:
Rank                  1000
Title 1000
Genre 1000
Description 1000
Director 1000
Actors 1000
Year 1000
Runtime (Minutes) 1000
Rating 1000
Votes 1000
Revenue (Millions) 872
Metascore 936
dtype: int64
 

可以看出,每列应该有1000个数据的,count()时缺失值并没有算进来。

 

对于缺失值的处理,应根据具体的业务场景以及数据完整性的要求选择较合适的方案。常见的处理方案包括删除存在缺失值的数据(dropna)、替换缺失值(fillna)。

 

删除

 

某些场景下,有缺失值会认为该样本数据无效,就需要对整行或者整列数据进行删除(dropna)。

 

统计无缺失值的行的个数:

In [50]:
movies.notna().all(axis=1).sum()
Out[50]:
838
 

删除所有含缺失值的行:

In [51]:
data = movies.dropna()
data.shape
Out[51]:
(838, 12)
In [52]:
movies.shape
Out[52]:
(1000, 12)
 

data里838行数据都是无缺失值的。但是,值得注意的是,dropna()默认并不会在原来的数据集上删除,除非指定dropna(inplace=True)。下面演示一下:

In [53]:
# 复制一份完整的数据做原地删除演示
movies_copy = movies.copy()
print(movies_copy.shape) # 原地删除
movies_copy.dropna(inplace=True) # 查看原地删除是否生效
movies_copy.shape
 
(1000, 12)
Out[53]:
(838, 12)
 

统计无缺失值的列的个数:

In [54]:
movies.notna().all(axis=0).sum()
Out[54]:
10
 

删除含缺失值的列:

In [55]:
data = movies.dropna(axis=1)
data.shape
Out[55]:
(1000, 10)
 

how : {'any', 'all'}, default 'any'. Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

  • 'any' : If any NA values are present, drop that row or column.
  • 'all' : If all values are NA, drop that row or column.
 

删除所有值都缺失的行:

In [56]:
# Drop the rows where all elements are missing
data = movies.dropna(how='all')
data.shape
Out[56]:
(1000, 12)
 

这里的数据不存在所有值都缺失的行,所以how='all'dropna()对此处的数据集无任何影响。

 

subset: Define in which columns to look for missing values.

 

subset参数可以指定在哪些列中寻找缺失值:

In [57]:
# 指定在Title、Metascore两列中寻找缺失值
data = movies.dropna(subset=['Title', 'Metascore'])
data.shape
Out[57]:
(936, 12)
 

统计TitleMetascore这两列无缺失值的行数。讲道理,肯定是等于上面删除缺失值后的行数。

In [58]:
movies[['Title', 'Metascore']].notna().all(axis=1).sum()
Out[58]:
936
 

thresh : int, optional. Require that many non-NA values.

In [59]:
# Keep only the rows with at least 2 non-NA values.
data = movies[['Title', 'Metascore', 'Revenue (Millions)']].dropna(thresh=2)
data.shape
Out[59]:
(970, 3)
 

由于Title列没有缺失值,相当于删除掉Metascore,Revenue (Millions)两列都为缺失值的行,如下:

In [60]:
data = movies[['Metascore', 'Revenue (Millions)']].dropna(how='all')
data.shape
Out[60]:
(970, 2)
 

填充

 

处理缺失值的另外一种常用方法是填充(fillna)。填充值虽然不绝对准确,但对获得真实结果的影响并不大时,可以尝试一用。

 

先看个简单的例子,然后再应用到上面的电影数据中去。因为电影数据比较多,演示起来并不直观。

In [61]:
hero = pd.DataFrame(data={'score': [97, np.nan, 96, np.nan, 95],
'wins': [np.nan, 9, np.nan, 11, 10],
'author': ['古龙', '金庸', np.nan, np.nan, np.nan],
'book': ['多情剑客无情剑', '笑傲江湖', '倚天屠龙记', '射雕英雄传', '绝代双骄'],
'skill': ['小李飞刀', '独孤九剑', '九阳神功', '降龙十八掌', '移花接玉'],
'wife': [np.nan, '任盈盈', np.nan, '黄蓉', np.nan],
'child': [np.nan, np.nan, np.nan, '郭襄', np.nan]},
index=['李寻欢', '令狐冲', '张无忌', '郭靖', '花无缺']) hero
Out[61]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李寻欢 97.0 NaN 古龙 多情剑客无情剑 小李飞刀 NaN NaN
令狐冲 NaN 9.0 金庸 笑傲江湖 独孤九剑 任盈盈 NaN
张无忌 96.0 NaN NaN 倚天屠龙记 九阳神功 NaN NaN
郭靖 NaN 11.0 NaN 射雕英雄传 降龙十八掌 黄蓉 郭襄
花无缺 95.0 10.0 NaN 绝代双骄 移花接玉 NaN NaN
 

全部填充为unknown

In [62]:
hero.fillna('unknown')
Out[62]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李寻欢 97 unknown 古龙 多情剑客无情剑 小李飞刀 unknown unknown
令狐冲 unknown 9 金庸 笑傲江湖 独孤九剑 任盈盈 unknown
张无忌 96 unknown unknown 倚天屠龙记 九阳神功 unknown unknown
郭靖 unknown 11 unknown 射雕英雄传 降龙十八掌 黄蓉 郭襄
花无缺 95 10 unknown 绝代双骄 移花接玉 unknown unknown
 

只替换第一个:

In [63]:
hero.fillna('unknown', limit=1)
Out[63]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李寻欢 97 unknown 古龙 多情剑客无情剑 小李飞刀 unknown unknown
令狐冲 unknown 9 金庸 笑傲江湖 独孤九剑 任盈盈 NaN
张无忌 96 NaN unknown 倚天屠龙记 九阳神功 NaN NaN
郭靖 NaN 11 NaN 射雕英雄传 降龙十八掌 黄蓉 郭襄
花无缺 95 10 NaN 绝代双骄 移花接玉 NaN NaN
 

不同列替换不同的值:

In [64]:
hero.fillna(value={'score': 100, 'author': '匿名', 'wife': '保密'})
Out[64]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李寻欢 97.0 NaN 古龙 多情剑客无情剑 小李飞刀 保密 NaN
令狐冲 100.0 9.0 金庸 笑傲江湖 独孤九剑 任盈盈 NaN
张无忌 96.0 NaN 匿名 倚天屠龙记 九阳神功 保密 NaN
郭靖 100.0 11.0 匿名 射雕英雄传 降龙十八掌 黄蓉 郭襄
花无缺 95.0 10.0 匿名 绝代双骄 移花接玉 保密 NaN
 

method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None

上面是填充固定值,此外还能指定填充方法。

 

pad / ffill: propagate last valid observation forward to next valid

In [65]:
# 使用前一个有效值填充
hero.fillna(method='ffill')
Out[65]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李寻欢 97.0 NaN 古龙 多情剑客无情剑 小李飞刀 NaN NaN
令狐冲 97.0 9.0 金庸 笑傲江湖 独孤九剑 任盈盈 NaN
张无忌 96.0 9.0 金庸 倚天屠龙记 九阳神功 任盈盈 NaN
郭靖 96.0 11.0 金庸 射雕英雄传 降龙十八掌 黄蓉 郭襄
花无缺 95.0 10.0 金庸 绝代双骄 移花接玉 黄蓉 郭襄
 

backfill / bfill: use next valid observation to fill gap.

In [66]:
# 使用后一个有效值填充
hero.fillna(method='bfill')
Out[66]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李寻欢 97.0 9.0 古龙 多情剑客无情剑 小李飞刀 任盈盈 郭襄
令狐冲 96.0 9.0 金庸 笑傲江湖 独孤九剑 任盈盈 郭襄
张无忌 96.0 11.0 NaN 倚天屠龙记 九阳神功 黄蓉 郭襄
郭靖 95.0 11.0 NaN 射雕英雄传 降龙十八掌 黄蓉 郭襄
花无缺 95.0 10.0 NaN 绝代双骄 移花接玉 NaN NaN
 

scorewins两列的缺失值用平均值来填充:

In [67]:
hero.fillna(hero[['score', 'wins']].mean())
Out[67]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李寻欢 97.0 10.0 古龙 多情剑客无情剑 小李飞刀 NaN NaN
令狐冲 96.0 9.0 金庸 笑傲江湖 独孤九剑 任盈盈 NaN
张无忌 96.0 10.0 NaN 倚天屠龙记 九阳神功 NaN NaN
郭靖 96.0 11.0 NaN 射雕英雄传 降龙十八掌 黄蓉 郭襄
花无缺 95.0 10.0 NaN 绝代双骄 移花接玉 NaN NaN
 

指定列填充为unknown,并原地替换:

In [68]:
hero.child.fillna('unknown', inplace=True)
hero
Out[68]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  score wins author book skill wife child
李寻欢 97.0 NaN 古龙 多情剑客无情剑 小李飞刀 NaN unknown
令狐冲 NaN 9.0 金庸 笑傲江湖 独孤九剑 任盈盈 unknown
张无忌 96.0 NaN NaN 倚天屠龙记 九阳神功 NaN unknown
郭靖 NaN 11.0 NaN 射雕英雄传 降龙十八掌 黄蓉 郭襄
花无缺 95.0 10.0 NaN 绝代双骄 移花接玉 NaN unknown
 

再回到上面的实际的电影数据案例。现在用平均值替换缺失值:

In [69]:
filled_movies = movies.fillna(
movies[['Revenue (Millions)', 'Metascore']].mean()) np.any(filled_movies.isna())
Out[69]:
False
 

可见,填充后的电影数据中已经不存在缺失值了。

 

数据替换

 

数据替换常用于数据清洗整理、枚举转换、数据修正等场景。

 

先看下replace()方法的介绍:

 
Signature:
replace(
to_replace=None,
value=None,
inplace=False,
limit=None,
regex=False,
method='pad',
) Docstring:
Replace values given in `to_replace` with `value`.
 

再看几个例子:

In [70]:
s = pd.Series(['a', 'b', 'c', 'd', 'e'])
s
Out[70]:
0    a
1 b
2 c
3 d
4 e
dtype: object
In [71]:
s.replace('a', 'aa')
Out[71]:
0    aa
1 b
2 c
3 d
4 e
dtype: object
In [72]:
s.replace({'d': 'dd', 'e': 'ee'})
Out[72]:
0     a
1 b
2 c
3 dd
4 ee
dtype: object
In [73]:
s.replace(['a', 'b', 'c'], ['aa', 'bb', 'cc'])
Out[73]:
0    aa
1 bb
2 cc
3 d
4 e
dtype: object
In [74]:
# 将c替换为它前一个值
s.replace('c', method='ffill')
Out[74]:
0    a
1 b
2 b
3 d
4 e
dtype: object
In [75]:
# 将c替换为它后一个值
s.replace('c', method='bfill')
Out[75]:
0    a
1 b
2 d
3 d
4 e
dtype: object
In [76]:
df = pd.DataFrame({'A': [0, -11, 2, 3, 35],
'B': [0, -20, 2, 5, 16]})
df
Out[76]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 0 0
1 -11 -20
2 2 2
3 3 5
4 35 16
In [77]:
df.replace(0, 5)
Out[77]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 5 5
1 -11 -20
2 2 2
3 3 5
4 35 16
In [78]:
df.replace([0, 2, 3, 5], 10)
Out[78]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 10 10
1 -11 -20
2 10 10
3 10 10
4 35 16
In [79]:
df.replace([0, 2], [100, 200])
Out[79]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 100 100
1 -11 -20
2 200 200
3 3 5
4 35 16
In [80]:
df.replace({0: 10, 2: 22})
Out[80]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 10 10
1 -11 -20
2 22 22
3 3 5
4 35 16
In [81]:
df.replace({'A': 0, 'B': 2}, 100)
Out[81]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 100 0
1 -11 -20
2 2 100
3 3 5
4 35 16
In [82]:
df.replace({'A': {2: 200, 3: 300}})
Out[82]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 0 0
1 -11 -20
2 200 2
3 300 5
4 35 16
 

对一些极端值,如过大或者过小,可以使用df.clip(lower, upper)来修剪,当数据大于upper时,使用upper的值,小于lower时用lower 的值,类似numpy.clip的方法。

 

在修剪之前,再看一眼原始数据:

In [83]:
df
Out[83]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 0 0
1 -11 -20
2 2 2
3 3 5
4 35 16
In [84]:
# 修剪成最小为2,最大为10
df.clip(2, 10)
Out[84]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 2 2
1 2 2
2 2 2
3 3 5
4 10 10
In [85]:
# 对每列元素的最小值和最大值进行不同的限制

# 将A列数值修剪成[-3, 3]之间
# 将B列数值修剪成[-5, 5]之间
df.clip([-3, -5], [3, 5], axis=1)
Out[85]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 0 0
1 -3 -5
2 2 2
3 3 5
4 3 5
In [86]:
# 对每行元素的最小值和最大值进行不同的限制

# 将第1行数值修剪成[5, 10]之间
# 将第2行数值修剪成[-15, -12]之间
# 将第3行数值修剪成[6, 10]之间
# 将第4行数值修剪成[4, 10]之间
# 将第5行数值修剪成[20, 30]之间
df.clip([5, -15, 6, 4, 20],
[10, -12, 10, 10, 30],
axis=0)
Out[86]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 5 5
1 -12 -15
2 6 6
3 4 5
4 30 20
 

另外,可以将无效值先替换为nan,再做缺失值处理。这样就能应用上前面讲到的缺失值处理相关的知识。

 

比如这里的df,我们认为小于0的数据都是无效数据,可以:

In [87]:
df.replace([-11, -20], np.nan)
Out[87]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 0.0 0.0
1 NaN NaN
2 2.0 2.0
3 3.0 5.0
4 35.0 16.0
 

当然,也可以像下面这样把无效数据变为nan:

In [88]:
df[df >= 0]
Out[88]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 0.0 0.0
1 NaN NaN
2 2.0 2.0
3 3.0 5.0
4 35.0 16.0
 

此时,上面讲到的缺失值处理就能派上用场了。

 

文本内容比较复杂时,可以使用正则进行匹配替换。下面看几个例子:

In [89]:
df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
'B': ['abc', 'bar', 'xyz']})
df
Out[89]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 bat abc
1 foo bar
2 bait xyz
In [90]:
# 利用正则将ba开头且总共3个字符的文本替换为new
df.replace(to_replace=r'^ba.$', value='new', regex=True)
Out[90]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 new abc
1 foo new
2 bait xyz
In [91]:
# 如果多列正则不同的情况下可以按以下格式对应传入
df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
Out[91]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 new abc
1 foo bar
2 bait xyz
In [92]:
df.replace(regex=r'^ba.$', value='new')
Out[92]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 new abc
1 foo new
2 bait xyz
In [93]:
# 不同正则替换不同的值
df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})
Out[93]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 new abc
1 xyz new
2 bait xyz
In [94]:
# 多个正则替换为同一个值
df.replace(regex=[r'^ba.$', 'foo'], value='new')
Out[94]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  A B
0 new abc
1 new new
2 bait xyz
 

重复值

 

重复值在数据清洗中可能需要删除。下面介绍Pandas如何识别重复值以及如何删除重复值。

 
Signature:
df.duplicated(
subset: Union[Hashable, Sequence[Hashable], NoneType] = None,
keep: Union[str, bool] = 'first',
) -> 'Series' Docstring:
Return boolean Series denoting duplicate rows. Considering certain columns is optional. Parameters
----------
subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by
default use all of the columns.
keep : {'first', 'last', False}, default 'first'
Determines which duplicates (if any) to mark. - ``first`` : Mark duplicates as ``True`` except for the first occurrence.
- ``last`` : Mark duplicates as ``True`` except for the last occurrence.
- False : Mark all duplicates as ``True``.
 

看官方给的例子:

In [95]:
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
df
Out[95]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
In [96]:
# 默认情况下,对于每行重复的值,第一次出现都设置为False,其他为True
df.duplicated()
Out[96]:
0    False
1 True
2 False
3 False
4 False
dtype: bool
In [97]:
# 将每行重复值的最后一次出现设置为False,其他为True
df.duplicated(keep='last')
Out[97]:
0     True
1 False
2 False
3 False
4 False
dtype: bool
In [98]:
# 所有重复行都为True
df.duplicated(keep=False)
Out[98]:
0     True
1 True
2 False
3 False
4 False
dtype: bool
In [99]:
# 参数subset可以在指定列上查找重复值
df.duplicated(subset=['brand'])
Out[99]:
0    False
1 True
2 False
3 True
4 True
dtype: bool
 

再看如何删除重复值:

 
Signature:
df.drop_duplicates(
subset: Union[Hashable, Sequence[Hashable], NoneType] = None,
keep: Union[str, bool] = 'first',
inplace: bool = False,
ignore_index: bool = False,
) -> Union[ForwardRef('DataFrame'), NoneType]
Docstring:
Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes
are ignored. Parameters
----------
subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by
default use all of the columns.
keep : {'first', 'last', False}, default 'first'
Determines which duplicates (if any) to keep.
- ``first`` : Drop duplicates except for the first occurrence.
- ``last`` : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
inplace : bool, default False
Whether to drop duplicates in place or to return a copy.
ignore_index : bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
 

同样继续官方给的例子:

In [100]:
# By default, it removes duplicate rows based on all columns
df.drop_duplicates()
Out[100]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
In [101]:
# To remove duplicates on specific column(s), use `subset`
df.drop_duplicates(subset=['brand'])
Out[101]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
In [102]:
# To remove duplicates and keep last occurences, use `keep`
df.drop_duplicates(subset=['brand', 'style'], keep='last')
Out[102]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  brand style rating
1 Yum Yum cup 4.0
2 Indomie cup 3.5
4 Indomie pack 5.0
 

分组与聚合

 

 

在数据统计与分析中,分组与聚合非常常见。如果是SQL,对应的就是Group By和聚合函数(Aggregation Functions)。下面看看pandas是怎么玩的。

 
Signature:
df.groupby(
by=None,
axis=0,
level=None,
as_index: bool = True,
sort: bool = True,
group_keys: bool = True,
squeeze: bool = <object object at 0x7f3df810e750>,
observed: bool = False,
dropna: bool = True,
) -> 'DataFrameGroupBy' Docstring:
Group DataFrame using a mapper or by a Series of columns.
 

groupby()方法可以按指定字段对DataFrame进行分组,生成一个分组器对象,然后再把这个对象的各个字段按一定的聚合方法输出。

 

其中by为分组字段,由于是第一个参数可以省略,可以按列表给多个。会返回一个DataFrameGroupBy对象,如果不给聚合方法,不会返回 DataFrame

 

准备演示数据:

In [103]:
df = pd.read_csv('https://files.cnblogs.com/files/blogs/478024/team.csv.zip')
df
Out[103]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  name team Q1 Q2 Q3 Q4
0 Liver E 89 21 24 64
1 Arry C 36 37 37 57
2 Ack A 57 60 18 84
3 Eorge C 93 96 71 78
4 Oah D 65 49 61 86
... ... ... ... ... ... ...
95 Gabriel C 48 59 87 74
96 Austin7 C 21 31 30 43
97 Lincoln4 C 98 93 1 20
98 Eli E 11 74 58 91
99 Ben E 21 43 41 74

100 rows × 6 columns

In [104]:
# 按team分组后对应列求和
df.groupby('team').sum()
Out[104]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Q1 Q2 Q3 Q4
team        
A 1066 639 875 783
B 975 1218 1202 1136
C 1056 1194 1068 1127
D 860 1191 1241 1199
E 963 1013 881 1033
In [105]:
# 按team分组后对应列求平均值
df.groupby('team').mean()
Out[105]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Q1 Q2 Q3 Q4
team        
A 62.705882 37.588235 51.470588 46.058824
B 44.318182 55.363636 54.636364 51.636364
C 48.000000 54.272727 48.545455 51.227273
D 45.263158 62.684211 65.315789 63.105263
E 48.150000 50.650000 44.050000 51.650000
In [106]:
# 按team分组后不同列使用不同的聚合方式
df.groupby('team').agg({'Q1': sum, # 求和
'Q2': 'count', # 计数
'Q3': 'mean', # 求平均值
'Q4': max}) # 求最大值
Out[106]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Q1 Q2 Q3 Q4
team        
A 1066 17 51.470588 97
B 975 22 54.636364 99
C 1056 22 48.545455 98
D 860 19 65.315789 99
E 963 20 44.050000 98
 

If by is a function, it's called on each value of the object's index.

In [107]:
# team在C之前(包括C)分为一组,C之后的分为另外一组
df.set_index('team').groupby(lambda team: 'team1' if team <= 'C' else 'team2')['name'].count()
Out[107]:
team1    61
team2 39
Name: name, dtype: int64
 

或者下面这种写法也行:

In [108]:
df.groupby(lambda idx: 'team1' if df.loc[idx]['team'] <= 'C' else 'team2')['name'].count()
Out[108]:
team1    61
team2 39
Name: name, dtype: int64
In [109]:
# 按name的长度(length)分组,并取出每组中name的第一个值和最后一个值
df.groupby(df['name'].apply(lambda x: len(x))).agg({'name': ['first', 'last']})
Out[109]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

  name
  first last
name    
3 Ack Ben
4 Arry Leon
5 Liver Aiden
6 Harlie Jamie0
7 William Austin7
8 Harrison Lincoln4
9 Alexander Theodore3
In [110]:
# 只对部分分组
df.set_index('team').groupby({'A': 'A组', 'B': 'B组'})['name'].count()
Out[110]:
A组    17
B组 22
Name: name, dtype: int64
 

可以将以上方法混合组成列表进行分组:

In [111]:
# 按team,name长度分组,取分组中最后一行
df.groupby(['team', df['name'].apply(lambda x: len(x))]).last()
Out[111]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

    name Q1 Q2 Q3 Q4
team name          
A 3 Ack 57 60 18 84
4 Toby 52 27 17 68
5 Aaron 96 75 55 8
6 Nathan 87 77 62 13
7 Stanley 69 71 39 97
B 3 Kai 66 45 13 48
4 Liam 2 80 24 25
5 Lewis 4 34 77 28
6 Jamie0 39 97 84 55
7 Albert0 85 38 41 17
8 Grayson7 59 84 74 33
C 4 Adam 90 32 47 39
5 Calum 14 91 16 82
6 Connor 62 38 63 46
7 Austin7 21 31 30 43
8 Lincoln4 98 93 1 20
9 Sebastian 1 14 68 48
D 3 Oah 65 49 61 86
4 Ezra 16 56 86 61
5 Aiden 20 31 62 68
6 Reuben 70 72 76 56
7 Hunter3 38 80 82 40
8 Benjamin 15 88 52 25
9 Theodore3 43 7 68 80
E 3 Ben 21 43 41 74
4 Leon 38 60 31 7
5 Roman 73 1 25 44
6 Dexter 73 94 53 20
7 Zachary 12 71 85 93
8 Jackson5 6 10 15 33
 

We can groupby different levels of a hierarchical index using the level parameter.

In [112]:
arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
['Captive', 'Wild', 'Captive', 'Wild']] index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type')) df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
index=index)
df
Out[112]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

    Max Speed
Animal Type  
Falcon Captive 390.0
Wild 350.0
Parrot Captive 30.0
Wild 20.0
In [113]:
# df.groupby(level=0).mean()
df.groupby(level="Animal").mean()
Out[113]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Max Speed
Animal  
Falcon 370.0
Parrot 25.0
In [114]:
# df.groupby(level=1).mean()
df.groupby(level="Type").mean()
Out[114]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Max Speed
Type  
Captive 210.0
Wild 185.0
 

We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is True.

In [115]:
l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df = pd.DataFrame(l, columns=["a", "b", "c"])
df
Out[115]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
In [116]:
df.groupby(by=["b"]).sum()
Out[116]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  a c
b    
1.0 2 3
2.0 2 5
In [117]:
df.groupby(by=["b"], dropna=False).sum()
Out[117]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  a c
b    
1.0 2 3
2.0 2 5
NaN 1 4
 

上面体验了一下pandas分组聚合的基本使用后,接下来看看分组聚合的一些过程细节。

 

分组

 

有以下动物最大速度数据:

In [118]:
df = pd.DataFrame([('bird', 'Falconiformes', 389.0),
('bird', 'Psittaciformes', 24.0),
('mammal', 'Carnivora', 80.2),
('mammal', 'Primates', np.nan),
('mammal', 'Carnivora', 58)],
index=['falcon', 'parrot', 'lion',
'monkey', 'leopard'],
columns=('class', 'order', 'max_speed')) df
Out[118]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
In [119]:
# 分组数
df.groupby('class').ngroups
Out[119]:
2
In [120]:
# 查看分组
df.groupby('class').groups
Out[120]:
{'bird': ['falcon', 'parrot'], 'mammal': ['lion', 'monkey', 'leopard']}
In [121]:
df.groupby('class').size()
Out[121]:
class
bird 2
mammal 3
dtype: int64
In [122]:
# 查看鸟类分组内容
df.groupby('class').get_group('bird')
Out[122]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
 

获取分组中的第几个值:

In [123]:
# 第一个
df.groupby('class').nth(1)
Out[123]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  order max_speed
class    
bird Psittaciformes 24.0
mammal Primates NaN
In [124]:
# 最后一个
df.groupby('class').nth(-1)
Out[124]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  order max_speed
class    
bird Psittaciformes 24.0
mammal Carnivora 58.0
In [125]:
# 第一个,第二个
df.groupby('class').nth([1, 2])
Out[125]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  order max_speed
class    
bird Psittaciformes 24.0
mammal Primates NaN
mammal Carnivora 58.0
In [126]:
# 每组显示前2个
df.groupby('class').head(2)
Out[126]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
In [127]:
# 每组最后2个
df.groupby('class').tail(2)
Out[127]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
In [128]:
# 分组序号
df.groupby('class').ngroup()
Out[128]:
falcon     0
parrot 0
lion 1
monkey 1
leopard 1
dtype: int64
In [129]:
# 返回每个元素在所在组的序号的序列
df.groupby('class').cumcount(ascending=False)
Out[129]:
falcon     1
parrot 0
lion 2
monkey 1
leopard 0
dtype: int64
In [130]:
# 按鸟类首字母分组
df.groupby(df['class'].str[0]).groups
Out[130]:
{'b': ['falcon', 'parrot'], 'm': ['lion', 'monkey', 'leopard']}
In [131]:
# 按鸟类第一个字母和第二个字母分组
df.groupby([df['class'].str[0], df['class'].str[1]]).groups
Out[131]:
{('b', 'i'): ['falcon', 'parrot'], ('m', 'a'): ['lion', 'monkey', 'leopard']}
In [132]:
# 在组内的排名
df.groupby('class').rank()
Out[132]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  max_speed
falcon 2.0
parrot 1.0
lion 2.0
monkey NaN
leopard 1.0
 

聚合

 

对数据进行分组后,接下来就可以收获果实了,给分组给定统计方法,最终得到分组聚合的结果。除了常见的数学统计方法,还可以使用 agg()transform()等函数进行操作。

In [133]:
# 描述性统计
df.groupby('class').describe()
Out[133]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

  max_speed
  count mean std min 25% 50% 75% max
class                
bird 2.0 206.5 258.093975 24.0 115.25 206.5 297.75 389.0
mammal 2.0 69.1 15.697771 58.0 63.55 69.1 74.65 80.2
In [134]:
# 一列使用多个聚合方法
df.groupby('class').agg({'max_speed': ['min', 'max', 'sum']})
Out[134]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

  max_speed
  min max sum
class      
bird 24.0 389.0 413.0
mammal 58.0 80.2 138.2
In [135]:
df.groupby('class')['max_speed'].agg(
Max='max', Min='min', Diff=lambda x: x.max() - x.min())
Out[135]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  Max Min Diff
class      
bird 389.0 24.0 365.0
mammal 80.2 58.0 22.2
In [136]:
df.groupby('class').agg(max_speed=('max_speed', 'max'),
count_order=('order', 'count'))
Out[136]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  max_speed count_order
class    
bird 389.0 2
mammal 80.2 3
In [137]:
df.groupby('class').agg(
max_speed=pd.NamedAgg(column='max_speed', aggfunc='max'),
count_order=pd.NamedAgg(column='order', aggfunc='count')
)
Out[137]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  max_speed count_order
class    
bird 389.0 2
mammal 80.2 3
 

transform类似于agg,但不同的是它返回的是一个DataFrame,每个会将原来的值一一替换成统计后的值,比如按组计算平均值,那么返回的新DataFrame中每个值就是它所在组的平均值。

In [138]:
df.groupby('class').agg(np.mean)
Out[138]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  max_speed
class  
bird 206.5
mammal 69.1
In [139]:
df.groupby('class').transform(np.mean)
Out[139]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  max_speed
falcon 206.5
parrot 206.5
lion 69.1
monkey 69.1
leopard 69.1
 

分组后筛选原始数据:

 
Signature:
DataFrameGroupBy.filter(func, dropna=True, *args, **kwargs) Docstring:
Return a copy of a DataFrame excluding filtered elements. Elements from groups are filtered if they do not satisfy the
boolean criterion specified by func.
In [140]:
# 筛选出 按class分组后,分组内max_speed平均值大于100的元素
df.groupby(['class']).filter(lambda x: x['max_speed'].mean() > 100)
Out[140]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
In [141]:
# 取出分组后index
df.groupby('class').apply(lambda x: x.index.to_list())
Out[141]:
class
bird [falcon, parrot]
mammal [lion, monkey, leopard]
dtype: object
In [142]:
# 取出分组后每组中max_speed最大的前N个
df.groupby('class').apply(lambda x: x.sort_values(
by='max_speed', ascending=False).head(1))
Out[142]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

    class order max_speed
class        
bird falcon bird Falconiformes 389.0
mammal lion mammal Carnivora 80.2
In [143]:
df.groupby('class').apply(lambda x: pd.Series({
'speed_max': x['max_speed'].max(),
'speed_min': x['max_speed'].min(),
'speed_mean': x['max_speed'].mean(),
}))
Out[143]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  speed_max speed_min speed_mean
class      
bird 389.0 24.0 206.5
mammal 80.2 58.0 69.1
 

按分组导出Excel文件:

In [144]:
for group, data in df.groupby('class'):
data.to_excel(f'data/{group}.xlsx')
In [145]:
# 每组去重值后数量
df.groupby('class').order.nunique()
Out[145]:
class
bird 2
mammal 2
Name: order, dtype: int64
In [146]:
# 每组去重后的值
df.groupby("class")['order'].unique()
Out[146]:
class
bird [Falconiformes, Psittaciformes]
mammal [Carnivora, Primates]
Name: order, dtype: object
In [147]:
# 统计每组数据值的数量
df.groupby("class")['order'].value_counts()
Out[147]:
class   order
bird Falconiformes 1
Psittaciformes 1
mammal Carnivora 2
Primates 1
Name: order, dtype: int64
In [148]:
# 每组最大的1个
df.groupby("class")['max_speed'].nlargest(1)
Out[148]:
class
bird falcon 389.0
mammal lion 80.2
Name: max_speed, dtype: float64
In [149]:
# 每组最小的2个
df.groupby("class")['max_speed'].nsmallest(2)
Out[149]:
class
bird parrot 24.0
falcon 389.0
mammal leopard 58.0
lion 80.2
Name: max_speed, dtype: float64
In [150]:
# 每组值是否单调递增
df.groupby("class")['max_speed'].is_monotonic_increasing
Out[150]:
class
bird False
mammal False
Name: max_speed, dtype: bool
In [151]:
# 每组值是否单调递减
df.groupby("class")['max_speed'].is_monotonic_decreasing
Out[151]:
class
bird True
mammal False
Name: max_speed, dtype: bool
 

堆叠与透视

 

实际生产中,我们拿到的原始数据的表现形状可能并不符合当前需求,比如说不是期望的维度、数据不够直观、表现力不够等等。此时,可以对原始数据进行适当的变形,比如堆叠、透视、行列转置等。

 

堆叠

 

看个简单的例子就能明白讲的是什么:

In [152]:
df = pd.DataFrame([[19, 136, 180, 98], [21, 122, 178, 96]], index=['令狐冲', '李寻欢'],
columns=['age', 'weight', 'height', 'score'])
df
Out[152]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  age weight height score
令狐冲 19 136 180 98
李寻欢 21 122 178 96
In [153]:
# 有点像宽表变高表, 我是这样觉得的
df.stack()
Out[153]:
令狐冲  age        19
weight 136
height 180
score 98
李寻欢 age 21
weight 122
height 178
score 96
dtype: int64
In [154]:
# 有点像高表变宽表
df.stack().unstack()
Out[154]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  age weight height score
令狐冲 19 136 180 98
李寻欢 21 122 178 96
 

透视表

 

 
Signature:
df.pivot(index=None, columns=None, values=None) -> 'DataFrame' Docstring:
Return reshaped DataFrame organized by given index / column values. Reshape data (produce a "pivot" table) based on column values. Uses
unique values from specified `index` / `columns` to form axes of the
resulting DataFrame. This function does not support data
aggregation, multiple values will result in a MultiIndex in the
columns.
In [155]:
df = pd.DataFrame({'name': ['江小鱼', '江小鱼', '江小鱼', '花无缺', '花无缺',
'花无缺'],
'bug_level': ['A', 'B', 'C', 'A', 'B', 'C'],
'bug_count': [2, 3, 5, 1, 5, 6]})
df
Out[155]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  name bug_level bug_count
0 江小鱼 A 2
1 江小鱼 B 3
2 江小鱼 C 5
3 花无缺 A 1
4 花无缺 B 5
5 花无缺 C 6
 

把上面的bug等级与bug数统计表变形如下,还是原来的数据,但是不是更加直观呢?

In [156]:
df.pivot(index='name', columns='bug_level', values='bug_count')
Out[156]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

bug_level A B C
name      
江小鱼 2 3 5
花无缺 1 5 6
 

如果原始数据中有重复的统计呢?就比如说上面的例子中来自不同产品线的bug统计,就可能出现两行这样的数据['江小鱼','B',3]、['江小鱼','B',4],先试下用pivot会怎样?

In [157]:
df = pd.DataFrame({'name': ['江小鱼', '江小鱼', '江小鱼', '江小鱼', '江小鱼', '花无缺', '花无缺',
'花无缺', '花无缺', '花无缺', ],
'bug_level': ['A', 'B', 'C', 'B', 'C', 'A', 'B', 'C', 'A', 'B'],
'bug_count': [2, 3, 5, 4, 6, 1, 5, 6, 3, 1],
'score': [70, 80, 90, 76, 86, 72, 82, 88, 68, 92]})
df
Out[157]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  name bug_level bug_count score
0 江小鱼 A 2 70
1 江小鱼 B 3 80
2 江小鱼 C 5 90
3 江小鱼 B 4 76
4 江小鱼 C 6 86
5 花无缺 A 1 72
6 花无缺 B 5 82
7 花无缺 C 6 88
8 花无缺 A 3 68
9 花无缺 B 1 92
In [158]:
try:
df.pivot(index='name', columns='bug_level', values='bug_count')
except ValueError as e:
print(e)
 
Index contains duplicate entries, cannot reshape
 

原来,pivot()只能将数据进行reshape,不支持聚合。遇到上面这种含重复值需进行聚合计算,应使用pivot_table()。它能实现类似Excel那样的高级数据透视功能。

In [159]:
# 统计员工来自不同产品线不同级别的bug总数
df.pivot_table(index=['name'], columns=['bug_level'],
values='bug_count', aggfunc=np.sum)
Out[159]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

bug_level A B C
name      
江小鱼 2 7 11
花无缺 4 6 6
 

当然,这里的聚合可以非常灵活:

 
In [161]:
df.pivot_table(index=['name'], columns=['bug_level'], aggfunc={
'bug_count': np.sum, 'score': [max, np.mean]})
Out[161]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

  bug_count score
  sum max mean
bug_level A B C A B C A B C
name                  
江小鱼 2 7 11 70 80 90 70 78 88
花无缺 4 6 6 72 92 88 70 87 88
 

还可以给每列每行加个汇总,如下所示:

In [162]:
df.pivot_table(index=['name'], columns=['bug_level'],
values='bug_count', aggfunc=np.sum, margins=True, margins_name='汇总')
Out[162]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

bug_level A B C 汇总
name        
江小鱼 2 7 11 20
花无缺 4 6 6 16
汇总 6 13 17 36
 

交叉表

 

交叉表是用于统计分组频率的特殊透视表。简单来说,就是将两个或者多个列重中不重复的元素组成一个新的 DataFrame,新数据的行和列交叉的部分值为其组合在原数据中的数量。

 

还是来个例子比较直观。有如下学生选专业数据:

In [163]:
df = pd.DataFrame({'name': ['杨过', '小龙女', '郭靖', '黄蓉', '李寻欢', '孙小红', '张无忌',
'赵敏', '令狐冲', '任盈盈'],
'gender': ['男', '女', '男', '女', '男', '女', '男', '女', '男', '女'],
'major': ['机械工程', '软件工程', '金融工程', '工商管理', '机械工程', '金融工程', '软件工程', '工商管理', '软件工程', '工商管理']})
df
Out[163]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

  name gender major
0 杨过 机械工程
1 小龙女 软件工程
2 郭靖 金融工程
3 黄蓉 工商管理
4 李寻欢 机械工程
5 孙小红 金融工程
6 张无忌 软件工程
7 赵敏 工商管理
8 令狐冲 软件工程
9 任盈盈 工商管理
 

若想了解学生选专业是否与性别有关,可以做如下统计:

In [164]:
pd.crosstab(df['gender'], df['major'])
Out[164]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

major 工商管理 机械工程 软件工程 金融工程
gender        
3 0 1 1
0 2 2 1
 

同时,回忆一下上篇讲到的 https://www.cnblogs.com/bytesfly/p/pandas-1.html#画图

In [165]:
# 男、女生填报专业比例饼状图
pd.crosstab(df['gender'], df['major']).T.plot(
kind='pie', subplots=True, figsize=(12, 8), autopct="%.0f%%")
plt.show()
 
 

换个角度看下:

In [166]:
# 各专业男女生填报人数柱状图
pd.crosstab(df['gender'], df['major']).T.plot(
kind='bar', stacked=True, rot=0, title='各专业男女生填报人数柱状图', xlabel='', figsize=(10, 6))
plt.show()
 
 

再回到上面所讲的交叉表相关知识。

In [167]:
# 对交叉结果进行归一化
pd.crosstab(df['gender'], df['major'], normalize=True)
Out[167]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

major 工商管理 机械工程 软件工程 金融工程
gender        
0.3 0.0 0.1 0.1
0.0 0.2 0.2 0.1
In [168]:
# 对交叉结果按行进行归一化
pd.crosstab(df['gender'], df['major'], normalize='index')
Out[168]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

major 工商管理 机械工程 软件工程 金融工程
gender        
0.6 0.0 0.2 0.2
0.0 0.4 0.4 0.2
In [169]:
# 对交叉结果按列进行归一化
pd.crosstab(df['gender'], df['major'], normalize='columns')
Out[169]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

major 工商管理 机械工程 软件工程 金融工程
gender        
1.0 0.0 0.333333 0.5
0.0 1.0 0.666667 0.5
 

同样,也可以给每列每行加个汇总,如下:

In [170]:
pd.crosstab(df['gender'], df['major'], margins=True, margins_name='汇总')
Out[170]:
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

major 工商管理 机械工程 软件工程 金融工程 汇总
gender          
3 0 1 1 5
0 2 2 1 5
汇总 3 2 3 2 10
 

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

最新文章

  1. 概率DP
  2. VS Code First使用Mysql数据库详解
  3. iOS 8 WKWebView
  4. 2016HUAS_ACM暑假集训2E - I Hate It
  5. URAL 1320 Graph Decomposition(并查集)
  6. basicjava
  7. Lifting the Stone
  8. HTTP verb的安全性和幂等性
  9. Kooboo中主要的几个关键词中的关系
  10. vi / vim 删除以及其它命令
  11. 323. Number of Connected Components in an Undirected Graph
  12. 菜鸟的 Sass 学习笔记
  13. 工程造价数据服务云平台(造价BIM)
  14. Android中在不同activity中进行自定义广播的解析
  15. 在Docker容器中搭建MXNet/Gluon开发环境
  16. HDU 2112 HDU Today(最短路径+map)
  17. Centos6.5部署Rsyslog+LogAnalyzer收集网络及系统日志
  18. oracle中数字保留几位小数的问题
  19. opencv的移植
  20. 过滤器(Filter)、拦截器(Interceptor)、监听器(Listener)

热门文章

  1. VBS脚本编程(3)——常用函数
  2. [Linux网络、命名空间、veth设备对、docker的host模式、container模式、none模式、brideg模式、网桥的增删查,容器与网桥的连接断开]
  3. 利用ServletContext,实现Session动态权限变更
  4. php 安装 yii 报错: phpunit/phpunit 4.8.32 requires ext-dom *
  5. c#json将字符串反序列化成对象时不新建类的做法
  6. hive学习笔记之四:分区表
  7. 关于Excel中表格转Markdown格式的技巧
  8. hdu 1394 线段树计算逆序数
  9. spring boot j集成seagger 加入拦截器后 swagger 不能访问
  10. 剑指 Offer 12. 矩阵中的路径