任何分组(groupby)操作都涉及原始对象的以下操作之一。它们是 -

  • 分割对象
  • 应用一个函数
  • 结合的结果

在许多情况下,我们将数据分成多个集合,并在每个子集上应用一些函数。在应用函数中,可以执行以下操作 -

  • 聚合 - 计算汇总统计
  • 转换 - 执行一些特定于组的操作
  • 过滤 - 在某些情况下丢弃数据

下面来看看创建一个DataFrame对象并对其执行所有操作 -

import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data) print (df)
Python

执行上面示例代码,得到以下结果 -

    Points  Rank    Team  Year
0 876 1 Riders 2014
1 789 2 Riders 2015
2 863 2 Devils 2014
3 673 3 Devils 2015
4 741 3 Kings 2014
5 812 4 kings 2015
6 756 1 Kings 2016
7 788 1 Kings 2017
8 694 2 Riders 2016
9 701 4 Royals 2014
10 804 1 Royals 2015
11 690 2 Riders 2017
Shell

将数据拆分成组

Pandas对象可以分成任何对象。有多种方式来拆分对象,如 -

  • obj.groupby(‘key’)
  • obj.groupby([‘key1’,’key2’])
  • obj.groupby(key,axis=1)

现在来看看如何将分组对象应用于DataFrame对象

示例

import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data) print (df.groupby('Team'))
Python

执行上面示例代码,得到以下结果 -

<pandas.core.groupby.DataFrameGroupBy object at 0x00000245D60AD518>
Shell

查看分组

import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data) print (df.groupby('Team').groups)
Python

执行上面示例代码,得到以下结果 -

{
'Devils': Int64Index([2, 3], dtype='int64'),
'Kings': Int64Index([4, 6, 7], dtype='int64'),
'Riders': Int64Index([0, 1, 8, 11], dtype='int64'),
'Royals': Int64Index([9, 10], dtype='int64'),
'kings': Int64Index([5], dtype='int64')
}
Shell

示例

按多列分组 -

import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print (df.groupby(['Team','Year']).groups)
Python

执行上面示例代码,得到以下结果 -

{
('Devils', 2014): Int64Index([2], dtype='int64'),
('Devils', 2015): Int64Index([3], dtype='int64'),
('Kings', 2014): Int64Index([4], dtype='int64'),
('Kings', 2016): Int64Index([6], dtype='int64'),
('Kings', 2017): Int64Index([7], dtype='int64'),
('Riders', 2014): Int64Index([0], dtype='int64'),
('Riders', 2015): Int64Index([1], dtype='int64'),
('Riders', 2016): Int64Index([8], dtype='int64'),
('Riders', 2017): Int64Index([11], dtype='int64'),
('Royals', 2014): Int64Index([9], dtype='int64'),
('Royals', 2015): Int64Index([10], dtype='int64'),
('kings', 2015): Int64Index([5], dtype='int64')
}
Shell

迭代遍历分组

使用groupby对象,可以遍历类似itertools.obj的对象。

import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') for name,group in grouped:
print (name)
print (group)
Python

执行上面示例代码,得到以下结果 -

2014
Points Rank Team Year
0 876 1 Riders 2014
2 863 2 Devils 2014
4 741 3 Kings 2014
9 701 4 Royals 2014
2015
Points Rank Team Year
1 789 2 Riders 2015
3 673 3 Devils 2015
5 812 4 kings 2015
10 804 1 Royals 2015
2016
Points Rank Team Year
6 756 1 Kings 2016
8 694 2 Riders 2016
2017
Points Rank Team Year
7 788 1 Kings 2017
11 690 2 Riders 2017
Shell

默认情况下,groupby对象具有与分组名相同的标签名称。

选择一个分组

使用get_group()方法,可以选择一个组。参考以下示例代码 -

import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data) grouped = df.groupby('Year')
print (grouped.get_group(2014))
Python

执行上面示例代码,得到以下结果 -

   Points  Rank    Team  Year
0 876 1 Riders 2014
2 863 2 Devils 2014
4 741 3 Kings 2014
9 701 4 Royals 2014
Shell

聚合

聚合函数为每个组返回单个聚合值。当创建了分组(group by)对象,就可以对分组数据执行多个聚合操作。

一个比较常用的是通过聚合或等效的agg方法聚合 -

import pandas as pd
import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data) grouped = df.groupby('Year')
print (grouped['Points'].agg(np.mean))
Python

执行上面示例代码,得到以下结果 -

Year
2014 795.25
2015 769.50
2016 725.00
2017 739.00
Name: Points, dtype: float64
Shell

另一种查看每个分组的大小的方法是应用size()函数 -

import pandas as pd
import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Team')
print (grouped.agg(np.size))
Python

执行上面示例代码,得到以下结果 -

Team
Devils 2 2 2
Kings 3 3 3
Riders 4 4 4
Royals 2 2 2
kings 1 1 1
Shell

一次应用多个聚合函数

通过分组系列,还可以传递函数的列表或字典来进行聚合,并生成DataFrame作为输出 -

import pandas as pd
import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data) grouped = df.groupby('Team')
agg = grouped['Points'].agg([np.sum, np.mean, np.std])
print (agg)
Python

执行上面示例代码,得到以下结果 -

         sum        mean         std
Team
Devils 1536 768.000000 134.350288
Kings 2285 761.666667 24.006943
Riders 3049 762.250000 88.567771
Royals 1505 752.500000 72.831998
kings 812 812.000000 NaN
Shell

转换

分组或列上的转换返回索引大小与被分组的索引相同的对象。因此,转换应该返回与组块大小相同的结果。

import pandas as pd
import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data) grouped = df.groupby('Team')
score = lambda x: (x - x.mean()) / x.std()*10
print (grouped.transform(score))
Python

执行上面示例代码,得到以下结果 -

       Points       Rank       Year
0 12.843272 -15.000000 -11.618950
1 3.020286 5.000000 -3.872983
2 7.071068 -7.071068 -7.071068
3 -7.071068 7.071068 7.071068
4 -8.608621 11.547005 -10.910895
5 NaN NaN NaN
6 -2.360428 -5.773503 2.182179
7 10.969049 -5.773503 8.728716
8 -7.705963 5.000000 3.872983
9 -7.071068 7.071068 -7.071068
10 7.071068 -7.071068 7.071068
11 -8.157595 5.000000 11.618950
Shell

过滤

过滤根据定义的标准过滤数据并返回数据的子集。filter()函数用于过滤数据。

import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
filter = df.groupby('Team').filter(lambda x: len(x) >= 3) print (filter)
Python

执行上面示例代码,得到以下结果 -

    Points  Rank    Team  Year
0 876 1 Riders 2014
1 789 2 Riders 2015
4 741 3 Kings 2014
6 756 1 Kings 2016
7 788 1 Kings 2017
8 694 2 Riders 2016
11 690 2 Riders 2017
Shell

在上述过滤条件下,要求返回三次以上参加IPL的队伍。

最新文章

  1. 【三石jQuery视频教程】01.图片循环展示
  2. controller_name classify constantize model_name
  3. BZOJ1565 [NOI2009]植物大战僵尸(拓扑排序 + 最大权闭合子图)
  4. 图解JVM的Class文件格式(详细版)
  5. iOS 短视频开发总结二 AVCaptureSession
  6. Python Iterable Iterator Yield
  7. js操作Dom的一些方法简化
  8. No compatible targets were found.Do you wish to...的解决方案。
  9. JQuery EasyUI内Combobox的onChange事件
  10. poj 1190 DFS 不等式放缩进行剪枝
  11. SpringMVC(二)——流程控制
  12. Play-With-Docker在chrome上的插件
  13. es6学习日记3
  14. BZOJ4517[Scoi2016]美味——主席树
  15. webpack 相关插件及作用(表格)
  16. 解题:NOI 2014 随机数生成器
  17. 绝命毒师第五季/全集Breaking Bad迅雷下载
  18. Cent OS 常用配置命令
  19. matlab imresize 改变图像大小
  20. maven拓展——使用tomcat插件运行maven项目

热门文章

  1. 超哥带你学网络编程部分blog
  2. HTTP 筛选器 DLL C:\Windows\Microsoft.Net\Framework\v4.0.30319\aspnet_filter.dll 加载失败。数据是错误。
  3. java工程编写manifest文件
  4. Apache配置HTTPS的过程小记
  5. [NOIP2018TG]保卫王国
  6. 6.Insert Documents-官方文档摘录
  7. git 设置 .gitignore 为全局global + 配置.gitignore为全局后不生效解决办法
  8. Vue(5)- axios、vuex
  9. Python中的不同进制的语法和转换
  10. N多条短信,用什么算法从中找出相似内容的来?