Pandas实现groupby分组统计
类似SQL: select city.max(temperature) from city_weather groupby by city
groupby: 先对数据分组,然后在每个分组上应用聚合函数、转换函数
import pandas
as pd
import numpy
as np
%matplotlib inline
df
= pd
.DataFrame
({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : np
.random
.randn
(8),
'D' : np
.random
.randn
(8)})
df
ABCD
0fooone-0.1907411.6129481barone-0.7812091.0021172footwo0.858491-0.1784293barthree-0.2805590.0838164footwo0.214302-0.1077175bartwo-0.134638-1.6043656fooone-2.108731-0.6538197foothree0.414685-1.717435
1、分组使用聚合函数做数据统计
1、单个列groupby,查询所有数据列的统计
df
.groupby
('A').sum()
CDA
bar-1.196406-0.518431foo-0.811994-1.044452
1、groupby中的A变成了数据的索引列 2、B列不是数字,所以被自动忽略
2、多个列groupby,查询所有数据列的统计
df
.groupby
(['A','B']).mean
()
CDAB
barone-0.7812091.002117three-0.2805590.083816two-0.134638-1.604365fooone-1.1497360.479564three0.414685-1.717435two0.536397-0.143073
(A,B)成对变成了二级索引,不想改变原来索引。加一个as_index=False
df
.groupby
(['A','B'],as_index
=False).mean
()
ABCD
0barone-0.7812091.0021171barthree-0.2805590.0838162bartwo-0.134638-1.6043653fooone-1.1497360.4795644foothree0.414685-1.7174355footwo0.536397-0.143073
3、同时查看多种数据统计
df
.groupby
('A').agg
([np
.sum,np
.mean
,np
.std
])
CDsummeanstdsummeanstdA
bar-1.196406-0.3988020.339116-0.518431-0.172811.322055foo-0.811994-0.1623991.151755-1.044452-0.208891.204567
列变成了多级索引
4、查看单列的结果数据统计
df
.groupby
('A')['C'].agg
([np
.sum,np
.mean
,np
.std
])
summeanstdA
bar-1.196406-0.3988020.339116foo-0.811994-0.1623991.151755
df
.groupby
('A').agg
([np
.sum,np
.mean
,np
.std
])['C']
summeanstdA
bar-1.196406-0.3988020.339116foo-0.811994-0.1623991.151755
5、不同列使用不同的聚合函数
df
.groupby
('A').agg
({'C':np
.sum,'D':np
.std
})
CDA
bar-1.1964061.322055foo-0.8119941.204567
欢迎关注个人公众号