df
= DataFrame
({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np
.arange
(0,5),
'data2':np
.arange
(-5,0)})
print(df
)
key1 key2 data1 data2
0 a one
0 -5
1 a two
1 -4
2 b one
2 -3
3 b two
3 -2
4 a one
4 -1
1.将‘data2’这一列按照‘key1’这一列进行分组
grouped
= df
['data2'].groupby
(df
['key1'])
print(grouped
.min())
key1
a
-5
b
-3
Name
: data2
, dtype
: int32
2.将分组的列做成列表传入:将‘data1’这一列按照‘key1’和‘key2’列进行分组
grouped1
= df
['data1'].groupby
([df
['key1'],df
['key2']])
print(grouped1
.min())
key1 key2
a one
0
two
1
b one
2
two
3
Name
: data1
, dtype
: int32
3.分组键传的值也可以是正确长度的数组
list1
= [2001,2002,2003,2004,2005]
grouped2
= df
['data1'].groupby
(list1
)
print(grouped2
.mean
())
2001 0
2002 1
2003 2
2004 3
2005 4
Name
: data1
, dtype
: int32
4.将整个DataFrame按照其中的某一列进行划分,这时候可以传入该列的列名
grouped3
= df
.groupby
('key1')
print(grouped3
.mean
())
data1 data2
key1
a
1.666667 -3.333333
b
2.500000 -2.500000
grouped4
= df
.groupby
(['key1','key2'])
print(grouped4
.mean
())
data1 data2
key1 key2
a one
2 -3
two
1 -4
b one
2 -3
two
3 -2
5.size:返回一个包含组大小信息的Series
grouped5
= df
.groupby
(['key1','key2'])
print(grouped5
.size
())
key1 key2
a one
2
two
1
b one
1
two
1
dtype
: int64
6.GroupBy对象支持迭代,会生成一个包含组名和数据块的
2维元组序列
for name
,value
in df
.groupby
('key1'):
print(name
,end
="---分组名")
print(value
)
a
---分组名
key1 key2 data1 data2
0 a one
0 -5
1 a two
1 -4
4 a one
4 -1
b
---分组名
key1 key2 data1 data2
2 b one
2 -3
3 b two
3 -2
7.当迭代时有多个分组键的情况下,元组的第一个元素是键值的元组
('a', 'one')
key1 key2 data1 data2
0 a one
0 -5
4 a one
4 -1
('a', 'two')
key1 key2 data1 data2
1 a two
1 -4
('b', 'one')
key1 key2 data1 data2
2 b one
2 -3
('b', 'two')
key1 key2 data1 data2
3 b two
3 -2
8.将分组的数据保存在字典中,通过分组的值取出相应的值
dicta
= dict(list(df
.groupby
('key1')))
print(dicta
)
{'a': key1 key2 data1 data2
0 a one
0 -5
1 a two
1 -4
4 a one
4 -1,
'b': key1 key2 data1 data2
2 b one
2 -3
3 b two
3 -2}
print(dicta
['a'])
key1 key2 data1 data2
0 a one
0 -5
1 a two
1 -4
4 a one
4 -1
9.将从DataFrame创建的GroupBy对象用列名称或列名称数组进行索引时,会产生用于聚合列子集的效果
print(df
.groupby
('key1')['data1'].min())
key1
a
0
b
2
Name
: data1
, dtype
: int32
print(df
.groupby
('key1')[['data1']].min())
data1
key1
a
0
b
2
10.在字典或Series中写入各个列的对应关系,然后将字典传入grouped数组,实现各列相加
people
= DataFrame
(np
.arange
(25).reshape
((5,5)),
columns
=['a','b','c','d','e'],
index
=['aa','bb','cc','dd','ee'])
dictb
= {'a':'now','b':'now','c':'now','d':'last','e':'last'}
print(people
.groupby
(dictb
,axis
=1).sum())
last now
aa
7 3
bb
17 18
cc
27 33
dd
37 48
ee
47 63
mapSeries
= Series
(dictb
)
print(people
.groupby
(mapSeries
,axis
=1).sum())
last now
aa
7 3
bb
17 18
cc
27 33
dd
37 48
ee
47 63
11.使用Python函数来定义分组关系,函数会将每个索引值调用一次,会将返回值作为分组名称
people2
= DataFrame
(np
.arange
(25).reshape
((5,5)),
columns
=['a','b','c','d','e'],
index
=['aaa','bbbb','cc','dd','ee'])
print(people2
.groupby
(len).size
())
2 3
3 1
4 1
dtype
: int64
12.将函数与其他混合使用
print(people2
.groupby
([len,'a']).size
())
a
2 10 1
15 1
20 1
3 0 1
4 5 1
dtype
: int64
13. 按索引层级分组
column
= [['A','A','A','B','B'],[1,3,5,1,3]]
name
= ['city','tenor']
people3
= DataFrame
(np
.arange
(25).reshape
((5,5)),
columns
=column
,
index
=['aaa','bbbb','cc','dd','ee'])
people3
.columns
.names
=name
print(people3
.groupby
(level
='city',axis
=1).count
())
city A B
aaa
3 2
bbbb
3 2
cc
3 2
dd
3 2
ee
3 2
14.使用自己的聚合函数:将自己的函数传入agg或aggregate
hah
= lambda x
:x
.max()-x
.min()
print(df
.groupby
('key1').aggregate
(hah
))
data1 data2
key1
a
4 4
b
1 1
15.describe
()
print(df
.groupby
('key1').describe
())
data1
... data2
count mean std
min 25% ... min 25% 50% 75% max
key1
...
a
3.0 1.666667 2.081666 0.0 0.50 ... -5.0 -4.50 -4.0 -2.50 -1.0
b
2.0 2.500000 0.707107 2.0 2.25 ... -3.0 -2.75 -2.5 -2.25 -2.0
[2 rows x
16 columns
]
16.各列同时使用多个函数进行聚合,向agg传入函数列表
hah
= lambda x
:x
.min()
print(df
.groupby
('key1').agg
([hah
,'mean','max']))
data1 data2
<lambda_0
> mean
max <lambda_0
> mean
max
key1
a
0 1.666667 4 -5 -3.333333 -1
b
2 2.500000 3 -3 -2.500000 -2
17.从上面我们可以看出
lambda函数的显示名称为
<lambda_0
>,我们可以通过传入元组来自定义函数名称
print(df
.groupby
('key1').agg
([('hah',hah
),'mean','max']))
data1 data2
hah mean
max hah mean
max
key1
a
0 1.666667 4 -5 -3.333333 -1
b
2 2.500000 3 -3 -2.500000 -2
18.将不同的函数应用到不同的列上,需要把列名和函数对应的字典传入agg
print(df
.groupby
('key1').agg
({'data1':[('hah',hah
),'mean','max'],'data2':['mean']}))
data1 data2
hah mean
max mean
key1
a
0 1.666667 4 -3.333333
b
2 2.500000 3 -2.500000
19.分位数
qq
= pd
.cut
(np
.random
.randn
(100),4)
dff
= Series
(np
.random
.randn
(100))
print(dff
.groupby
(qq
).agg
(['count','mean','max']))
count mean
max
(-2.296, -1.064] 16 0.358343 2.006945
(-1.064, 0.164] 41 0.190866 2.265793
(0.164, 1.391] 34 0.021378 2.545082
(1.391, 2.619] 9 0.410388 1.696888
qq
= pd
.qcut
(np
.random
.randn
(100),4)
dff
= Series
(np
.random
.randn
(100))
print(dff
.groupby
(qq
).agg
(['count','mean','max']))
count mean
max
(-3.247, -0.861] 25 -0.102378 1.743070
(-0.861, -0.0778] 25 0.470662 2.484437
(-0.0778, 0.511] 25 0.214561 1.822943
(0.511, 1.827] 25 -0.432044 2.611959