最近一直在做机器学习比赛,学习大神们的源码时发现这两个函数使用频繁,自己也是花了一阵子才搞明白,先草草记录下暂时在比赛中用到的,比赛结束后再细细整理。
1、gruopby
In [35]: df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
...: 'key2':['one', 'two', 'one', 'two', 'one'],
...: 'data1':np.random.randn(5),^M
...: 'data2':np.random.randn(5)})
In [36]: df
Out[36]:
key1 key2 data1 data2
0 a one -1.400763 0.494059
1 a two 1.303229 -2.396705
2 b one -0.482499 -1.590093
3 b two -0.902582 -0.909068
4 a one -0.628412 1.724196
In [100]: df.groupby(['key1','key2'],as_index=False)['key1'].agg({'TotalNumber':'count'})
Out[100]:
key1 key2 TotalNumber
0 a one 2
1 a two 1
2 b one 1
3 b two 1
这里用到了key1,key2两个键值作为分组标准,然后对key1进行计数(比赛中用到了类似的)。
还有,agg函数也经常使用,常与groupby连用
2、merge合并
In [89]: left = pd.DataFrame({'key1':['foo','foo','bar'],'key2':['one','one','two'],'lval':[1,2,3]})
In [90]: right = pd.DataFrame({'key1':['foo','foo','bar','bar'],'key2':['one','one','one','two'],'rval':[4,5,6,7]})
In [91]: left
Out[91]:
key1 key2 lval
0 foo one 1
1 foo one 2
2 bar two 3
In [92]: right
Out[92]:
key1 key2 rval
0 foo one 4
1 foo one 5
2 bar one 6
3 bar two 7
In [93]: left.merge(right,on=['key1','key2'],how='left')
Out[93]:
key1 key2 lval rval
0 foo one 1 4
1 foo one 1 5
2 foo one 2 4
3 foo one 2 5
4 bar two 3 7
这里,用到了key1和key2两个键值作为合并依据,合并方式为left(左侧DataFrame取全部,右侧DataFrame取部分)