由id - python汇总多个字符串列

时间:2018-05-07 17:30:43

标签: python pandas

我在Python中获得了以下数据框:

d = pd.DataFrame({'id': [1, 1, 1, 2, 2, 3],
              'col1': ['normal', 'well', 'normal', 'normal', 'well', 'normal'],
              'col2': ['bad', 'normal','normal', 'normal', 'normal', 'bad']})

我想按id汇总,但如果没有其他内容('well'或'bad'),请保留列为'normal'或'normal'以外的字符串。如下所示:

result = pd.DataFrame({'id': [1, 2, 3],
                'col1': ['well', 'well', 'normal'],
                'col2': ['bad', 'normal', 'bad']})

我正在考虑排序,然后使用groupby和.first但不确定如何在每列的顶部获得所需的级别。

3 个答案:

答案 0 :(得分:5)

使用分类来定义订单

cats = ['well', 'bad', 'normal']
d = d.assign(
    col1=pd.Categorical(d.col1, cats, ordered=True),
    col2=pd.Categorical(d.col2, cats, ordered=True)
)

d.groupby('id', as_index=False).min()

   id    col1    col2
0   1    well     bad
1   2    well  normal
2   3  normal     bad

答案 1 :(得分:4)

如果在replace之前没有NaN s值,请先NaN使用GroupBy.first

d = d.replace('normal', np.nan).groupby('id').first().fillna('normal')
#alternative solution
d = d.mask(d == 'normal').groupby('id').first().fillna('normal')
print (d)
      col1    col2
id                
1     well     bad
2     well  normal
3   normal     bad

答案 2 :(得分:2)

创建帮助键以帮助排序,然后我们执行Traceback (most recent call last): File "test_request.py", line 53, in grpc_request() File "test_request.py", line 50, in grpc_request response = stub.Predict(request=request,metadata=metadata) File "/usr/local/lib/python2.7/dist-packages/grpc/_channel.py", line 487, in call return _end_unary_response_blocking(state, call, False, deadline) File "/usr/local/lib/python2.7/dist-packages/grpc/_channel.py", line 437, in _end_unary_response_blocking raise _Rendezvous(state, None, None, deadline) grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, OS Error)>

groupby