我有一个像这样的数据框:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'vote':[5,4,5,1,10,1,9],
'doggo': [None,"doggo",None,None,"doggo",None,None],
'floofer': ["floofer",None,None,"floofer",None,None,None],
'pupper': [None,None,"pupper",None,None,None,None],
'puppo':[None,None,None,None,None,None,"puppo"]})
我想结合最后4列和gnerate:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'vote':[5,4,5,1,10,1,9],
'categories': ["floofer","doggo","pupper","floofer","doggo",None,"puppo"]})
任何指导表示赞赏。
答案 0 :(得分:1)
如果每个类别列中的每一行只有一个不为None
的值,则解决方案:
cols = ['doggo','floofer','pupper','puppo']
cols1 = df.columns.difference(cols)
df2 = df[cols1].join(df[cols].ffill(axis=1).iloc[:, -1].rename('Categories'))
print (df2)
id vote Categories
0 1 5 floofer
1 2 4 doggo
2 3 5 pupper
3 4 1 floofer
4 5 10 doggo
5 6 1 None
6 7 9 puppo
说明:
首先仅选择具有分类数据的列,并向前填充缺少的值-预期数据在最后一列:
print (df[cols].ffill(axis=1))
doggo floofer pupper puppo
0 None floofer floofer floofer
1 doggo doggo doggo doggo
2 None None pupper pupper
3 None floofer floofer floofer
4 doggo doggo doggo doggo
5 None None None None
6 None None None puppo
按位置选择最后一列:
print (df[cols].ffill(axis=1).iloc[:, -1])
0 floofer
1 doggo
2 pupper
3 floofer
4 doggo
5 None
6 puppo
Name: puppo, dtype: object
如果有多个值,则从分类列的列名称创建数据的解决方案:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'vote':[5,4,5,1,10,1,9],
'doggo': [None,"doggo1",None,"doggo2","doggo3",None,None],
'floofer': ["floofer1",None,None,"floofer2",None,None,None],
'pupper': [None,None,"pupper1",None,None,None,None],
'puppo':["puppo1",None,None,None,None,None,"puppo2"]})
print (df)
id vote doggo floofer pupper puppo
0 1 5 None floofer1 None puppo1
1 2 4 doggo1 None None None
2 3 5 None None pupper1 None
3 4 1 doggo2 floofer2 None None
4 5 10 doggo3 None None None
5 6 1 None None None None
6 7 9 None None None puppo2
s = (df[cols].notnull()
.dot(pd.Index(cols) + ', ')
.str.strip(', ')
.rename('Categories')
.replace('', np.nan)
)
df = df[cols1].join(s)
print (df)
id vote Categories
0 1 5 floofer, puppo
1 2 4 doggo
2 3 5 pupper
3 4 1 doggo, floofer
4 5 10 doggo
5 6 1 NaN
6 7 9 puppo
另一种解决方案,预期的输出不是来自列名:
s = pd.Series(df[cols].add(', ').fillna('').values.sum(axis=1),
index=df.index, name='Categories').str.strip(', ')
df = df[cols1].join(s)
print (df)
id vote Categories
0 1 5 floofer1, puppo1
1 2 4 doggo1
2 3 5 pupper1
3 4 1 doggo2, floofer2
4 5 10 doggo3
5 6 1
6 7 9 puppo2
答案 1 :(得分:1)
bfill
+ iloc
您可以bfill
(回填)并选择第一列:
(df.set_index(['id', 'vote'])
.bfill(axis=1)
.iloc[:, 0]
.reset_index(name='Categories'))
id vote Categories
0 1 5 floofer
1 2 4 doggo
2 3 5 pupper
3 4 1 floofer
4 5 10 doggo
5 6 1 None
6 7 9 puppo
stack
+ reindex
cats = (df.drop(['id', 'vote'], 1).stack()
.reset_index(level=1, drop=True).reindex(df.index))
pd.DataFrame(dict(id=df.id, vote=df.vote, Categories=cats))
id vote Categories
0 1 5 floofer
1 2 4 doggo
2 3 5 pupper
3 4 1 floofer
4 5 10 doggo
5 6 1 NaN
6 7 9 puppo
last_valid_index
缓慢但简洁。
(df.set_index(['id', 'vote'])
.agg(lambda x: x.last_valid_index(), axis=1)
.reset_index(name='Categories'))
id vote Categories
0 1 5 floofer
1 2 4 doggo
2 3 5 pupper
3 4 1 floofer
4 5 10 doggo
5 6 1 None
6 7 9 puppo
假设“ id”和“ vote”是唯一的非分类列。
答案 2 :(得分:0)
我们可以利用x or None
是x
的事实,并在每一行中使用Numpy的logical_or运算符来简化/应用该类别。
import numpy as np
cols = ['doggo','floofer','pupper','puppo']
categories = np.logical_or.reduce(df[cols], axis=1)
df = df.assign(categories=categories).drop(cols, axis=1)