我有一个包含多个“操作”列的数据框。如何找到与模式匹配的最后一个操作并返回其列索引或标签?
我的数据:
name action_1 action_2 action_3
bill referred referred
bob introduced referred referred
mary introduced
june introduced referred
dale referred
donna introduced
我想要的是什么:
name action_1 action_2 action_3 last_referred
bill referred referred action_2
bob introduced referred referred action_3
mary introduced NA
june introduced referred action_2
dale referred action_1
donna introduced NA
答案 0 :(得分:2)
只需使用apply
上的axis=1
函数,并将pattern
参数作为函数的附加参数传递。
In [3]: def func(row, pattern):
referrer = np.nan
for key in row.index:
if row[key] == pattern:
referrer = key
return referrer
df['last_referred'] = df.apply(func, pattern='referred', axis=1)
df
Out[3]: name action_1 action_2 action_3 last_referred
0 bill referred referred None action_2
1 bob introduced referred referred action_3
2 mary introduced NaN
3 june introduced referred action_2
4 dale referred action_1
5 donna introduced NaN
答案 1 :(得分:2)
矢量化方法,使用arange
查找最后一个索引max
和连接:
df['last_referred'] = np.r_[[np.NaN], df.columns][
((df == 'referred') * (np.arange(df.shape[1]) + 1)).max(axis=1).values]
说明:
我们想要找到每行中最右边的单元格,其值为'referred'
:
>>> df == 'referred'
name action_1 action_2 action_3
0 False True True False
1 False False True True
2 False False False False
3 False False True False
4 False True False False
5 False False False False
一个选项是DataFrame.idxmax
,但这会产生第一个(即最左边)的事件。但是,假设我们可以使用列索引替换True
值,我们可以使用普通max
。由于True
为1
且False
为0
,我们可以通过垂直播放整数范围[0, 1, 2, ...]
来实现此目的:
>>> np.arange(df.shape[1])
array([0, 1, 2, 3])
>>> (df == 'referred') * np.arange(df.shape[1])
name action_1 action_2 action_3
0 0 1 2 0
1 0 0 2 3
2 0 0 0 0
3 0 0 2 0
4 0 1 0 0
5 0 0 0 0
>>> ((df == 'referred') * np.arange(df.shape[1])).max(axis=1)
0 2
1 3
2 0
3 2
4 1
5 0
dtype: int32
但有一个问题:我们无法区分“名称”列中的'referred'
与根本不存在的区别。轻松修复;只需从1开始整数范围:
>>> ((df == 'referred') * (np.arange(df.shape[1]) + 1)).max(axis=1)
0 3
1 4
2 0
3 3
4 2
5 0
dtype: int32
现在只需使用此数组索引列名:
>>> df.columns[((df == 'referred') * (np.arange(df.shape[1]) + 1)).max(axis=1).values]
IndexError: index 4 is out of bounds for size 4
糟糕!我们需要将0
作为NaN
,然后将其余列转换为np.r_
。我们可以使用>>> np.r_[[np.NaN], df.columns]
array([nan, 'name', 'action_1', 'action_2', 'action_3'], dtype=object)
>>> np.r_[[np.NaN], df.columns][
((df == 'referred') * (np.arange(df.shape[1]) + 1)).max(axis=1).values]
array(['action_2', 'action_3', nan, 'action_2', 'action_1', nan], dtype=object)
来连接数组:
{{1}}
你有它。
答案 2 :(得分:1)
您可以使用pandas.melt
和groupby
:
In [123]: molten = pd.melt(df, id_vars='name', var_name='last_referred')
In [124]: molten
Out[124]:
name last_referred value
0 bill action_1 referred
1 bob action_1 introduced
2 mary action_1 introduced
3 june action_1 introduced
4 dale action_1 referred
5 donna action_1 introduced
6 bill action_2 referred
7 bob action_2 referred
8 mary action_2 NaN
9 june action_2 referred
10 dale action_2 NaN
11 donna action_2 NaN
12 bill action_3 NaN
13 bob action_3 referred
14 mary action_3 NaN
15 june action_3 NaN
16 dale action_3 NaN
17 donna action_3 NaN
In [125]: gb = molten.groupby('name')
In [126]: col = gb.apply(lambda x: x[x.value == 'referred'].tail(1)).last_referred
In [127]: col.index = col.index.droplevel(1)
In [128]: col
Out[128]:
name
bill action_2
bob action_3
dale action_1
june action_2
Name: last_referred, dtype: object
In [129]: newdf = df.join(col, on='name')
In [130]: newdf
Out[130]:
name action_1 action_2 action_3 last_referred
0 bill referred referred NaN action_2
1 bob introduced referred referred action_3
2 mary introduced NaN NaN NaN
3 june introduced referred NaN action_2
4 dale referred NaN NaN action_1
5 donna introduced NaN NaN NaN
答案 3 :(得分:0)
您还可以使用idxmax,它返回最大值的第一个索引,否则返回第一个索引。这确实需要添加一个额外的“NA”列,所以它有点麻烦。
revcols = df.columns.values.tolist()
revcols.reverse()
tmpdf = df=='referred'
tmpdf['NA'] = False
lastrefer = tmpdf[['NA']+revcols].idxmax(axis=1)