我想有条件地在pandas数据帧中逐行替换值,以便保留max(行),而行中的所有其他值将设置为None
。
我的直觉转向apply()
,但我不确定这是否是正确的选择,或者如何做到这一点。
示例(但可能有多列):
tmp= pd.DataFrame({
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
} )
tmp
A B
0 1 3
1 2 4
2 3 1
3 4 33
4 5 10
5 6 9
6 7 7
7 8 3
8 9 10
9 10 10
通缉输出:
somemagic(tmp)
A B
0 None 3
1 None 4
2 3 None
3 None 33
4 None 10
5 None 9
6 7 None # on tie I don't really care which one is set to None
7 8 None
8 None 10
9 10 None # on tie I don't really care which one is set to None
有关如何实现这一目标的任何建议吗?
答案 0 :(得分:2)
print (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
mask = (tmp.eq(tmp.max(axis=1), axis=0))
print (mask)
A B
0 False True
1 False True
2 True False
3 False True
4 False True
5 False True
6 True True
7 True False
8 False True
9 True True
df = (tmp[mask])
print (df)
A B
0 NaN 3.0
1 NaN 4.0
2 3.0 NaN
3 NaN 33.0
4 NaN 10.0
5 NaN 9.0
6 7.0 7.0
7 8.0 NaN
8 NaN 10.0
9 10.0 10.0
然后,如果列中的值相等,则可以添加NaN
:
mask = (tmp.eq(tmp.max(axis=1), axis=0))
mask['B'] = mask.B & (tmp.A != tmp.B)
print (mask)
A B
0 False True
1 False True
2 True False
3 False True
4 False True
5 False True
6 True False
7 True False
8 False True
9 True False
df = (tmp[mask])
print (df)
A B
0 NaN 3.0
1 NaN 4.0
2 3.0 NaN
3 NaN 33.0
4 NaN 10.0
5 NaN 9.0
6 7.0 NaN
7 8.0 NaN
8 NaN 10.0
9 10.0 NaN
计时(len(df)=10
):
In [234]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
1000 loops, best of 3: 974 µs per loop
In [235]: %timeit (gh(tmp))
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.64 ms per loop
(len(df)=100k
):
In [244]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
100 loops, best of 3: 7.42 ms per loop
In [245]: %timeit (gh(t1))
1 loop, best of 3: 8.81 s per loop
时间安排的代码:
import pandas as pd
tmp= pd.DataFrame({
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
} )
tmp = pd.concat([tmp]*10000).reset_index(drop=True)
t1 = tmp.copy()
print (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
def top(row):
data = row.tolist()
return [d if d == max(data) else None for d in data]
def gh(tmp1):
return tmp1.apply(top, axis=1)
print (gh(t1))
答案 1 :(得分:2)
我建议您使用apply()
。您可以按如下方式使用它:
In [1]: import pandas as pd
In [2]: tmp= pd.DataFrame({
...: 'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
...: 'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
...: } )
In [3]: tmp
Out[3]:
A B
0 1 3
1 2 4
2 3 1
3 4 33
4 5 10
5 6 9
6 7 7
7 8 3
8 9 10
9 10 10
In [4]: def top(row):
...: data = row.tolist()
...: return [d if d == max(data) else None for d in data]
...:
In [5]: df2 = tmp.apply(top, axis=1)
In [6]: df2
Out[6]:
A B
0 NaN 3
1 NaN 4
2 3 NaN
3 NaN 33
4 NaN 10
5 NaN 9
6 7 7
7 8 NaN
8 NaN 10
9 10 10