Question

我想有条件地在pandas数据帧中逐行替换值，以便保留max（行），而行中的所有其他值将设置为None。我的直觉转向apply()，但我不确定这是否是正确的选择，或者如何做到这一点。

示例（但可能有多列）：

tmp= pd.DataFrame({
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
} )

tmp
    A   B
0   1   3
1   2   4
2   3   1
3   4   33
4   5   10
5   6   9
6   7   7
7   8   3
8   9   10
9   10  10

通缉输出：

somemagic(tmp)
    A       B
0   None    3
1   None    4
2   3       None
3   None    33
4   None    10
5   None    9
6   7       None    # on tie I don't really care which one is set to None
7   8       None
8   None    10
9   10      None    # on tie I don't really care which one is set to None

有关如何实现这一目标的任何建议吗？

Answer 1

您可以将DataFrame值eq与max进行比较：

print (tmp[tmp.eq(tmp.max(axis=1), axis=0)])

mask = (tmp.eq(tmp.max(axis=1), axis=0))
print (mask)
       A      B
0  False   True
1  False   True
2   True  False
3  False   True
4  False   True
5  False   True
6   True   True
7   True  False
8  False   True
9   True   True

df = (tmp[mask])
print (df)
      A     B
0   NaN   3.0
1   NaN   4.0
2   3.0   NaN
3   NaN  33.0
4   NaN  10.0
5   NaN   9.0
6   7.0   7.0
7   8.0   NaN
8   NaN  10.0
9  10.0  10.0

然后，如果列中的值相等，则可以添加NaN：

mask = (tmp.eq(tmp.max(axis=1), axis=0))
mask['B'] = mask.B & (tmp.A != tmp.B)
print (mask)
       A      B
0  False   True
1  False   True
2   True  False
3  False   True
4  False   True
5  False   True
6   True  False
7   True  False
8  False   True
9   True  False

df = (tmp[mask])
print (df)
      A     B
0   NaN   3.0
1   NaN   4.0
2   3.0   NaN
3   NaN  33.0
4   NaN  10.0
5   NaN   9.0
6   7.0   NaN
7   8.0   NaN
8   NaN  10.0
9  10.0   NaN

计时（len(df)=10）：

In [234]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
1000 loops, best of 3: 974 µs per loop

In [235]: %timeit (gh(tmp))
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.64 ms per loop

（len(df)=100k）：

In [244]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
100 loops, best of 3: 7.42 ms per loop

In [245]: %timeit (gh(t1))
1 loop, best of 3: 8.81 s per loop

时间安排的代码：

import pandas as pd

tmp= pd.DataFrame({
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
} )


tmp = pd.concat([tmp]*10000).reset_index(drop=True)
t1 = tmp.copy()

print (tmp[tmp.eq(tmp.max(axis=1), axis=0)])


def top(row):
    data = row.tolist()
    return [d if d == max(data) else None for d in data]

def gh(tmp1):
    return tmp1.apply(top, axis=1)

print (gh(t1))

Answer 2

我建议您使用apply()。您可以按如下方式使用它：

In [1]: import pandas as pd

In [2]: tmp= pd.DataFrame({
   ...: 'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
   ...: 'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
   ...: } )

In [3]: tmp
Out[3]: 
    A   B
0   1   3
1   2   4
2   3   1
3   4  33
4   5  10
5   6   9
6   7   7
7   8   3
8   9  10
9  10  10

In [4]: def top(row):
   ...:         data = row.tolist()
   ...:         return [d if d == max(data) else None for d in data]
   ...: 

In [5]: df2 = tmp.apply(top, axis=1)

In [6]: df2
Out[6]: 
    A   B
0 NaN   3
1 NaN   4
2   3 NaN
3 NaN  33
4 NaN  10
5 NaN   9
6   7   7
7   8 NaN
8 NaN  10
9  10  10

Python：逐行替换数据帧值

2 个答案: