Question

我有像这样的浮点数据，它是由3个神经元产生的神经网络输出。我想根据最大行值转换为二进制分类标签（互斥）。

0.423201  0.368718 0.338091
 0.246899  0.437535 0.000262
 0.978685 0.136219  0.027693

，输出应为

1 0 0
0 1 0
1 0 0

这意味着每行可以连续一次具有值1，全部为零（最大值变为1）。

如何在pandas或python中执行此操作？我知道熊猫中的get_dummies是要走的路，但它不起作用。

如果可以，请帮忙。

Answer 1

我认为您可以使用rank，然后将其与df1的最大值进行比较。最后一次将astype转换为DataFrame到int：

print df
          0         1         2
0  0.423201  0.368718  0.338091
1  0.246899  0.437535  0.000262
2  0.978685  0.136219  0.027693

df1 = df.rank(method='max', axis=1)
print df1
   0  1  2
0  3  2  1
1  2  3  1
2  3  2  1

#get max value of df1
ma = df1.max().max()
print ma
3.0

print (df1 == ma)
       0      1      2
0   True  False  False
1  False   True  False
2   True  False  False

print (df1 == ma).astype(int)
   0  1  2
0  1  0  0
1  0  1  0
2  1  0  0

修改：

我认为您可以使用eq按max df行进行比较，最后按astype转换为int：

print df.max(axis=1) 0 10 1 8 2 9 dtype: int64 print df.eq(df.max(axis=1), axis=0).astype(int) 0 1 2 0 1 0 0 1 0 1 0 2 1 0 0

计时

len(df) = 3：

In [418]: %timeit df.eq(df.max(axis=1), axis=0).astype(int) The slowest run took 5.44 times longer than the fastest. This could mean that an intermediate result is being cached 1000 loops, best of 3: 334 µs per loop In [419]: %timeit df.apply(lambda x: x == x.max(), axis='columns').astype(int) The slowest run took 4.49 times longer than the fastest. This could mean that an intermediate result is being cached 1000 loops, best of 3: 1.44 ms per loop In [420]: %timeit (df.rank(method='max', axis=1) == df.rank(method='max', axis=1).max().max()).astype(int) The slowest run took 4.83 times longer than the fastest. This could mean that an intermediate result is being cached 1000 loops, best of 3: 656 µs per loop

len(df) = 3000：

In [426]: %timeit df.eq(df.max(axis=1), axis=0).astype(int) The slowest run took 5.44 times longer than the fastest. This could mean that an intermediate result is being cached 1000 loops, best of 3: 456 µs per loop In [427]: %timeit df.apply(lambda x: x == x.max(), axis='columns').astype(int) 1 loops, best of 3: 496 ms per loop In [428]: %timeit (df.rank(method='max', axis=1) == df.rank(method='max', axis=1).max().max()).astype(int) The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached 1000 loops, best of 3: 1.32 ms per loop

Answer 2

我认为这会更简单，更快。

df.apply(lambda x: x == x.max(), axis='columns').astype(int)

如何在pandas中二进制化浮点值？

2 个答案: