Question

如果我有一个带有列x的数据帧df，并希望在伪代码中使用此值基于x的值创建列y

 if df['x'] <-2 then df['y'] = 1 
 else if df['x'] > 2 then df['y']= -1 
 else df['y'] = 0

我将如何实现这一目标。我假设np.where是最好的方法，但不知道如何正确编码。

Answer 1

一种简单的方法是首先分配默认值，然后执行2 loc次呼叫：

In [66]:

df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
Out[66]:
   x
0  0
1 -3
2  5
3 -1
4  1

In [69]:

df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
df
Out[69]:
   x  y
0  0  0
1 -3  1
2  5 -1
3 -1  0
4  1  0

如果您想使用np.where，那么您可以使用嵌套np.where：

In [77]:

df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
df
Out[77]:
   x  y
0  0  0
1 -3  1
2  5 -1
3 -1  0
4  1  0

所以在这里我们定义第一个条件，其中x小于-2，返回1，然后我们有另一个np.where测试另一个条件，其中x大于2并返回-1，否则返回0

<强>定时

In [79]:

%timeit df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))

1000 loops, best of 3: 1.79 ms per loop

In [81]:

%%timeit
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1

100 loops, best of 3: 3.27 ms per loop

因此，对于此样本数据集，np.where方法的速度是其两倍

Answer 2

这是pd.cut的一个很好的用例，您可以在其中定义范围，并根据这些ranges来分配labels：

df['y'] = pd.cut(df['x'], [-np.inf, -2, 2, np.inf], labels=[1, 0, -1], right=False)

输出

在pandas数据帧中向量化条件赋值

2 个答案: