Question

我有一个Pandas DataFrame＆＃34;表＆＃34;它包含一个名为＆＃34; OPINION＆＃34;的列，填充了字符串值。我想创建一个名为＆＃34; cond5＆＃34;的新列。对于＆＃34; OPINION＆＃34;的每一行都填充为TRUE是＆＃34;买＆＃34;或＆＃34;中立＆＃34;。

我试过了

table["cond5"]= table.OPINION == "buy" or table.OPINION == "neutral"

这给了我一个错误，

table["cond5"]= table.OPINION.all() in ("buy", "neutral")

对所有行返回FALSE。

Answer 1

正如Ed Chum指出的那样，你可以使用isin method：

table['cond5'] = table['OPINION'].isin(['buy', 'neutral'])

isin检查确切的相等性。也许这将是最简单，最易读的。

修复

table["cond5"] = table.OPINION == "buy" or table.OPINION == "neutral"

使用

table["cond5"] = (table['OPINION'] == "buy") | (table['OPINION'] == "neutral")

括号是必要的，因为|有higher precedence (binding power)而不是==。

x or y要求x和y为布尔值。

(table['OPINION'] == "buy") or (table['OPINION'] == "neutral")

自Series can no be reduced to a single boolean value以来引发错误。

因此，请使用逻辑或运算符|，它采用系列元素中or的值。

另一种选择是

import numpy as np
table["cond5"] = np.logical_or.reduce([(table['OPINION'] == val) for val in ('buy', 'neutral')])

如果('buy', 'neutral')是一个更长的元组，这可能会有用。

另一个选择是使用Pandas'vectorized string method, str.contains：

table["cond5"] = table['OPINION'].str.contains(r'buy|neutral')

str.contains对r'buy|neutral'中每个项目的Cythonized循环中的模式table['OPINION']执行正则表达式搜索。

现在如何决定使用哪一个？以下是使用IPython的时间基准：

In [10]: table = pd.DataFrame({'OPINION':np.random.choice(['buy','neutral','sell',''], size=10**6)})

In [11]: %timeit (table['OPINION'] == "buy") | (table['OPINION'] == "neutral")
10 loops, best of 3: 121 ms per loop

In [12]: %timeit np.logical_or.reduce([(table['OPINION'] == val) for val in ('buy', 'neutral')])
1 loops, best of 3: 204 ms per loop

In [13]: %timeit table['OPINION'].str.contains(r'buy|neutral')
1 loops, best of 3: 474 ms per loop

In [14]: %timeit table['OPINION'].isin(['buy', 'neutral'])
10 loops, best of 3: 40 ms per loop

所以看起来 isin最快。

根据两个字符串值评估Pandas DataFrame

1 个答案: