Question

说，为了举个例子，我有几个列编码不同类型的费率（"annual rate"，"1/2 annual rate"等）。我想在我的数据框架上使用query来查找任何这些费率高于1的条目。

首先，我找到了我想在查询中使用的列：

cols = [x for ix, x in enumerate(df.columns) if 'rate' in x]

其中，例如cols包含：

["annual rate", "1/2 annual rate", "monthly rate"]

然后我想做类似的事情：

df.query('any of my cols > 1')

如何为query格式化此内容？

Answer 1

query执行Python 表达式的完整解析（有一些限制，例如，您不能使用lambda表达式或三元if / else表达式）。这意味着您在查询字符串中引用的任何列必须是有效的Python标识符（对于＆＃34;变量名称＆＃34;更正式的单词）。检查此问题的一种方法是使用Name模块中潜伏的tokenize模式：

In [156]: tokenize.Name
Out[156]: '[a-zA-Z_]\\w*'

In [157]: def isidentifier(x):
   .....:     return re.match(tokenize.Name, x) is not None
   .....:

In [158]: isidentifier('adsf')
Out[158]: True

In [159]: isidentifier('1adsf')
Out[159]: False

现在，由于您的列名称包含空格，因此每个用空格分隔的单词将被评估为单独的标识符，因此您可以使用

df.query("annual rate > 1")

这是无效的Python语法。尝试在Python解释器中键入annual rate，您将获得SyntaxError例外。

带回家消息：将列重命名为有效的变量名称。除非您的列遵循某种结构，否则您无法以编程方式（至少可以轻松地）执行此操作。在你的情况下你可以做

In [166]: cols
Out[166]: ['annual rate', '1/2 annual rate', 'monthly rate']

In [167]: list(map(lambda x: '_'.join(x.split()).replace('1/2', 'half'), cols))
Out[167]: ['annual_rate', 'half_annual_rate', 'monthly_rate']

然后您可以格式化查询字符串，类似于@ acushner的示例

In [173]: newcols
Out[173]: ['annual_rate', 'half_annual_rate', 'monthly_rate']

In [174]: ' or '.join('%s > 1' % c for c in newcols)
Out[174]: 'annual_rate > 1 or half_annual_rate > 1 or monthly_rate > 1'

注意：您实际上需要在此处使用`query`：

In [180]: df = DataFrame(randn(10, 3), columns=cols)

In [181]: df
Out[181]:
   annual rate  1/2 annual rate  monthly rate
0      -0.6980           0.6322        2.5695
1      -0.1413          -0.3285       -0.9856
2       0.8189           0.7166       -1.4302
3       1.3300          -0.9596       -0.8934
4      -1.7545          -0.9635        2.8515
5      -1.1389           0.1055        0.5423
6       0.2788          -1.3973       -0.9073
7      -1.8570           1.3781        0.0501
8      -0.6842          -0.2012       -0.5083
9      -0.3270          -1.5280        0.2251

[10 rows x 3 columns]

In [182]: df.gt(1).any(1)
Out[182]:
0     True
1    False
2    False
3     True
4     True
5    False
6    False
7     True
8    False
9    False
dtype: bool

In [183]: df[df.gt(1).any(1)]
Out[183]:
   annual rate  1/2 annual rate  monthly rate
0      -0.6980           0.6322        2.5695
3       1.3300          -0.9596       -0.8934
4      -1.7545          -0.9635        2.8515
7      -1.8570           1.3781        0.0501

[4 rows x 3 columns]

正如@Jeff在评论中指出的那样，可以引用非标识符列名称，尽管它是一种笨重的方式：

pd.eval('df[df["annual rate"]>0]')

如果你想挽救小猫的生命，我不建议你写这样的代码。

Answer 2

这样的事情应该可以解决问题

df.query('|'.join('(%s > 1)' % col for col in cols))

我不知道如何处理列名中的空格，所以你可能需要重命名它们。

在Pandas中使用查询中的动态列表

2 个答案:

注意：您实际上需要在此处使用`query`：

在Pandas中使用查询中的动态列表

2 个答案:

注意：您实际上需要在此处使用query：

注意：您实际上需要在此处使用`query`：