Question

我正在对一个熊猫的数据帧运行一个相当复杂的过滤器（我正在过滤以通过字典针对67个不同阈值传递测试结果）。为此，我需要执行以下操作：

query_string = ' | '.join([f'{k} > {v}' for k , v in dictionary.items()])
test_passes = df.query(query_string, engine='python')

其中k是测试名称，v是阈值。

这很好，我可以通过测试通过将行导出到csv。

我想知道是否有一种方法可以附加一列来计算通过测试次数的列。因此，例如，如果特定行记录了1-67次测试通过。

Answer 1

因此，我终于在最初发布熊猫查询后开始解决以下问题。最初的问题是我的用例是否通过测试，如果实际上是测试失败，那么...

test_failures = data.query(query_string, engine='python').copy()

该副本是为了防止意外的数据操纵和链接错误消息。

for k, row in test_failures.iterrows():
    failure_count=0
    test_count=0
    for key, val in threshold_dict.items():
        test_count +=1
        if row[key] > val:
            failure_count +=1
    test_failures.at[k, 'Test Count'] = test_count
    test_failures.at[k, 'Failure Count'] = failure_count

据我所知，iterrows（）并不是最快的迭代方法，但它确实分别提供了索引（k）和数据字典（row），我发现它们比itertuples（）返回的元组更有用。

sorted_test_failures = test_failures.sort_values('Failure Count', ascending=False)  

sorted_test_failures.to_csv('failures.csv', encoding='utf8')

进行一些整理和保存。

我已经对（8000 x 66）的虚拟数据集进行了测试-它不提供突破性的速度，但可以完成工作。任何改善都将很棒！

Answer 2

在这里回答：

https://stackoverflow.com/a/24516612/6815750

但是举个例子，您可以执行以下操作：

new_df = df.apply(pd.Series.value_counts, axis = 1) #where df is your current dataframe holding the pass/fails

df[new_df.columns] = new_df

Answer 3

您可以改用以下方法：

dictionary = {'a':'b', 'b': 'c'}
data = pd.DataFrame({'a': [1,2,3], 'b': [ 2,1,2], 'c': [2,1,1] })
test_components = pd.DataFrame([df.loc[:, k] > df.loc[:, v] for k , v in dictionary.items()]).T
# now can inspect what conditions were met in `test_components` variable
condition = test_components.any(axis=1)
data_filtered = data.loc[common_condition, :]

Python Pandas-是否可以结合使用计数器运行dataframe.query方法

3 个答案: