基于汇总统计数据的Pandas DataFrame groupby特定过滤器

时间:2017-03-14 23:48:47

标签: python pandas

拥有这样的大型数据框:

runs = 100
rows = runs * 20
run_index = sum([[n] * 20 for n in range(runs)], [])
color = sum([sum([[n] * 5 for n in range(4)], []) for each in range(runs)], [])
size = sum([sum([range(5) for each in range(4)], []) for each in range(runs)], [])
value1 = [np.random.random() for each in range(rows)]
value2 = [np.random.random() for each in range(rows)]
value3 = [np.random.random() for each in range(rows)]
df = pd.DataFrame(zip(run_index, color, size, value1, value1, value1), columns=['run', 'color', 'size', 'value1', 'value2', 'value3'])

如何实现以下过滤器: 基本上,排除每次运行,其中至少一个值{1,2,3}低于所有运行中给定(颜色:大小)类别中给定值的第10个百分位数(或任何其他统计数据)。

runs_to_exclude = set()
value_names = ['value1', 'value2', 'value3']
aggregate_stats = df.groupby(['color', 'size'])[value_names].describe(percentiles=[0.05])
for run, run_group in df.groupby('run'):
    for index, row in run_group.iterrows():
        row_color = row['color']
        row_size = row['size']
        for value_name in value_names:
            if row[value_name] < aggregate_stats.loc[row_color, row_size, '5%'][value_name]:
                runs_to_exclude.add(run)
                break

使用大型数据帧进行for循环非常慢,显然不是Pandas的方法。 如何用高效的熊猫做这样的事情? groupby()。filter(filter_function)方案似乎只能使用常量

1 个答案:

答案 0 :(得分:0)

稍微更改一下您的统计信息聚合,以便它只包含您要检查的值:

aggregate_stats = df.groupby(['color', 'size'])[value_names].\
                     describe(percentiles=[0.05])
fifth_p = aggregate_stats.reset_index().\
                          loc[aggregate_stats.reset_index().level_2 == '5%',
                              ['color', 'size',
                               'value1', 'value2', 'value3']].\
                          set_index(['color', 'size'])

(也许不必要?)

假设你保持索引匹配,你现在可以直接减去数据框 - 索引应该保持它们一致。

run_to_check = 0
checker = df.loc[df.run == run_to_check,
                 ['color', 'size',
                  'value1', 'value2', 'value3']].\
             set_index(['color','size']) < fifth_p
if checker.sum().sum() > 0:
    # Some fifth_p value was greater than a df value
    # run is bad
else:
    # run is fine