拥有这样的大型数据框:
runs = 100
rows = runs * 20
run_index = sum([[n] * 20 for n in range(runs)], [])
color = sum([sum([[n] * 5 for n in range(4)], []) for each in range(runs)], [])
size = sum([sum([range(5) for each in range(4)], []) for each in range(runs)], [])
value1 = [np.random.random() for each in range(rows)]
value2 = [np.random.random() for each in range(rows)]
value3 = [np.random.random() for each in range(rows)]
df = pd.DataFrame(zip(run_index, color, size, value1, value1, value1), columns=['run', 'color', 'size', 'value1', 'value2', 'value3'])
如何实现以下过滤器: 基本上,排除每次运行,其中至少一个值{1,2,3}低于所有运行中给定(颜色:大小)类别中给定值的第10个百分位数(或任何其他统计数据)。
runs_to_exclude = set()
value_names = ['value1', 'value2', 'value3']
aggregate_stats = df.groupby(['color', 'size'])[value_names].describe(percentiles=[0.05])
for run, run_group in df.groupby('run'):
for index, row in run_group.iterrows():
row_color = row['color']
row_size = row['size']
for value_name in value_names:
if row[value_name] < aggregate_stats.loc[row_color, row_size, '5%'][value_name]:
runs_to_exclude.add(run)
break
使用大型数据帧进行for循环非常慢,显然不是Pandas的方法。 如何用高效的熊猫做这样的事情? groupby()。filter(filter_function)方案似乎只能使用常量
答案 0 :(得分:0)
稍微更改一下您的统计信息聚合,以便它只包含您要检查的值:
aggregate_stats = df.groupby(['color', 'size'])[value_names].\
describe(percentiles=[0.05])
fifth_p = aggregate_stats.reset_index().\
loc[aggregate_stats.reset_index().level_2 == '5%',
['color', 'size',
'value1', 'value2', 'value3']].\
set_index(['color', 'size'])
(也许不必要?)
假设你保持索引匹配,你现在可以直接减去数据框 - 索引应该保持它们一致。
run_to_check = 0
checker = df.loc[df.run == run_to_check,
['color', 'size',
'value1', 'value2', 'value3']].\
set_index(['color','size']) < fifth_p
if checker.sum().sum() > 0:
# Some fifth_p value was greater than a df value
# run is bad
else:
# run is fine