我有一个演示应用程序,该应用程序使用crossfilter.js的维度和组通过交互式过滤器为图提供动力(非常类似于航空公司的实时演示http://square.github.io/crossfilter/)。我的真实数据集太大,无法使用crossfilter.js。不过,我已经成功地使用熊猫对数据进行了类似的过滤。
但是,我很难理解如何在熊猫中建模/表示crossfilter的group()行为-
除相关维的过滤器外,分组与交叉过滤器的当前过滤器相交。 https://github.com/square/crossfilter/wiki/API-Reference#group-map-reduce
例如,使用车辆数据:
Make Year Color
-------------------
Ford 2000 Red
Honda 2001 Blue
Ford 2001 Green
如果我应用了过滤器Make: Ford
并获得了每个维度/组的计数,那么我期望:
Make:
Ford: 2
Honda: 1
Year:
2000: 1
2001: 1
Color:
Red: 1
Blue: 0
Green: 1
因此,对于Make
维度,Make: Ford
过滤器被删除以获取计数。对于Year
和Color
尺寸,将应用此尺寸,因此2001 Blue Honda不会对数量有所贡献。
答案 0 :(得分:0)
在没有好的答案的情况下,这就是我拼凑的。它需要将过滤器编码为树,以便可以遍历,并且在每次通过中都将适当的系列过滤器无效。我仍然对更好的解决方案感兴趣。
基于上述问题的示例调用,其中df
是熊猫数据帧:
crossfilter(df, ('eq', 'Make', 'Ford'), ['Make', 'Year', 'Color'])
代码:
# filter operators of the form (operator, filter1, filter2)
group_ops = {
'and': operator.and_,
'or': operator.or_,
}
# hokie way of forcing all-pass or all-fail filters
nan = float('nan')
# recursive function that turns a tree of python-dict encoded filters into
# bitwise operators for pandas
def build_filter(df, payload, nullify_series=None, nullify_value=True):
if not payload:
# no filters, but we have to return something
# so grab the first series and filter out all NaN values
return operator.ne(df.ix[:,0], nan)
op = payload[0]
if op in value_ops:
# format: (operator, series, val)
series = payload[1]
value = payload[2]
if series == nullify_series:
# nullify filter
if nullify_value:
# push toward True (e.g. nested in an 'and' operator)
return operator.ne(df[series], nan)
else:
# push toward False (e.g. nested in an 'or' operator)
return operator.eq(df[series], nan)
return value_ops[op](df[series], value)
elif op == 'not':
# format: ('not', nested_filter)
value = payload[1]
return operator.inv(build_filter(df, value, nullify_series, False))
else:
# format: (operator, nested_filter_1, nested_filter_2)
group1 = payload[1]
group2 = payload[2]
return group_ops[op](build_filter(df, group1, nullify_series, True),
build_filter(df, group2, nullify_series, True))
# returns value counts for all series in `gather`, applying filters in `filters` in all other series
def crossfilter(df, filters, gather):
df_scoped = df[gather]
results = { series: df_filtered[series].value_counts().to_dict()
for series in gather
for df_filtered in [ df_scoped[build_filter(df_scoped, filters, series)]
if filters else df_scoped ]}
return results