我有一个Pandas DataFrame,它包含大量类别,每个类别都有功能,每个类别都有自己的子功能,这些功能分组成对。简单版本如下所示:
0 1 ...
categories features subfeatures
cat1 feature1 subfeature1 -0.224487 -0.227524
subfeature2 -0.591399 -0.799228
feature2 subfeature1 1.190110 -1.365895 ...
subfeature2 0.720956 -1.325562
cat2 feature1 subfeature1 1.856932 NaN
subfeature2 -1.354258 -0.740473
feature2 subfeature1 0.234075 -1.362235 ...
subfeature2 0.013875 1.309564
cat3 feature1 subfeature1 NaN NaN
subfeature2 -1.260408 1.559721 ...
feature2 subfeature1 0.419246 0.084386
subfeature2 0.969270 1.493417
... ... ...
可以使用以下代码生成:
import pandas as pd
import numpy as np
np.random.seed(seed=90)
results = np.random.randn(3,2,2,2)
results[2,0,0,:] = np.nan
results[1,0,0,1] = np.nan
results = results.reshape((-1,2))
index = pd.MultiIndex.from_product([["cat1", "cat2", "cat3"],
["feature1", "feature2"],
["subfeature1", "subfeature2"]],
names=["categories", "features", "subfeatures"])
df = pd.DataFrame(results, index=index)
现在,我想检索同一列中cat1
和subfeature1
之间存在差异的顶级类别(subfeature2
等)0
或1
)超过特定阈值。
例如:如果阈值为1,那么我希望返回cat2
和cat3
,因为subfeature1
列subfeature2
与0
列之间存在差异}是1.856932 - ( - 1.354258),其为3.21119> feature1
中cat2
的阈值= 1。同样,subfeature1
,subfeature2
中1
列cat3
与feature2
之间的差异为1.493417 - 0.084386 = 1.409031> 1.另一方面,不会返回cat1
,因为子特征对之间的差异不大于1. NaN
值会使一对无效并被忽略。
我已经设法实现了迭代方法,但我觉得我没有充分利用Pandas的全部功能并且缺乏性能:
for cat in df.index.levels[0]:
for feature in df.index.levels[1]:
df2 = df.xs((cat, feature))
diffs = abs(df2.loc['subfeature1'] - df2.loc['subfeature2'])
if max(diffs) > threshold and cat not in results:
results.append(cat)
得到以下特性:
['cat2', 'cat3']
我怎样才能使用Pandas的内置矢量化功能来实现这样的东西?
编辑:使用下面的杰夫答案,我发现了一些时髦的东西:
def f(x):
a = max(abs(x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures')))
print a
return a > 1
result = df.groupby(level=['categories','features']).filter(f)
print(result)
给出:
0.366912262765
0.571703714569
1
0.469153603312
0.0403331129905
3.2111900125 <------------------------------------------------
nan
0.220200012413
2.67179897269 <---------------------------------------------------
nan
nan
0.550023734074
1.40903094796 <-----------------------------------------------------!!!!!!!!!!!
0 1
categories features subfeatures
cat2 feature1 subfeature1 1.856932 NaN
subfeature2 -1.354258 -0.740473
我已经突出显示了算法应该根据分数包含类别的所有地方。然而,它并不适用于cat3
。 nans可以与它有关吗?
答案 0 :(得分:1)
分组排名前2位。然后使用过滤器仅返回所需要素的最大差异(此处阈值为0)
In [41]: df.groupby(level=['categories','features']).filter(lambda x: (x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures')).max()>0)
Out[41]:
0 1
categories features subfeatures
cat1 feature1 subfeature1 -0.224487 -0.227524
subfeature2 -0.591399 -0.799228
feature2 subfeature1 1.190110 -1.365895
subfeature2 0.720956 -1.325562
cat2 feature1 subfeature1 1.856932 NaN
subfeature2 -1.354258 -0.740473
feature2 subfeature1 0.234075 -1.362235
subfeature2 0.013875 1.309564
一个有用的调试辅助工具,可以执行以下操作:
def f(x):
print x
return (x.xs(......)) # e.g. the filter from above
df.groupby(.....).filter(f)