以下是我想要做的事情的一个更简单的例子。 基本上,我想通过这个大的pd.DataFrame
对象和组{b}行bin_id
,然后我想得到outliers
的那些并将它们分成2个子集。< / strong>
我觉得如果我能用GroupBy
了解如何做到这一点,那么我可以扩展如何使用我的实际数据集并使用额外的列。例如,在我的实际问题中,我需要合并bin_type
,其中一些attr_j
是列表,但这里的问题太复杂了。我现在正在做pd.DataFrame
子集是没有问题的,但这似乎是学习GroupBy
的好机会。我尝试按照http://pandas.pydata.org/pandas-docs/stable/groupby.html上的文档进行操作,但我并不完全确定levels
指的是什么,而且许多示例都使用的是aggregate
,我不认为适用于这个问题。
np.random.seed(0)
# Shape
n, m = 10,4
# Synthesize data
DF_data = pd.DataFrame(np.random.normal(size=(n,m)), columns=["attr_%d"%(_) for _ in range(m)])
DF_data.insert(loc=0, column="bin_id", value=["bin_0","bin_0","bin_0","bin_1","bin_1","bin_2","bin_2","bin_2","bin_2", "bin_3"])
DF_data.insert(loc=1, column="bin_type", value=["core", "modular","redundant", "core","redundant", "core","core","modular", "modular", "core"])
DF_data.insert(loc=5, column="outlier", value=[False, False, False, True, False, False, False, True, False, False])
DF_data.index = pd.Index(["".join(np.random.choice(list("abcdefghijklmnopqrstuvwxyz"), size=n)) for _ in range(n)], name="label_id")
# print(DF_data)
# bin_id bin_type attr_0 attr_1 attr_2 outlier attr_3
# label_id
# odvmzkuleg bin_0 core 1.764052 0.400157 0.978738 False 2.240893
# epudmeuiop bin_0 modular 1.867558 -0.977278 0.950088 False -0.151357
# udxpnvvqrf bin_0 redundant -0.103219 0.410599 0.144044 False 1.454274
# jdafarsecq bin_1 core 0.761038 0.121675 0.443863 True 0.333674
# dcknqhvjak bin_1 redundant 1.494079 -0.205158 0.313068 False -0.854096
# slxczcddso bin_2 core -2.552990 0.653619 0.864436 False -0.742165
# dursojbekw bin_2 core 2.269755 -1.454366 0.045759 False -0.187184
# lilctqawag bin_2 modular 1.532779 1.469359 0.154947 True 0.378163
# toktyinycd bin_2 modular -0.887786 -1.980796 -0.347912 False 0.156349
# clnqiiticy bin_3 core 1.230291 1.202380 -0.387327 False -0.302303
D_bin_data = defaultdict(lambda: defaultdict(list))
for bin_id in DF_data["bin_id"].unique():
# Group by `bin_id`
mask_idx = DF_data["bin_id"] == bin_id
DF_tmp = DF_data.loc[mask_idx,:]
# Split groups into outliers and non-outliers
not_outlier_mask = DF_tmp["outlier"] == False
# Send them to a 2 level dict (2 or 3?)
D_bin_data[bin_id]["non-outliers"] = DF_tmp.index[not_outlier_mask]
D_bin_data[bin_id]["outliers"] = DF_tmp.index[DF_tmp["outlier"]]
# print(D_bin_data)
# defaultdict(<function __main__.<lambda>>,
# {'bin_0': defaultdict(list,
# {'non-outliers': Index(['odvmzkuleg', 'epudmeuiop', 'udxpnvvqrf'], dtype='object', name='label_id'),
# 'outliers': Index([], dtype='object', name='label_id')}),
# 'bin_1': defaultdict(list,
# {'non-outliers': Index(['dcknqhvjak'], dtype='object', name='label_id'),
# 'outliers': Index(['jdafarsecq'], dtype='object', name='label_id')}),
# 'bin_2': defaultdict(list,
# {'non-outliers': Index(['slxczcddso', 'dursojbekw', 'toktyinycd'], dtype='object', name='label_id'),
# 'outliers': Index(['lilctqawag'], dtype='object', name='label_id')}),
# 'bin_3': defaultdict(list,
# {'non-outliers': Index(['clnqiiticy'], dtype='object', name='label_id'),
# 'outliers': Index([], dtype='object', name='label_id')})})
使用GroupBy
可以实现上述目标吗?我从未使用过SQL,但显然它的语法非常相似。