如何在Pandas中使用GroupBy将索引组织成组?

时间:2016-12-27 19:25:02

标签: python pandas indexing dataframe group-by

以下是我想要做的事情的一个更简单的例子。 基本上,我想通过这个大的pd.DataFrame对象和组{b}行bin_id,然后我想得到outliers的那些并将它们分成2个子集。< / strong>

我觉得如果我能用GroupBy了解如何做到这一点,那么我可以扩展如何使用我的实际数据集并使用额外的列。例如,在我的实际问题中,我需要合并bin_type,其中一些attr_j是列表,但这里的问题太复杂了。我现在正在做pd.DataFrame子集是没有问题的,但这似乎是学习GroupBy的好机会。我尝试按照http://pandas.pydata.org/pandas-docs/stable/groupby.html上的文档进行操作,但我并不完全确定levels指的是什么,而且许多示例都使用的是aggregate,我不认为适用于这个问题。

np.random.seed(0)
# Shape
n, m = 10,4
# Synthesize data
DF_data = pd.DataFrame(np.random.normal(size=(n,m)), columns=["attr_%d"%(_) for _ in range(m)])
DF_data.insert(loc=0, column="bin_id", value=["bin_0","bin_0","bin_0","bin_1","bin_1","bin_2","bin_2","bin_2","bin_2", "bin_3"])
DF_data.insert(loc=1, column="bin_type", value=["core", "modular","redundant", "core","redundant", "core","core","modular", "modular", "core"])
DF_data.insert(loc=5, column="outlier", value=[False, False, False, True, False, False, False, True, False, False])
DF_data.index = pd.Index(["".join(np.random.choice(list("abcdefghijklmnopqrstuvwxyz"), size=n)) for _ in range(n)], name="label_id")
# print(DF_data)
#            bin_id   bin_type    attr_0    attr_1    attr_2 outlier    attr_3
# label_id                                                                    
# odvmzkuleg  bin_0       core  1.764052  0.400157  0.978738   False  2.240893
# epudmeuiop  bin_0    modular  1.867558 -0.977278  0.950088   False -0.151357
# udxpnvvqrf  bin_0  redundant -0.103219  0.410599  0.144044   False  1.454274
# jdafarsecq  bin_1       core  0.761038  0.121675  0.443863    True  0.333674
# dcknqhvjak  bin_1  redundant  1.494079 -0.205158  0.313068   False -0.854096
# slxczcddso  bin_2       core -2.552990  0.653619  0.864436   False -0.742165
# dursojbekw  bin_2       core  2.269755 -1.454366  0.045759   False -0.187184
# lilctqawag  bin_2    modular  1.532779  1.469359  0.154947    True  0.378163
# toktyinycd  bin_2    modular -0.887786 -1.980796 -0.347912   False  0.156349
# clnqiiticy  bin_3       core  1.230291  1.202380 -0.387327   False -0.302303

D_bin_data = defaultdict(lambda: defaultdict(list))
for bin_id in DF_data["bin_id"].unique():
    # Group by `bin_id`
    mask_idx = DF_data["bin_id"] == bin_id
    DF_tmp = DF_data.loc[mask_idx,:]
    # Split groups into outliers and non-outliers
    not_outlier_mask = DF_tmp["outlier"] == False
    # Send them to a 2 level dict (2 or 3?)
    D_bin_data[bin_id]["non-outliers"] = DF_tmp.index[not_outlier_mask]
    D_bin_data[bin_id]["outliers"] = DF_tmp.index[DF_tmp["outlier"]]

# print(D_bin_data)
# defaultdict(<function __main__.<lambda>>,
#             {'bin_0': defaultdict(list,
#                          {'non-outliers': Index(['odvmzkuleg', 'epudmeuiop', 'udxpnvvqrf'], dtype='object', name='label_id'),
#                           'outliers': Index([], dtype='object', name='label_id')}),
#              'bin_1': defaultdict(list,
#                          {'non-outliers': Index(['dcknqhvjak'], dtype='object', name='label_id'),
#                           'outliers': Index(['jdafarsecq'], dtype='object', name='label_id')}),
#              'bin_2': defaultdict(list,
#                          {'non-outliers': Index(['slxczcddso', 'dursojbekw', 'toktyinycd'], dtype='object', name='label_id'),
#                           'outliers': Index(['lilctqawag'], dtype='object', name='label_id')}),
#              'bin_3': defaultdict(list,
#                          {'non-outliers': Index(['clnqiiticy'], dtype='object', name='label_id'),
#                           'outliers': Index([], dtype='object', name='label_id')})})

使用GroupBy可以实现上述目标吗?我从未使用过SQL,但显然它的语法非常相似。

0 个答案:

没有答案