Question

我有一个要分组的值系列，另一个包含在第一组之后的每个组的起始位置索引的系列（第一组应理解为从位置索引0开始）。一系列值可以具有任意索引。有没有办法使用它来生成分组汇总？理想情况下，将保留空的组。示例：

values = pd.Series(np.arange(10, 20), index=np.arange(110, 120))
group_indices = pd.Series([3, 3, 8])

现在，应该values.groupby(group_indices)进行分组，以便第一组为values.iloc[:3]，第二组为values.iloc[3:3]（一个空组），第三组为values.iloc[3:8]，并且第四是values.iloc[8:]，而values.groupby(group_indices).mean()将是pd.Series([11.0, NaN, 15.0, 18.5])。

Answer 1

直接使用numpy.split例程：

In [1286]: values = pd.Series(np.arange(10, 20))

In [1287]: group_indices = pd.Series([0, 3, 8])

In [1288]: pd.Series([s.mean() for s in np.split(values, group_indices) if s.size])
Out[1288]: 
0    11.0
1    15.0
2    18.5
dtype: float64

要帐户“空”组-只需删除if s.size支票：

In [1304]: group_indices = pd.Series([3, 3, 8])

In [1305]: pd.Series([s.mean() for s in np.split(values, group_indices)])
Out[1305]: 
0    11.0
1     NaN
2    15.0
3    18.5
dtype: float64

Answer 2

这是一种简单的方法

values.groupby(values.index.isin(group_indices).cumsum()).mean()
Out[454]: 
1    11.0
2    15.0
3    18.5
dtype: float64

Answer 3

鉴于您的更新，这是使用pd.merge_asof执行此操作的一种奇怪方法。需要特别注意处理从0到系列中的第一个索引的第一个组。

import pandas as pd
import numpy as np

(pd.merge_asof(values.to_frame('val'), 
               values.iloc[np.r_[group_indices]].reset_index().reset_index().drop(columns=0), 
               left_index=True, right_on='index',
               direction='backward')
   .fillna({'level_0': -1})          # Because your first group is 0: first index
   .groupby('level_0').val.mean()
   .reindex([-1]+[*range(len(group_indices))])  # Get 0 size groups in output
)

level_0
-1    11.0
 0     NaN
 1    15.0
 2    18.5
Name: val, dtype: float64

Answer 4

让我们稍微修改一下group_indicies，使组名（1,2,3）可见，

group_indices = pd.Series([1,2,3],index=[0, 3, 8])

然后

values.groupby(group_indices.reindex(values.index,method='ffill')).mean()

会给你想要的东西。

请注意，group_indices.reindex(values.index,method='ffill')给您

为values的每一行分配一个组号。

Answer 5

我的解决方案包括保持输入不变并进行一些难看的调整：

Node
    node_id       int identity
    node_property
    other_property_of_this_node

Node_Relationship
    node_relationship_id int identity -- Optional
    parent_node_id           -- FK to the node table
    child_node_id            -- Also FK to the node table

输出

pd.DataFrame(values).assign(group=pd.cut(pd.DataFrame(values).index,
                     [-1,2,7,np.inf], labels=[0,1,2])).groupby('group').mean()

Answer 6

感谢所有答案，尤其是WeNYoBen's。以下将产生正确的组，并跳过空的组。

# First, add the final index to `group_indices` so that
# we have a series of right endpoints, or interval upper bounds
upper_bounds = group_indices.append(pd.Series(values.shape[0]), ignore_index=True)

# Compute indices of nonempty groups
lower_bounds = upper_bounds.shift(fill_value=0)
nonempty_group_idxs = upper_bounds != lower_bounds

# Get means indexed from 0 to n_nonempty_groups-1
means = values.groupby(pd.RangeIndex(values.shape[0]).isin(upper_bounds).cumsum()).mean()

# Reassign index for the correct groups
means.index = nonempty_group_idxs.index[nonempty_group_idxs]

这将具有不连续的索引，其中跳过的元素对应于原始groupby中的空组。如果您想在这些地方放置NaN，可以

means = means.reindex(index=pd.RangeIndex(group_indices.shape[0]))

给定Groupby的每个组的起始位置索引

6 个答案: