Question

我的数据混乱无序。

store_id  period_id  sales_volume
0        4186684        226       1004.60
1        5219836        226        989.00
2        4185865        226        827.45
3        4186186        226        708.40
4        4523929        226        690.75
5        4186441        226        592.55    
...          ...        ...           ...
846960  11710234        195          0.60
846961  11693671        236          0.60
846962  27105667        212          0.60
846963  11693725        201          0.60
846964  27078031        234          0.60
846965  11663800        231          0.60

在period_id列中，这些值仅在连续进行时指示该过程持续了多长时间，一旦序列中断，这意味着新的时期已经开始。周期的这种表示形式与每个store_id相关。由于无法按顺序对数据进行排序，因此在下面以示例的形式进行展示：

          store_id    period_id    sales_volume
0          4168621        208        1004.60
1          4168621        209        989.00   #end of period
2          4168621        211        827.45
3          4168621        212        708.40
4          4168621        213        690.75
5          4168621        214        592.55   #end of period
6          41685          208        4634
7          41685          209        3356563  #end of period

我已将值按store_id分组：

df.groupby('store_id').agg(lambda x: x.tolist())

并收到

store_id  sales_volume                        period_id  

4168621   [226, 202, 199, 204, 224, 193  ...  [27.45,10.0,8.15,7.6, ...
4168624   [226, 216, 215, 225, 214, 217  ...  [429.8, 131.35,92.0   ...
4168636   [226, 217, 238, 223, 234, 240, ...  [33.30, 9.3, 6.4,     ...
4168639   [226, 204, 211, 208, 232, 207, ...  [19.3,8.05, 6.5, 6.4, ...
...       ...                                 ...

事实证明，我需要以某种方式对period_id中的值进行排序，以便计算出每个store_id产生的序列数，如代码2所示。它显示3序列

不知道我该怎么做...

Answer 1

如果只需要在每个period_id中按store_id进行排序，则可以使用df.sort_values。使用示例数据框作为输入：

df.sort_values(['store_id', 'period_id']).reset_index(drop=True)

df
   store_id  period_id  sales_volume
0     41685        208       4634.00
1     41685        209    3356563.00
2   4168621        208       1004.60
3   4168621        209        989.00
4   4168621        211        827.45
5   4168621        212        708.40
6   4168621        213        690.75
7   4168621        214        592.55

如果要检测每个时间段（例如，然后按时间段分组），这是一种方法：

df['period_group'] = df['period_id'].diff().fillna(1).ne(1).astype(int).cumsum()

df
   store_id  period_id  sales_volume  period_group
0   4168621        208       1004.60             0
1   4168621        209        989.00             0
2   4168621        211        827.45             1
3   4168621        212        708.40             1
4   4168621        213        690.75             1
5   4168621        214        592.55             1
6     41685        208       4634.00             2
7     41685        209    3356563.00             2

然后您可以按此新列period_group进行分组，以分析连续期间ID的“运行”。

计算重复序列熊猫

1 个答案: