将pandas Dataframe拆分为不同的组

时间:2017-11-13 20:59:29

标签: python pandas dataframe

给出以下数据集DF:

uuid,eventTime,Op.progress,Op.progressPercentage, AnotherAttribute
C0972765-8436-0000-0000-000000000000,2017-08-19T12:52:39,P,3.0,01:57:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:52:49,P,3.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:53:18,P,4.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:53:49,P,5.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:54:27,P,5.0,01:54:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:55:07,P,6.0,01:54:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:55:27,P,6.0,01:53:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:33:46,W,40.0,01:13:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:10,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:16,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:18,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:55,P,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:15,P,1.0,01:59:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:31,P,3.0,01:57:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:51,P,3.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:42:22,P,4.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:42:51,P,4.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:29:22,S,98.0,00:04:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:29:27,S,98.0,00:03:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:30:27,S,99.0,00:02:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:31:27,S,100.0,00:01:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:33:01,F,100.0,00:01:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:33:01,F,100.0,00:01:00

我想分成两部分:

DF1:

uuid,eventTime,Op.progress,Op.progressPercentage, AnotherAttribute
C0972765-8436-0000-0000-000000000000,2017-08-19T12:52:39,P,3.0,01:57:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:52:49,P,3.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:53:18,P,4.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:53:49,P,5.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:54:27,P,5.0,01:54:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:55:07,P,6.0,01:54:00
C0972765-8436-0000-0000-000000000000,2017-08-19T12:55:27,P,6.0,01:53:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:33:46,W,40.0,01:13:00

df2:

uuid,eventTime,Op.progress,Op.progressPercentage, AnotherAttribute
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:10,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:16,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:18,N,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:40:55,P,1.0,02:00:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:15,P,1.0,01:59:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:31,P,3.0,01:57:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:41:51,P,3.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:42:22,P,4.0,01:56:00
C0972765-8436-0000-0000-000000000000,2017-08-19T13:42:51,P,4.0,01:55:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:29:22,S,98.0,00:04:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:29:27,S,98.0,00:03:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:30:27,S,99.0,00:02:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:31:27,S,100.0,00:01:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:33:01,F,100.0,00:01:00
C0972765-8436-0000-0000-000000000000,2017-08-19T15:33:01,F,100.0,00:01:00

拆分应基于Op.progressPercentage属性,该属性可以采用1到100之间的值。

当我尝试应用splitting a pandas Dataframe提供的解决方案时,如下所示,我没有得到正确的预期结果。

df_dataset = pd.read_csv(filepath) #your input data saved here
wash_list = []
shifted = df_dataset['Op.progressPercentage'].shift()
m = shifted.diff(-1).ne(0) & shifted.eq(100)
a = m.cumsum()
aa = df_dataset.groupby([df_dataset.uuid,a])
for k, gp in aa:         
    wash_list.append(gp.sort_values(['uuid', 'eventTime'], ascending=[1, 1]))

for wash in wash_list :
    print("")
    print(wash.to_string())
    print("")

请,任何帮助将非常感谢。 非常感谢你提前, 最好的祝福, 卡罗

2 个答案:

答案 0 :(得分:3)

IIUC,(不考虑异常情况)您可以使用diff + cumsum获取不同的群组,groupby代表这些群组:

for _, g in df.groupby((~df['Op.progressPercentage']\
                          .diff().fillna(0).ge(0)).cumsum()):
     print(g, '\n')

<强>详情

这些小组是这样的:

(~df['Op.progressPercentage'].diff().fillna(0).ge(0)).cumsum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    1
17    1
18    1
19    1
20    1
21    1
22    1
Name: Op.progressPercentage, dtype: int64

答案 1 :(得分:2)

  • 使用np.diff计算下一个值与当前值之间的差异。
  • d < 0显示值下降的位置
  • np.flatnonzero找到非零值的位置。在我们的例子中,True
  • Sind np.diff从源数组中删除了一个元素,我添加1以获得正确的位置。
  • np.splitdf分隔为负数diff
  • 的所有部分
  • 我用一些好心的方式将它打印出来。
d = np.diff(df['Op.progressPercentage'].values)
results = np.split(df, np.flatnonzero(d < 0) + 1)

print(*results, sep='\n' * 2)

                                   uuid            eventTime Op.progress  Op.progressPercentage  AnotherAttribute
0  C0972765-8436-0000-0000-000000000000  2017-08-19T12:52:39           P                    3.0          01:57:00
1  C0972765-8436-0000-0000-000000000000  2017-08-19T12:52:49           P                    3.0          01:56:00
2  C0972765-8436-0000-0000-000000000000  2017-08-19T12:53:18           P                    4.0          01:55:00
3  C0972765-8436-0000-0000-000000000000  2017-08-19T12:53:49           P                    5.0          01:55:00
4  C0972765-8436-0000-0000-000000000000  2017-08-19T12:54:27           P                    5.0          01:54:00
5  C0972765-8436-0000-0000-000000000000  2017-08-19T12:55:07           P                    6.0          01:54:00
6  C0972765-8436-0000-0000-000000000000  2017-08-19T12:55:27           P                    6.0          01:53:00
7  C0972765-8436-0000-0000-000000000000  2017-08-19T13:33:46           W                   40.0          01:13:00

                                    uuid            eventTime Op.progress  Op.progressPercentage  AnotherAttribute
8   C0972765-8436-0000-0000-000000000000  2017-08-19T13:40:10           N                    1.0          02:00:00
9   C0972765-8436-0000-0000-000000000000  2017-08-19T13:40:16           N                    1.0          02:00:00
10  C0972765-8436-0000-0000-000000000000  2017-08-19T13:40:18           N                    1.0          02:00:00
11  C0972765-8436-0000-0000-000000000000  2017-08-19T13:40:55           P                    1.0          02:00:00
12  C0972765-8436-0000-0000-000000000000  2017-08-19T13:41:15           P                    1.0          01:59:00
13  C0972765-8436-0000-0000-000000000000  2017-08-19T13:41:31           P                    3.0          01:57:00
14  C0972765-8436-0000-0000-000000000000  2017-08-19T13:41:51           P                    3.0          01:56:00
15  C0972765-8436-0000-0000-000000000000  2017-08-19T13:42:22           P                    4.0          01:56:00
16  C0972765-8436-0000-0000-000000000000  2017-08-19T13:42:51           P                    4.0          01:55:00
17  C0972765-8436-0000-0000-000000000000  2017-08-19T15:29:22           S                   98.0          00:04:00
18  C0972765-8436-0000-0000-000000000000  2017-08-19T15:29:27           S                   98.0          00:03:00
19  C0972765-8436-0000-0000-000000000000  2017-08-19T15:30:27           S                   99.0          00:02:00
20  C0972765-8436-0000-0000-000000000000  2017-08-19T15:31:27           S                  100.0          00:01:00
21  C0972765-8436-0000-0000-000000000000  2017-08-19T15:33:01           F                  100.0          00:01:00
22  C0972765-8436-0000-0000-000000000000  2017-08-19T15:33:01           F                  100.0          00:01:00