Pandas - 具有来自另一列的条件的groupby列

时间:2018-02-19 23:41:45

标签: python pandas dataframe group-by pandas-groupby

我正在努力研究如何使用条件对多个列值进行分组:

以下是我的数据作为pandas数据帧的样子:

{"bun_tag": "punctuation"},{"bun_tag": "quotation marks"},{"bun_tag": "document heading"},{"bun_tag": "document structure"}{"bun_tag": "multiple inversion"},{"bun_tag": "overloaded compound"},{"bun_tag": "syntactic ambiguity"},{"bun_tag": "excessive syntactic distance"}{"bun_tag": "omission"},{"bun_tag": "referential distortion"}{"bun_tag": "tense"},{"bun_tag": "modality"},{"bun_tag": "nominalisation"},{"bun_tag": "directive infinitive"}{"bun_tag": "good"}

Array
(
    [bun_tag] => good
)

{"bun_tag": "garden path"},{"bun_tag": "overloaded compound"}{"bun_tag": "missing determiner"},{"bun_tag": "referential ambiguity"}{"bun_tag": "garden path"}

Array
(
    [bun_tag] => garden path
)

{"bun_tag": "overloaded compound"}

Array
(
    [bun_tag] => overloaded compound
)

{"bun_tag": "capitalisation"},{"bun_tag": "title of document section"}{"bun_tag": "syntactic ambiguity"},{"bun_tag": "excessive syntactic distance"}{"bun_tag": "selectional restriction"}

Array
(
    [bun_tag] => selectional restriction
)

{"bun_tag": "garden path"},{"bun_tag": "overloaded compound"},{"bun_tag": "syntactic ambiguity"}{"bun_tag": "relational ambiguity"},{"bun_tag": "containment relationship"}{"bun_tag": "punctuation"},{"bun_tag": "weak interruption"}{"bun_tag": "meaning unclear"},{"bun_tag": "domain terminology"}{"bun_tag": "homonymy"},{"bun_tag": "nominalisation"},{"bun_tag": "meaning unclear"},{"bun_tag": "domain terminology"},{"bun_tag": "referential ambiguity"}{"bun_tag": "document structure"}

Array
(
    [bun_tag] => document structure
)

{"bun_tag": "polysemy"},{"bun_tag": "agent / receiver"},{"bun_tag": "relational distortion"}

我的目标是找出按ID分组的日期之间的日/小时或分钟差异。

我的输出看起来应该更像这样(差异在小时):

id      trigger     timestamp
1       started     2017-10-01 14:00:1
1       ended       2017-10-04 12:00:1
2       started     2017-10-02 10:00:1
1       started     2017-10-03 11:00:1
2       ended       2017-10-04 12:00:1    
2       started     2017-10-05 15:00:1
1       ended       2017-10-05 16:00:1
2       ended       2017-10-05 17:00:1

我尝试了很多选项,但我不能提供最有效的解决方案。

以下是我的代码:

首先,我尝试将数据拆分为“已开始”#39;并且'结束了':

id      trigger     timestamp           trigger     timestamp               diff
1       started     2017-10-01 14:00:1  ended       2017-10-04 12:00:1      70
1       started     2017-10-03 11:00:1  ended       2017-10-05 16:00:1      53
2       started     2017-10-02 10:00:1  ended       2017-10-04 12:00:1      26
2       started     2017-10-05 15:00:1  ended       2017-10-05 17:00:1      2

然后:

df['started'] = df.groupby(['id', 'timestamp'])['trigger'] == 'started'

df['ended'] = df.groupby(['id', 'timestamp'])['trigger'] == 'ended'

但它没有工作。 或

df.groupby(['id', 'started', 'ended'], as_index=True).sum()

也没有直觉结果。

可以指出一些正确的方向如何用熊猫做到这一点? 我还将在数据中使用空匹配,如何使用df['started'] = df.groupby(['trigger'])['timestamp'].np.where(df['trigger']=='started') 将NaN或缺少的数据添加到新数据框中。

1 个答案:

答案 0 :(得分:9)

  1. idtrigger设为索引
  2. 由于索引包含重复的条目,因此使用groupwise cumcount附加另一个索引列。完全相同,df必须有一个MultiIndex,其中包含3列
  3. unstack timestamp
  4. 按小时查找列之间的差异并将结果返回
  5. df['timestamp'] = pd.to_datetime(df['timestamp']) # if necessary
    
    i = df.groupby(['id', 'trigger']).cumcount()
    df.set_index(['id', i, 'trigger']).timestamp.unstack().assign(
           diff=lambda d: d.ended.sub(d.started).dt.total_seconds() / 3600
    )
    

    感谢piRSquared的改进。

    v
    
                      timestamp                      diff
    trigger               ended             started      
    id                                                   
    1  0    2017-10-04 12:00:01 2017-10-01 14:00:01  70.0
       1    2017-10-05 16:00:01 2017-10-03 11:00:01  53.0
    2  0    2017-10-04 12:00:01 2017-10-02 10:00:01  50.0
       1    2017-10-05 17:00:01 2017-10-05 15:00:01   2.0
    

    结果与您的问题中描述的不完全相同,但我相信MultiIndex列可以更清晰地表示您的输出,而不是两个触发列。