Question

以下是我正在使用的较大数据框的前几行。我有代码（感谢用户harvpan）将所有单词组合在一起，而说话者姓名不变，在组合中保留了第一个单词的“开始”值和最后一个单词的“停止”值。这段代码：

df.groupby([(df['speaker'] != df['speaker'].shift()).cumsum(), , df['speaker']], as_index=False).agg({
    'word': ' '.join,
    'start': 'min',
    'stop': 'max'
})

转动此数据框：

      word    start  stop      speaker
0       but   2.72  2.85        2
1    that's   2.85  3.09        2
2   alright   3.09  3.47        2
3     we'll   8.43  8.69        1
4      have   8.69  8.97        1
5        to   8.97  9.07        1
6      okay   9.19 10.01        2
7      sure  10.02 11.01        2
8     what?  11.02 12.00        1

对此：

       word        start  stop speaker
0  but that's alright  2.72  3.47  2
1       we'll have to  8.43  9.07  1
2           okay sure  9.19 11.01  2
3               what? 11.02 12.00  1

哪个很棒。但是，我想限制每个新单词列中要组合的单词总数。具体来说，我希望每个新单词组合的平均单词数约为4。

例如：

如果说话者更改之前的单词数<= 4，将所有单词合并为1个值
如果说话者变更之前的单词数> 4 AND＃个单词// 4 == 0，则将单词合并为4的组合（例如，说话者变更之前的单词数= 16，结果分为4组）< / p>
如果说话者更改之前的单词数> 4 AND＃个单词// 4！= 0，则将单词组合为尽可能多的4个组合，同时允许余数大于1。（例如，数字更改说话者之前的单词数=101。我想要25组4和1组1，而不是25组4和1组5）。

所以，如果我有这个：

      word    start  stop      speaker
0       but   2.72  2.85        2
1    that's   2.85  3.09        2
2   alright   3.09  3.47        2
3     we'll   8.43  8.69        1
4      have   8.69  8.97        1
5        to   8.97  9.07        1
6      okay   9.19 10.01        2
7      sure  10.02 11.01        2
8     what?  11.02 12.00        1
9         i  12.01 13.00        2
10     want  13.01 14.00        2
11       to  14.01 15.00        2
12       go  15.01 16.00        2
13     there 16.01 17.00        2
14    where  17.01 18.00        1
15       is  18.01 19.00        1 
16       it  19.01 20.00        1         
17      you  20.01 21.00        1
18    would  21.01 22.00        1
19     like  22.01 23.00        1
20       to  23.01 24.00        1
21       go  24.01 25.00        1

我明白了：

       word        start  stop speaker
0  but that's alright  2.72  3.47  2
1       we'll have to  8.43  9.07  1
2           okay sure  9.19 11.01  2
3               what? 11.02 12.00  1
4  I want to go there 12.01 17.00  2
5     where is it you 17.01 21.00  1
6    would like to go 21.01 25.00  1

谢谢！

Answer 1

考虑到您的最终代码，我认为可以使用它。只需将“扬声器”分解为多个分区即可分组。

请注意，我的示例每个说话者使用2个单词，而不是4个单词，因为使用示例数据更容易。

import pandas as pd
import math


z = pd.read_clipboard()

y = ((z.groupby((z['speaker'] != z['speaker'].shift(1)).cumsum()).cumcount().apply(float)+1) / 2)

z['speaker2'] = z['speaker'].apply(str) + y.apply(math.floor).apply(str)

z.groupby([(z['speaker2'] != z['speaker2'].shift()).cumsum(),  z['speaker2']], as_index=False).agg({
    'word': ' '.join,
    'start': 'min',
    'stop': 'max'

         word  start   stop
0  but that's   2.72   3.09
1     alright   3.09   3.47
2  we'll have   8.43   8.97
3          to   8.97   9.07
4   okay sure   9.19  11.01
5       what?  11.02  12.00
})

如何通过熊猫中的条件来限制groupby中的行数

1 个答案: