以下是我正在使用的较大数据框的前几行。我有代码(感谢用户harvpan)将所有单词组合在一起,而说话者姓名不变,在组合中保留了第一个单词的“开始”值和最后一个单词的“停止”值。这段代码:
df.groupby([(df['speaker'] != df['speaker'].shift()).cumsum(), , df['speaker']], as_index=False).agg({
'word': ' '.join,
'start': 'min',
'stop': 'max'
})
转动此数据框:
word start stop speaker
0 but 2.72 2.85 2
1 that's 2.85 3.09 2
2 alright 3.09 3.47 2
3 we'll 8.43 8.69 1
4 have 8.69 8.97 1
5 to 8.97 9.07 1
6 okay 9.19 10.01 2
7 sure 10.02 11.01 2
8 what? 11.02 12.00 1
对此:
word start stop speaker
0 but that's alright 2.72 3.47 2
1 we'll have to 8.43 9.07 1
2 okay sure 9.19 11.01 2
3 what? 11.02 12.00 1
哪个很棒。 但是,我想限制每个新单词列中要组合的单词总数。具体来说,我希望每个新单词组合的平均单词数约为4。
例如:
如果说话者更改之前的单词数<= 4,将所有单词合并为1个值
如果说话者变更之前的单词数> 4 AND#个单词// 4 == 0,则将单词合并为4的组合(例如,说话者变更之前的单词数= 16,结果分为4组)< / p>
如果说话者更改之前的单词数> 4 AND#个单词// 4!= 0,则将单词组合为尽可能多的4个组合,同时允许余数大于1。(例如,数字更改说话者之前的单词数=101。我想要25组4和1组1,而不是25组4和1组5)。
所以,如果我有这个:
word start stop speaker
0 but 2.72 2.85 2
1 that's 2.85 3.09 2
2 alright 3.09 3.47 2
3 we'll 8.43 8.69 1
4 have 8.69 8.97 1
5 to 8.97 9.07 1
6 okay 9.19 10.01 2
7 sure 10.02 11.01 2
8 what? 11.02 12.00 1
9 i 12.01 13.00 2
10 want 13.01 14.00 2
11 to 14.01 15.00 2
12 go 15.01 16.00 2
13 there 16.01 17.00 2
14 where 17.01 18.00 1
15 is 18.01 19.00 1
16 it 19.01 20.00 1
17 you 20.01 21.00 1
18 would 21.01 22.00 1
19 like 22.01 23.00 1
20 to 23.01 24.00 1
21 go 24.01 25.00 1
我明白了:
word start stop speaker
0 but that's alright 2.72 3.47 2
1 we'll have to 8.43 9.07 1
2 okay sure 9.19 11.01 2
3 what? 11.02 12.00 1
4 I want to go there 12.01 17.00 2
5 where is it you 17.01 21.00 1
6 would like to go 21.01 25.00 1
谢谢!
答案 0 :(得分:0)
考虑到您的最终代码,我认为可以使用它。只需将“扬声器”分解为多个分区即可分组。
请注意,我的示例每个说话者使用2个单词,而不是4个单词,因为使用示例数据更容易。
import pandas as pd
import math
z = pd.read_clipboard()
y = ((z.groupby((z['speaker'] != z['speaker'].shift(1)).cumsum()).cumcount().apply(float)+1) / 2)
z['speaker2'] = z['speaker'].apply(str) + y.apply(math.floor).apply(str)
z.groupby([(z['speaker2'] != z['speaker2'].shift()).cumsum(), z['speaker2']], as_index=False).agg({
'word': ' '.join,
'start': 'min',
'stop': 'max'
word start stop
0 but that's 2.72 3.09
1 alright 3.09 3.47
2 we'll have 8.43 8.97
3 to 8.97 9.07
4 okay sure 9.19 11.01
5 what? 11.02 12.00
})