熊猫数据框,重新组合

时间:2019-11-11 21:15:07

标签: python-3.x pandas

我有以下示例数据集:

ThisWorkbook.Sheets("Sheet2").Range("A1:C1").Value=ThisWorkbook.Sheets("Sheet1").Range("A1:C1").Value

enter image description here

基本上:句子,它们的开始和结束时间以及每秒的字符。

现在,我还有一个列表:

import pandas as pd
data = {'Sentences':['Sentence1', 'Sentence2', 'Sentence3', 'Sentences4', 'Sentences5', 'Sentences6','Sentences7', 'Sentences8'],\
            'Start_Time':[10,15,77,120,150,160,176,188],\
            'End_Time': [12,17,88,128,158,168,182,190],\
            'cps': [3,4,5,6,2,4,5,6]}
 df = pd.DataFrame(data)
 print(df)

基于该列表,我想重新组合句子。该列表列出了每个组的开始时间和结束时间,即

  • 9到90:句子1-3(3,因为该时间大部分在该组中)
  • 90至161:句子4-5(句子6不属于该组,因为大部分时间不在该组中)
  • 161至200:第6个句子(该组中的大多数)和第7-8个句子

这是我到目前为止所做的:

time_list = [9,80,161,200]

enter image description here

如您所见,结果不是应有的结果。我觉得目前这有点混乱。

2 个答案:

答案 0 :(得分:2)

使用:

mean_time=df[['Start_Time','End_Time']].mean(axis=1).rename('Interval Time')
labels = ["{0}-{1}".format(time_list[i], time_list[i+1]) for i in range(len(time_list)-1)]

new_df= ( df.groupby(pd.cut(mean_time,bins=time_list, labels=labels,include_lowest=True))
            .Sentences
            .agg(','.join)
            .reset_index())
print(new_df)

  Interval Time                         Sentences
0          9-90     Sentence1,Sentence2,Sentence3
1        90-161             Sentences4,Sentences5
2       161-200  Sentences6,Sentences7,Sentences8

使用time_list = [9,80,161,200]

  Interval Time                         Sentences
0          9-80               Sentence1,Sentence2
1        80-161   Sentence3,Sentences4,Sentences5
2       161-200  Sentences6,Sentences7,Sentences8

如果您愿意创建列表:

new_df= ( df.groupby(pd.cut(mean_time,time_list,right=False, labels=labels,include_lowest=True))
            .Sentences
            .agg(list)
            .reset_index())
print(new_df)

输出:

  Interval Time                             Sentences
0          9-80                [Sentence1, Sentence2]
1        80-161   [Sentence3, Sentences4, Sentences5]
2       161-200  [Sentences6, Sentences7, Sentences8]

答案 1 :(得分:1)

time_list = [9,90,161,200]
li={}
li1 = []
counter = 0
for i,j in zip(time_list, time_list[1:]):
    li[counter]=range(i,j)
    li1.append([counter,i,j])
    counter+=1
df1 = pd.DataFrame(li1, columns=['Group','Start', 'End'])
df1
  Group Start End
0   0   9   90
1   1   90  161
2   2   161 200

从时间表中创建数据框,并创建一个字典,将值范围映射到组号

data = {'Sentences':['Sentence1', 'Sentence2', 'Sentence3', 'Sentences4', 'Sentences5', 'Sentences6','Sentences7', 'Sentences8'],\
            'Start_Time':[10,15,77,120,150,160,176,188],\
            'End_Time': [12,17,88,128,158,168,182,190],\
            'cps': [3,4,5,6,2,4,5,6]}
df = pd.DataFrame(data)

def f(row):
    val = range(row['Start_Time'],row['End_Time'])
    len_list=[]
    for k,v in li.items():
        len_list.append(len([i for i in val if i in v]))
    if max(len_list)==0:
        return None
    return len_list.index(max(len_list)) # returns first max of the groups when same length

df['Group'] = df.apply(lambda i:f(i), axis=1)
df.merge(df1, on='Group').groupby(['Start', 'End'], as_index=False)['Sentences'].sum()
Start   End Sentences
0   9   90  Sentence1Sentence2Sentence3
1   90  161 Sentences4Sentences5
2   161 200 Sentences6Sentences7Sentences8