我已经尝试过有关此主题的其他文章,但似乎找不到正确的解决方案。
我有一个数据框,其中描述了由演讲者分隔的对话:
import pandas as pd
data = [[1, 'hello'], [2, 'hi there'], [1, 'how are you?'],[2, 'i am well'], [2, 'how are you?']]
df = pd.DataFrame(data, columns = ['speaker', 'turn'])
我想要做的是合并存在相同扬声器标签的相邻行。换句话说,我希望能够合并最后两行,因为它们都应计为同一会话回合。
data = [[1, 'hello'], [2, 'hi there'], [1, 'how are you?'],[2, 'i am well', 'how are you?']
我怀疑答案与groupby函数有关,但到目前为止,我尝试使其工作仍未奏效。
答案 0 :(得分:3)
在熊猫中,字符串处理不当;这些操作可能看起来是矢量的,但实际上不是。无论如何,您只想在此阶段汇总列表,并且该格式也不适合您期望标量值的df。使用itertools.groupby
import itertools
from operator import itemgetter
data = [[1, 'hello'], [2, 'hi there'], [1, 'how are you?'],[2, 'i am well'],
[2, 'how are you?']]
rebuilt_list = []
for speaker, comment_group in itertools.groupby(data, itemgetter(0)):
comments = [speaker] # To make sure you have the speaker id as first value
for comment in comment_group:
comments.extend(comment[1:])
rebuilt_list.append(comments)
答案 1 :(得分:2)
熊猫的另一种实现方式:
services.AddScoped<IParser, EventCounterParser>();
services.AddScoped<IParser, EventLevelParser>();
services.AddScoped<EventHandlerFactory>();
答案 2 :(得分:1)
IIUC,
# get all occurrences where speaker is eq to above and below row.
s = df['speaker'].eq(df['speaker'].shift()) | df['speaker'].eq(df['speaker'].shift(-1))
# filter out the above rows and concat the frame with a groupby
print(
pd.concat(
[
df.loc[~s],
df.loc[s]
.groupby("speaker")["turn"]
.apply(lambda x: ",".join(x))
.reset_index(),
]).reset_index())
结果
speaker turn
0 1 hello
1 2 hi there
2 1 how are you?
3 2 i am well,how are you?
您可以编辑应用以匹配所需的结果。 (如果要在逗号后留空格)
由于使用了Apply,因此不适用于大型数据集。