我有一个基本上使用以下代码从即时消息文件中提取的列表:
with open(input('address here pls: '),'r') as f:
f = f.readlines()
我返回的是一系列元素,例如
> ['=Start=','From: Me','To: You','Hey there','Howre u doing?','=End',
'=Start=','From: You','To: Me','Good!','How bout you?','=End',
]
我正在尝试将所有内容放在开始和结束之间,将From和To指定为表格标题,并在其间作为正文获取消息。
最终目标是将其推送到pandas数据帧。
以下是我想要获得的结果:
======================================
From|To |Message |
======================================
Me |You|'Hey there Howre you doing?'|
You |Me |'Good! How bout you?' |
答案 0 :(得分:4)
您可以使用:
L = ['=Start=','From: Me','To: You','Hey there','Howre u doing?','=End',
'=Start=','From: You','To: Me','Good!','How bout you?','=End',
]
#create df from L
df = pd.DataFrame({'Message': L})
#create groups by mask and cumulative sum
b = (df.Message == '=Start=').cumsum()
#extract text in From and To
df['From'] = df.Message.str.extract('From: (.*)', expand=False).ffill()
df['To'] = df.Message.str.extract('To: (.*)', expand=False).ffill()
#remove unnecessary rows
out = ['=Start=','=End','From:','To:']
df = df[~df.Message.str.contains('|'.join(out))]
#groupby by Series b and aggregate
df = df.groupby(b).agg({'Message': ' '.join, 'To': 'last', 'From': 'last'})
df = df.reset_index(drop=True)
print (df)
Message To From
0 Hey there Howre u doing? You Me
1 Good! How bout you? Me You