我有以下数据:
Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)
我正在尝试将其分解为如下的问答格式:
Rep: hi !
Customer: i was wondering if you have a delivery option? If so what are the options available ?
Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options.
Customer: ok! thank you
Rep: Is there anything else that I can help you with?
(Chat ended)
这是一组具有唯一ID的对话。拆分之后,我希望每个问题和答案都作为不同的列,分别与每个答案相匹配。
我尝试了以下操作:
for i in d.split(':'):
if i:
print(i.strip().split('.'))
输出如下:
['Rep']
['hi ! Customer']
['i was wondering if you have a delivery option? If so what are the options available ? Rep']
["i'd be happy to answer that for you! There is a 2 and 5 day delivery options", ' Customer']
['ok! thank you Rep']
['Is there anything else that I can help you with? (Chat ended)']
答案 0 :(得分:0)
因此,您基本上想确定要在何处插入换行符-这样,如果总是“ customer”和“ rep”,则可以尝试几种不同的模式:
(?<!^)(Customer:|Rep:|\(Chat ended)
demo
我们只是检查我们是否不在字符串的开头,然后通过对它们进行“或”运算来匹配常量标记。或更笼统地说,
(?<=\s)([A-Z]\w+:|\(Chat ended)
demo
我们回头看一个空格(不是在字符串的开头),然后匹配CapitalizedWord + COLON或结束序列,然后在每次匹配之前插入换行符。
同时替换:
\n$0
答案 1 :(得分:0)
与':'
分开很危险,因为对话本身可能包含':'
。
您应该首先拥有代表和客户的名字,以便您可以搜索他们的名字,然后以正则表达式模式搜索:
,可以使用re.findall
来解析示例聊天进入:
[('Rep', 'hi !'), ('Customer', 'i was wondering if you have a delivery option? If so what are the options available ?'), ('Rep', "i'd be happy to answer that for you! There is a 2 and 5 day delivery options."), ('Customer', 'ok! thank you')]
然后使用循环将项目映射到您喜欢的dict数据结构中:
import re
from pprint import pprint
def parse_chat(chat, rep, customer):
conversation = {}
rep_message = ''
for person, message in re.findall(r'({0}|{1}): (.*?)\s*(?=(?:{0}|{1}):)'.format(rep, customer), chat):
if person == rep:
rep_message = message
else:
conversation[rep_message] = message
return conversation
chat = '''Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)'''
pprint(parse_chat(chat, 'Rep', 'Customer'))
这将输出:
{'hi !': 'i was wondering if you have a delivery option? If so what are the options available ?',
"i'd be happy to answer that for you! There is a 2 and 5 day delivery options.": 'ok! thank you'}
答案 2 :(得分:0)
您可以使用更简单的正则表达式!!
import re
p = re.compile('(\w*\s*:)')
input_string = "Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)"
new_string = p.sub(r'\n\g<1>',input_string)
for line in new_string.split('\n')[1:]:
print line
答案 3 :(得分:0)
基于这样的假设,即冒号后面只有一个非空格分隔的单词,最好的方法是使用正则表达式在冒号之前匹配Customer
和Rep
字符串,并且然后插入换行符,以便获得适当的格式。
以下是一个工作示例:
import re
# The data has been stored into a string by this point
data = "Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)"
# First insert the newlines before the first word before a colon
newlines = re.sub(r'(\S+)\s*:', r'\n\g<1>:', data)
# Remove the first newline and fix the (Chat ended) on the end
solution = re.sub(r'\(Chat ended\)', '\n(Chat ended)', newlines[1:])
print(solution)
> "Rep: hi !
Customer: i was wondering if you have a delivery option? If so what are the options available ?
Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options.
Customer: ok! thank you
Rep: Is there anything else that I can help you with?
(Chat ended)"
newlines = re.sub...
行首先在data
字符串中搜索任何非空格分隔的单词,后跟冒号,然后将其替换为\n
字符,后跟任意序列匹配非空格字符\S+
(可以是Customer
,Rep
,Bill
等),然后在末尾插入:
最后,假设所有对话都以(Chat ended)
结尾,则代码行随后仅匹配该文本,并以与newlines = re.sub...
行相同的方式将其移动到新行。
输出是一个字符串,但是如果您需要将其用作其他任何内容,则可以基于'\n'
对其进行拆分,然后再执行操作。