Question

我有以下数据：

Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)

我正在尝试将其分解为如下的问答格式：

Rep: hi !
Customer: i was wondering if you have a delivery option? If so what are the options available ? 
Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options.
Customer: ok! thank you
Rep: Is there anything else that I can help you with?
(Chat ended)

这是一组具有唯一ID的对话。拆分之后，我希望每个问题和答案都作为不同的列，分别与每个答案相匹配。

我尝试了以下操作：

for i in d.split(':'):
    if i:
        print(i.strip().split('.'))

输出如下：

['Rep']
['hi ! Customer']
['i was wondering if you have a delivery option? If so what are the options available ? Rep']
["i'd be happy to answer that for you! There is a 2 and 5 day delivery options", ' Customer']
['ok! thank you Rep']
['Is there anything else that I can help you with? (Chat ended)']

Answer 1

因此，您基本上想确定要在何处插入换行符-这样，如果总是“ customer”和“ rep”，则可以尝试几种不同的模式：

(?<!^)(Customer:|Rep:|\(Chat ended) demo

我们只是检查我们是否不在字符串的开头，然后通过对它们进行“或”运算来匹配常量标记。或更笼统地说，

(?<=\s)([A-Z]\w+:|\(Chat ended) demo

我们回头看一个空格（不是在字符串的开头），然后匹配CapitalizedWord + COLON或结束序列，然后在每次匹配之前插入换行符。

同时替换：

\n$0

Answer 2

与':'分开很危险，因为对话本身可能包含':'。

您应该首先拥有代表和客户的名字，以便您可以搜索他们的名字，然后以正则表达式模式搜索:，可以使用re.findall来解析示例聊天进入：

[('Rep', 'hi !'), ('Customer', 'i was wondering if you have a delivery option? If so what are the options available ?'), ('Rep', "i'd be happy to answer that for you! There is a 2 and 5 day delivery options."), ('Customer', 'ok! thank you')]

然后使用循环将项目映射到您喜欢的dict数据结构中：

import re
from pprint import pprint
def parse_chat(chat, rep, customer):
    conversation = {}
    rep_message = ''
    for person, message in re.findall(r'({0}|{1}): (.*?)\s*(?=(?:{0}|{1}):)'.format(rep, customer), chat):
        if person == rep:
            rep_message = message
        else:
            conversation[rep_message] = message
    return conversation
chat = '''Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)'''
pprint(parse_chat(chat, 'Rep', 'Customer'))

这将输出：

{'hi !': 'i was wondering if you have a delivery option? If so what are the options available ?',
 "i'd be happy to answer that for you! There is a 2 and 5 day delivery options.": 'ok! thank you'}

Answer 3

您可以使用更简单的正则表达式！！

import re

p = re.compile('(\w*\s*:)')

input_string = "Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)"

new_string = p.sub(r'\n\g<1>',input_string)

for line in new_string.split('\n')[1:]:
    print line

Answer 4

解决方案

基于这样的假设，即冒号后面只有一个非空格分隔的单词，最好的方法是使用正则表达式在冒号之前匹配Customer和Rep字符串，并且然后插入换行符，以便获得适当的格式。

以下是一个工作示例：

import re

# The data has been stored into a string by this point
data = "Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)"

# First insert the newlines before the first word before a colon
newlines = re.sub(r'(\S+)\s*:', r'\n\g<1>:', data)

# Remove the first newline and fix the (Chat ended) on the end
solution = re.sub(r'\(Chat ended\)', '\n(Chat ended)', newlines[1:])

print(solution)

> "Rep: hi ! 
  Customer: i was wondering if you have a delivery option? If so what are the options available ? 
  Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. 
  Customer: ok! thank you 
  Rep: Is there anything else that I can help you with? 
  (Chat ended)"

说明

newlines = re.sub...行首先在data字符串中搜索任何非空格分隔的单词，后跟冒号，然后将其替换为\n字符，后跟任意序列匹配非空格字符\S+（可以是Customer，Rep，Bill等），然后在末尾插入:

最后，假设所有对话都以(Chat ended)结尾，则代码行随后仅匹配该文本，并以与newlines = re.sub...行相同的方式将其移动到新行。

输出是一个字符串，但是如果您需要将其用作其他任何内容，则可以基于'\n'对其进行拆分，然后再执行操作。

将聊天对话分为句子并映射响应

4 个答案:

解决方案

说明