将聊天对话分为句子并映射响应

时间:2018-06-29 17:21:14

标签: regex python-3.x text-mining

我有以下数据:

Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)

我正在尝试将其分解为如下的问答格式:

Rep: hi !
Customer: i was wondering if you have a delivery option? If so what are the options available ? 
Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options.
Customer: ok! thank you
Rep: Is there anything else that I can help you with?
(Chat ended)

这是一组具有唯一ID的对话。拆分之后,我希望每个问题和答案都作为不同的列,分别与每个答案相匹配。

我尝试了以下操作:

for i in d.split(':'):
    if i:
        print(i.strip().split('.'))

输出如下:

['Rep']
['hi ! Customer']
['i was wondering if you have a delivery option? If so what are the options available ? Rep']
["i'd be happy to answer that for you! There is a 2 and 5 day delivery options", ' Customer']
['ok! thank you Rep']
['Is there anything else that I can help you with? (Chat ended)']

4 个答案:

答案 0 :(得分:0)

因此,您基本上想确定要在何处插入换行符-这样,如果总是“ customer”和“ rep”,则可以尝试几种不同的模式:

(?<!^)(Customer:|Rep:|\(Chat ended) demo

我们只是检查我们是否不在字符串的开头,然后通过对它们进行“或”运算来匹配常量标记。或更笼统地说,

(?<=\s)([A-Z]\w+:|\(Chat ended) demo

我们回头看一个空格(不是在字符串的开头),然后匹配CapitalizedWord + COLON或结束序列,然后在每次匹配之前插入换行符。

同时替换:

\n$0

答案 1 :(得分:0)

':'分开很危险,因为对话本身可能包含':'

您应该首先拥有代表和客户的名字,以便您可以搜索他们的名字,然后以正则表达式模式搜索:,可以使用re.findall来解析示例聊天进入:

[('Rep', 'hi !'), ('Customer', 'i was wondering if you have a delivery option? If so what are the options available ?'), ('Rep', "i'd be happy to answer that for you! There is a 2 and 5 day delivery options."), ('Customer', 'ok! thank you')]

然后使用循环将项目映射到您喜欢的dict数据结构中:

import re
from pprint import pprint
def parse_chat(chat, rep, customer):
    conversation = {}
    rep_message = ''
    for person, message in re.findall(r'({0}|{1}): (.*?)\s*(?=(?:{0}|{1}):)'.format(rep, customer), chat):
        if person == rep:
            rep_message = message
        else:
            conversation[rep_message] = message
    return conversation
chat = '''Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)'''
pprint(parse_chat(chat, 'Rep', 'Customer'))

这将输出:

{'hi !': 'i was wondering if you have a delivery option? If so what are the options available ?',
 "i'd be happy to answer that for you! There is a 2 and 5 day delivery options.": 'ok! thank you'}

答案 2 :(得分:0)

您可以使用更简单的正则表达式!!

import re

p = re.compile('(\w*\s*:)')

input_string = "Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)"

new_string = p.sub(r'\n\g<1>',input_string)

for line in new_string.split('\n')[1:]:
    print line

答案 3 :(得分:0)

解决方案

基于这样的假设,即冒号后面只有一个非空格分隔的单词,最好的方法是使用正则表达式在冒号之前匹配CustomerRep字符串,并且然后插入换行符,以便获得适当的格式。

以下是一个工作示例:

import re

# The data has been stored into a string by this point
data = "Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)"

# First insert the newlines before the first word before a colon
newlines = re.sub(r'(\S+)\s*:', r'\n\g<1>:', data)

# Remove the first newline and fix the (Chat ended) on the end
solution = re.sub(r'\(Chat ended\)', '\n(Chat ended)', newlines[1:])

print(solution)

> "Rep: hi ! 
  Customer: i was wondering if you have a delivery option? If so what are the options available ? 
  Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. 
  Customer: ok! thank you 
  Rep: Is there anything else that I can help you with? 
  (Chat ended)"

说明

newlines = re.sub...行首先在data字符串中搜索任何非空格分隔的单词,后跟冒号,然后将其替换为\n字符,后跟任意序列匹配非空格字符\S+(可以是CustomerRepBill等),然后在末尾插入:

最后,假设所有对话都以(Chat ended)结尾,则代码行随后仅匹配该文本,并以与newlines = re.sub...行相同的方式将其移动到新行。

输出是一个字符串,但是如果您需要将其用作其他任何内容,则可以基于'\n'对其进行拆分,然后再执行操作。