有限状态机

Question

我有四个这样的发言人：

Team_A=[Fred,Bob]

Team_B=[John,Jake]

他们正在进行对话，并且全部用字符串表示，即。 convo =

Fred
hello

John
hi

Bob
how is it going?

Jake
we are doing fine

如何拆解和重新组合字符串，以便将其拆分为2个字符串，1个Team_A所说的字符串，以及Team_A所说的1个字符串？

输出：team_A_said="hello how is it going?"，team_B_said="hi we are doing fine"

线条并不重要。

我有这个糟糕的find ...然后slice代码不可扩展。有人可以提出别的建议吗？有任何图书馆可以帮忙吗？

我在nltk库

中找不到任何内容

Answer 1

此代码假定convo 的内容严格符合
name\nstuff they said\n\n
图案。它使用的唯一棘手的代码是zip(*[iter(lines)]*3)，它创建了lines列表中的三元组字符串列表。有关此技术和替代技术的讨论，请参阅How do you split a list into evenly sized chunks in Python?。

#!/usr/bin/env python

team_ids = ('A', 'B')

team_names = (
    ('Fred', 'Bob'),
    ('John', 'Jake'),
)

#Build a dict to get team name from person name
teams = {}
for team_id, names in zip(team_ids, team_names):
    for name in names:
        teams[name] = team_id


#Each block in convo MUST consist of <name>\n<one line of text>\n\n
#Do NOT omit the final blank line at the end
convo = '''Fred
hello

John
hi

Bob
how is it going?

Jake
we are doing fine

'''

lines = convo.splitlines()

#Group lines into <name><text><empty> chunks
#and append the text into the appropriate list in `said`
said = {'A': [], 'B': []}
for name, text, _ in zip(*[iter(lines)]*3):
    team_id = teams[name]
    said[team_id].append(text)

for team_id in team_ids:
    print 'Team %s said: %r' % (team_id, ' '.join(said[team_id]))

<强>输出

Team A said: 'hello how is it going?'
Team B said: 'hi we are doing fine'

Answer 2

您可以使用正则表达式来拆分每个条目。然后可以使用itertools.ifilter为每个对话提取所需的条目。

import itertools
import re

def get_team_conversation(entries, team):
    return [e for e in itertools.ifilter(lambda x: x.split('\n')[0] in team, entries)]

Team_A = ['Fred', 'Bob']
Team_B = ['John', 'Jake']

convo = """
Fred
hello

John
hi

Bob
how is it going?

Jake
we are doing fine"""

find_teams = '^(' + '|'.join(Team_A + Team_B) + r')$'
entries = [e[0].strip() for e in re.findall('(' + find_teams + '.*?)' + '(?=' + find_teams + r'|\Z)', convo, re.S+re.M)]

print 'Team-A', get_team_conversation(entries, Team_A)
print 'Team-B', get_team_conversation(entries, Team_B)

提供以下输出：

Team-A ['Fred\nhello', 'Bob\nhow is it going?']
Team_B ['John\nhi', 'Jake\nwe are doing fine']

Answer 3

这是语言分析的问题。

答案是正在进行的工作

有限状态机

会话记录可以通过将其描述为由自动机解析而具有以下状态来理解：

[start]  ---> [Name]----> [Text]-+----->[end]
               ^                 |
               |                 | (whitespaces)
               +-----------------+

您可以通过关注该状态机来解析对话。如果您的解析成功（即跟随状态到文本结尾），您可以浏览“对话树”以获得意义。

对您的对话进行标记（词法分析器）

您需要使用函数来识别name状态。这很简单

name = (Team_A | Team_B) + '\n'

会话交替

在这个答案中，我并没有假设一个对话涉及说话的人之间的交替，就像这个对话会：

Fred     # author 1
hello

John     # author 2
hi

Bob      # author 3
how is it going ?

Bob      # ERROR : author 3 again !
are we still on for saturday, Fred ?

如果您的成绩单连接来自同一作者的答案

，这可能会有问题

根据列表

3 个答案:

有限状态机

对您的对话进行标记（词法分析器）

会话交替