我有四个这样的发言人:
Team_A=[Fred,Bob]
Team_B=[John,Jake]
他们正在进行对话,并且全部用字符串表示,即。 convo
=
Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine
如何拆解和重新组合字符串,以便将其拆分为2个字符串,1个Team_A
所说的字符串,以及Team_A
所说的1个字符串?
输出:team_A_said="hello how is it going?"
,team_B_said="hi we are doing fine"
线条并不重要。
我有这个糟糕的find
...然后slice
代码不可扩展。有人可以提出别的建议吗?有任何图书馆可以帮忙吗?
我在nltk
库
答案 0 :(得分:2)
此代码假定convo
的内容严格符合
name\nstuff they said\n\n
图案。它使用的唯一棘手的代码是zip(*[iter(lines)]*3)
,它创建了lines
列表中的三元组字符串列表。有关此技术和替代技术的讨论,请参阅How do you split a list into evenly sized chunks in Python?。
#!/usr/bin/env python
team_ids = ('A', 'B')
team_names = (
('Fred', 'Bob'),
('John', 'Jake'),
)
#Build a dict to get team name from person name
teams = {}
for team_id, names in zip(team_ids, team_names):
for name in names:
teams[name] = team_id
#Each block in convo MUST consist of <name>\n<one line of text>\n\n
#Do NOT omit the final blank line at the end
convo = '''Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine
'''
lines = convo.splitlines()
#Group lines into <name><text><empty> chunks
#and append the text into the appropriate list in `said`
said = {'A': [], 'B': []}
for name, text, _ in zip(*[iter(lines)]*3):
team_id = teams[name]
said[team_id].append(text)
for team_id in team_ids:
print 'Team %s said: %r' % (team_id, ' '.join(said[team_id]))
<强>输出强>
Team A said: 'hello how is it going?'
Team B said: 'hi we are doing fine'
答案 1 :(得分:1)
您可以使用正则表达式来拆分每个条目。然后可以使用itertools.ifilter
为每个对话提取所需的条目。
import itertools
import re
def get_team_conversation(entries, team):
return [e for e in itertools.ifilter(lambda x: x.split('\n')[0] in team, entries)]
Team_A = ['Fred', 'Bob']
Team_B = ['John', 'Jake']
convo = """
Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine"""
find_teams = '^(' + '|'.join(Team_A + Team_B) + r')$'
entries = [e[0].strip() for e in re.findall('(' + find_teams + '.*?)' + '(?=' + find_teams + r'|\Z)', convo, re.S+re.M)]
print 'Team-A', get_team_conversation(entries, Team_A)
print 'Team-B', get_team_conversation(entries, Team_B)
提供以下输出:
Team-A ['Fred\nhello', 'Bob\nhow is it going?']
Team_B ['John\nhi', 'Jake\nwe are doing fine']
答案 2 :(得分:0)
这是语言分析的问题。
答案是正在进行的工作
会话记录可以通过将其描述为由自动机解析而具有以下状态来理解:
[start] ---> [Name]----> [Text]-+----->[end]
^ |
| | (whitespaces)
+-----------------+
您可以通过关注该状态机来解析对话。如果您的解析成功(即跟随状态到文本结尾),您可以浏览“对话树”以获得意义。
您需要使用函数来识别name
状态。这很简单
name = (Team_A | Team_B) + '\n'
在这个答案中,我并没有假设一个对话涉及说话的人之间的交替,就像这个对话会:
Fred # author 1
hello
John # author 2
hi
Bob # author 3
how is it going ?
Bob # ERROR : author 3 again !
are we still on for saturday, Fred ?
如果您的成绩单连接来自同一作者的答案
,这可能会有问题