根据列表

时间:2015-10-15 09:40:51

标签: python string nltk

我有四个这样的发言人:

Team_A=[Fred,Bob]

Team_B=[John,Jake]

他们正在进行对话,并且全部用字符串表示,即。 convo =

Fred
hello

John
hi

Bob
how is it going?

Jake
we are doing fine

如何拆解和重新组合字符串,以便将其拆分为2个字符串,1个Team_A所说的字符串,以及Team_A所说的1个字符串?

输出:team_A_said="hello how is it going?"team_B_said="hi we are doing fine"

线条并不重要。

我有这个糟糕的find ...然后slice代码不可扩展。有人可以提出别的建议吗?有任何图书馆可以帮忙吗?

我在nltk

中找不到任何内容

3 个答案:

答案 0 :(得分:2)

此代码假定convo 的内容严格符合
name\nstuff they said\n\n
图案。它使用的唯一棘手的代码是zip(*[iter(lines)]*3),它创建了lines列表中的三元组字符串列表。有关此技术和替代技术的讨论,请参阅How do you split a list into evenly sized chunks in Python?

#!/usr/bin/env python

team_ids = ('A', 'B')

team_names = (
    ('Fred', 'Bob'),
    ('John', 'Jake'),
)

#Build a dict to get team name from person name
teams = {}
for team_id, names in zip(team_ids, team_names):
    for name in names:
        teams[name] = team_id


#Each block in convo MUST consist of <name>\n<one line of text>\n\n
#Do NOT omit the final blank line at the end
convo = '''Fred
hello

John
hi

Bob
how is it going?

Jake
we are doing fine

'''

lines = convo.splitlines()

#Group lines into <name><text><empty> chunks
#and append the text into the appropriate list in `said`
said = {'A': [], 'B': []}
for name, text, _ in zip(*[iter(lines)]*3):
    team_id = teams[name]
    said[team_id].append(text)

for team_id in team_ids:
    print 'Team %s said: %r' % (team_id, ' '.join(said[team_id]))

<强>输出

Team A said: 'hello how is it going?'
Team B said: 'hi we are doing fine'

答案 1 :(得分:1)

您可以使用正则表达式来拆分每个条目。然后可以使用itertools.ifilter为每个对话提取所需的条目。

import itertools
import re

def get_team_conversation(entries, team):
    return [e for e in itertools.ifilter(lambda x: x.split('\n')[0] in team, entries)]

Team_A = ['Fred', 'Bob']
Team_B = ['John', 'Jake']

convo = """
Fred
hello

John
hi

Bob
how is it going?

Jake
we are doing fine"""

find_teams = '^(' + '|'.join(Team_A + Team_B) + r')$'
entries = [e[0].strip() for e in re.findall('(' + find_teams + '.*?)' + '(?=' + find_teams + r'|\Z)', convo, re.S+re.M)]

print 'Team-A', get_team_conversation(entries, Team_A)
print 'Team-B', get_team_conversation(entries, Team_B)

提供以下输出:

Team-A ['Fred\nhello', 'Bob\nhow is it going?']
Team_B ['John\nhi', 'Jake\nwe are doing fine']

答案 2 :(得分:0)

这是语言分析的问题。

答案是正在进行的工作

有限状态机

会话记录可以通过将其描述为由自动机解析而具有以下状态来理解:

[start]  ---> [Name]----> [Text]-+----->[end]
               ^                 |
               |                 | (whitespaces)
               +-----------------+  

您可以通过关注该状态机来解析对话。如果您的解析成功(即跟随状态到文本结尾),您可以浏览“对话树”以获得意义。

对您的对话进行标记(词法分析器)

您需要使用函数来识别name状态。这很简单

name = (Team_A | Team_B) + '\n'

会话交替

在这个答案中,我并没有假设一个对话涉及说话的人之间的交替,就像这个对话会:

Fred     # author 1
hello

John     # author 2
hi

Bob      # author 3
how is it going ?

Bob      # ERROR : author 3 again !
are we still on for saturday, Fred ?

如果您的成绩单连接来自同一作者的答案

,这可能会有问题