我想在Python中设计一个自定义标记生成器模块,允许用户指定用于输入的标记生成器。例如,请考虑以下输入:
问:什么是实现这一目标的好方法?答:我不太确定。我想我 将使用Python。
我希望能够提供NLTK's sentence tokenization,sent_tokenize()
作为选项,因为它在许多情况下运作良好,我不想重新发明轮子。除此之外,我还想提供一个更细粒度的标记化构建器(类似于规则引擎的东西)。让我解释一下:
假设我提供了几个标记器:
SENTENCE # Tokenizes the given input by using sent_tokenize()
WORD # Tokenizes the given input by using word_tokenize()
QA # Tokenizes using a custom regular expression. E.g., Q: (.*?) A: (.*?)
我想支持如下规则:
因此,预期输出如下:
1。 QA - >句
[
('QUESTION',
('SENTENCE', 'What is a good way to achieve this?'),
),
('ANSWER',
('SENTENCE', 'I am not so sure', 'I think I will use Python')
)
]
2。 QA
[
('QUESTION', 'What is a good way to achieve this?'),
('ANSWER', 'I am not so sure. I think I will use Python')
]
有效实现这一目标的好设计是什么?
答案 0 :(得分:10)
由于Python中的标记化很容易,我想知道您的模块计划提供什么。 我的意思是,在开始一个软件时,一个好的设计而不是考虑使用场景而不是首先考虑数据结构。
您对预期输出的示例有点令人困惑。 我假设您希望tokenizer在左侧返回名称,并在右侧返回令牌列表。 我玩了一些以获得类似的结果,但使用列表更容易处理:
import re
# some tokenizers
def tokzr_WORD(txt): return ('WORD', re.findall(r'(?ms)\W*(\w+)', txt)) # split words
def tokzr_SENT(txt): return ('SENTENCE', re.findall(r'(?ms)\s*(.*?(?:\.|\?|!))', txt)) # split sentences
def tokzr_QA(txt):
l_qa = []
for m in re.finditer(r'(?ms)^[\s#\-\*]*(?:Q|Question)\s*:\s*(?P<QUESTION>\S.*?\?)[\s#\-\*]+(?:A|Answer)\s*:\s*(?P<ANSWER>\S.*?)$', txt): # split (Q, A) sequences
for k in ['QUESTION', 'ANSWER']:
l_qa.append(m.groupdict()[k])
return ('QA', l_qa)
def tokzr_QA_non_canonical(txt): # Note: not supported by tokenize_recursively() as not canonical.
l_qa = []
for m in re.finditer(r'(?ms)^[\s#\-\*]*(?:Q|Question)\s*:\s*(?P<QUESTION>\S.*?\?)[\s#\-\*]+(?:A|Answer)\s*:\s*(?P<ANSWER>\S.*?)$', txt): # split (Q, A) sequences
for k in ['QUESTION', 'ANSWER']:
l_qa.append((k, m.groupdict()[k]))
return l_qa
dict_tokzr = { # control string: tokenizer function
'WORD' : tokzr_WORD,
'SENTENCE': tokzr_SENT,
'QA' : tokzr_QA,
}
# the core function
def tokenize_recursively(l_tokzr, work_on, lev=0):
if isinstance(work_on, basestring):
ctrl, work_on = dict_tokzr[l_tokzr[0]](work_on) # tokenize
else:
ctrl, work_on = work_on[0], work_on[1:] # get right part
ret = [ctrl]
if len(l_tokzr) == 1:
ret.append(work_on) # add right part
else:
for wo in work_on: # dive into tree
t = tokenize_recursively(l_tokzr[1:], wo, lev + 1)
ret.append(t)
return ret
# just for printing
def nestedListLines(aList, ind=' ', d=0):
""" Returns multi-line string representation of \param aList. Use \param ind to indent per level. """
sRet = '\n' + d * ind + '['
nested = 0
for i, e in enumerate(aList):
if i:
sRet += ', '
if type(e) == type(aList):
sRet += nestedListLines(e, ind, d + 1)
nested = 1
else:
sRet += '\n' + (d + 1) * ind + repr(e) if nested else repr(e)
sRet += '\n' + d * ind + ']' if nested else ']'
return sRet
# main()
inp1 = """
* Question: I want try something. Should I?
* Answer : I'd assume so. Give it a try.
"""
inp2 = inp1 + 'Q: What is a good way to achieve this? A: I am not so sure. I think I will use Python.'
print repr(tokzr_WORD(inp1))
print repr(tokzr_SENT(inp1))
print repr(tokzr_QA(inp1))
print repr(tokzr_QA_non_canonical(inp1)) # Really this way?
print
for ctrl, inp in [ # example control sequences
('SENTENCE-WORD', inp1),
('QA-SENTENCE', inp2)
]:
res = tokenize_recursively(ctrl.split('-'), inp)
print nestedListLines(res)
顺便说一下。 Python / Lib / tokenize.py(对于Python代码本身)可能值得一看如何处理事情。
答案 1 :(得分:4)
如果我正确理解了这个问题,那么我认为你应该重新发明轮子。我会为你想要的不同类型的标记化实现状态机,并使用python字典来保存标记。
http://en.wikipedia.org/wiki/Finite-state_machine
示例状态机将采用带空格的句子并打印出单词,当然你可以用更简单的方式做这个具体的例子!但是通常情况下,使用状态机可以获得线性时间性能并且可以轻松地降低成本!
while 1:
if state == "start":
if i == len(text):
state = "end"
elif text[i] == " ":
state = "new word"
i = i - 1
else:
word.append(text[i])
elif state == "new word":
print(''.join(word))
del word[:]
state = "start"
elif state == "end":
print(''.join(word))
break
i = i + 1
http://docs.python.org/2/library/collections.html#collections.Counter
然后你可以使用这个python数据结构来保存你的令牌。我认为它非常适合您的需求!
希望这是一些帮助。