我正在使用NLTK将维基百科的文章标记为句子。但是punkt tokenizer没有给出非常好的结果,因为有时会出现etc.
出现时句子被标记化的问题,或者当['as they say "harry is a good boy.', '" He thinks']
等文本中出现双重引号时出现问题。< / p>
我想坚持NLTK本身,因为这是夹在其他一些过程之间的东西。
是否还有其他可以使用的分类器?
我也不介意在python中使用任何其他库。
答案 0 :(得分:3)
尝试使用正则表达式分割文本,你可以使用否定lookbehind断言:
import re
# This is the Lorem ipsum text modified a little bit in order to match your requirements.
# Note the following:
# 1. - "et dolore magna" --> the presence of `"`
# 2. - Sunt, in culpa etc. qui ... --> The presence if `etc.`
text = """Lorem ipsum dolor sit amet. Consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore "et dolore magna" aliqua. Ut enim ad minim veniam. Cillum dolore
proident. Sunt, in culpa, etc. qui officia deserunt mollit anim id est laborum."""
# Here is used the negative lookbehind assertion to split the text using any point
# `.` not preceded by `etc` as separator.
sentences = re.split("(?<!etc)\.", text)
# Then all white spaces are removed to leave just the words.
sentences = [" ".join(re.findall("\w+", sentence)) for sentence in sentences]
# Finally,
print(sentences)
当然,如果我们有一个我们可以随时使用的功能,那么所有这些都会更好。
def get_sentences(text):
sentences = re.split("(?<!etc)\.", text)
return [" ".join(re.findall("\w+", sentence)) for sentence in sentences]
# Example of use.
print(get_sentences(text))
重要强>
如果您发现另一个例外,例如etc.
,可以说,NLTK.
您可以将它添加到拆分器模式中,如下所示:
...
sentences = re.split("(?<!(etc|NLTK)\.", text)
...
参考文献: