(这个问题一般是关于字符串检查而不是自然语言处理本身,但是如果你把它视为一个NLP问题,想象一下,为了简单起见,当前的分析师可以分析它不是一个语言,我将使用英文字符串作为例如
假设在
中只能形成一个单词的可能形式假设我想找到第一个实例的索引,在句子中出现任何形式的单词coach
,是否有更简单的方法来执行这两种方法:
长条件
sentence = "this is a sentence with the Coaches"
target = "coach"
print target.capitalize()
for j, i in enumerate(sentence.split(" ")):
if i == target.capitalize() or i == target.capitalize()+"es" or \
i == target.capitalize()+"s" or i == target+"es" or i==target+"s" or \
i == target:
print j
迭代try-except
variations = [target, target+"es", target+"s", target.capitalize()+"es",
target.capitalize()+"s", target.capitalize()]
ind = 0
for i in variations:
try:
j == sentence.split(" ").index(i)
print j
except ValueError:
continue
答案 0 :(得分:2)
我建议看一下NLTK的词干包:http://nltk.org/api/nltk.stem.html
使用它你可以“从单词中删除形态词缀,只留下单词词干。词干算法旨在删除那些词汇所需的词缀,例如语法角色,时态,派生形态只留下词的词干。”
如果您的语言目前不在NLTK范围内,您应该考虑扩展NLTK。如果你真的需要简单的东西并且不打扰NLTK,那么你仍然应该将你的代码编写为一组小的,易于组合的实用程序函数,例如:
import string
def variation(stem, word):
return word.lower() in [stem, stem + 'es', stem + 's']
def variations(sentence, stem):
sentence = cleanPunctuation(sentence).split()
return ( (i, w) for i, w in enumerate(sentence) if variation(stem, w) )
def cleanPunctuation(sentence):
exclude = set(string.punctuation)
return ''.join(ch for ch in sentence if ch not in exclude)
def firstVariation(sentence, stem):
for i, w in variations(sentence, stem):
return i, w
sentence = "First coach, here another two coaches. Coaches are nice."
print firstVariation(sentence, 'coach')
# print all variations/forms of 'coach' found in the sentence:
print "\n".join([str(i) + ' ' + w for i,w in variations(sentence, 'coach')])
答案 1 :(得分:1)
形态学通常是一种有限状态现象,因此正则表达式是处理它的完美工具。使用如下函数构建一个匹配所有案例的RE:
def inflect(stem):
"""Returns an RE that matches all inflected forms of stem."""
pat = "^[%s%s]%s(?:e?s)$" % (stem[0], stem[0].upper(), re.escape(stem[1:]))
return re.compile(pat)
用法:
>>> sentence = "this is a sentence with the Coaches"
>>> target = inflect("coach")
>>> [(i, w) for i, w in enumerate(sentence.split()) if re.match(target, w)]
[(6, 'Coaches')]
如果变形规则比这更复杂,请考虑使用Python's verbose REs。