Question

（这个问题一般是关于字符串检查而不是自然语言处理本身，但是如果你把它视为一个NLP问题，想象一下，为了简单起见，当前的分析师可以分析它不是一个语言，我将使用英文字符串作为例如

假设在

中只能形成一个单词的可能形式

首字母大写
其复数形式带有“s”
其复数形式带有“es”
大写+“es”
大写+“s”
没有复数或大写的基本形式

假设我想找到第一个实例的索引，在句子中出现任何形式的单词coach，是否有更简单的方法来执行这两种方法：

长条件

sentence = "this is a sentence with the Coaches"
target = "coach"

print target.capitalize()

for j, i in enumerate(sentence.split(" ")):
  if i == target.capitalize() or i == target.capitalize()+"es" or \
     i == target.capitalize()+"s" or i == target+"es" or i==target+"s" or \
     i == target:
    print j

迭代try-except

variations = [target, target+"es", target+"s", target.capitalize()+"es",
target.capitalize()+"s", target.capitalize()]

ind = 0
for i in variations:
  try:
    j == sentence.split(" ").index(i)
    print j
  except ValueError:
    continue

Answer 1

我建议看一下NLTK的词干包：http://nltk.org/api/nltk.stem.html

使用它你可以“从单词中删除形态词缀，只留下单词词干。词干算法旨在删除那些词汇所需的词缀，例如语法角色，时态，派生形态只留下词的词干。”

如果您的语言目前不在NLTK范围内，您应该考虑扩展NLTK。如果你真的需要简单的东西并且不打扰NLTK，那么你仍然应该将你的代码编写为一组小的，易于组合的实用程序函数，例如：

import string 

def variation(stem, word):
    return word.lower() in [stem, stem + 'es', stem + 's']

def variations(sentence, stem):
    sentence = cleanPunctuation(sentence).split()
    return ( (i, w) for i, w in enumerate(sentence) if variation(stem, w) )

def cleanPunctuation(sentence):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in sentence if ch not in exclude)

def firstVariation(sentence, stem):
    for i, w  in variations(sentence, stem):
        return i, w

sentence = "First coach, here another two coaches. Coaches are nice."

print firstVariation(sentence, 'coach')

# print all variations/forms of 'coach' found in the sentence:
print "\n".join([str(i) + ' ' + w for i,w in variations(sentence, 'coach')])

Answer 2

形态学通常是一种有限状态现象，因此正则表达式是处理它的完美工具。使用如下函数构建一个匹配所有案例的RE：

def inflect(stem):
    """Returns an RE that matches all inflected forms of stem."""
    pat = "^[%s%s]%s(?:e?s)$" % (stem[0], stem[0].upper(), re.escape(stem[1:]))
    return re.compile(pat)

用法：

>>> sentence = "this is a sentence with the Coaches"
>>> target = inflect("coach")
>>> [(i, w) for i, w in enumerate(sentence.split()) if re.match(target, w)]
[(6, 'Coaches')]

如果变形规则比这更复杂，请考虑使用Python's verbose REs。

在句子字符串中找到不同的单词实现 - Python

2 个答案: