在句子字符串中找到不同的单词实现 - Python

时间:2012-11-05 17:48:30

标签: python string nlp stemming

(这个问题一般是关于字符串检查而不是自然语言处理本身,但是如果你把它视为一个NLP问题,想象一下,为了简单起见,当前的分析师可以分析它不是一个语言,我将使用英文字符串作为例如

假设在

中只能形成一个单词的可能形式
  1. 首字母大写
  2. 其复数形式带有“s”
  3. 其复数形式带有“es”
  4. 大写+“es”
  5. 大写+“s”
  6. 没有复数或大写的基本形式
  7. 假设我想找到第一个实例的索引,在句子中出现任何形式的单词coach,是否有更简单的方法来执行这两种方法:

    长条件

    sentence = "this is a sentence with the Coaches"
    target = "coach"
    
    print target.capitalize()
    
    for j, i in enumerate(sentence.split(" ")):
      if i == target.capitalize() or i == target.capitalize()+"es" or \
         i == target.capitalize()+"s" or i == target+"es" or i==target+"s" or \
         i == target:
        print j
    

    迭代try-except

    variations = [target, target+"es", target+"s", target.capitalize()+"es",
    target.capitalize()+"s", target.capitalize()]
    
    ind = 0
    for i in variations:
      try:
        j == sentence.split(" ").index(i)
        print j
      except ValueError:
        continue
    

2 个答案:

答案 0 :(得分:2)

我建议看一下NLTK的词干包:http://nltk.org/api/nltk.stem.html

使用它你可以“从单词中删除形态词缀,只留下单词词干。词干算法旨在删除那些词汇所需的词缀,例如语法角色,时态,派生形态只留下词的词干。”

如果您的语言目前不在NLTK范围内,您应该考虑扩展NLTK。如果你真的需要简单的东西并且不打扰NLTK,那么你仍然应该将你的代码编写为一组小的,易于组合的实用程序函数,例如:

import string 

def variation(stem, word):
    return word.lower() in [stem, stem + 'es', stem + 's']

def variations(sentence, stem):
    sentence = cleanPunctuation(sentence).split()
    return ( (i, w) for i, w in enumerate(sentence) if variation(stem, w) )

def cleanPunctuation(sentence):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in sentence if ch not in exclude)

def firstVariation(sentence, stem):
    for i, w  in variations(sentence, stem):
        return i, w

sentence = "First coach, here another two coaches. Coaches are nice."

print firstVariation(sentence, 'coach')

# print all variations/forms of 'coach' found in the sentence:
print "\n".join([str(i) + ' ' + w for i,w in variations(sentence, 'coach')])

答案 1 :(得分:1)

形态学通常是一种有限状态现象,因此正则表达式是处理它的完美工具。使用如下函数构建一个匹配所有案例的RE:

def inflect(stem):
    """Returns an RE that matches all inflected forms of stem."""
    pat = "^[%s%s]%s(?:e?s)$" % (stem[0], stem[0].upper(), re.escape(stem[1:]))
    return re.compile(pat)

用法:

>>> sentence = "this is a sentence with the Coaches"
>>> target = inflect("coach")
>>> [(i, w) for i, w in enumerate(sentence.split()) if re.match(target, w)]
[(6, 'Coaches')]

如果变形规则比这更复杂,请考虑使用Python's verbose REs