干涸后为什么懦弱成为懦夫?

时间:2014-04-27 02:38:59

标签: nlp nltk stemming

我注意到在应用Porter词干(来自NLTK库)后,我得到了一些奇怪的词干,例如" cowardli" " contrari" 。对我来说,它们看起来根本不像茎。

可以吗?可能是我在某处犯了错误吗?

这是我的代码:

string = string.lower()
tokenized = nltk.tokenize.regexp_tokenize(string,"[a-z]+")
filtered = [w for w in tokenized if w not in nltk.corpus.stopwords.words("english")]


stemmer = nltk.stem.porter.PorterStemmer()
stemmed = []
for w in filtered:
    stemmed.append(stemmer.stem(w))

以下是我用来处理http://pastebin.com/XUMNCYAU的文字(犯罪和惩罚的开头"陀思妥耶夫斯基的书)。

1 个答案:

答案 0 :(得分:2)

首先让我们看一下NLTK所拥有的不同词干/词形变换器:

>>> from nltk import stem
>>> lancaster = stem.lancaster.LancasterStemmer()
>>> porter = stem.porter.PorterStemmer()
>>> snowball = stem.snowball.EnglishStemmer()
>>> wnl = stem.wordnet.WordNetLemmatizer()
>>> word = "cowardly"
>>> lancaster.stem(word)
'coward'
>>> porter.stem(word)
u'cowardli'
>>> snowball.stem(word)
u'coward'
>>> wnl.stem(word)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'WordNetLemmatizer' object has no attribute 'stem'
>>> wnl.lemmatize(word)
'cowardly'

注意:WordNetLemmatizer不是词干分析器,因此它输出cowardly的词形,在这种情况下它是相同的词。

似乎Porter stemmer是唯一一个更改cowardly -> cowardli的人,让我们看一下代码,看看它发生的原因,请参阅http://www.nltk.org/_modules/nltk/stem/porter.html#PorterStemmer

似乎这是ly -> li

的部分
def _step1c(self, word):
    """step1c() turns terminal y to i when there is another vowel in the stem.
    --NEW--: This has been modified from the original Porter algorithm so that y->i
    is only done when y is preceded by a consonant, but not if the stem
    is only a single consonant, i.e.

       (*c and not c) Y -> I

    So 'happy' -> 'happi', but
      'enjoy' -> 'enjoy'  etc

    This is a much better rule. Formerly 'enjoy'->'enjoi' and 'enjoyment'->
    'enjoy'. Step 1c is perhaps done too soon; but with this modification that
    no longer really matters.

    Also, the removal of the vowelinstem(z) condition means that 'spy', 'fly',
    'try' ... stem to 'spi', 'fli', 'tri' and conflate with 'spied', 'tried',
    'flies' ...
    """
    if word[-1] == 'y' and len(word) > 2 and self._cons(word, len(word) - 2):
        return word[:-1] + 'i'
    else:
        return word