Question

尽管我知道可以使用NLTK之类的工具为我完成此任务，但是，我想了解如何有效地对列表中的多个词干进行切片。

说我的单词列表是

list = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]

我想去除的普通茎可能是

stems = ["s", "es", "ed", "est", "ing", "ly"] etc

我不想将词干指定为；

noStem = ["walrus", "rest", "wing", "feed"]

我已经研究出如何针对一个特定的词干，例如“ s”。例如，我的代码应为；

for eachWord in list:
    if eachWord not in noStem:
       if eachWord[-1] == "s":
           eachWord = eachWord[:-1]

stemmedList = stemmedList + [eachWord]

我不确定如何将其以更有效的方式应用于我的所有茎。

感谢您的帮助和建议！

Answer 1

我建议您将localhost:4000转换为192.168.10.10:4000，以使支票192.168.10.1:4000快速。然后，您可以检查noStem中的单词set是否存在词干。如果是这样，则可以使用匹配的最大词干并将其从单词中删除：

if eachWord not in noStem

Answer 2

我认为这不是一个糟糕的开始。您只需要添加第二个循环即可使用多个结尾。您可以尝试如下操作（（您会注意到我已将变量list重命名，因为用内置名称遮盖变量很危险）

stemmed_list = []
for word in word_list:
    if word not in noStem:
        for ending in stems:
            if word.endswith(ending):
                 word = word[:-len(ending)]
                 break   # This will prevent iterating over all endings once match is found
    stemmed_list.append(word)

或者根据您的评论，您不想使用endswith

stemmed_list = []
for word in word_list:
    if word not in noStem:
        for ending in stems:
            if word[-len(ending):] == ending:
                 word = word[:-len(ending)]
                 break   # This will prevent iterating over all endings once match is found
    stemmed_list.append(word)

Answer 3

比这要复杂得多，但是这里使用了更快的pandas模块来编写入门代码。来了

import pandas as pd
import re

word_list = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]

stems = ["es",  "ed", "est", "ing", "ly", "s"]

# a set for quick lookup 
noStem = set(["walrus", "rest", "wing", "feed"])

# build series
words = pd.Series(word_list)

# filter out words in noStem
words = words[words.apply(lambda x: x not in noStem)]

# compile regular explession - performance - join all stems for matching
term_matching = '|'.join(stems)
expr = re.compile(r'(.+?)({})$'.format(term_matching))

df = words.str.extract(expr, expand=True)
df.dropna(how='any', inplace=True)
df.columns = ['words', 'stems']

stemmed_list = df.words.tolist()

我希望这对您有帮助...

如何使用切片在单词末尾删除几个不同的词干

3 个答案: