如何使用切片在单词末尾删除几个不同的词干

时间:2018-11-21 19:32:28

标签: python

尽管我知道可以使用NLTK之类的工具为我完成此任务,但是,我想了解如何有效地对列表中的多个词干进行切片。

说我的单词列表是

list = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]

我想去除的普通茎可能是

stems = ["s", "es", "ed", "est", "ing", "ly"] etc

我不想将词干指定为;

noStem = ["walrus", "rest", "wing", "feed"]

我已经研究出如何针对一个特定的词干,例如“ s”。例如,我的代码应为;

for eachWord in list:
    if eachWord not in noStem:
       if eachWord[-1] == "s":
           eachWord = eachWord[:-1]

stemmedList = stemmedList + [eachWord]

我不确定如何将其以更有效的方式应用于我的所有茎。

感谢您的帮助和建议!

3 个答案:

答案 0 :(得分:0)

我建议您将localhost:4000转换为192.168.10.10:4000,以使支票192.168.10.1:4000快速。然后,您可以检查noStem中的单词set是否存在词干。如果是这样,则可以使用匹配的最大词干并将其从单词中删除:

if eachWord not in noStem

答案 1 :(得分:0)

我认为这不是一个糟糕的开始。您只需要添加第二个循环即可使用多个结尾。您可以尝试如下操作((您会注意到我已将变量list重命名,因为用内置名称遮盖变量很危险)

stemmed_list = []
for word in word_list:
    if word not in noStem:
        for ending in stems:
            if word.endswith(ending):
                 word = word[:-len(ending)]
                 break   # This will prevent iterating over all endings once match is found
    stemmed_list.append(word)

或者根据您的评论,您不想使用endswith

stemmed_list = []
for word in word_list:
    if word not in noStem:
        for ending in stems:
            if word[-len(ending):] == ending:
                 word = word[:-len(ending)]
                 break   # This will prevent iterating over all endings once match is found
    stemmed_list.append(word)

答案 2 :(得分:0)

比这要复杂得多,但是这里使用了更快的pandas模块来编写入门代码。来了

import pandas as pd
import re

word_list = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]

stems = ["es",  "ed", "est", "ing", "ly", "s"]

# a set for quick lookup 
noStem = set(["walrus", "rest", "wing", "feed"])

# build series
words = pd.Series(word_list)

# filter out words in noStem
words = words[words.apply(lambda x: x not in noStem)]

# compile regular explession - performance - join all stems for matching
term_matching = '|'.join(stems)
expr = re.compile(r'(.+?)({})$'.format(term_matching))

df = words.str.extract(expr, expand=True)
df.dropna(how='any', inplace=True)
df.columns = ['words', 'stems']

stemmed_list = df.words.tolist()

我希望这对您有帮助...