尽管我知道可以使用NLTK之类的工具为我完成此任务,但是,我想了解如何有效地对列表中的多个词干进行切片。
说我的单词列表是
list = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]
我想去除的普通茎可能是
stems = ["s", "es", "ed", "est", "ing", "ly"] etc
我不想将词干指定为;
noStem = ["walrus", "rest", "wing", "feed"]
我已经研究出如何针对一个特定的词干,例如“ s”。例如,我的代码应为;
for eachWord in list:
if eachWord not in noStem:
if eachWord[-1] == "s":
eachWord = eachWord[:-1]
stemmedList = stemmedList + [eachWord]
我不确定如何将其以更有效的方式应用于我的所有茎。
感谢您的帮助和建议!
答案 0 :(得分:0)
我建议您将localhost:4000
转换为192.168.10.10:4000
,以使支票192.168.10.1:4000
快速。然后,您可以检查noStem
中的单词set
是否存在词干。如果是这样,则可以使用匹配的最大词干并将其从单词中删除:
if eachWord not in noStem
答案 1 :(得分:0)
我认为这不是一个糟糕的开始。您只需要添加第二个循环即可使用多个结尾。您可以尝试如下操作((您会注意到我已将变量list
重命名,因为用内置名称遮盖变量很危险)
stemmed_list = []
for word in word_list:
if word not in noStem:
for ending in stems:
if word.endswith(ending):
word = word[:-len(ending)]
break # This will prevent iterating over all endings once match is found
stemmed_list.append(word)
或者根据您的评论,您不想使用endswith
stemmed_list = []
for word in word_list:
if word not in noStem:
for ending in stems:
if word[-len(ending):] == ending:
word = word[:-len(ending)]
break # This will prevent iterating over all endings once match is found
stemmed_list.append(word)
答案 2 :(得分:0)
比这要复杂得多,但是这里使用了更快的pandas模块来编写入门代码。来了
import pandas as pd
import re
word_list = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]
stems = ["es", "ed", "est", "ing", "ly", "s"]
# a set for quick lookup
noStem = set(["walrus", "rest", "wing", "feed"])
# build series
words = pd.Series(word_list)
# filter out words in noStem
words = words[words.apply(lambda x: x not in noStem)]
# compile regular explession - performance - join all stems for matching
term_matching = '|'.join(stems)
expr = re.compile(r'(.+?)({})$'.format(term_matching))
df = words.str.extract(expr, expand=True)
df.dropna(how='any', inplace=True)
df.columns = ['words', 'stems']
stemmed_list = df.words.tolist()
我希望这对您有帮助...