Question

我有一个字符串列表，我想删除每个字符串中的停用词。问题是，停用词的长度比字符串长得多，我不想重复将每个字符串与停用词列表进行比较。有没有一种方法在python中同时存在这些多个字符串？

lis = ['aka', 'this is a good day', 'a pretty dog']
stopwords = [] # pretty long list of words
for phrase in lis:
    phrase = phrase.split(' ') # get list of words
    for word in phrase:
        if stopwords.contain(word):
            phrase.replace(word, '')

这是我目前的方法。但这些意味着我必须浏览列表中的所有短语。有没有办法只用一次比较来处理这些短语？

感谢。

Answer 1

这是同样的想法，但有一些改进。将list个停用词转换为set，以便更快地进行查找。然后，您可以在列表推导中迭代您的短语列表。然后，您可以对短语中的单词进行迭代，如果它们不在停止集中，则保留它们，然后将join短语重新组合在一起。

>>> lis = ['aka', 'this is a good day', 'a pretty dog']
>>> stopwords = ['a', 'dog']
>>> stop = set(stopwords)
>>> [' '.join(j for j in i.split(' ') if j not in stop) for i in lis]
['aka', 'this is good day', 'pretty']

Answer 2

您可以计算每个短语形成的列表与停用词之间的差异。

>>> lis = ['aka', 'this is a good day', 'a pretty dog']
>>> stopwords = ['a', 'dog']

>>> stop = set(stopwords)
>>> result = map(lambda phrase: " ".join(list( set(phrase.split(' ')) - stop)), lis)
>>> print( result )

['aka', 'this is good day', 'pretty']

python同时处理多个字符串

2 个答案: