Question

我创建了两个列表：l1是我的主要列表，l2是包含某些停用词的列表。我打算从l1的第二个嵌套列表中删除l2中的停用词。但是，该代码似乎效率不高，仅删除了一个停用词，而其余的保留在l1中。这就是l1的样子：

[['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection.....]],...]

这就是l2的样子：

['a', ' able', ' about', ' across', ' after', ' all', ' almost', ' also', ' am', ' among', ' an', ' and', ' any',....]

这是我尝试过的：

for i in l1:
   i[1] = i[1].lower()
   i[1] = i[1].split()
   for j in i[1]:
      if j in l2:
         i[1].remove(j)

Answer 1

如果您不想重新发明轮子，可以使用nltk标记文本并删除停用词：

import nltk
data = [['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection']]

for text in (label_text[1] for label_text in data): 
    filtered_tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in nltk.corpus.stopwords.words('english')]
    print(filtered_tokens)

输出应为：

>>> [',', 'princess', '?', ':', ')']
>>> ['Urgent', '!', 'Please', 'call', '09061213237', 'landline', '.', '£5000', 'cash', 'luxury', '4*', 'Canary', 'Islands', 'Holiday', 'await', 'collection']

如果您仍然想使用自己的停用词列表，则以下方法可以为您解决问题：

import nltk

data = [['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection']]
stopwords = ['a', 'able', 'about', 'across', 'after', 'all', 'almost', 'also', 'am', 'among', 'an', 'and', 'any' ]

for text in (label_text[1] for label_text in data): 
    filtered_tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stopwords]
    print(filtered_tokens)

>>> ['how', 'you', 'will', 'do', 'that', ',', 'princess', '?', ':', ')']
>>> ['Urgent', '!', 'Please', 'call', '09061213237', 'from', 'landline', '.', '£5000', 'cash', 'or', 'luxury', '4*', 'Canary', 'Islands', 'Holiday', 'await', 'collection']

Answer 2

您可能应该将l2转换为regex，并使用re.sub将l1中的每个字符串转换为import re l1 = [['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. \xc2\xa35000 cash or a luxury 4* Canary Islands Holiday await collection.....']] l2 = ['a', ' able', ' about', ' across', ' after', ' all', ' almost', ' also', ' am', ' among', ' an', ' and', ' any'] stop_re = re.compile( r'(\s+|\b)({})\b'.format(r'|'.join(word.strip() for word in l2)), re.IGNORECASE) cleaned = [[stop_re.sub('', part).strip() for part in sublist] for sublist in l1] # cleaned ==> # [['ham', 'how you will do that, princess? :)'], # ['spam', # 'Urgent! Please call 09061213237 from landline. \xc2\xa35000 cash or luxury 4* Canary Islands Holiday await collection.....']]。像这样：

{{1}}

Answer 3

这里的问题之一是，您在进行l2时对l1中的每个单词进行if j in l2的迭代（如果使用O(n)，则时间复杂度高）它很慢。由于您仅对l2中的单词感兴趣，因此可以将其转换为集合，如果O(1)用于访问其中的项目，则该集合将耗时。看来l2在每个单词中都有空格，这将使其更难追踪。

还会出现一个错误（在迭代时从列表中删除内容时很常见）是，当您向前迭代时从列表中删除项目时，它实际上会偏移列表，并且您将跳过对列表中的下一项。通过反转要从中删除的列表的迭代，可以轻松解决此问题。

# Strip the spaces in l2 by using strip() on each element, and convert it to a set
l2 = set(map(lambda x: x.strip(), l2))

for i in l1:
    i[1] = i[1].lower()
    i[1] = i[1].split()
    # Reverse so it won't skip words on iteration
    for j in reversed(i[1]):
        if j in l2:
            i[1].remove(j)
    # Put back the strings again
    i[1] = ' '.join(i[1])

先前的解决方案的时间复杂度为O(m*n)，其中m是要检查的单词总数，n是停用词的数量。此解决方案的时间复杂度应仅为O(m)。

垃圾邮件过滤：删除停用词

3 个答案: