垃圾邮件过滤:删除停用词

时间:2019-03-27 21:50:21

标签: python

我创建了两个列表:l1是我的主要列表,l2是包含某些停用词的列表。我打算从l1的第二个嵌套列表中删除l2中的停用词。但是,该代码似乎效率不高,仅删除了一个停用词,而其余的保留在l1中。 这就是l1的样子:

[['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection.....]],...]

这就是l2的样子:

['a', ' able', ' about', ' across', ' after', ' all', ' almost', ' also', ' am', ' among', ' an', ' and', ' any',....]

这是我尝试过的:

for i in l1:
   i[1] = i[1].lower()
   i[1] = i[1].split()
   for j in i[1]:
      if j in l2:
         i[1].remove(j)

3 个答案:

答案 0 :(得分:3)

如果您不想重新发明轮子,可以使用nltk标记文本并删除停用词:

import nltk
data = [['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection']]

for text in (label_text[1] for label_text in data): 
    filtered_tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in nltk.corpus.stopwords.words('english')]
    print(filtered_tokens)

输出应为:

>>> [',', 'princess', '?', ':', ')']
>>> ['Urgent', '!', 'Please', 'call', '09061213237', 'landline', '.', '£5000', 'cash', 'luxury', '4*', 'Canary', 'Islands', 'Holiday', 'await', 'collection']

如果您仍然想使用自己的停用词列表,则以下方法可以为您解决问题:

import nltk

data = [['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection']]
stopwords = ['a', 'able', 'about', 'across', 'after', 'all', 'almost', 'also', 'am', 'among', 'an', 'and', 'any' ]

for text in (label_text[1] for label_text in data): 
    filtered_tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stopwords]
    print(filtered_tokens)

>>> ['how', 'you', 'will', 'do', 'that', ',', 'princess', '?', ':', ')']
>>> ['Urgent', '!', 'Please', 'call', '09061213237', 'from', 'landline', '.', '£5000', 'cash', 'or', 'luxury', '4*', 'Canary', 'Islands', 'Holiday', 'await', 'collection']

答案 1 :(得分:0)

您可能应该将l2转换为regex,并使用re.subl1中的每个字符串转换为import re l1 = [['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. \xc2\xa35000 cash or a luxury 4* Canary Islands Holiday await collection.....']] l2 = ['a', ' able', ' about', ' across', ' after', ' all', ' almost', ' also', ' am', ' among', ' an', ' and', ' any'] stop_re = re.compile( r'(\s+|\b)({})\b'.format(r'|'.join(word.strip() for word in l2)), re.IGNORECASE) cleaned = [[stop_re.sub('', part).strip() for part in sublist] for sublist in l1] # cleaned ==> # [['ham', 'how you will do that, princess? :)'], # ['spam', # 'Urgent! Please call 09061213237 from landline. \xc2\xa35000 cash or luxury 4* Canary Islands Holiday await collection.....']] 。像这样:

{{1}}

答案 2 :(得分:0)

这里的问题之一是,您在进行l2时对l1中的每个单词进行if j in l2的迭代(如果使用O(n),则时间复杂度高)它很慢。 由于您仅对l2中的单词感兴趣,因此可以将其转换为集合,如果O(1)用于访问其中的项目,则该集合将耗时。 看来l2在每个单词中都有空格,这将使其更难追踪。

还会出现一个错误(在迭代时从列表中删除内容时很常见)是,当您向前迭代时从列表中删除项目时,它实际上会偏移列表,并且您将跳过对列表中的下一项。通过反转要从中删除的列表的迭代,可以轻松解决此问题。

# Strip the spaces in l2 by using strip() on each element, and convert it to a set
l2 = set(map(lambda x: x.strip(), l2))

for i in l1:
    i[1] = i[1].lower()
    i[1] = i[1].split()
    # Reverse so it won't skip words on iteration
    for j in reversed(i[1]):
        if j in l2:
            i[1].remove(j)
    # Put back the strings again
    i[1] = ' '.join(i[1])

先前的解决方案的时间复杂度为O(m*n),其中m是要检查的单词总数,n是停用词的数量。 此解决方案的时间复杂度应仅为O(m)