如何让nltk.NaiveBayesClassifier.train()与我的词典

时间:2015-05-23 19:16:13

标签: nltk spam-prevention naivebayes

我目前正在使用Naive Bayles制作一个简单的垃圾邮件/火腿邮件过滤器。

为了理解我的算法逻辑:我有一个包含大量os文件的文件夹,这是垃圾邮件/电子邮件的例子。我在这个文件夹中还有另外两个文件,其中包含我所有火腿示例的标题,另一个文件包含我所有垃圾邮件示例的标题。我这样组织,所以我可以正确地打开和阅读这些电子邮件。

我把我认为的所有单词都放在字典结构中,标签为“spam”或“ham”,具体取决于我从哪种文件中提取它们。

然后我正在使用nltk.NaiveBayesClassifier.train(),所以我可以训练我的分类器,但是我收到了错误:

用于featureset,label_featuresets中的标签: ValueError:要解压缩的值太多

我不知道为什么会这样。当我寻找解决方案时,我发现字符串不可清除,我使用列表来执行此操作,然后我将其转换为字典,据我所知,它可以清除,但它一直收到此错误。 有人知道怎么解决吗?谢谢!

我的所有代码如下:

import nltk
import re 
import random

stopwords   = nltk.corpus.stopwords.words('english')    #Words I should avoid since they have weak value for classification
my_file     = open("spam_files.txt", "r")   #my_file now has the name of each file that contains a spam email example
word        = {}    #a dictionary where I will storage all the words and which value they have (spam or ham)

for lines in my_file:   #for each name of file (which will be represenetd by LINES) of my_file
with open(lines.rsplit('\n')[0]) as email: #I will open the file pointed by LINES, and then, read the email example that is inside this file
    for phrase in email:    #After that, I will take every phrase of this email example I just opened
        try:    #and I'll try to tokenize it
            tokens = nltk.word_tokenize(phrase)
        except: 
            continue    #I will ignore non-ascii elements
        for c in tokens:    #for each token
            regex = re.compile('[^a-zA-Z]') #I will also exclude numbers
            c = regex.sub('', c)
            if (c): #If there is any element left
                if (c not in stopwords): #And if this element is a not a stopword
                    c.lower()
                    word.update({c: 'spam'})#I put this element in my dictionary. Since I'm analysing spam examples, variable C is labeled "spam".

my_file.close() 
email.close()

#The same logic is used for the Ham emails. Since my ham emails contain only ascii elements, I dont test it with TRY
my_file = open("ham_files.txt", "r")
for lines in my_file:
with open(lines.rsplit('\n')[0]) as email:
    for phrase in email:
        tokens = nltk.word_tokenize(phrase)
        for c in tokens:
            regex = re.compile('[^a-zA-Z]')
            c = regex.sub('', c)
            if (c):
                if (c not in stopwords):
                    c.lower()
                    word.update({c: 'ham'})

my_file.close() 
email.close()

#And here I train my classifier
classifier = nltk.NaiveBayesClassifier.train(word)
classifier.show_most_informative_features(5)

1 个答案:

答案 0 :(得分:1)

nltk.NaiveBayesClassifier.train()期望“元组列表(featureset, label) ”(请参阅​​train()方法的文档) 没有提到的是featureset应该是映射到特征值的特征名称的字典。

因此,在带有词袋模型的典型垃圾邮件/火腿分类中,标签为“垃圾邮件”/“火腿”或1/0或True / False; 要素名称是出现的单词,值是每个单词出现的次数。 例如,train()方法的参数可能如下所示:

[({'greetings': 1, 'loan': 2, 'offer': 1}, 'spam'),
 ({'money': 3}, 'spam'),
 ...
 ({'dear': 1, 'meeting': 2}, 'ham'),
 ...
]

如果您的数据集相当小,您可能希望将实际字数替换为1,以减少数据稀疏性。