如何在NLTK中合并NaiveBayesClassifier对象

时间:2016-04-28 04:39:16

标签: nlp nltk

我正在使用NLTK工具包开发一个项目。使用我的硬件,我能够在一个小数据集上运行分类器对象。因此,我将数据分成较小的块并在其中运行分类器对象,同时将所有这些单独的对象存储在pickle文件中。

现在进行测试我需要将整个对象作为一个来获得更好的结果。所以我的问题是如何将这些对象合并为一个。

objs = []

while True:
    try:
        f = open(picklename,"rb")
        objs.extend(pickle.load(f))
        f.close()
    except EOFError:
        break

这样做不起作用。它会给出错误TypeError: 'NaiveBayesClassifier' object is not iterable

NaiveBayesClassifier代码:

 classifier = nltk.NaiveBayesClassifier.train(training_set)

1 个答案:

答案 0 :(得分:0)

我不确定您的数据的确切格式,但您不能简单地合并不同的分类器。 Naive Bayes分类器根据训练的数据存储概率分布,并且您无法在不访问原始数据的情况下合并概率分布。

如果你在这里查看源代码:http://www.nltk.org/_modules/nltk/classify/naivebayes.html 分类器的一个实例存储:

self._label_probdist = label_probdist
self._feature_probdist = feature_probdist

这些是使用相对频率计数在列车方法中计算的。 (例如P(L_1)=(训练集中的L1的数量)/(训练集中的#标签)。要合并两者,你需要得到(训练1中的L1#+训练2)/(标签数量)在T1 + T2)。

然而,天真的贝叶斯程序从头开始实施起来并不困难,特别是如果你遵循“火车”的话。上面链接中的源代码。这是一个大纲,使用NaiveBayes源代码

  1. Store' FreqDist'标签和特征的每个数据子集的对象。

    label_freqdist = FreqDist()
    feature_freqdist = defaultdict(FreqDist)
    feature_values = defaultdict(set)
    fnames = set()
    
    # Count up how many times each feature value occurred, given
    # the label and featurename.
    for featureset, label in labeled_featuresets:
        label_freqdist[label] += 1
        for fname, fval in featureset.items():
            # Increment freq(fval|label, fname)
            feature_freqdist[label, fname][fval] += 1
            # Record that fname can take the value fval.
            feature_values[fname].add(fval)
            # Keep a list of all feature names.
            fnames.add(fname)
    
    # If a feature didn't have a value given for an instance, then
    # we assume that it gets the implicit value 'None.'  This loop
    # counts up the number of 'missing' feature values for each
    # (label,fname) pair, and increments the count of the fval
    # 'None' by that amount.
    for label in label_freqdist:
        num_samples = label_freqdist[label]
        for fname in fnames:
            count = feature_freqdist[label, fname].N()
            # Only add a None key when necessary, i.e. if there are
            # any samples with feature 'fname' missing.
            if num_samples - count > 0:
                feature_freqdist[label, fname][None] += num_samples - count
                feature_values[fname].add(None)
    # Use pickle to store label_freqdist, feature_freqdist,feature_values
    
  2. 结合那些使用内置的'添加'方法。这样您就可以获得所有数据的相对频率。

    all_label_freqdist = FreqDist()
    all_feature_freqdist = defaultdict(FreqDist)
    all_feature_values = defaultdict(set)
    
    for file in train_labels:
        f = open(file,"rb")
        all_label_freqdist += pickle.load(f)
        f.close()
    
    # Combine the default dicts for features similarly
    
  3. 使用'估算器'创建概率分布。

    estimator = ELEProbDist()
    
    label_probdist = estimator(all_label_freqdist)
    
    # Create the P(fval|label, fname) distribution
    feature_probdist = {}
    for ((label, fname), freqdist) in all_feature_freqdist.items():
        probdist = estimator(freqdist, bins=len(all_feature_values[fname]))
        feature_probdist[label, fname] = probdist
    
    classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
    
  4. 分类器不会将所有数据的计数组合在一起并生成您需要的数据。