Question

我正在使用NLTK工具包开发一个项目。使用我的硬件，我能够在一个小数据集上运行分类器对象。因此，我将数据分成较小的块并在其中运行分类器对象，同时将所有这些单独的对象存储在pickle文件中。

现在进行测试我需要将整个对象作为一个来获得更好的结果。所以我的问题是如何将这些对象合并为一个。

objs = []

while True:
    try:
        f = open(picklename,"rb")
        objs.extend(pickle.load(f))
        f.close()
    except EOFError:
        break

这样做不起作用。它会给出错误TypeError: 'NaiveBayesClassifier' object is not iterable。

NaiveBayesClassifier代码：

 classifier = nltk.NaiveBayesClassifier.train(training_set)

Answer 1

我不确定您的数据的确切格式，但您不能简单地合并不同的分类器。 Naive Bayes分类器根据训练的数据存储概率分布，并且您无法在不访问原始数据的情况下合并概率分布。

如果你在这里查看源代码：http://www.nltk.org/_modules/nltk/classify/naivebayes.html 分类器的一个实例存储：

self._label_probdist = label_probdist
self._feature_probdist = feature_probdist

这些是使用相对频率计数在列车方法中计算的。（例如P（L_1）=（训练集中的L1的数量）/（训练集中的＃标签）。要合并两者，你需要得到（训练1中的L1＃+训练2）/（标签数量）在T1 + T2）。

然而，天真的贝叶斯程序从头开始实施起来并不困难，特别是如果你遵循“火车”的话。上面链接中的源代码。这是一个大纲，使用NaiveBayes源代码

Store＆＃39; FreqDist＆＃39;标签和特征的每个数据子集的对象。

label_freqdist = FreqDist()
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
fnames = set()

# Count up how many times each feature value occurred, given
# the label and featurename.
for featureset, label in labeled_featuresets:
    label_freqdist[label] += 1
    for fname, fval in featureset.items():
        # Increment freq(fval|label, fname)
        feature_freqdist[label, fname][fval] += 1
        # Record that fname can take the value fval.
        feature_values[fname].add(fval)
        # Keep a list of all feature names.
        fnames.add(fname)

# If a feature didn't have a value given for an instance, then
# we assume that it gets the implicit value 'None.'  This loop
# counts up the number of 'missing' feature values for each
# (label,fname) pair, and increments the count of the fval
# 'None' by that amount.
for label in label_freqdist:
    num_samples = label_freqdist[label]
    for fname in fnames:
        count = feature_freqdist[label, fname].N()
        # Only add a None key when necessary, i.e. if there are
        # any samples with feature 'fname' missing.
        if num_samples - count > 0:
            feature_freqdist[label, fname][None] += num_samples - count
            feature_values[fname].add(None)
# Use pickle to store label_freqdist, feature_freqdist,feature_values

结合那些使用内置的＆＃39;添加＆＃39;方法。这样您就可以获得所有数据的相对频率。

all_label_freqdist = FreqDist()
all_feature_freqdist = defaultdict(FreqDist)
all_feature_values = defaultdict(set)

for file in train_labels:
    f = open(file,"rb")
    all_label_freqdist += pickle.load(f)
    f.close()

# Combine the default dicts for features similarly

使用＆＃39;估算器＆＃39;创建概率分布。

estimator = ELEProbDist()

label_probdist = estimator(all_label_freqdist)

# Create the P(fval|label, fname) distribution
feature_probdist = {}
for ((label, fname), freqdist) in all_feature_freqdist.items():
    probdist = estimator(freqdist, bins=len(all_feature_values[fname]))
    feature_probdist[label, fname] = probdist

classifier = NaiveBayesClassifier(label_probdist, feature_probdist)

分类器不会将所有数据的计数组合在一起并生成您需要的数据。

如何在NLTK中合并NaiveBayesClassifier对象

1 个答案: