我正在使用NLTK工具包开发一个项目。使用我的硬件,我能够在一个小数据集上运行分类器对象。因此,我将数据分成较小的块并在其中运行分类器对象,同时将所有这些单独的对象存储在pickle文件中。
现在进行测试我需要将整个对象作为一个来获得更好的结果。所以我的问题是如何将这些对象合并为一个。
objs = []
while True:
try:
f = open(picklename,"rb")
objs.extend(pickle.load(f))
f.close()
except EOFError:
break
这样做不起作用。它会给出错误TypeError: 'NaiveBayesClassifier' object is not iterable
。
NaiveBayesClassifier代码:
classifier = nltk.NaiveBayesClassifier.train(training_set)
答案 0 :(得分:0)
我不确定您的数据的确切格式,但您不能简单地合并不同的分类器。 Naive Bayes分类器根据训练的数据存储概率分布,并且您无法在不访问原始数据的情况下合并概率分布。
如果你在这里查看源代码:http://www.nltk.org/_modules/nltk/classify/naivebayes.html 分类器的一个实例存储:
self._label_probdist = label_probdist
self._feature_probdist = feature_probdist
这些是使用相对频率计数在列车方法中计算的。 (例如P(L_1)=(训练集中的L1的数量)/(训练集中的#标签)。要合并两者,你需要得到(训练1中的L1#+训练2)/(标签数量)在T1 + T2)。
然而,天真的贝叶斯程序从头开始实施起来并不困难,特别是如果你遵循“火车”的话。上面链接中的源代码。这是一个大纲,使用NaiveBayes源代码
Store' FreqDist'标签和特征的每个数据子集的对象。
label_freqdist = FreqDist()
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
fnames = set()
# Count up how many times each feature value occurred, given
# the label and featurename.
for featureset, label in labeled_featuresets:
label_freqdist[label] += 1
for fname, fval in featureset.items():
# Increment freq(fval|label, fname)
feature_freqdist[label, fname][fval] += 1
# Record that fname can take the value fval.
feature_values[fname].add(fval)
# Keep a list of all feature names.
fnames.add(fname)
# If a feature didn't have a value given for an instance, then
# we assume that it gets the implicit value 'None.' This loop
# counts up the number of 'missing' feature values for each
# (label,fname) pair, and increments the count of the fval
# 'None' by that amount.
for label in label_freqdist:
num_samples = label_freqdist[label]
for fname in fnames:
count = feature_freqdist[label, fname].N()
# Only add a None key when necessary, i.e. if there are
# any samples with feature 'fname' missing.
if num_samples - count > 0:
feature_freqdist[label, fname][None] += num_samples - count
feature_values[fname].add(None)
# Use pickle to store label_freqdist, feature_freqdist,feature_values
结合那些使用内置的'添加'方法。这样您就可以获得所有数据的相对频率。
all_label_freqdist = FreqDist()
all_feature_freqdist = defaultdict(FreqDist)
all_feature_values = defaultdict(set)
for file in train_labels:
f = open(file,"rb")
all_label_freqdist += pickle.load(f)
f.close()
# Combine the default dicts for features similarly
使用'估算器'创建概率分布。
estimator = ELEProbDist()
label_probdist = estimator(all_label_freqdist)
# Create the P(fval|label, fname) distribution
feature_probdist = {}
for ((label, fname), freqdist) in all_feature_freqdist.items():
probdist = estimator(freqdist, bins=len(all_feature_values[fname]))
feature_probdist[label, fname] = probdist
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
分类器不会将所有数据的计数组合在一起并生成您需要的数据。