Question

我正在尝试使用两个类文本分类。通常我创建训练模型的Pickle文件，并在训练阶段加载那些泡菜以消除再培训。

当我有12000个评论时，每个课程有超过50000个推文，培训模型大小为1.4 GB。

现在将这个大型模型数据存储到Pickle并加载它实际上是不可行和可取的。

还有更好的替代方案吗？

以下是示例代码，我尝试了多种pickleing方式，这里我使用了dill包

    def train(self):
            global pos, neg, totals
            retrain = False

            # Load counts if they already exist.
            if not retrain and os.path.isfile(CDATA_FILE):
                    # pos, neg, totals = cPickle.load(open(CDATA_FILE))
                    pos, neg, totals = dill.load(open(CDATA_FILE, 'r'))
                    return

            for file in os.listdir("./suspected/"):
                    for word in set(self.negate_sequence(open("./unsuspected/" + file).read())):
                            neg[word] += 1
                            pos['not_' + word] += 1
            for file in os.listdir("./suspected/"):
                    for word in set(self.negate_sequence(open("./suspected/" + file).read())):
                            pos[word] += 1
                            neg['not_' + word] += 1

            self.prune_features()

            totals[0] = sum(pos.values())
            totals[1] = sum(neg.values())

            countdata = (pos, neg, totals)
            dill.dump(countdata, open(CDATA_FILE, 'w') )

UPDATE ：大泡菜背后的原因是，分类数据非常大。我考虑过1-4克的特征选择。分类数据集本身约为300mb，因此考虑使用多图谱方法进行特征选择可以创建大型训练模型。

Answer 1

Pickle作为一种格式很重。它存储对象的所有细节。以hdf5等高效格式存储数据会好得多。如果您不熟悉hdf5，可以考虑将数据存储在简单的平面文本文件中。您可以使用csv或json，具体取决于您的数据结构。你会发现要么比pickle更有效率。

您可以查看gzip来创建和加载压缩档案。

Answer 2

here说明了问题和解决方案。简而言之，该问题是由于以下事实造成的：使用CountVectorizer，尽管您可能会要求少量功能，例如max_features=1000中，转换器仍在引擎盖下保留所有可能功能的副本以用于调试。例如，CountVectorizer具有以下属性：

stop_words_ : set
Terms that were ignored because they either:
- occurred in too many documents (max_df)
- occurred in too few documents (min_df)
- were cut off by feature selection (max_features).
This is only available if no vocabulary was given.

，这会导致模型大小变得太大。要解决此问题，您可以在对模型进行酸洗之前将stop_words_设置为None（摘自上面的链接的示例）：（请检查link above以获得详细信息）

import pickle

model_name = 'clickbait-model-sm.pkl'
cfr_pipeline.named_steps.vectorizer.stop_words_ = None
pickle.dump(cfr_pipeline, open(model_name, 'wb'), protocol=2)

训练具有大数据的分类器

2 个答案: