我建立了一个模型来根据收据预测供应商名称。
为此,我使用了词袋方法和随机森林分类器。代码如下所示。
X = df['taggunResponse']
y = df['label']
count_vect = CountVectorizer(ngram_range=(1,1), binary =True, max_df = 0.03 ,min_df = 10, lowercase = False)
count_vect.fit(X)
X_counts = count_vect.transform(X)
X_train_counts, X_test_counts, y_train, y_test, filenames_train, filenames_test = train_test_split(X_counts, y, filenames, test_size=0.23, random_state = 0)
clf = RandomForestClassifier(n_estimators=20, random_state=0).fit(X_train_counts, y_train)
如果我腌制对象,我在磁盘上看到的大小为:
y〜200 kB
X_train_counts〜2 MB
clf〜2 GB
我的分类器如何比用于训练它的数据大得多?我在这里做错什么了吗?