我目前正在使用单个特征向量器训练LinearSVC分类器。我正在处理新闻,这些新闻存储在单独的文件中。这些文件最初具有标题,文本正文,日期,作者,有时还有图像。但是我最终删除了文本主体以外的所有内容。我这样做是这样的:
# Loading the files (Plain files with just the news content. Nor date, author or other features.)
data_train = load_files(self.TRAIN_FOLDER, encoding=self.ENCODING) # data_train
data_test = load_files(self.TEST_FOLDER, encoding=self.ENCODING)
unlabeled = load_files(self.UNLABELED_FOLDER, encoding=self.ENCODING)
categories = data_train.target_names
# Get the sparse matrix of each dataset
y_train = data_train.target
y_test = data_test.target
# Vectorizing
vectorizer = TfidfVectorizer(encoding=self.ENCODING, use_idf=True, norm='l2', binary=False, sublinear_tf=True, min_df=0.001, max_df=1.0, ngram_range=(1, 2), analyzer='word')
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
X_unlabeled = vectorizer.transform(self.data_unlabeled.data)
# Instantiating the classifier
clf = LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3)
# Fitting the model according to the training set and predicting
scaler = preprocessing.StandardScaler(with_mean=False)
scaler = scaler.fit(X_train)
normalized_X_train = scaler.transform(X_train)
clf.fit(normalized_X_train, y_train)
normalized_X_test = scaler.transform(X_test)
pred = clf.predict(normalized_X_test)
accuracy_score = metrics.accuracy_score(y_test, pred)
recall_score = metrics.recall_score(y_test, pred)
precision_score = metrics.precision_score(y_test, pred)
但是现在我想包括其他功能,例如日期或作者,而我发现的所有更简单的示例都使用一个功能。所以我不太确定如何进行。我是否应该将所有信息都放在一个文件中?如何区分作者与内容?我应该为每个功能使用矢量化程序吗?如果是这样,我是否应该适合具有不同矢量化特征的模型?还是应该为每个功能使用不同的分类器?您能建议我阅读一些东西(向新手解释)吗?
预先感谢
答案 0 :(得分:2)
TfidfVectorizer的输出是一个scipy.sparse.csr.csr_matrix
对象。您可以使用hstack添加更多功能(例如here)。或者,您可以将上面已有的要素空间转换为numpy数组或pandas df,然后将新要素(可能是从其他矢量化程序创建的)添加为新要素。无论哪种方式,最终的X_train和X_test都应将所有功能都放在一个位置。在进行培训之前,您可能还需要标准化它们(here)。您似乎在这里没有这样做。
我没有您的数据,因此以下是一些虚拟数据的示例:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)
X_train = pd.DataFrame(X_train.todense())
X_train['has_image'] = [1, 0, 0, 1] # just adding a dummy feature for demonstration