Question

我在here中提出了同样的问题：

我对sklearn中的文本分类中使用交叉验证有疑问。在交叉验证之前对所有数据进行矢量化是有问题的，因为分类器会看到＆＃34;看到＆＃34;词汇发生在测试数据中。 Weka已过滤分类器以解决此问题。这个函数的等效sklearn是多少？我的意思是每个折叠，特征集会有所不同，因为训练数据不同。

但是，因为我正在对分类步骤和分类步骤之间的数据进行大量处理，所以我不能使用管道......并且试图通过我自己实现交叉验证作为整个过程的外部循环...对此有任何指导，因为我对python和sickitlearn都很陌生

Answer 1

我认为使用交叉验证迭代器作为外部循环是一个好主意，也是一个让您的步骤清晰可读的起点：

from sklearn.cross_validation import KFold
X = np.array(["Science today", "Data science", "Titanic", "Batman"]) #raw text
y = np.array([1, 1, 2, 2]) #categories e.g., Science, Movies
kf = KFold(y.shape[0], n_folds=2)
for train_index, test_index in kf:
    x_train, y_train = X[train_index], y[train_index] 
    x_test, y_test = X[test_index], y[test_index]
    #Now continue with your pre-processing steps..

Answer 2

我可能错过了你的问题的含义，并且对Weka不熟悉，但你可以将词汇作为字典传递给你在sklearn中使用的矢量化器。这是一个跳过“第二个”字样的例子。在测试集中，仅使用列车组的功能。

def preprocessor(string):
    #do logic here

def tokenizer(string):
    # do logic here

from sklearn.cross_validation import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
clf = Pipeline([('vect', TfidfVectorizer(processor=preprocessor, tokenizer=tokenizer)), ('svm', LinearSVC())])

另请注意，您可以在矢量化程序中使用自己的处理和/或标记化过程，如：

#reads in all my csvs
filelist <- list.files(pattern = ".csv")
list_of_data = lapply(filelist, read.csv)

all_data = do.call(rbind.fill, list_of_data)
write.csv(all_data, file = "all_Data_test.csv")

交叉验证和文本分类

2 个答案: