段落向量模型的交叉验证

时间:2019-01-02 10:54:49

标签: scikit-learn transform cross-validation gensim

当我尝试对段落矢量模型应用交叉验证时,我遇到了一个错误:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from gensim.sklearn_api import D2VTransformer

data = pd.read_csv('https://pastebin.com/raw/bSGWiBfs')
np.random.seed(0)

X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
y_train = data.label

model = D2VTransformer(size=10, min_count=1, iter=5, seed=1)
clf = LogisticRegression(random_state=0)

pipeline = Pipeline([
        ('vec', model),
        ('clf', clf)
    ])

pipeline.fit(X_train, y_train)

score = pipeline.score(X_train, y_train)
print("Score:", score) # This works
cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=3)
print("Cross-Validation:", cval) # This doesn't work
  

KeyError:0

我尝试用X_traincross_val_score替换model.transform(X_train)中的model.fit_transform(X_train)。另外,我尝试使用原始输入数据(data.text)代替预处理文本进行相同操作。我怀疑交叉验证的X_train格式肯定有问题,与Pipeline的.score函数相比,效果很好。我还注意到cross_val_scoreCountVectorizer()一起工作。

有人发现错误吗?

1 个答案:

答案 0 :(得分:1)

否,这与从model进行转换无关。它与cross_val_score有关。

cross_val_score将根据cv参数拆分提供的数据。为此,它将执行以下操作:

for train, test in splitter.split(X_train, y_train):
    new_X_train, new_y_train = X_train[train], y_train[train]

但是您的X_trainpandas.Series对象,其中基于索引的选择无法像这样工作。看到这个:https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position

更改此行:

X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)

收件人:

# Access the internal numpy array
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1).values

OR

# Convert series to list
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1).tolist()