Question

我正在尝试为本地报纸（作为学校项目）创建推荐系统，但是当我尝试使用model_selection库中的cross_validate函数时，我遇到了麻烦。

我正在尝试使用SVD并获得f1分数。但我有点困惑。所以这是无监督学习，我没有测试集，所以我想使用KFolding进行交叉验证。我相信这个折叠的数量由＆＃34; cv＆＃34;表示。 cross_validate函数中的参数。这是对的吗？

当我尝试运行代码时出现问题，因为我得到以下堆栈跟踪：https://hastebin.com/kidoqaquci.tex

我没有把任何东西传递给＆＃34; y＆＃34; cross_validate函数的参数，但这是错误的吗？这不是测试集应该去的地方吗？正如我所说，我没有任何测试集，因为这是无人监督的。我在这里查看了3.1.1.1章中的示例：http://scikit-learn.org/stable/modules/cross_validation.html

看起来他们正在传递一个＆＃34;目标＆＃34;对于cross_validate函数中的数据集。但为什么他们既传递目标集又传递cv参数？ cv值是否高于1表示是否应该使用kfolding并且将遗漏的折叠用作目标（测试）集？

还是我完全误解了什么？为什么我得到了＃34;缺少的参数＆＃34;堆栈跟踪中的错误？

这是失败的代码：

from sklearn.model_selection import cross_val_score as cv
from sklearn.decomposition.truncated_svd import TruncatedSVD
import pandas as pd

# keywords_data_filename = 'keywords_data.txt'
active_data_filename = 'active_time_data.txt'

header = ['user_id', 'item_id', 'rating']
# keywords_data = pd.read_csv(keywords_data_filename, sep='*', names=header, engine='python')
active_time_data = pd.read_csv(active_data_filename, sep='*', names=header, engine='python')


# Number of users in current set
print('Number of unique users in current data-set', active_time_data.user_id.unique().shape[0])
print('Number of unique articles in current data-set', active_time_data.item_id.unique().shape[0])

# SVD allows us to look at our input matrix as a product of three smaller matrices; U, Z and V.
# In short this will help us discover concepts from the original input matrix,
# (subsets of users that like subsets of items)
# Note that use of SVD is not strictly restricted to user-item matrices
# https://www.youtube.com/watch?v=P5mlg91as1c

algorithm = TruncatedSVD()

# Finally we run our cross validation in n folds, where n is denoted by the cv parameter.
# Verbose can be adjusted by an integer to determine level of verbosity.
# We pass in our SVD algorithm as the estimator used to fit the data.
# X is our data set that we want to fit.
# Since our estimator (The SVD algorithm), We must either define our own estimator, or we can simply define how it
# score the fitting.
# Since we currently evaluate the enjoyment of our users per article highly binary, (Please see the rate_article fn in
# the filter script), we can easily decide our precision and recall based on whether or not our prediction exactly
# matches the binary rating field in the test set.
# This, the F1 scoring metric seems an intuitive choice for measuring our success, as it provides a balanced score
# based on the two.

cv(estimator=algorithm, X=active_time_data, scoring='f1', cv=5, verbose=True)

Answer 1

这里有多个问题：

1）TruncatedSVD是dimensionality reduction algorithm。所以我不明白你打算如何计算f1_score。

2）f1_score传统上用于分类任务，并有一个公式：

f1 = 2*recall*precision
    --------------------
     recall + precision

其中回忆和精确度是根据真阳性，真阴性，假阳性，假阴性来定义的，而这反过来又需要计算真实类和预测类。

3）cv = 1毫无意义。在cross_val_score中，cv表示折叠次数。所以cv = 5表示在每次折叠中，80％的数据将在训练中，20％在测试中。那么你打算如何在没有某种基本事实的真实标签的情况下测试数据。

推荐系统使用SciKit-Learn的cross_validate，缺少1个必需的位置参数：＆＃39; y_true＆＃39;

1 个答案: