我的数据集包括user ID
,item ID
和rating
,如下所示:
user ID item ID rating
1233 1011 4
1220 0999 3
2011 0702 1
...
当我将它们分成train
和test
集时:
from sklearn import cross_validation
train, test = cross_validation.train_test_split(df, test_size = 0.2)
测试装置中的用户是否已经出现在火车组中,是否有物品?如果没有,我该怎么办?我在document找不到答案。你能告诉我吗?
答案 0 :(得分:0)
如果您想确保您的训练和测试分区不包含相同的用户和项目配对,那么您可以用整数标签替换每个唯一(用户,项目)组合,然后将这些标签传递给{{3 }}。要为每个唯一配对分配整数标签,您可以使用LabelKFold
:
import numpy as np
import pandas as pd
from sklearn.cross_validation import LabelKFold
df = pd.DataFrame({'users':[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
'items':[0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
'ratings':[2, 4, 3, 1, 4, 3, 0, 0, 0, 1, 0, 1]})
users_items = df[['users', 'items']].values
d = np.dtype((np.void, users_items.dtype.itemsize * users_items.shape[1]))
_, uidx = np.unique(np.ascontiguousarray(users_items).view(d), return_inverse=True)
for train, test in LabelKFold(uidx):
# train your classifier using df.loc[train, ['users', 'items']] and
# df.loc[train, 'ratings']...
# cross-validate on df.loc[test, ['users', 'items']] and
# df.loc[test, 'ratings']...
我仍然很难理解你的问题。如果您想保证您的培训和测试集做包含同一用户的示例,那么您可以使用this trick:
for train, test in StratifiedKFold(df['users']):
# ...
答案 1 :(得分:0)
def train_test_split(self, ratings, train_rate=0.8):
"""
Split ratings into Training set and Test set
"""
grps = ratings.groupby('user_id').groups
test_df_index = list()
train_df_index = list()
test_iid = list()
train_iid = list()
for key in grps:
count = 0
local_index = list()
grp = np.array(list(grps[key]))
n_test = int(len(grp) * (1 - train_rate))
for i, index in enumerate(grp):
if count >= n_test:
break
if ratings.iloc[index]['movie_id'] in test_iid:
continue
test_iid.append(ratings.iloc[index]['movie_id'])
test_df_index.append(index)
local_index.append(i)
count += 1
grp = np.delete(grp, local_index)
if count < n_test:
local_index = list()
for i, index in enumerate(grp):
if count >= n_test:
break
test_iid.append(ratings.iloc[index]['movie_id'])
test_df_index.append(index)
local_index.append(i)
count += 1
grp = np.delete(grp, local_index)
train_df_index.append(grp)
test_df_index = np.hstack(np.array(test_df_index))
train_df_index = np.hstack(np.array(train_df_index))
np.random.shuffle(test_df_index)
np.random.shuffle(train_df_index)
return ratings.iloc[train_df_index], ratings.iloc[test_df_index]
您可以使用此方法进行拆分,我已经尽力确保训练集和测试集具有相同的用户ID和电影ID。