Question

我正在使用movielens数据集（ratings.dat）和pandas数据帧来读取和处理数据。我必须将这些数据分成测试和训练集。通过使用pandas dataframe.sample函数，可以将数据分成随机分割。例如：

train = df.sample（frac = 0.8，random_state = 200）

test = df.drop（train.index）

现在我尝试对user_id上的数据进行排序，然后对时间戳进行排序，我需要分别在训练集和测试集中将数据划分为每用户80％-20％。

因此，例如，如果user1评了10部电影，那么该用户的条目应根据时间戳从最旧到最新排序

ratings = pd.read_csv（＆＃39; filename＆＃39;，sep =＆＃39; \ t＆＃39;，engine =＆＃39; python＆＃39;，header = 0）

sorted_df = ratings.sort（[＆＃39; user_id＆＃39;，＆＃39; timestamp＆＃39;]，ascending = [True，True]）

并且拆分应该使得具有最早时间戳的前8个条目将在训练集中，并且最新的2个条目将在测试集中。

我不知道我怎么能这样做。有什么建议吗？

由于

数据：

           user_id   item_id   rating   Timestamp 
15              1      539        5  838984068
16              1      586        5  838984068
5               1      355        5  838984474
9               1      370        5  838984596
12              1      466        5  838984679
14              1      520        5  838984679
19              1      594        5  838984679
7               1      362        5  838984885
20              1      616        5  838984941
23              2      260        5  868244562
29              2      733        3  868244562
32              2      786        3  868244562
36              2     1073        3  868244562
33              2      802        2  868244603
38              2     1356        3  868244603
30              2      736        3  868244698
31              2      780        3  868244698
27              2      648        2  868244699

Answer 1

它需要多个步骤，但可以实现如下。

直觉是根据时间戳生成等级，并将其约束在0和1之间。然后，低于0.8的所有内容都将是您的列车设置，否则您的测试设置。

我们怎么做？创建排名很简单

df.groupby('user_id')['Timestamp'].rank(method='first')
Out[51]: 
0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
5     6.0
6     7.0
7     8.0
8     9.0
9     1.0
10    2.0
11    3.0
12    4.0
13    5.0
14    6.0
15    7.0
16    8.0
17    9.0
Name: Timestamp, dtype: float64

然后，您需要在每组中有多少值之间创建映射。您可以在此处找到其他信息：Inplace transformation pandas with groupby。

df['user_id'].map(df.groupby('user_id')['Timestamp'].apply(len))
Out[52]: 
0     9
1     9
2     9
3     9
4     9
5     9
6     9
7     9
8     9
9     9
10    9
11    9
12    9
13    9
14    9
15    9
16    9
17    9
Name: user_id, dtype: int64

现在你可以把所有东西放在一起

ranks = df.groupby('user_id')['Timestamp'].rank(method='first')
counts = df['user_id'].map(df.groupby('user_id')['Timestamp'].apply(len))
(ranks / counts) > 0.8
Out[55]: 
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8      True
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16     True
17     True
dtype: bool

根据python中的训练和测试集中的时间戳，为每个用户拆分数据集

1 个答案: