我正在建立一个推荐系统,这是关于为培训系统准备数据的。
以Netflix为例,用户接触了netflix中的电影,当我们(netflix)推荐一部movie-X时,他会感兴趣吗?
为我提供了带有项目的用户交互历史记录的列表 交互类型(rating_type)包括“视图”,“共享”,“书签”等。
user_id, item_id, rating_type, timestamp
根据上面的数据,我正在创建看起来像这样的训练数据:
user_id, prior_item_ids, item_id, target
user_id
接触了prior_item_ids
,当我们推荐item_id
时,他会喜欢吗? (target=1
其他target=0
)
我正在如下创建数据。代码也在下面给出 花费了这么长时间,想知道是否有更好的策略或我的策略的更好实施。
for each positive rating
I make one positive training data.
By finding the prior ratings.
* user_id, item_id (of the positive rating), prior_ids, target=1
I make 4 negative training data as well
I randomly select 4 negative ratings which happend before the positive rating
I make sure it's truely negative by ensuring user didn't give positive rating(share/bookmark) afterwards (The given item is not included in the next 10 positive ratings)
for each negative ratings, find prior ratings
I have 4 of the following
* user_id, item_id (of the negative rating), prior_ids, target=0
If user has not positive rating, we build one negative training data
这是我的实现,需要很长时间。
class Ranking(object):
def __init__(self):
self.num_prior = 10
def prepare_rating_data(self, file_path):
self.data = pd.read_csv(file_path, dtype={'review_meta_id': object, 'user_id': object}).sort_values('timestamp')
df = self.data
df.dropna(subset=['review_meta_id', 'user_id'], inplace=True)
num_prior = self.num_prior
results = []
for user_id, group in df.sort_values(
['user_id', 'timestamp'], ascending=[True, False]
).groupby('user_id'):
group = group.reset_index()
positive = None
for index, row in group.iterrows():
# print(index)
if row.rating_type not in [20, 90]:
positive = row
low = max(0, index - num_prior)
priors = group.drop_duplicates(subset=['user_id', 'review_meta_id'])[low:index]
result_positive_dict = {
'user_id': user_id,
'review_meta_id': positive.review_meta_id,
'prior_ids': ','.join(priors.review_meta_id),
'target': 1
}
results.append(result_positive_dict)
# 20, 90 = negative
positives = group[(group.index>=index) & (~group.rating_type.isin([20, 90]))][:10]
num_negative = 4
for i in range(num_negative):
index_sample = random.sample(range(index+1), 1)[0]
sample = group.iloc[index_sample]
low = max(0, index_sample - num_prior)
try_count = 5
for _ in range(try_count):
if sample.rating_type not in [20, 90] or sample.review_meta_id in positives.review_meta_id:
index_sample = random.sample(range(index+1), 1)[0]
sample = group.iloc[index_sample]
low = max(0, index_sample - num_prior)
priors = group.drop_duplicates(subset=['user_id', 'review_meta_id'])[low:index_sample]
negative = sample
result_negative_dict = {
'user_id': user_id,
'review_meta_id': negative.review_meta_id,
'prior_ids': ','.join(priors.review_meta_id),
'target': 0
}
results.append(result_negative_dict)
if positive is None:
group = group.drop_duplicates(
subset=['user_id', 'review_meta_id'])
n = min(len(group), num_prior + 1)
group = group.sample(n)
result_negative_dict = {
'user_id': user_id,
'review_meta_id': group.tail(1)['review_meta_id'].iloc[0],
'prior_ids': ','.join(group.review_meta_id[:-1]),
'target': 0
}
results.append(result_negative_dict)
df_result = pd.DataFrame(results, columns=['review_meta_id', 'prior_ids', 'target', 'user_id'])
df = self.apply_prior_ids_pad(df_result)
return df_result
def apply_prior_ids_pad(self, df):
def pad(x):
x = x.strip()
result = x.split(',') or []
result = result + ['0'] * (self.num_prior - len(result))
return result
df['prior_ids'] = df['prior_ids'].apply(pad)
return df
我已经为代码/数据https://github.com/littlehome-eugene/data制作了git repo