熊猫,负采样数据准备

时间:2019-04-06 16:52:43

标签: python pandas

我正在建立一个推荐系统,这是关于为培训系统准备数据的。

以Netflix为例,用户接触了netflix中的电影,当我们(netflix)推荐一部movie-X时,他会感兴趣吗?

为我提供了带有项目的用户交互历史记录的列表 交互类型(rating_type)包括“视图”,“共享”,“书签”等。

user_id, item_id, rating_type, timestamp

根据上面的数据,我正在创建看起来像这样的训练数据:

user_id, prior_item_ids, item_id, target

user_id接触了prior_item_ids,当我们推荐item_id时,他会喜欢吗? (target=1其他target=0

我正在如下创建数据。代码也在下面给出 花费了这么长时间,想知道是否有更好的策略或我的策略的更好实施。

for each positive rating


  I make one positive training data.
    By finding the prior ratings.
    * user_id, item_id (of the positive rating), prior_ids, target=1

  I make 4 negative training data as well
    I randomly select 4 negative ratings which happend before the positive rating
    I make sure it's truely negative by ensuring user didn't give positive rating(share/bookmark) afterwards (The given item is not included in the next 10 positive ratings)
    for each negative ratings, find prior ratings

    I have 4 of the following
    * user_id, item_id (of the negative rating), prior_ids, target=0

  If user has not positive rating, we build one negative training data

这是我的实现,需要很长时间。

class Ranking(object):
    def __init__(self):
        self.num_prior = 10

    def prepare_rating_data(self, file_path):

        self.data = pd.read_csv(file_path, dtype={'review_meta_id': object, 'user_id': object}).sort_values('timestamp')
        df = self.data
        df.dropna(subset=['review_meta_id', 'user_id'], inplace=True)

        num_prior = self.num_prior
        results = []
        for user_id, group in df.sort_values(
            ['user_id', 'timestamp'], ascending=[True, False]
        ).groupby('user_id'):
          group = group.reset_index()
          positive = None

          for index, row in group.iterrows():
              # print(index)
              if row.rating_type not in [20, 90]:
                positive = row

                low = max(0, index - num_prior)
                priors = group.drop_duplicates(subset=['user_id', 'review_meta_id'])[low:index]

                result_positive_dict = {
                  'user_id': user_id,
                  'review_meta_id': positive.review_meta_id,
                  'prior_ids': ','.join(priors.review_meta_id),
                  'target': 1
                }
                results.append(result_positive_dict)
                # 20, 90 = negative
                positives = group[(group.index>=index) & (~group.rating_type.isin([20, 90]))][:10]
                num_negative = 4

                for i in range(num_negative):
                  index_sample = random.sample(range(index+1), 1)[0]
                  sample = group.iloc[index_sample]

                  low = max(0, index_sample - num_prior)

                  try_count = 5
                  for _ in range(try_count):
                      if sample.rating_type not in [20, 90] or sample.review_meta_id in positives.review_meta_id:
                        index_sample = random.sample(range(index+1), 1)[0]
                        sample = group.iloc[index_sample]

                  low = max(0, index_sample - num_prior)
                  priors = group.drop_duplicates(subset=['user_id', 'review_meta_id'])[low:index_sample]

                  negative = sample
                  result_negative_dict = {
                    'user_id': user_id,
                    'review_meta_id': negative.review_meta_id,
                    'prior_ids': ','.join(priors.review_meta_id),
                    'target': 0
                  }

                  results.append(result_negative_dict)

          if positive is None:
            group = group.drop_duplicates(
              subset=['user_id', 'review_meta_id'])
            n = min(len(group), num_prior + 1)
            group = group.sample(n)

            result_negative_dict = {
                'user_id': user_id,
                'review_meta_id': group.tail(1)['review_meta_id'].iloc[0],
                'prior_ids': ','.join(group.review_meta_id[:-1]),
                'target': 0
              }

            results.append(result_negative_dict)

        df_result = pd.DataFrame(results, columns=['review_meta_id', 'prior_ids', 'target', 'user_id'])

        df = self.apply_prior_ids_pad(df_result)
        return df_result

    def apply_prior_ids_pad(self, df):
      def pad(x):
        x = x.strip()
        result = x.split(',') or []
        result = result + ['0'] * (self.num_prior - len(result))

        return result
      df['prior_ids'] = df['prior_ids'].apply(pad)

      return df

我已经为代码/数据https://github.com/littlehome-eugene/data制作了git repo

0 个答案:

没有答案