如何从数据中抽取负片

时间:2019-03-26 09:42:24

标签: python pandas numpy sampling

我整整一天都在努力理解以下代码,这对我来说意义不大。

有人可以启发我做什么吗?

某些上下文:

  

我们获得了(user-id, item-id, rating)形式的评分记录。   对于每条记录(正记录),我们正在尝试获取一个   极少(四个)否定项(在某种意义上,用户尚未评分)。

下面的代码应该是picking the negative items,但是哇,很难遵循:(而且注释并没有太大帮助。.

最困惑的部分是self._total_negativesleft_index = self.index_bounds[negative_users]

 class BisectionDataConstructor(BaseDataConstructor):
   """Use bisection to index within positive examples.

   This class tallies the number of negative items which appear before each
   positive item for a user. This means that in order to select the ith negative
   item for a user, it only needs to determine which two positive items bound
   it at which point the item id for the ith negative is a simply algebraic
   expression.
   """

   def _index_segment(self, user):

     lower, upper = self.index_bounds[user:user+2]
     items = self._sorted_train_pos_items[lower:upper]

     negatives_since_last_positive = np.concatenate(
       [items[0][np.newaxis], items[1:] - items[:-1] - 1])

     return np.cumsum(negatives_since_last_positive)

   def construct_lookup_variables(self):

     inner_bounds = np.argwhere(self._train_pos_users[1:] -
                                self._train_pos_users[:-1])[:, 0] + 1
     (upper_bound,) = self._train_pos_users.shape
     self.index_bounds = np.array([0] + inner_bounds.tolist() + [upper_bound])

     # Later logic will assume that the users are in sequential ascending order.
     assert np.array_equal(self._train_pos_users[self.index_bounds[:-1]],
                           np.arange(self._num_users))

     self._sorted_train_pos_items = self._train_pos_items.copy()

     for i in range(self._num_users):
       lower, upper = self.index_bounds[i:i+2]
       self._sorted_train_pos_items[lower:upper].sort()

     self._total_negatives = np.concatenate([
         self._index_segment(i) for i in range(self._num_users)])

   def lookup_negative_items(self, negative_users, **kwargs):

     output = np.zeros(shape=negative_users.shape, dtype=rconst.ITEM_DTYPE) - 1

     left_index = self.index_bounds[negative_users]
     right_index = self.index_bounds[negative_users + 1] - 1

     num_positives = right_index - left_index + 1
     num_negatives = self._num_items - num_positives
     neg_item_choice = stat_utils.very_slightly_biased_randint(num_negatives)

     use_shortcut = neg_item_choice >= self._total_negatives[right_index]
     output[use_shortcut] = (
         self._sorted_train_pos_items[right_index] + 1 +
         (neg_item_choice - self._total_negatives[right_index])
     )[use_shortcut]

     if np.all(use_shortcut):
       # The bisection code is ill-posed when there are no elements.
       return output

来自https://github.com/tensorflow/models/blob/master/official/recommendation/data_pipeline.py

when  train_pos_users = np.array(
     [0,0,  1,1,1,   2,2,2,  3,3,3,3,3,3,  4,4])

self.index_bounds = array([ 0,  2,  5,  8, 14, 16])

如果您对它熟悉,并且在网上有关于它所进行操作的描述,那么我可以非常用它来理解它的作用。.我尝试使用Google搜索bisection negative sampling,但是没有任何结果..

  • 编辑

所以二等分意味着减半,这类似于二进制搜索。

我认为代码无法实现预期的功能,并留下了github问题。

https://github.com/tensorflow/models/issues/6441

0 个答案:

没有答案