Question

我有一个数据集，我需要过滤“唯一”事件。基本上，我想删除同一个用户在一天内多次购买同一产品的每一行，而不管变量设备如何。在多次出现的情况下，我只保留第一行。

数据：

datetime, device, product, user

  [
  ['2013-07-08 15:00:00', 'pc',       'X',        'A'],
  ['2013-07-09 17:00:00', 'pc',       'X',        'A'],
  ['2013-07-09 10:00:00', 'andr',     'Y',        'B'],
  ['2013-07-10 18:00:00', 'pc',       'Y',        'B'],
  ['2013-07-10 21:00:00', 'ipho',     'Y',        'B'],       <- second occurance of B getting Y that day
  ['2013-07-10 22:00:00', 'andr',     'Y',        'B'],       <- third occurance of B getting Y that day
  ['2013-07-10 02:00:00', 'ipho',     'Z',        'C'],
  ['2013-07-10 11:00:00', 'pc',       'Z',        'C']        <- second occurance of C getting Z that day
  ]

应将其过滤为：

  ['2013-07-08 15:00:00', 'pc',       'X',        'A'],
  ['2013-07-09 17:00:00', 'pc',       'X',        'A'],
  ['2013-07-09 10:00:00', 'andr',     'Y',        'B'],
  ['2013-07-10 18:00:00', 'pc',       'Y',        'B'],
  ['2013-07-10 02:00:00', 'ipho',     'Z',        'C'],
  ['2013-07-10 11:00:00', 'pc',       'Z',        'C']

我将如何做到这一点？

Answer 1

从日期时间中删除时间部分，然后将每个项目存储在字典中（如果尚未存在）。作为字典的关键，使用日期，产品，用户的元组。

E.g。

 d = {}
 for datetime, device, product, user in table:
     date = datetime[:10]
     if (date, product, user) not in d:
         d[(date, product, user)] = [datetime, device, product, user]

如何通过半唯一值过滤列表

1 个答案: