Question

我有一个

形式的数据集

user_id, date, other_columns
1, 2017-03-10, ...
2, 2017-03-10, ...
3, 2017-03-10, ...
...

我需要执行以下操作：对于数据集中的每一行，我想生成一个新行，该行将包含当前行以及对应于不同用户的同一天的N行的随机子集如下：

row, other_rows
{'user_id': 1, 'date': '2017-03-10', ...}, [{'user_id': 2,...},...]
{'user_id': 2, 'date': '2017-03-10', ...}, [{'user_id': 1,...},...]
...

我已将其实现如下，但在云上执行时，对于大型数据集来说非常慢。

dataset
| 'map-to-date' >> beam.Map(lambda x: (x['date'], x))
| 'group-by-date' >> beam.GroupByKey()
| 'generate-output' >> beam.ParDo(GenerateOutputRows())

其中GenerateOutputRows定义为：

class GenerateOutputRows(beam.DoFn):
    def process(self, element):
        (date, rows) = element
        for r in rows:
            other_users_rows = list(filter(lambda x: x['user_id'] != r['user_id'],
                                           rows))
            yield (r, random.sample(other_users_rows, N))

你能想到另一种更有效的方法来获得理想的结果吗？

Answer 1

除非你能提供一些简化的假设，否则我不会立即在算法上看到更有效的方法，例如：

在1个日期内，user_id是唯一的还是可以有多个具有相同用户的行？
如果是，那么同一用户的other_users_rows的多个样本必须在统计上独立吗？如果不是，那么您可以使用一些缓存，并对具有相同user_id的行多次重复使用相同的样本。
不同用户的other_users_rows样本是否必须在统计上独立，或者如果用户A的样本不包括用户B，那么可以为用户B使用完全相同的样本吗？< / LI>

一般来说，这是一个算法问题，而不是Cloud Dataflow / Apache Beam问题，因为代码的瓶颈是GenerateOutputRows中的O（rows.size（）^ 2）循环，而Beam不能自动加快;我建议前往http://cstheory.stackexchange.com寻求有关更高效算法的建议。

从PCollection获取随机行集

1 个答案: