多个笛卡尔连接pySpark

时间:2014-12-11 23:35:10

标签: hadoop apache-spark

我在进行多个笛卡尔连接时遇到内存错误,即使它实际上是非常小的数据集。谁能解释为什么会这样呢?

In [1]: foo = sc.records([{'foo': 123}, {'foo': 321}])
In [2]: bar = sc.records([{'bar': 123}, {'bar': 321}])
In [3]: baz = sc.records([{'baz': 123}, {'baz': 321}])
In [4]: qux = foo.cartesian(bar)\
   ...:          .map(lambda (x,y): x.merge(y))\
   ...:          .cartesian(baz)\
   ...:          .map(lambda (x,y): x.merge(y))
In [5]: qux.collect()

java.lang.OutOfMemoryError: GC overhead limit exceeded

1 个答案:

答案 0 :(得分:0)

我最终定义了自己的cartesianJoin函数

def cartesianJoin(self, other):
    return self.map(lambda rec: (0, rec)).join(other.map(lambda rec: (0, rec))).map(lambda (key, (x, y)): x.merge(y))
end