Question

我正在尝试加入两个数据帧。

data：DataFrame [_1：bigint，_2：vector]

cluster：DataFrame [cluster：bigint]

result = data.join(broadcast(cluster))

奇怪的是，所有执行者都在加入步骤失败。

我不知道我能做些什么。

HDFS上的数据文件为2.8 gb，群集数据仅为5 mb。使用Parquet读取文件。

Answer 1

这是什么工作：

data = sqlContext.read.parquet(data_path)
data = data.withColumn("id", monotonicallyIncreasingId())

cluster = sqlContext.read.parquet(cluster_path)  
cluster = cluster.withColumn("id", monotonicallyIncreasingId())

result = data.join(cluster, on="id")

将群集DataFrame直接添加到数据DataFrame中：

data.withColumn("cluster", cluster.cluster)

不起作用。

data.join(cluster)

同样不起作用，执行程序在有足够内存的情况下失败。

不知道为什么它不起作用......

如何在没有公共密钥的情况下合并Spark Hadoop中的两个数据帧？

1 个答案: