DataFrame / Dataset连接在Spark 2.0 / Yarn中没有产生正确的结果

时间:2016-10-06 14:53:52

标签: apache-spark apache-spark-sql apache-spark-dataset

我们在Hadoop 2.7.2,Centos 7.2上有一个运行Apache Spark 2.0的集群。我们使用Spark DataFrame / DataSet API编写了一些新代码,但在写入然后将数据读取到Windows Azure存储Blob(默认HDFS位置)后,在连接上注意到错误结果。我已经能够通过群集上运行的以下代码片段来复制该问题。

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show

输出

+-----+---------+-----+                                                         
| user|dimension|score|
+-----+---------+-----+
|12345|        0|  1.0|
+-----+---------+-----+

+---------+-------+-----+
|dimension|cluster|score|
+---------+-------+-----+
|        0|      1|  1.0|
|        1|      0|  1.0|
|        2|      2|  1.0|
+---------+-------+-----+

+-----+---------+-----+---------+-------+-----+
| user|dimension|score|dimension|cluster|score|
+-----+---------+-----+---------+-------+-----+
|12345|        0|  1.0|        0|      1|  1.0|
+-----+---------+-----+---------+-------+-----+

这是正确的。但是在写完和读取数据后,我们看到了这个

dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")

val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]

dims2.show
cent2.show

dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show

输出

+-----+---------+-----+                                                         
| user|dimension|score|
+-----+---------+-----+
|12345|        0|  1.0|
+-----+---------+-----+

+---------+-------+-----+
|dimension|cluster|score|
+---------+-------+-----+
|        0|      1|  1.0|
|        1|      0|  1.0|
|        2|      2|  1.0|
+---------+-------+-----+

+-----+---------+-----+---------+-------+-----+
| user|dimension|score|dimension|cluster|score|
+-----+---------+-----+---------+-------+-----+
|12345|        0|  1.0|     null|   null| null|
+-----+---------+-----+---------+-------+-----+

但是,使用RDD API会产生正确的结果

dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => (row.dimension, row) ) ).take(5)

res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0))))

我们已经尝试将输出格式更改为ORC而不是镶木地板,但我们看到相同的结果。在本地运行Spark 2.0而不是在群集上运行没有此问题。在Hadoop集群的主节点上以本地模式运行spark也可以。只有在YARN上运行时,我们才能看到这个问题。

这似乎与此问题非常相似:https://issues.apache.org/jira/browse/SPARK-10896

1 个答案:

答案 0 :(得分:0)

此问题已由https://issues.apache.org/jira/browse/SPARK-17806

中提交的提取请求修复