如何为dataframe.expect方法排序列

时间:2017-03-27 08:53:59

标签: apache-spark apache-spark-sql spark-streaming spark-dataframe

我正在尝试在Spark中实现SQL MINUS行为,这里有2个JSON people1.json和people2.json具有相同的数据

{"name":"abc","age":22}
{"name":"xyz","age":20}

Spark Code

{{2} }

people2.json中的列名称年龄与此时间不匹配(年龄为xage)

val dfpeople1 = spark.read.json("/tmp/people1.json")
val dfpeople2 = spark.read.json("/tmp/people2.json")

val dfDifference = dfpeople1.except(dfpeople2)

dfDifference.show()

##########dataframe dataframe (as expected)##########
+---+----+
|age|name|
+---+----+
+---+----+

重命名列名称的Spark代码。

{"name":"abc","xage":22}
{"name":"xyz","xage":20}

我希望最终的Dataframe结果为前一个结果(0行)。由于列顺序的原因,它不会出现这种情况!

还有其他方法可以达到这个目的吗?

1 个答案:

答案 0 :(得分:0)

只需使用dfpeople2

中的列选择dfpeople1即可
val dfpeople1 = spark.read.json("/tmp/people1.json")
val xdfpeople2 = spark.read.json("/tmp/people2.json")
val dfpeople2 = xdfpeople2.withColumnRenamed("xage","age")

val columns = dfpeople1.columns
val dfpeople2Ordered = dfpeople2.select(columns.head, columns.tail:_*)

val dfDifference = dfpeople1.except(dfpeople2Ordered)

dfpeople1.show()
dfpeople2.show()
dfpeople2Ordered.show()
dfDifference.show()


##########Output##########
*******people1 dataframe******    
+---+----+
|age|name|
+---+----+
| 22| abc|
| 20| xyz|
+---+----+

*******people2 dataframe******    
|name|age|
+----+---+
| abc| 22|
| xyz| 20|
+----+---+

*******dfpeople2Ordered dataframe******    
+---+----+
|age|name|
+---+----+
| 22| abc|
| 20| xyz|
+---+----+

*******dfDifference -> ******       
+---+----+
|age|name|
+---+----+
+---+----+