我正在尝试在Spark中实现SQL MINUS行为,这里有2个JSON people1.json和people2.json具有相同的数据
{"name":"abc","age":22}
{"name":"xyz","age":20}
Spark Code
{{2} }people2.json中的列名称年龄与此时间不匹配(年龄为xage)
val dfpeople1 = spark.read.json("/tmp/people1.json")
val dfpeople2 = spark.read.json("/tmp/people2.json")
val dfDifference = dfpeople1.except(dfpeople2)
dfDifference.show()
##########dataframe dataframe (as expected)##########
+---+----+
|age|name|
+---+----+
+---+----+
重命名列名称的Spark代码。
{"name":"abc","xage":22}
{"name":"xyz","xage":20}
我希望最终的Dataframe结果为前一个结果(0行)。由于列顺序的原因,它不会出现这种情况!
还有其他方法可以达到这个目的吗?
答案 0 :(得分:0)
只需使用dfpeople2
dfpeople1
即可
val dfpeople1 = spark.read.json("/tmp/people1.json")
val xdfpeople2 = spark.read.json("/tmp/people2.json")
val dfpeople2 = xdfpeople2.withColumnRenamed("xage","age")
val columns = dfpeople1.columns
val dfpeople2Ordered = dfpeople2.select(columns.head, columns.tail:_*)
val dfDifference = dfpeople1.except(dfpeople2Ordered)
dfpeople1.show()
dfpeople2.show()
dfpeople2Ordered.show()
dfDifference.show()
##########Output##########
*******people1 dataframe******
+---+----+
|age|name|
+---+----+
| 22| abc|
| 20| xyz|
+---+----+
*******people2 dataframe******
|name|age|
+----+---+
| abc| 22|
| xyz| 20|
+----+---+
*******dfpeople2Ordered dataframe******
+---+----+
|age|name|
+---+----+
| 22| abc|
| 20| xyz|
+---+----+
*******dfDifference -> ******
+---+----+
|age|name|
+---+----+
+---+----+