有两张桌子;一个是ID表1,另一个是属性表2.
表1
表2
如果表1中同一行的ID具有相同的属性,那么我们得到数字1,否则我们得到0.最后,我们得到结果表3.
表3
例如,id1和id2具有不同的颜色和大小,因此id1和id2行(表3中的第2行)具有“id1 id2 0 0”;
id1和id3具有相同的颜色和不同的大小,因此id1和id3行(表3中的第3行)具有“id1 id3 1 0”;
相同属性--- 1 不同的属性--- 0
如何使用Scala数据帧获得结果表3?
答案 0 :(得分:3)
这应该可以解决问题
import spark.implicits._
val t1 = List(
("id1","id2"),
("id1","id3"),
("id2","id3")
).toDF("id_x", "id_y")
val t2 = List(
("id1","blue","m"),
("id2","red","s"),
("id3","blue","s")
).toDF("id", "color", "size")
t1
.join(t2.as("x"), $"id_x" === $"x.id", "inner")
.join(t2.as("y"), $"id_y" === $"y.id", "inner")
.select(
'id_x,
'id_y,
when($"x.color" === $"y.color",1).otherwise(0).alias("color").cast(IntegerType),
when($"x.size" === $"y.size",1).otherwise(0).alias("size").cast(IntegerType)
)
.show()
导致:
+----+----+-----+----+
|id_x|id_y|color|size|
+----+----+-----+----+
| id1| id2| 0| 0|
| id1| id3| 1| 0|
| id2| id3| 0| 1|
+----+----+-----+----+
答案 1 :(得分:2)
以下是使用UDF
执行此操作的方法,它可以帮助您了解代码的重复次数并最小化以提高性能
import spark.implicits._
val df1 = spark.sparkContext.parallelize(Seq(
("id1", "id2"),
("id1","id3"),
("id2","id3")
)).toDF("idA", "idB")
val df2 = spark.sparkContext.parallelize(Seq(
("id1", "blue", "m"),
("id2", "red", "s"),
("id3", "blue", "s")
)).toDF("id", "color", "size")
val firstJoin = df1.join(df2, df1("idA") === df2("id"), "inner")
.withColumnRenamed("color", "colorA")
.withColumnRenamed("size", "sizeA")
.withColumnRenamed("id", "idx")
val secondJoin = firstJoin.join(df2, firstJoin("idB") === df2("id"), "inner")
val check = udf((v1: String, v2:String ) => {
if (v1.equalsIgnoreCase(v2)) 1 else 0
})
val result = secondJoin
.withColumn("color", check(col("colorA"), col("color")))
.withColumn("size", check(col("sizeA"), col("size")))
val finalResult = result.select("idA", "idB", "color", "size")
希望这有帮助!