我有两个数据框。
dataDF
+---+
| tt|
+---+
| a|
| b|
| c|
| ab|
+---+
更改
+----+-----+------+
|name|alter|profit|
+----+-----+------+
| a| aa| 1|
| b| a| 5|
| c| ab| 8|
+----+-----+------+
任务是在dataframe alter col(“ name”)中搜索col“ tt”,如果找到它加入了它们,如果找不到,则在col(“ alter”)中搜索col“ tt”。 col(“名称”)的优先级高于col(“ alter”)。这意味着如果col(“ tt”)的行与col(“ name”)匹配,我不想将其与仅匹配col(“ alter”)的其他行进行匹配。我该如何完成这项任务?
我试图编写一个联接,但是它不起作用。
dataDF = dataDF.select("*")
.join(broadcast(alterDF),
col("tt") === col("Name") || col("tt") === col("alter"),
"left")
结果是:
+---+----+-----+------+
| tt|name|alter|profit|
+---+----+-----+------+
| a| a| aa| 1|
| a| b| a| 5| // this row is not expected.
| b| b| a| 5|
| c| c| ab| 8|
| ab| c| ab| 8|
+---+----+-----+------+
答案 0 :(得分:0)
您可以尝试加入两次。第一次使用name列,过滤掉数据不匹配的tt值,并将其与alter列连接。合并两个结果。请在下面找到相同的代码。希望对您有所帮助。
//Creating Test Data
val dataDF = Seq("a", "b", "c", "ab").toDF("tt")
val alter = Seq(("a", "aa", 1), ("b", "a", 5), ("c", "ab", 8))
.toDF("name", "alter", "profit")
val join1 = dataDF.join(alter, col("tt") === col("name"), "left")
val join2 = join1.filter( col("name").isNull).select("tt")
.join(alter, col("tt") === col("alter"), "left")
val joinDF = join1.filter( col("name").isNotNull).union(join2)
joinDF.show(false)
+---+----+-----+------+
|tt |name|alter|profit|
+---+----+-----+------+
|a |a |aa |1 |
|b |b |a |5 |
|c |c |ab |8 |
|ab |c |ab |8 |
+---+----+-----+------+