我有两个数据框:1)帐户和2)客户。帐户的架构为:
Name Id Telehone Mob email
AR 1 123 1234 test1@gmail.com
BR 2 213 4123 test2@gmail.com
CR 3 231 3214 test3@gmail.com
KR 4 132 1324 test4@gmail.com
第二表客户为:
Id Phone Email
2 2344 testq@gmail.com
6 132 testf@gmail.com
7 64562 test1@gmail.com
我需要加入这两个数据帧,以使Id
匹配Id
OR
Phone
匹配Telephone
OR Mob Or Email
匹配{{ 1}}。在上述情况下,客户的第一行在ID上匹配,第二行在电话上匹配,第三行在电子邮件上匹配。该连接应保留在所有帐户记录中。
答案 0 :(得分:3)
检查以下代码。
scala> accountDF.show(false)
+----+---+---------+----+---------------+
|name|id |telephone|mob |email |
+----+---+---------+----+---------------+
|AR |1 |123 |1234|test1@gmail.com|
|BR |2 |213 |4123|test2@gmail.com|
|CR |3 |231 |3214|test3@gmail.com|
|KR |4 |132 |1324|test4@gmail.com|
+----+---+---------+----+---------------+
scala> customerDF.show(false)
+---+-----+---------------+
|id |phone|email |
+---+-----+---------------+
|2 |2344 |testq@gmail.com|
|6 |132 |testf@gmail.com|
|7 |64562|test1@gmail.com|
+---+-----+---------------+
scala> accountDF.printSchema
root
|-- name: string (nullable = true)
|-- id: string (nullable = true)
|-- telephone: string (nullable = true)
|-- mob: string (nullable = true)
|-- email: string (nullable = true)
scala> customerDF.printSchema
root
|-- id: string (nullable = true)
|-- phone: string (nullable = true)
|-- email: string (nullable = true)
scala>
accountDF.join(customerDF,(accountDF("id") === customerDF("id") || (accountDF("telephone") === customerDF("phone") ||accountDF("mob") === customerDF("phone")) || accountDF("email") === customerDF("email")),"left").show(false)
+----+---+---------+----+---------------+----+-----+---------------+
|name|id |telephone|mob |email |id |phone|email |
+----+---+---------+----+---------------+----+-----+---------------+
|AR |1 |123 |1234|test1@gmail.com|7 |64562|test1@gmail.com|
|BR |2 |213 |4123|test2@gmail.com|2 |2344 |testq@gmail.com|
|CR |3 |231 |3214|test3@gmail.com|null|null |null |
|KR |4 |132 |1324|test4@gmail.com|6 |132 |testf@gmail.com|
+----+---+---------+----+---------------+----+-----+---------------+
答案 1 :(得分:1)
您可以使用spark SQL
轻松满足此要求。
要引用的代码-
import org.apache.spark.sql.functions._
val accountdf = sc.parallelize(Seq(("AR",1,123,1234,"test1@gmail.com"),("BR", 2, 213, 4123, "test2@gmail.com"),("CR", 3, 231, 3214, "test3@gmail.com"),("KR", 4, 132, 1324, "test4@gmail.com"))).toDF("name","id","telephone","mob","email")
accountdf.createOrReplaceTempView("account")
val customerdf = sc.parallelize(Seq((2,2344,"testq@gmail.com"),(6,132,"testf@gmail.com"),(7,64562,"test1@gmail.com"))).toDF("id","phone","email")
customerdf.createOrReplaceTempView("customer")
sql("select * from account a left join customer c on a.id = c.id or (a.telephone = c.phone or a.mob = c.phone) or a.email = c.email").show(false)
+----+---+---------+----+---------------+----+-----+---------------+
|name|id |telephone|mob |email |id |phone|email |
+----+---+---------+----+---------------+----+-----+---------------+
|BR |2 |213 |4123|test2@gmail.com|2 |2344 |testq@gmail.com|
|KR |4 |132 |1324|test4@gmail.com|6 |132 |testf@gmail.com|
|AR |1 |123 |1234|test1@gmail.com|7 |64562|test1@gmail.com|
|CR |3 |231 |3214|test3@gmail.com|null|null |null |
+----+---+---------+----+---------------+----+-----+---------------+
答案 2 :(得分:0)
val sourceDF = Seq(("AR",1,123,1234,"test1@gmail.com"),
("BR",2,213,4123,"test2@gmail.com"),
("CR",3,231,3214,"test3@gmail.com"),
("KR",4,132,1324,"test4@gmail.com")
).toDF("Name","Id","Telehone","Mob","email")
val sourceDF2 = Seq((2,2344,"testq@gmail.com"),
(6,132,"testf@gmail.com"),
(7,64562,"test1@gmail.com")
).toDF("Id","Phone","Email")
val joinDF = sourceDF.join(sourceDF2,
sourceDF.col("Id") === sourceDF2.col("Id") ||
(sourceDF.col("Telehone") === sourceDF2.col("Phone") ||
sourceDF.col("Mob") === sourceDF2.col("Phone")) ||
sourceDF.col("email") === sourceDF2.col("Email")
,
"inner")
// use "inner" or "left" or ...