如何在连接两个数据帧时提供更多列条件。例如,我想运行以下内容:
val Lead_all = Leads.join(Utm_Master,
Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==
Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")
我想只在这些列匹配时才加入。但是上面的语法无效,因为cols只需要一个字符串。那么我怎样才能得到我想要的东西。
答案 0 :(得分:73)
这种情况有一个Spark column/expression API join:
Leaddetails.join(
Utm_Master,
Leaddetails("LeadSource") <=> Utm_Master("LeadSource")
&& Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")
&& Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")
&& Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
"left"
)
示例中的<=>
运算符表示&#34; Equality test that is safe for null values&#34;。
与简单Equality test(===
)的主要区别在于,如果其中一列可能具有空值,则第一个可以安全使用。
答案 1 :(得分:14)
从Spark版本1.5.0(目前尚未发布)开始,您可以加入多个DataFrame列。请参阅SPARK-7990: Add methods to facilitate equi-join on multiple join keys。
<强>的Python 强>
Leads.join(
Utm_Master,
["LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"],
"left_outer"
)
<强> Scala的强>
这个问题要求Scala回答,但我没有使用Scala。这是我最好的猜测......
Leads.join(
Utm_Master,
Seq("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left_outer"
)
答案 2 :(得分:6)
您可以做的一件事是使用原始SQL:
case class Bar(x1: Int, y1: Int, z1: Int, v1: String)
case class Foo(x2: Int, y2: Int, z2: Int, v2: String)
val bar = sqlContext.createDataFrame(sc.parallelize(
Bar(1, 1, 2, "bar") :: Bar(2, 3, 2, "bar") ::
Bar(3, 1, 2, "bar") :: Nil))
val foo = sqlContext.createDataFrame(sc.parallelize(
Foo(1, 1, 2, "foo") :: Foo(2, 1, 2, "foo") ::
Foo(3, 1, 2, "foo") :: Foo(4, 4, 4, "foo") :: Nil))
foo.registerTempTable("foo")
bar.registerTempTable("bar")
sqlContext.sql(
"SELECT * FROM foo LEFT JOIN bar ON x1 = x2 AND y1 = y2 AND z1 = z2")
答案 3 :(得分:6)
在 Pyspark 中,您可以单独指定每个条件:
val Lead_all = Leads.join(Utm_Master,
(Leaddetails.LeadSource == Utm_Master.LeadSource) &
(Leaddetails.Utm_Source == Utm_Master.Utm_Source) &
(Leaddetails.Utm_Medium == Utm_Master.Utm_Medium) &
(Leaddetails.Utm_Campaign == Utm_Master.Utm_Campaign))
请务必正确使用运算符和括号。
答案 4 :(得分:5)
<强> Scala的:强>
Leaddetails.join(
Utm_Master,
Leaddetails("LeadSource") <=> Utm_Master("LeadSource")
&& Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")
&& Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")
&& Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
"left"
)
使其不区分大小写,
import org.apache.spark.sql.functions.{lower, upper}
然后在join方法的条件下使用lower(value)
。
例如:dataFrame.filter(lower(dataFrame.col("vendor")).equalTo("fortinet"))
答案 5 :(得分:1)
===
选项为我提供了重复的列。所以我改用Seq
。
val Lead_all = Leads.join(Utm_Master,
Seq("Utm_Source","Utm_Medium","Utm_Campaign"),"left")
当然,这只适用于连接列的名称相同的情况。
答案 6 :(得分:1)
在 Pyspark 中,在每个条件周围使用括号是在联接条件中使用多个列名称的关键。
joined_df = df1.join(df2,
(df1['name'] == df2['name']) &
(df1['phone'] == df2['phone'])
)
答案 7 :(得分:0)
Spark SQL支持在括号中加入列的元组,例如
... WHERE (list_of_columns1) = (list_of_columns2)
这比由一组“AND”组合的每对列指定相等的表达式(=)短。
例如:
SELECT a,b,c
FROM tab1 t1
WHERE
NOT EXISTS
( SELECT 1
FROM t1_except_t2_df e
WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
)
而不是
SELECT a,b,c
FROM tab1 t1
WHERE
NOT EXISTS
( SELECT 1
FROM t1_except_t2_df e
WHERE t1.a=e.a AND t1.b=e.b AND t1.c=e.c
)
这样的可读性较差,特别是当列列表很大并且您想要轻松处理NULL时。
答案 8 :(得分:0)
尝试一下:
val rccJoin=dfRccDeuda.as("dfdeuda")
.join(dfRccCliente.as("dfcliente")
,col("dfdeuda.etarcid")===col("dfcliente.etarcid")
&& col("dfdeuda.etarcid")===col("dfcliente.etarcid"),"inner")