Spark为数据帧连接指定多个列条件

时间:2015-07-06 07:35:54

标签: apache-spark apache-spark-sql rdd

如何在连接两个数据帧时提供更多列条件。例如,我想运行以下内容:

val Lead_all = Leads.join(Utm_Master,  
    Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==
    Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")

我想只在这些列匹配时才加入。但是上面的语法无效,因为cols只需要一个字符串。那么我怎样才能得到我想要的东西。

9 个答案:

答案 0 :(得分:73)

这种情况有一个Spark column/expression API join

Leaddetails.join(
    Utm_Master, 
    Leaddetails("LeadSource") <=> Utm_Master("LeadSource")
        && Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")
        && Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")
        && Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
    "left"
)

示例中的<=>运算符表示&#34; Equality test that is safe for null values&#34;。

与简单Equality test===)的主要区别在于,如果其中一列可能具有空值,则第一个可以安全使用。

答案 1 :(得分:14)

从Spark版本1.5.0(目前尚未发布)开始,您可以加入多个DataFrame列。请参阅SPARK-7990: Add methods to facilitate equi-join on multiple join keys

<强>的Python

Leads.join(
    Utm_Master, 
    ["LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"],
    "left_outer"
)

<强> Scala的

这个问题要求Scala回答,但我没有使用Scala。这是我最好的猜测......

Leads.join(
    Utm_Master,
    Seq("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
    "left_outer"
)

答案 2 :(得分:6)

您可以做的一件事是使用原始SQL:

case class Bar(x1: Int, y1: Int, z1: Int, v1: String)
case class Foo(x2: Int, y2: Int, z2: Int, v2: String)

val bar = sqlContext.createDataFrame(sc.parallelize(
    Bar(1, 1, 2, "bar") :: Bar(2, 3, 2, "bar") ::
    Bar(3, 1, 2, "bar") :: Nil))

val foo = sqlContext.createDataFrame(sc.parallelize(
    Foo(1, 1, 2, "foo") :: Foo(2, 1, 2, "foo") ::
    Foo(3, 1, 2, "foo") :: Foo(4, 4, 4, "foo") :: Nil))

foo.registerTempTable("foo")
bar.registerTempTable("bar")

sqlContext.sql(
    "SELECT * FROM foo LEFT JOIN bar ON x1 = x2 AND y1 = y2 AND z1 = z2")

答案 3 :(得分:6)

Pyspark 中,您可以单独指定每个条件:

val Lead_all = Leads.join(Utm_Master,  
    (Leaddetails.LeadSource == Utm_Master.LeadSource) &
    (Leaddetails.Utm_Source == Utm_Master.Utm_Source) &
    (Leaddetails.Utm_Medium == Utm_Master.Utm_Medium) &
    (Leaddetails.Utm_Campaign == Utm_Master.Utm_Campaign))

请务必正确使用运算符和括号。

答案 4 :(得分:5)

<强> Scala的:

Leaddetails.join(
    Utm_Master, 
    Leaddetails("LeadSource") <=> Utm_Master("LeadSource")
        && Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")
        && Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")
        && Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
    "left"
)

使其不区分大小写

import org.apache.spark.sql.functions.{lower, upper}

然后在join方法的条件下使用lower(value)

例如:dataFrame.filter(lower(dataFrame.col("vendor")).equalTo("fortinet"))

答案 5 :(得分:1)

===选项为我提供了重复的列。所以我改用Seq

val Lead_all = Leads.join(Utm_Master,
    Seq("Utm_Source","Utm_Medium","Utm_Campaign"),"left")

当然,这只适用于连接列的名称相同的情况。

答案 6 :(得分:1)

Pyspark 中,在每个条件周围使用括号是在联接条件中使用多个列名称的关键。

joined_df = df1.join(df2, 
    (df1['name'] == df2['name']) &
    (df1['phone'] == df2['phone'])
)

答案 7 :(得分:0)

Spark SQL支持在括号中加入列的元组,例如

... WHERE (list_of_columns1) = (list_of_columns2)

这比由一组“AND”组合的每对列指定相等的表达式(=)短。

例如:

SELECT a,b,c
FROM    tab1 t1
WHERE 
   NOT EXISTS
   (    SELECT 1
        FROM    t1_except_t2_df e
        WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
   )

而不是

SELECT a,b,c
FROM    tab1 t1
WHERE 
   NOT EXISTS
   (    SELECT 1
        FROM    t1_except_t2_df e
        WHERE t1.a=e.a AND t1.b=e.b AND t1.c=e.c
   )

这样的可读性较差,特别是当列列表很大并且您想要轻松处理NULL时。

答案 8 :(得分:0)

尝试一下:

val rccJoin=dfRccDeuda.as("dfdeuda")
.join(dfRccCliente.as("dfcliente")
,col("dfdeuda.etarcid")===col("dfcliente.etarcid") 
&& col("dfdeuda.etarcid")===col("dfcliente.etarcid"),"inner")