在一个数据框中查找满足另一个数据框中特定条件的所有行

时间:2020-06-27 06:07:46

标签: sql scala apache-spark apache-spark-sql

我有两个数据框,如下所示:

df1 (参考数据)

Tempe, AZ, USA
San Jose, CA, USA
Mountain View, CA, USA
New York, NY, USA

df2 (用户输入的数据)

Tempe, AZ
Tempe, Arizona
San Jose, USA
San Jose, CA
Mountain View, CA

我想获得一个数据框( df3 ),如下所示:

-------------------------------------------
|Tempe, AZ, USA        | Tempe, Arizona   |
|Tempe, AZ, USA        | Tempe, AZ        |
|San Jose, CA, USA     | San Jose, CA     |
|San Jose, CA, USA     | San Jose, USA    |
|Mountain View, CA, USA| Mountain View, CA|
-------------------------------------------

我已经是用户定义的功能:

isSameAs(str1: String, str2:String): Boolean{
    ......
} 

包含两个字符串(用户输入的数据和参考数据),并告诉我它们是否匹配。

我只需要找到在Scala Spark SQL中实现 map 的正确方法,以便获得像 df3 这样的数据框。

2 个答案:

答案 0 :(得分:2)

选项1 :您可以将UDF用作联接表达式:

request.user.scores.all().values_list(‘score’, flat=True)

此方法的缺点是Spark在两个数据帧import org.apache.spark.sql.functions._ val isSameAsUdf = udf(isSameAs(_,_)) val result = df1.join(df2, isSameAsUdf(df1.col("address"), df2.col("address"))) df1上执行笛卡尔乘积,然后过滤与联接条件不匹配的列(更多详细信息here )。运行df2打印

result.explain

选项2 :为避免笛卡尔乘积,将参考数据作为标准Scala序列broadcast,然后在另一个UDF中进行地址映射可能更快:< / p>

== Physical Plan ==
CartesianProduct UDF(address#4, address#10)
:- LocalTableScan [address#4]
+- LocalTableScan [address#10]

结果与选项1相同,但是打印val normalizedAddress: Seq[String] = //content of df2 as scala sequence val broadcastSeq = spark.sparkContext.broadcast(normalizedAddress) def toNormalizedAddress(str: String ): String = broadcastSeq.value.find(isSameAs(_, str)).getOrElse("") val toNormalizedAddressUdf = udf(toNormalizedAddress(_)) val result2 = df2.withColumn("NormalizedAddress", toNormalizedAddressUdf('address))

result2.explain

如果参考数据量足够小以进行广播,则第二个选项有效。根据群集的硬件,大约10.000s行的参考数据仍会被认为很小。

答案 1 :(得分:1)

假设使用以下模式(地址:字符串),请尝试以下操作-

加载数据

  val data1 =
      """Tempe, AZ, USA
        |San Jose, CA, USA
        |Mountain View, CA, USA""".stripMargin
    val df1 = data1.split(System.lineSeparator()).toSeq.toDF("address")
    df1.show(false)
    /**
      * +----------------------+
      * |address               |
      * +----------------------+
      * |Tempe, AZ, USA        |
      * |San Jose, CA, USA     |
      * |Mountain View, CA, USA|
      * +----------------------+
      */

    val data2 =
      """Tempe, AZ
        |Tempe, Arizona
        |San Jose, USA
        |San Jose, CA
        |Mountain View, CA""".stripMargin

    val df2 = data2.split(System.lineSeparator()).toSeq.toDF("address")
    df2.show(false)

    /**
      * +-----------------+
      * |address          |
      * +-----------------+
      * |Tempe, AZ        |
      * |Tempe, Arizona   |
      * |San Jose, USA    |
      * |San Jose, CA     |
      * |Mountain View, CA|
      * +-----------------+
      */

提取连接密钥并基于此进行连接


    df1.withColumn("joiningKey", substring_index($"address", ",", 1))
      .join(
        df2.withColumn("joiningKey", substring_index($"address", ",", 1)),
        "joiningKey"
      )
      .select(df1("address"), df2("address"))
      .show(false)

    /**
      * +----------------------+-----------------+
      * |address               |address          |
      * +----------------------+-----------------+
      * |Tempe, AZ, USA        |Tempe, AZ        |
      * |Tempe, AZ, USA        |Tempe, Arizona   |
      * |San Jose, CA, USA     |San Jose, USA    |
      * |San Jose, CA, USA     |San Jose, CA     |
      * |Mountain View, CA, USA|Mountain View, CA|
      * +----------------------+-----------------+
      */