火花数据帧中Struct属性的过滤器如何工作?

时间:2020-06-01 04:50:40

标签: scala apache-spark aws-glue

我想使用过滤器方法从数据框中过滤出一些记录。我有一个Struct地址数组,正在与列值进行比较。我正在使用以下代码:

<div>

我想基于比较从地址结构中删除该元素。示例架构如下:

entityJoinB_df.filter(col("addressstructm.streetName").cast(StringType) =!= (col("streetName")))

但是它不起作用。可能是什么问题。有人可以帮忙吗?

样本输入:

root
 |-- apartmentnumber: string (nullable = true)
 |-- streetName: string (nullable = true)
 |-- streetName2: string (nullable = true)
 |-- fullName: string (nullable = false)
 |-- address: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- streetName: string (nullable = true)
 |    |    |-- streetName2: string (nullable = true)
 |    |    |-- buildingName: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- city: string (nullable = true)
 |-- isActive: boolean (nullable = false)

样本输出:

[
{
"apartmentnumber":  122,
"streetName": "ABC ABC",
"streetName2": "CBD",
"fullName": "MR. X"
"address": [{
            "streetName": "ABC ABC",
            "streetName2": "CBD",
            "buildingName": "ONE",
            "city":"NY"
           },
           {
            "streetName": "XYZ ABC",
            "streetName2": "XCB",
            "buildingName": "ONE",
            "city":"NY"
           }]
}
]

谢谢, Upen

2 个答案:

答案 0 :(得分:0)

我认为可以通过将过滤器表达式修改为

来解决您的问题
import org.apache.spark.sql.functions._
    entityJoinB_df.withColumn("address",
      expr("filter(addressstructm.address, x-> ( x.streetName != streetName AND x.streetName != 'Secondary' ) )"))

假设addressstructm是您数据框的别名

下面是与您的示例结构相似的示例结构

import org.apache.spark.sql.functions._

object StructParsin {

  def main(args: Array[String]): Unit = {
    val spark = Constant.getSparkSess
    import spark.implicits._

    val df = List(
      Apartment(Array(Element("ABC ABC","123"),Element("XYZ ABC","123")),"ABC ABC"),
      Apartment(Array(Element("DEF","123"),Element("DEF1","123")),"XYZ")
    )
      .toDF

    df.printSchema()
    df.withColumn("newAddress",
      expr("filter(address, x -> ( x.streetName != streetName AND x.streetName != 'Secondary' ))"))
      .show()
  }
}

case class Element (streetName: String)

case class Apartment(address: Array[Element],streetName:String)

答案 1 :(得分:0)

尝试下面的代码。

scala>

entityJoinB_df
.withColumn("address",
    array_except($"address",
        array($"address"(array_position($"address.streetName",$"streetName")-1))
    )
)
.show(false)


+-------------------------+---------------+--------+----------+-----------+
|address                  |apartmentnumber|fullName|streetName|streetName2|
+-------------------------+---------------+--------+----------+-----------+
|[[ONE, NY, XYZ ABC, XCB]]|122            |MR. X   |ABC ABC   |CBD        |
+-------------------------+---------------+--------+----------+-----------+