如何根据列表的最后一个StructType“定位”

时间:2019-06-07 05:54:00

标签: apache-spark pyspark pyspark-sql

假设我有一个名为'arr'的StructType列表列的DataFrame,可以通过以下json描述

{
  "otherAttribute": "blabla...",
  "arr": [
     {
        "domain": "books",
        "others": "blabla..."
     }
     {
        "domain": "music",
        "others": "blabla..."
     }
  ]
}
{
  "otherAttribute": "blabla...",
  "arr": [
     {
        "domain": "music",
        "others": "blabla..."
     }
     {
        "domain": "furniture",
        "others": "blabla..."
     }
  ]
}
... ...

我们要过滤掉记录,以使“ arr”中的最后一个StructType的“ domain”属性为“ music”。在上面的示例中,我们需要保留第一记录,但丢弃第二条记录。需要帮助来编写这样的“ where”子句。

1 个答案:

答案 0 :(得分:1)

答案基于以下数据:

+---------------+----------------------------------------------+
|other_attribute|arr                                           |
+---------------+----------------------------------------------+
|first          |[[books, ...], [music, ...]]                  |
|second         |[[books, ...], [music, ...], [furniture, ...]]|
|third          |[[football, ...], [soccer, ...]]              |
+---------------+----------------------------------------------+

arr是一个结构数组。 arr的每个元素都具有属性domainothers(此处用...填充)。

DataFrame API方法(Fpyspark.sql.functions

df.filter(
    F.col("arr")[F.size(F.col("arr")) - 1]["domain"] == "music"
)

SQL方式:

SELECT 
  other_attribute,
  arr
FROM df
WHERE arr[size(arr) - 1]['domain'] = 'music'

输出表将如下所示:

+---------------+----------------------------+
|other_attribute|arr                         |
+---------------+----------------------------+
|first          |[[books, ...], [music, ...]]|
+---------------+----------------------------+

完整代码(建议使用PySpark控制台)

import pyspark.sql.types as T
import pyspark.sql.functions as F

schema = T.StructType()\
    .add("other_attribute", T.StringType())\
    .add("arr", T.ArrayType(
        T.StructType()
            .add("domain", T.StringType())
            .add("others", T.StringType())
        )
    )

df = spark.createDataFrame([
    ["first", [["books", "..."], ["music", "..."]]],
    ["second", [["books", "..."], ["music", "..."], ["furniture", "..."]]],
    ["third", [["football", "..."], ["soccer", "..."]]]
], schema)

filtered = df.filter(
    F.col("arr")[F.size(F.col("arr")) - 1]["domain"] == "music"
)

filtered.show(100, False)

df.createOrReplaceTempView("df")

filtered_with_sql = spark.sql("""
    SELECT 
      other_attribute,
      arr
    FROM df
    WHERE arr[size(arr) - 1]['domain'] = 'music'
""")

filtered_with_sql.show(100, False)