假设我有一个名为'arr'的StructType列表列的DataFrame,可以通过以下json描述
{
"otherAttribute": "blabla...",
"arr": [
{
"domain": "books",
"others": "blabla..."
}
{
"domain": "music",
"others": "blabla..."
}
]
}
{
"otherAttribute": "blabla...",
"arr": [
{
"domain": "music",
"others": "blabla..."
}
{
"domain": "furniture",
"others": "blabla..."
}
]
}
... ...
我们要过滤掉记录,以使“ arr”中的最后一个StructType的“ domain”属性为“ music”。在上面的示例中,我们需要保留第一记录,但丢弃第二条记录。需要帮助来编写这样的“ where”子句。
答案 0 :(得分:1)
答案基于以下数据:
+---------------+----------------------------------------------+
|other_attribute|arr |
+---------------+----------------------------------------------+
|first |[[books, ...], [music, ...]] |
|second |[[books, ...], [music, ...], [furniture, ...]]|
|third |[[football, ...], [soccer, ...]] |
+---------------+----------------------------------------------+
arr
是一个结构数组。
arr
的每个元素都具有属性domain
和others
(此处用...
填充)。
DataFrame API方法(F
是pyspark.sql.functions
)
df.filter(
F.col("arr")[F.size(F.col("arr")) - 1]["domain"] == "music"
)
SQL方式:
SELECT
other_attribute,
arr
FROM df
WHERE arr[size(arr) - 1]['domain'] = 'music'
输出表将如下所示:
+---------------+----------------------------+
|other_attribute|arr |
+---------------+----------------------------+
|first |[[books, ...], [music, ...]]|
+---------------+----------------------------+
完整代码(建议使用PySpark控制台)
import pyspark.sql.types as T
import pyspark.sql.functions as F
schema = T.StructType()\
.add("other_attribute", T.StringType())\
.add("arr", T.ArrayType(
T.StructType()
.add("domain", T.StringType())
.add("others", T.StringType())
)
)
df = spark.createDataFrame([
["first", [["books", "..."], ["music", "..."]]],
["second", [["books", "..."], ["music", "..."], ["furniture", "..."]]],
["third", [["football", "..."], ["soccer", "..."]]]
], schema)
filtered = df.filter(
F.col("arr")[F.size(F.col("arr")) - 1]["domain"] == "music"
)
filtered.show(100, False)
df.createOrReplaceTempView("df")
filtered_with_sql = spark.sql("""
SELECT
other_attribute,
arr
FROM df
WHERE arr[size(arr) - 1]['domain'] = 'music'
""")
filtered_with_sql.show(100, False)