如何将变量逻辑表达式转换为pyspark过滤器

时间:2019-11-12 14:52:00

标签: python pyspark logical-operators

我正在寻找一种将逻辑表达式的用户输入转换为过滤器以应用于pyspark中的数据集的方法。

例如

from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

cSchema = StructType(
    [
        StructField("object_id", StringType()),
        StructField("object_parts", ArrayType(StringType()))
    ]
)

testdata = [
    ['object_1', ["p1", "p2", "p3"]], 
    ['object_2', ["p2", "p4", "p6"]], 
    ['object_3', ["p1", "p3", "p6"]]
]


df = spark.createDataFrame(testdata, schema=cSchema) 
display(df)

object_id | object_parts
------------------------
object_1  |["p1", "p2", "p3"]
object_2  |["p2", "p4", "p6"]
object_3  |["p1", "p3", "p6"]

我现在被一种基于用户输入的逻辑表达式进行过滤的方法所困扰,例如

filter = "(p1 & p2) | p3"

此过滤器应给出带有object_1和object_3的行

0 个答案:

没有答案