我正在寻找一种将逻辑表达式的用户输入转换为过滤器以应用于pyspark中的数据集的方法。
例如
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
cSchema = StructType(
[
StructField("object_id", StringType()),
StructField("object_parts", ArrayType(StringType()))
]
)
testdata = [
['object_1', ["p1", "p2", "p3"]],
['object_2', ["p2", "p4", "p6"]],
['object_3', ["p1", "p3", "p6"]]
]
df = spark.createDataFrame(testdata, schema=cSchema)
display(df)
object_id | object_parts
------------------------
object_1 |["p1", "p2", "p3"]
object_2 |["p2", "p4", "p6"]
object_3 |["p1", "p3", "p6"]
我现在被一种基于用户输入的逻辑表达式进行过滤的方法所困扰,例如
filter = "(p1 & p2) | p3"
此过滤器应给出带有object_1和object_3的行