我有一个spark数据框,然后要应用过滤器字符串,过滤器仅选择了一些行,但我想知道未选择行的原因。 示例:
DataFrame列:customer_id|col_a|col_b|col_c|col_d
过滤字符串:col_a > 0 & col_b > 4 & col_c < 0 & col_d=0
等...
reason_for_exclusion
可以是任何字符串或字母,只要能说明为什么排除特定行即可。
我可以拆分过滤器字符串并应用每个过滤器,但我的过滤器字符串很大,效率很低,所以只需检查是否有更好的方法可以进行此操作?
谢谢
答案 0 :(得分:2)
您将必须检查过滤器表达式中的每个条件,这对于简单的过滤操作而言可能非常昂贵。 我建议对所有过滤的行显示相同的原因,因为它满足该表达式中的至少一个条件。它虽然不漂亮,但我更喜欢它,因为它效率很高,尤其是当您必须处理非常大的DataFrame时。
data = [(1, 1, 5, -3, 0),(2, 0, 10, -1, 0), (3, 0, 10, -4, 1),]
df = spark.createDataFrame(data, ["customer_id", "col_a", "col_b", "col_c", "col_d"])
filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0"
filtered_df = df.withColumn("reason_for_exclusion",
when(~expr(filter_expr),lit(filter_expr)
).otherwise(lit(None))
)
filtered_df.show(truncate=False)
输出:
+-----------+-----+-----+-----+-----+-------------------------------------------------+
|customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion |
+-----------+-----+-----+-----+-----+-------------------------------------------------+
|1 |1 |5 |-3 |0 |null |
|2 |0 |10 |-1 |0 |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0|
|3 |0 |10 |-4 |1 |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0|
+-----------+-----+-----+-----+-----+-------------------------------------------------+
编辑:
现在,如果您真的只想显示失败的条件,则可以将每个条件转到单独的列,然后使用DataFrame select
进行计算。然后,您必须检查评估为False
的列,以了解哪个条件失败了。
您可以使用<PREFIX>_<condition>
来命名这些列,以便以后可以轻松识别它们。这是一个完整的示例:
filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0"
COLUMN_FILTER_PREFIX = "filter_validation_"
original_columns = [col(c) for c in df.columns]
# create column for each condition in filter expression
condition_columns = [expr(f).alias(COLUMN_FILTER_PREFIX + f) for f in filter_expr.split("AND")]
# evaluate condition to True/False and persist the DF with calculated columns
filtered_df = df.select(original_columns + condition_columns)
filtered_df = filtered_df.persist(StorageLevel.MEMORY_AND_DISK)
# get back columns we calculated for filter
filter_col_names = [c for c in filtered_df.columns if COLUMN_FILTER_PREFIX in c]
filter_columns = list()
for c in filter_col_names:
filter_columns.append(
when(~col(f"`{c}`"),
lit(f"{c.replace(COLUMN_FILTER_PREFIX, '')}")
)
)
array_reason_filter = array_except(array(*filter_columns), array(lit(None)))
df_with_filter_reason = filtered_df.withColumn("reason_for_exclusion", array_reason_filter)
df_with_filter_reason.select(*original_columns, col("reason_for_exclusion")).show(truncate=False)
# output
+-----------+-----+-----+-----+-----+----------------------+
|customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion |
+-----------+-----+-----+-----+-----+----------------------+
|1 |1 |5 |-3 |0 |[] |
|2 |0 |10 |-1 |0 |[col_a > 0 ] |
|3 |0 |10 |-4 |1 |[col_a > 0 , col_d=0]|
+-----------+-----+-----+-----+-----+----------------------+