Question

我有一个pyspark数据框，其中一列的内容是字符串类型。我只想选择该列上的字符串长度大于5的行。我尝试使用size函数，但它仅适用于数组。

from pyspark.sql.functions import col, explode, regexp_replace, size

new_df = df.select("col_1", explode(col("col_2")) \
    .select("col_1", "col_2") \
    .where(col("col_1").isNotNull()) \
    .where(size(col("col_2")) <= 5) \
    .distinct()

是否可以通过不使用UDF的列内容长度进行选择？

Answer 1

，如前所述here。您可以使用length。因此，您的示例应如下所示：

from pyspark.sql.functions import col, explode, regexp_replace, length

new_df = df.select("col_1", explode(col("col_2")) \
    .select("col_1", "col_2") \
    .where(col("col_1").isNotNull()) \
    .where(length(col("col_2")) <= 5) \
    .distinct()

pyspark选择列内容长度<x的行

1 个答案: