Question

我想使用正则表达式过滤pyspark数据框中的一列。我想做这样的事情，但要使用正则表达式：

newdf = df.filter("only return rows with 8 to 10 characters in column called category")

这是我的正则表达式：

regex_string = "(\d{8}$|\d{9}$|\d{10}$)"

列类别是python中的字符串类型。

Answer 1

尝试使用 NumberFormat 功能。

示例：

length()

使用正则表达式df=spark.createDataFrame([('abcdefghij',),('abcdefghi',),('abcdefgh',),('abcdefghijk',)],['str_col']) from pyspark.sql.functions import * df.filter((length(col("str_col")) >= 8) & (length(col("str_col")) <= 10)).show() #+----------+ #| str_col| #+----------+ #|abcdefghij| #| abcdefghi| #| abcdefgh| #+----------+函数：

.rlike

pyspark通过正则表达式过滤列？

1 个答案: