Question

PySpark 2.4.3 版 AWS 胶水

我有一个单词列表（模式）：

INPUT:
+------+-----------+
| col0 | col1      |
+------|-----------+
| row1 | one_test1"|
| row2 | one_test2 |
+------------------+

和一个像这样的 DF：

OUTPUT:
+------+-----------+
| col0 | col1      |
+------|-----------+
| row1 | one       |
| row2 | one       |
+------------------+

我想根据我的单词列表检查 col1，如果该单词（模式）存在，我想删除它。给我这样的输出：

from pyspark.sql.functions import regexp_replace
new_df = old_df.withColumn('clean_text', regexp_replace('col1', '_test1"', ''))

我能做的最好的是以下 - 但是它只对一个词有好处：

SendStream.prototype.stream

Answer 1

在正则表达式中使用 | 表示 or：

from pyspark.sql.functions import regexp_replace

to_remove = ['_test1"','_test2','_test3']
new_df = old_df.withColumn(
    'clean_text', 
    F.regexp_replace('col1', '|'.join(to_remove), '')
)

new_df.show()
+----+----------+----------+
|col0|      col1|clean_text|
+----+----------+----------+
|row1|one_test1"|       one|
|row2| one_test2|       one|
+----+----------+----------+

Pyspark - 根据单词列表检查列字符串并删除

1 个答案: