我想删除pyspark数据框列中的一些重复单词。
基于Remove duplicates from PySpark array column
我的火花:
2.4.5
Py3代码:
test_df = spark.createDataFrame([("I like this Book and this book be DOWNLOADED on line",)], ["text"])
t3 = test_df.withColumn("text", F.array("text")) # have to convert it to array because the original large df is array type.
t4 = t3.withColumn('text', F.expr("transform(text, x -> lower(x))"))
t5 = t4.withColumn('text', F.array_distinct("text"))
t5.show(1, 120)
但是得到
+--------------------------------------------------------+
| text|
+--------------------------------------------------------+
|[i like this book and this book be downloaded on line]|
+--------------------------------------------------------+
我需要删除
book and this
似乎“ array_distinct”无法将其过滤掉?
谢谢
答案 0 :(得分:0)
您可以使用pyspark sql.functions
中的lcase,split,array_distinct和array_join函数
例如,F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")
这是工作代码
import pyspark.sql.functions as F
df
.withColumn("text_new",
F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")) \
.show(truncate=False)
说明:
在这里,您首先使用lcase(text)
将所有内容转换为小写,而不是使用split(text,' ')
在空白处拆分数组,这样会产生
[i, like, this, book, and, this, book, be, downloaded, on, line]|
然后将其传递给array_distinct
,它产生
[i, like, this, book, and, be, downloaded, on, line]
最后,使用array_join
i like this book and be downloaded on line