我有一个Seq
和数据框。数据框包含一列数组类型。我正在尝试从该列中删除Seq
中的元素。
例如:
val stop_words = Seq("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")
+---------------------------------------------------+
|sorted_items |
+---------------------------------------------------+
|[flannel, and, for, s, shirts, sleeve, warm] |
|[3, 5, kitchenaid, s] |
|[5, 6, case, flip, inch, iphone, on, xs] |
|[almonds, chocolate, covered, dark, joe, s, the] |
|null |
|[] |
|[animation, book] |
预期输出:
+---------------------------------------------------+
|sorted_items |
+---------------------------------------------------+
|[flannel, shirts, sleeve, warm] |
|[3, 5, kitchenaid] |
|[5, 6, case, flip, inch, iphone, xs] |
|[almonds, chocolate, covered, dark, joe, the] |
|null |
|[] |
|[animation, book] |
如何以有效且优化的方式完成此任务?
答案 0 :(得分:2)
使用MLlib软件包中的StopWordsRemover
。可以使用setStopWords
函数设置自定义停用词。 StopWordsRemover
将不会处理空值,因此在使用前需要处理这些空值。可以完成以下操作:
val df2 = df.withColumn("sorted_values", coalesce($"sorted_values", array()))
val remover = new StopWordsRemover()
.setStopWords(stop_words.toArray)
.setInputCol("sorted_values")
.setOutputCol("filtered")
val df3 = remover.transform(df2)
答案 1 :(得分:1)
使用array_except
中的spark.sql.functions
:
import org.apache.spark.sql.{functions => F}
val stopWords = Array("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")
val newDF = df.withColumn("sorted_items", F.array_except(df("sorted_items"), F.lit(stopWords)))
newDF.show(false)
输出:
+----------------------------------------+
|sorted_items |
+----------------------------------------+
|[flannel, shirts, sleeve, warm] |
|[3, 5, kitchenaid] |
|[5, 6, case, flip, inch, iphone, xs] |
|[almonds, chocolate, covered, dark, joe]|
|null |
|[] |
|[animation, book] |
+----------------------------------------+