scala:如何从值为Array [String]的多个DataFrame列中删除字符串

时间:2018-11-15 20:39:34

标签: scala apache-spark dataframe

我有一个DataFrame,其中多个列包含Array [String]类型的值(最多X列)。

| col1          | col2          | col3          |
| ----------------------------------------------|
| Array[String] | Array[String] | Array[String] |
| ...                                           |

我还有另外一个字符串列表(不在DataFrame中),这些字符串是我绝对讨厌并且不需要在DataFrame中使用的单词。

val bad_words = Array("doctor","saint")

我要搜索每个以Array [String]作为其类型的列,并删除其内容与bad_words列表中的单词之一匹配的数组中的单个字符串,即

之前:

| col1: Array[String]                      | col2: Array[String]                        |
| -----------------------------------------|--------------------------------------------|
| ["donut","Frisbee","phone","doctor"]     | ["I don't like the doctor","Bob Swagga"]   |
| ["Dorothy M. is a saint","I'm a banana"] | ["eenie","meenie","miney","Moe"]           |

之后:

| col1: Array[String]                      | col2: Array[String]                        |
| -----------------------------------------|--------------------------------------------|
| ["donut","Frisbee","phone"]              | ["Bob Swagga"]                             |
| ["I'm a banana"]                         | ["eenie","meenie","miney","Moe"]           |

如图所示,我还想检查bad_words是否是数组中任何字符串的子字符串。

1 个答案:

答案 0 :(得分:0)

执行此操作的一种方法是定义UDF。

def removeBadWords(input: Seq[String]): Seq[String] = {

   val badWords: Seq[String] = ???

   input.filter{ // Logic to filter strings containing bad words}
}

val badWordsUdf = udf(removeBadWords(_: Seq[String]))

def.select(badWordsUdf($"col1"), badWordsUdf($"col2"))