我有一个DataFrame,其中多个列包含Array [String]类型的值(最多X列)。
| col1 | col2 | col3 |
| ----------------------------------------------|
| Array[String] | Array[String] | Array[String] |
| ... |
我还有另外一个字符串列表(不在DataFrame中),这些字符串是我绝对讨厌并且不需要在DataFrame中使用的单词。
val bad_words = Array("doctor","saint")
我要搜索每个以Array [String]作为其类型的列,并删除其内容与bad_words列表中的单词之一匹配的数组中的单个字符串,即
之前:
| col1: Array[String] | col2: Array[String] |
| -----------------------------------------|--------------------------------------------|
| ["donut","Frisbee","phone","doctor"] | ["I don't like the doctor","Bob Swagga"] |
| ["Dorothy M. is a saint","I'm a banana"] | ["eenie","meenie","miney","Moe"] |
之后:
| col1: Array[String] | col2: Array[String] |
| -----------------------------------------|--------------------------------------------|
| ["donut","Frisbee","phone"] | ["Bob Swagga"] |
| ["I'm a banana"] | ["eenie","meenie","miney","Moe"] |
如图所示,我还想检查bad_words是否是数组中任何字符串的子字符串。
答案 0 :(得分:0)
执行此操作的一种方法是定义UDF。
def removeBadWords(input: Seq[String]): Seq[String] = {
val badWords: Seq[String] = ???
input.filter{ // Logic to filter strings containing bad words}
}
val badWordsUdf = udf(removeBadWords(_: Seq[String]))
def.select(badWordsUdf($"col1"), badWordsUdf($"col2"))