这是我的df有2个coloumns:
utid|description
12342|my name is 123 amrud and nitesh
2345|my name is anil
2122|my name is 1234 mohan
和列表{"mohan","nitesh"}
之类的列表
需要搜索描述中是否存在此列表中的elemnet ..如果是,则打印"找到"否则打印"未找到"在dataframe的不同coloumn中。输出df应该如下所示:
这个名单远远超过了大约20k的元素..
输出数据框应如下所示
utid|description|foundornot
12342|my name is 123 amrud and nitesh|found
2345|my name is xyz |not found
2122|my name is 1234 mohan|found
欢迎任何帮助
答案 0 :(得分:1)
您只需定义udf
函数检查条件并返回found
或not found
字符串
val list = List("mohan","nitesh")
import org.apache.spark.sql.functions._
def checkUdf = udf((strCol: String) => if (list.exists(strCol.contains)) "found" else "not found")
df.withColumn("foundornot", checkUdf(col("description"))).show(false)
多数民众赞成你应该得到
+-----+-------------------------------+----------+
|utid |description |foundornot|
+-----+-------------------------------+----------+
|12342|my name is 123 amrud and nitesh|found |
|2345 |my name is anil |not found |
|2122 |my name is 1234 mohan |found |
+-----+-------------------------------+----------+
我希望答案很有帮助