我有一个df,我需要搜索关键字列表中是否有任何元素集合。如果是,我需要将所有这些关键字@分隔在一个名为found的新列中。
我的df就像
utid | description
123 | my name is harry and I live in newyork
234 | my neighbour is daniel and he plays hockey
列表很像list = {harry,daniel,hockey,newyork}
输出应该像
utid | description | foundornot
123 | my name is harry and I live in newyork | harry@newyork
234 | my neighbour is daniel and he plays hockey | daniel@hockey
列表非常像大约20k关键字..如果没有找到打印NF
答案 0 :(得分:0)
您可以检查list
中的description
中是否存在udf
列val list = List("harry","daniel","hockey","newyork")
import org.apache.spark.sql.functions._
def checkUdf = udf((strCol: String) => if (list.exists(strCol.contains)) list.filter(strCol.contains(_)).mkString("@") else "NF")
df.withColumn("foundornot", checkUdf(col("description"))).show(false)
列中的元素,并将元素列表作为由分隔的字符串@ 将其返回,或者 NF 字符串为
+----+------------------------------------------+-------------+
|utid|description |foundornot |
+----+------------------------------------------+-------------+
|123 |my name is harry and i live in newyork |harry@newyork|
|234 |my neighbour is daniel and he plays hockey|daniel@hockey|
+----+------------------------------------------+-------------+
应该给你
{{1}}