我有一个这样的黑名单。
function yearlyValueFilter(array){
var yearlyValue = {}
array.forEach( (obj) => { //litterate on the input
var year = obj.Year
var value = obj.Value
if((year in yearlyValue)){ //if the array with not duplicated years conatins the litteration year just plus that value
yearlyValue[year] += value
}else{ //if not conatins, it gets as a basic value
yearlyValue[year] = value
}
})
return yearlyValue
}
我有一个像这样的数据框
val blacklist: Array[String]=Array("one of a kind", "one of the", "industry leading", "industry's", "industry leader", "lifetime", "#1 ", "number 1", "number one", "Guarantee", "guaranteed", "guarantees", "Compete", "Competes", "competing", "Competed", "competitor", "competitors", "competition", "competitions", "competitive", "competitor's")
我想过滤黑名单中的所有值。
在上述情况下,所有结果都会到来。
+------+---------------------------------+
|name | value |
+------+---------------------------------+
|atr1 | this is one of a kind product |
|atr2 | this product is industry leader |
|atr3 | it is competitor's nightmare |
+------+---------------------------------+
答案 0 :(得分:1)
将dataframe
视为
+----+-------------------------------+
|name|value |
+----+-------------------------------+
|atr1|this is one of a kind product |
|atr2|this product is industry leader|
|atr3|it is competitor's nightmare |
|atr4|testing for filter |
+----+-------------------------------+
您可以将udf
功能定义为
import org.apache.spark.sql.functions._
def blackListFilter = udf((value: String) => blacklist.map(value.contains(_)).toSeq.contains(true))
并将其称为满足您的需求
df.filter(blackListFilter($"value"))
你应该
+----+-------------------------------+
|name|value |
+----+-------------------------------+
|atr1|this is one of a kind product |
|atr2|this product is industry leader|
|atr3|it is competitor's nightmare |
+----+-------------------------------+