黑名单数组spark数据帧中字符串值的一部分

时间:2017-08-02 15:41:16

标签: scala apache-spark

我有一个这样的黑名单。

function yearlyValueFilter(array){
  var yearlyValue = {}  
  array.forEach( (obj) => { //litterate on the input
    var year = obj.Year   
    var value = obj.Value
    if((year in yearlyValue)){  //if the array with not duplicated years conatins the litteration year just plus that value 
      yearlyValue[year] += value
    }else{      //if not conatins, it gets as a basic value
      yearlyValue[year] = value
    }
  })
  return yearlyValue
}

我有一个像这样的数据框

val blacklist: Array[String]=Array("one of a kind", "one of the", "industry leading", "industry's", "industry leader", "lifetime", "#1 ", "number 1", "number one", "Guarantee", "guaranteed", "guarantees", "Compete", "Competes", "competing", "Competed", "competitor", "competitors", "competition", "competitions", "competitive", "competitor's")

我想过滤黑名单中的所有值。

在上述情况下,所有结果都会到来。

+------+---------------------------------+
|name  | value                           |
+------+---------------------------------+
|atr1  | this is one of  a kind product  |
|atr2  | this product is industry leader |
|atr3  | it is competitor's nightmare    |
+------+---------------------------------+

1 个答案:

答案 0 :(得分:1)

dataframe视为

+----+-------------------------------+
|name|value                          |
+----+-------------------------------+
|atr1|this is one of a kind product  |
|atr2|this product is industry leader|
|atr3|it is competitor's nightmare   |
|atr4|testing for filter             |
+----+-------------------------------+

您可以将udf功能定义为

import org.apache.spark.sql.functions._
def blackListFilter = udf((value: String) => blacklist.map(value.contains(_)).toSeq.contains(true))

并将其称为满足您的需求

df.filter(blackListFilter($"value"))

你应该

+----+-------------------------------+
|name|value                          |
+----+-------------------------------+
|atr1|this is one of a kind product  |
|atr2|this product is industry leader|
|atr3|it is competitor's nightmare   |
+----+-------------------------------+