如何将多个正则表达式模式与Spark中的列值进行匹配?

时间:2018-10-05 20:23:48

标签: regex scala apache-spark dataframe pattern-matching

我有专栏:

val originalSqlLikePatternMap = Map("item (%) is blacklisted%" -> "BLACK_LIST",
      "%Testing%" -> "TESTING",
  "%purchase count % is too low %" -> "TOO_LOW_PURCHASE_COUNT")

val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*") -> v._2)

val df = Seq(
  "Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low", 
  "Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
   "item (1) is blacklisted, #!@" 
).toDF("raw_type")

val converter = (value: String) => javaPatternMap.find(v => value.matches(v._1)).map(_._2).getOrElse("Unknown")
val converterUDF = udf(converter)

val result = df.withColumn("updatedType", converterUDF($"raw_type"))

但是它给出了:

+---------------------------------------------------------+----------------------+
|raw_type                                                 |updatedType           |
+---------------------------------------------------------+----------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING               |
|Foo purchase count (12, 4) is too low                    |TOO_LOW_PURCHASE_COUNT|
|#!@                                                      |Unknown               |
|item (mejwnw) is blacklisted                             |BLACK_LIST            |
|item (1) is blacklisted, #!@                             |BLACK_LIST            |
+---------------------------------------------------------+----------------------+

但是我想要“ Testing(2,4,(4,6,7)foo,Foo购买计数1太低”)以给出2个值“ TESTING,TOO_LOW_PURCHASE_COUNT”:

 +---------------------------------------------------------+--------------------------------+
|raw_type                                                 |updatedType                     |
+---------------------------------------------------------+--------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT |
|Foo purchase count (12, 4) is too low                    |TOO_LOW_PURCHASE_COUNT          |
|#!@                                                      |Unknown                         |
|item (mejwnw) is blacklisted                             |BLACK_LIST                      |
|item (1) is blacklisted, #!@                             |BLACK_LIST, Unkown              |
+---------------------------------------------------------+--------------------------------+

有人可以告诉我我在做什么错吗?

1 个答案:

答案 0 :(得分:2)

好的。所以,这里有几件事,

  1. 关于datesArray.forEach((date, index) =>{ let dateYear = dates.split('/')[0]; calendarObject[dateYear].push(dates); }); ,您需要针对每个正则表达式检查每个find以获得所需的输出,因此find是不正确的选择。

      

    迭代器产生的第一个满足谓词的值,如果   任何。

  2. 请注意正则表达式,低位后要留一个空格,这就是为什么它不匹配的原因。也许您应该重新考虑将Row也替换为%

    .*

因此,随着更改,您的代码将类似于

%purchase count % is too low %

输出

 val originalSqlLikePatternMap = Map(
      "item (%) is blacklisted%" -> "BLACK_LIST",
      "%Testing%" -> "TESTING",
      "%purchase count % is too low%" -> "TOO_LOW_PURCHASE_COUNT")

    val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*").r -> v._2)

    val df = Seq(
      "Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
      "Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
      "item (1) is blacklisted, #!@"
    ).toDF("raw_type")

    val converter = (value: String) => {
      val res = javaPatternMap.map(v => {
        v._1.findFirstIn(value) match {
          case Some(_) => v._2
          case None => ""
        }
      })
        .filter(_.nonEmpty).mkString(", ")

      if (res.isEmpty) "Unknown" else res
    }

    val converterUDF = udf(converter)

    val result = df.withColumn("updatedType", converterUDF($"raw_type"))

    result.show(false)

希望这会有所帮助!