如何从scala列中提取字符串?

时间:2018-10-09 21:50:58

标签: regex scala apache-spark

我有一个数据列表,其值类似于List [INTERSTED_FIELD:details]。我试图从中获取感兴趣的领域。如何删除不感兴趣的字段?

示例:

val df = Seq(
  "TESTING:Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low", 
  "PURCHASE:BLACKLIST_ITEM: Foo purchase count (12, 4) is too low ", 
   "UNKOWN:#!@", 
   "BLACKLIST_ITEM:item (mejwnw) is blacklisted",
   "BLACKLIST_ITEM:item (1) is blacklisted, UNKOWN:#!@" 
).toDF("raw_type")

df.show(false)

+-----------------------------------------------------------------+
|raw_type                                                         |
+-----------------------------------------------------------------+
|TESTING:Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|
|PURCHASE:BLACKLIST_ITEM: Foo purchase count (12, 4) is too low   |
|UNKOWN:#!@                                                       |
|BLACKLIST_ITEM:item (mejwnw) is blacklisted                      |
|BLACKLIST_ITEM:item (1) is blacklisted, UNKOWN:#!@               |
+-----------------------------------------------------------------+

我正在尝试获取:

+-----------------------------------------------------------------+
|raw_type                                                         |
+-----------------------------------------------------------------+
|TESTING                                                          | 
|PURCHASE,BLACKLIST_ITEM                                          |
|UNKOWN                                                           |
|BLACKLIST_ITEM                                                   |
|BLACKLIST_ITEM, UNKNOWN                                          |
+-----------------------------------------------------------------+

2 个答案:

答案 0 :(得分:1)

检查此UDF解决方案

scala> val df = Seq(
     |   "TESTING:Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
     |   "PURCHASE:BLACKLIST_ITEM: Foo purchase count (12, 4) is too low ",
     |    "UNKOWN:#!@",
     |    "BLACKLIST_ITEM:item (mejwnw) is blacklisted",
     |    "BLACKLIST_ITEM:item (1) is blacklisted, UNKOWN:#!@"
     | ).toDF("raw_type")
df: org.apache.spark.sql.DataFrame = [raw_type: string]

scala> def matchlist(a:String):String=
     | {
     | import scala.collection.mutable.ArrayBuffer
     | val x = ArrayBuffer[String]()
     | val pt = "([A-Z_]+):".r
     | pt.findAllIn(a).matchData.foreach { m => x.append(m.group(1)) }
     | return x.mkString(",")
     | }
matchlist: (a: String)String

scala> val myudfmatchlist = udf( matchlist(_:String):String )
myudfmatchlist: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala> df.select(myudfmatchlist($"raw_type")).show(false)
+-----------------------+
|UDF(raw_type)          |
+-----------------------+
|TESTING                |
|PURCHASE,BLACKLIST_ITEM|
|UNKOWN                 |
|BLACKLIST_ITEM         |
|BLACKLIST_ITEM,UNKOWN  |
+-----------------------+


scala>

答案 1 :(得分:1)

const lastDigit = (str1, str2) => {
  let num = Number(str1);
  const initial = num;
  const totalIterations = Number(str2);
  for (let i = 1; i < totalIterations; i++) {
    num = (num * initial) % 10;
  }
  return num;
}
console.log(lastDigit("53", "230"));

在spark-shell中:

val p = "[A-Z_]+(?=:)".r
df.rdd.map(x=>p.findAllIn(x.mkString).mkString(",")).toDF(df.columns:_*).show(false)