我有一个数据列表,其值类似于List [INTERSTED_FIELD:details]。我试图从中获取感兴趣的领域。如何删除不感兴趣的字段?
示例:
val df = Seq(
"TESTING:Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
"PURCHASE:BLACKLIST_ITEM: Foo purchase count (12, 4) is too low ",
"UNKOWN:#!@",
"BLACKLIST_ITEM:item (mejwnw) is blacklisted",
"BLACKLIST_ITEM:item (1) is blacklisted, UNKOWN:#!@"
).toDF("raw_type")
df.show(false)
+-----------------------------------------------------------------+
|raw_type |
+-----------------------------------------------------------------+
|TESTING:Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|
|PURCHASE:BLACKLIST_ITEM: Foo purchase count (12, 4) is too low |
|UNKOWN:#!@ |
|BLACKLIST_ITEM:item (mejwnw) is blacklisted |
|BLACKLIST_ITEM:item (1) is blacklisted, UNKOWN:#!@ |
+-----------------------------------------------------------------+
我正在尝试获取:
+-----------------------------------------------------------------+
|raw_type |
+-----------------------------------------------------------------+
|TESTING |
|PURCHASE,BLACKLIST_ITEM |
|UNKOWN |
|BLACKLIST_ITEM |
|BLACKLIST_ITEM, UNKNOWN |
+-----------------------------------------------------------------+
答案 0 :(得分:1)
检查此UDF解决方案
scala> val df = Seq(
| "TESTING:Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
| "PURCHASE:BLACKLIST_ITEM: Foo purchase count (12, 4) is too low ",
| "UNKOWN:#!@",
| "BLACKLIST_ITEM:item (mejwnw) is blacklisted",
| "BLACKLIST_ITEM:item (1) is blacklisted, UNKOWN:#!@"
| ).toDF("raw_type")
df: org.apache.spark.sql.DataFrame = [raw_type: string]
scala> def matchlist(a:String):String=
| {
| import scala.collection.mutable.ArrayBuffer
| val x = ArrayBuffer[String]()
| val pt = "([A-Z_]+):".r
| pt.findAllIn(a).matchData.foreach { m => x.append(m.group(1)) }
| return x.mkString(",")
| }
matchlist: (a: String)String
scala> val myudfmatchlist = udf( matchlist(_:String):String )
myudfmatchlist: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.select(myudfmatchlist($"raw_type")).show(false)
+-----------------------+
|UDF(raw_type) |
+-----------------------+
|TESTING |
|PURCHASE,BLACKLIST_ITEM|
|UNKOWN |
|BLACKLIST_ITEM |
|BLACKLIST_ITEM,UNKOWN |
+-----------------------+
scala>
答案 1 :(得分:1)
const lastDigit = (str1, str2) => {
let num = Number(str1);
const initial = num;
const totalIterations = Number(str2);
for (let i = 1; i < totalIterations; i++) {
num = (num * initial) % 10;
}
return num;
}
console.log(lastDigit("53", "230"));
在spark-shell中:
val p = "[A-Z_]+(?=:)".r
df.rdd.map(x=>p.findAllIn(x.mkString).mkString(",")).toDF(df.columns:_*).show(false)