我写了以下正则表达式:
val reg = ".+([A-Z_].+).(\\d{4})_(\\d{2})_(\\d{2})_(\\d{2})\\.orc".r
应该解析以下字符串: “ S3 //存储桶//TS11_YREDED.2018_09_28_02.orc” 解析方法是:
val dataExtraction: String => Map[String, String] = {
string: String => {
string match {
case reg(filename, year, month, day) =>
Map(FILE_NAME-> filename, YEAR -> year, MONTH -> month, DAY -> day)
case _ => Map(FILE_NAME-> filename,YEAR -> "", MONTH -> "", DAY -> "")
}
}
}
val YEAR = "YEAR"
val MONTH = "MONTH"
val DAY = "DAY"
val FILE_NAME = "FILE_NAME"
但无法正常工作 应该省略存储桶名称并解析文件名和日期
因此,预期输出应为:Map(FILE_NAME-> TS11_YREDED,YEAR->,MONTH-> 09,DAY-> 28) 知道如何解决它吗?
答案 0 :(得分:0)
.+
模式部分首先匹配整个字符串,而([A-Z_].+)
仅捕获要由后续模式捕获和匹配的内容。
您可以使用
"""(?:.*/)?(.*)\.(\d{4})_(\d{2})_(\d{2})_\d{2}\.orc""".r
请注意,点必须转义以匹配文字点。
详细信息
(?:.*/)?
-尽可能多的除换行符以外的任何0+个字符,直到最后一个/
并包括它(.*)
-捕获组1:尽可能多的除换行符以外的0+个字符\.
-一个点(\d{4})
-捕获第2组:四位数_
-下划线(\d{2})
-捕获组3:两位数字_
-下划线(\d{2})
-捕获第4组:两位数字_\d{2}\.orc
-_
,两位数字,.
和orc
在字符串的末尾。val text = "S3//bucket//TS11_YREDED.2018_09_28_02.orc"
val reg = """(?:.*/)?(.*)\.(\d{4})_(\d{2})_(\d{2})_\d{2}\.orc""".r
var YEAR = "YEAR"
var MONTH = "MONTH"
var DAY = "DAY"
var FILE_NAME = "FILE_NAME"
val dataExtraction: String => Map[String, String] = {
string: String => {
string match {
case reg(filename, year, month, day) =>
Map(FILE_NAME-> filename, YEAR -> year, MONTH -> month, DAY -> day)
case _ => Map(FILE_NAME-> FILE_NAME,YEAR -> YEAR, MONTH -> MONTH, DAY -> DAY)
}
}
}
println(dataExtraction(text))
// => Map(FILE_NAME -> TS11_YREDED, YEAR -> 2018, MONTH -> 09, DAY -> 28)
由于您没有使用最后一个捕获组,因此可以从模式中将其省略。
答案 1 :(得分:0)
检查一下:
val file_name = "TS11_YREDED.2018_09_28_02.orc"
val reg = """(.*?)\.(\d{4})_(\d{2})_(\d{2})_(\d{2})\.orc""".r
var file_details = scala.collection.mutable.ArrayBuffer[String]()
reg.findAllIn(file_name).matchData.foreach( m => file_details.appendAll(m.subgroups))
val names=Array("FILE_NAME","YEAR","MONTH","DAY","DUMMY")
for( (x,y) <- names.zip(file_details).toMap)
println(x + "->" + y)
//DUMMY->02
//DAY->28
//FILE_NAME->TS11_YREDED
//MONTH->09
//YEAR->2018