我正在尝试使用带有spark的正则表达式模式解析自定义日志文件:
我的日志文件:
2018-04-11 06:27:36 localhost debug: localhost received discover from 0.0.0.0
2018-04-11 06:27:36 localhost debug: sec = 0.4
2018-04-11 06:27:36 localhost debug: Msg-Type = text
2018-04-11 06:27:36 localhost debug: Content = XXXXXXXXXX
2018-04-11 06:27:34 localhost debug: localhost sending response to 0.0.0.0
2018-04-11 06:27:34 localhost debug: sec = 0.3
2018-04-11 06:27:34 localhost debug: Msg-Type = text
2018-04-11 06:27:34 localhost debug: Content = XXXXXXXXXX
...
以下是我的代码片段:
case class Rlog(dateTime: String, server_name: String, log_type: String, server_addr:String, action: String, target_addr:String, cost:String, msg_type:String, content:String)
case class Slog(dateTime: String, server_name: String, log_type: String, server_addr:String, action: String, target_addr:String, msg_type:String, content:String)
val pattern_1 = """([\w|\s|\:|-]{19})\s([a-z]+)\s(\w+):\s(\w+)\sreceived\s(\w+)\sfrom\s([\.|\w]+)"""
val pattern_2 = """([\w|\s|\:|-]{19})\s([a-z]+)\s(\w+):\s{5}([\w|-]+)\s=\s([\.|\w]+)"""
val pattern_3 = """([\w|\s|\:|-]{19})\s([a-z]+)\s(\w+):\s(\w+)\ssending\s(\w+)\sto\s([\.|\w]+)"""
sc.textFile("/directory/logfile").map(?????)
有没有办法做到这一点?
答案 0 :(得分:2)
您可以使用Date_Stat Total Success Gen_decline Failure_incomplete
01.05.2018 42045 39164 2096 785
02.05.2018 33721 30857 1727 1137
03.05.2018 28159 26042 1513 604
中的pattern.unapplySeq(string)
获取与正则表达式相关的所有群组匹配中的map
。
例如,如果你有字符串:
List
然后你跑:
val str = "2018-04-11 06:27:36 localhost debug: localhost received discover from 0.0.0.0"
你会得到:
pattern_1.unapplySeq(str)
我已将您的示例用于此解决方案。这个答案假定某个日志类型以及与之关联的msg类型,内容和秒都将使用相同的时间戳打印。
Option[List[String]] = Some(List(2018-04-11 06:27:36, localhost, debug, localhost, discover, 0.0.0.0))