如何使用Spark

时间:2018-05-18 08:13:00

标签: regex scala apache-spark

我正在尝试使用带有spark的正则表达式模式解析自定义日志文件:

我的日志文件:

2018-04-11 06:27:36 localhost debug: localhost received discover from 0.0.0.0
2018-04-11 06:27:36 localhost debug:     sec = 0.4
2018-04-11 06:27:36 localhost debug:     Msg-Type = text
2018-04-11 06:27:36 localhost debug:     Content = XXXXXXXXXX
2018-04-11 06:27:34 localhost debug: localhost sending response to 0.0.0.0
2018-04-11 06:27:34 localhost debug:     sec = 0.3
2018-04-11 06:27:34 localhost debug:     Msg-Type = text
2018-04-11 06:27:34 localhost debug:     Content = XXXXXXXXXX
...

以下是我的代码片段:

case class Rlog(dateTime: String, server_name: String, log_type: String, server_addr:String, action: String, target_addr:String, cost:String, msg_type:String, content:String)
case class Slog(dateTime: String, server_name: String, log_type: String, server_addr:String, action: String, target_addr:String, msg_type:String, content:String)

val pattern_1 = """([\w|\s|\:|-]{19})\s([a-z]+)\s(\w+):\s(\w+)\sreceived\s(\w+)\sfrom\s([\.|\w]+)"""
val pattern_2 = """([\w|\s|\:|-]{19})\s([a-z]+)\s(\w+):\s{5}([\w|-]+)\s=\s([\.|\w]+)"""
val pattern_3 = """([\w|\s|\:|-]{19})\s([a-z]+)\s(\w+):\s(\w+)\ssending\s(\w+)\sto\s([\.|\w]+)"""

sc.textFile("/directory/logfile").map(?????)

有没有办法做到这一点?

1 个答案:

答案 0 :(得分:2)

您可以使用Date_Stat Total Success Gen_decline Failure_incomplete 01.05.2018 42045 39164 2096 785 02.05.2018 33721 30857 1727 1137 03.05.2018 28159 26042 1513 604 中的pattern.unapplySeq(string)获取与正则表达式相关的所有群组匹配中的map

例如,如果你有字符串:

List

然后你跑:

val str = "2018-04-11 06:27:36 localhost debug: localhost received discover from 0.0.0.0"

你会得到:

pattern_1.unapplySeq(str)

我已将您的示例用于此解决方案。这个答案假定某个日志类型以及与之关联的msg类型,内容和秒都将使用相同的时间戳打印。

 Option[List[String]] = Some(List(2018-04-11 06:27:36, localhost, debug, localhost, discover, 0.0.0.0))