Question

我正在编写一个可以解析apache weblog文件的通用scala类。到目前为止，我的解决方案是使用组正则表达式来匹配日志字符串的不同部分。为了说明传入日志的每一行，给出类似下面的字符串

25.198.250.35 - [2014-07-19T16：05：33Z]＆＃34; GET / HTTP / 1.1＆＃34; 404 1081＆＃34; - ＆＃34; ＆＃34; Mozilla / 4.0（兼容; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322）＆＃34;

class HttpLogStringParser(logLine: String) {
  // Regex Pattern matching the logLine
  val pattern = """^([\d.]+) (\S+) (\S+) \[(.*)\] \"(.+?)\" (\d{3}) (\d+) \"(\S+)\" \"([^\"]+)\"$""".r
  val matched = pattern.findFirstMatchIn(logLine)

  def getIP: String = {
    val IP = matched match {
      case Some(m) => m.group(1)
      case _ => None
    }
    IP.toString
  }

  def getTimeStamp: String = {
    val timeStamp = matched match {
      case Some(m) => m.group(4)
      case _ => None
    }
    timeStamp.toString
  }

  def getRequestPage: String = {
    val requestPage = matched match {
      case Some(m) => m.group(5)
      case _ => None
    }
    requestPage.toString
  }

  def getStatusCode: String = {
    val statusCode = matched match {
      case Some(m) => m.group(6)
      case _ => None
    }
    statusCode.toString
  }
}

调用这些方法应该给我IP，日期，时间戳或状态代码。这是最好的方法吗？我也尝试过案例类的模式匹配，但这只是给我匹配布尔值。我完全错了。从输入日志字符串中获取所需值的最佳方法是什么？

Answer 1

参见文档：

http://www.scala-lang.org/api/current/#scala.util.matching.Regex

您可以从模式中删除锚点并执行：

val (a, b, c) = line match { case p(x, _, y, z) => (x, y, z) case _ => ??? }

其中p是你的正则表达式。

或使用(?:foo)或以其他方式删除不感兴趣的群组。

您还可以使用Regex.Groups从Match中提取群组。

否则，它并不重要。

将Regex移动到伴侣对象，使其只编译一次。

Apache Weblog使用正则表达式进行解析。哪个更好，案例类或Java类？

1 个答案: