使用Scala中的正则表达式从服务器日志字符串中提取路径

时间:2019-07-11 18:38:59

标签: java regex scala

我有一些类似下面的日志

endeavor.fujitsu.co.jp - - [10/Jul/1995:00:00:15 -0400] "GET /images/ HTTP/1.0" 200 17688                                   
ad13-022.compuserve.com - - [10/Jul/1995:00:00:15 -0400] "GET /history/gemini/gemini-spacecraft.txt HTTP/1.0" 200 651       
pm2-15.magicnet.net - - [10/Jul/1995:00:00:15 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713                        
204.239.199.40 - - [10/Jul/1995:00:00:16 -0400] "GET /shuttle/missions/sts-71/images/KSC-95EC-0613.gif HTTP/1.0" 200 45970  
pm1-4.tricon.net - - [10/Jul/1995:00:00:17 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669                        
scorpio.digex.net - - [10/Jul/1995:00:00:19 -0400] "GET /history/mercury/mr-3/mr-3.html HTTP/1.0" 200 1124

我需要从上述日志中提取路径。这是我尝试过的代码

val pattern = "\\s+([^\\s]+)\\s+HTTP".r
val match = pattern.findFirstIn(log)

这是我得到的输出。

/images/ HTTP
/history/gemini/gemini-spacecraft.txt HTTP
/images/launch-logo.gif HTTP
/shuttle/missions/sts-71/images/KSC-95EC-0613.gif HTTP
/images/WORLD-logosmall.gif HTTP
/history/mercury/mr-3/mr-3.html HTTP

如何清除上述路径中的HTTP?

3 个答案:

答案 0 :(得分:1)

您要匹配的对象在第一个捕获组中,

或者,您可以使用正向超前

\\s+[^\\s]+(?=\\s+HTTP)

Demo

enter image description here

答案 1 :(得分:0)

您的比赛位于第一个捕获组communicate()中,您可以将其缩短为:

()

在Scala中

\s(\S+)\s+HTTP 

Regex demo

您可能会使用findAllIn获取日志:

val pattern = "\\s(\\S+)\\s+HTTP".r

结果

val pattern = "\\s(\\S+)\\s+HTTP".r
val strings = List(
  """endeavor.fujitsu.co.jp - - [10/Jul/1995:00:00:15 -0400] "GET /images/ HTTP/1.0" 200 17688                                   """,
  """ad13-022.compuserve.com - - [10/Jul/1995:00:00:15 -0400] "GET /history/gemini/gemini-spacecraft.txt HTTP/1.0" 200 651       """,
  """pm2-15.magicnet.net - - [10/Jul/1995:00:00:15 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713                        """,
  """204.239.199.40 - - [10/Jul/1995:00:00:16 -0400] "GET /shuttle/missions/sts-71/images/KSC-95EC-0613.gif HTTP/1.0" 200 45970  """,
  """pm1-4.tricon.net - - [10/Jul/1995:00:00:17 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669                        """,
  """scorpio.digex.net - - [10/Jul/1995:00:00:19 -0400] "GET /history/mercury/mr-3/mr-3.html HTTP/1.0" 200 1124"""
)

strings.foreach { log =>
  val m = pattern.findAllIn(log).group(1)
  println(m)
}

Scala demo

还要匹配注释中的这一行:

  

columbia.acc.brad.ac.uk--[10 / Jul / 1995:00:52:36 -0400]“获取   /ksc.html“ 200 7067

您可以使用:

/images/
/history/gemini/gemini-spacecraft.txt
/images/launch-logo.gif
/shuttle/missions/sts-71/images/KSC-95EC-0613.gif
/images/WORLD-logosmall.gif
/history/mercury/mr-3/mr-3.html

Regex demo

答案 2 :(得分:0)

您可以提取捕获组(请参阅其他答案),也可以简化正则表达式模式以仅匹配您感兴趣的内容。

val pattern = "\\s+/[^\\s]+".r