我必须解析具有以下结构的请求日志
07/Dec/2017:18:15:58 +0100 [293920] -> GET URL HTTP/1.1
07/Dec/2017:18:15:58 +0100 [293920] <- 200 text/html 5ms
07/Dec/2017:18:15:58 +0100 [293921] -> GET URL HTTP/1.1
07/Dec/2017:18:15:58 +0100 [293921] <- 200 image/png 39ms
07/Dec/2017:18:15:59 +0100 [293922] -> HEAD URL HTTP/1.0
07/Dec/2017:18:15:59 +0100 [293922] <- 401 - 1ms
07/Dec/2017:18:15:59 +0100 [293923] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293923] <- 200 text/html 178ms
07/Dec/2017:18:15:59 +0100 [293924] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293924] <- 200 text/html 11ms
07/Dec/2017:18:15:59 +0100 [293925] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293925] <- 200 text/html 7ms
07/Dec/2017:18:15:59 +0100 [293926] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293926] <- 200 text/html 16ms
07/Dec/2017:18:15:59 +0100 [293927] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293927] <- 200 text/html 8ms
输出应根据方括号之间的数字链接此日志中的两行。 目标是使用其他数据处理软件包从此日志文件中提取信息。 我想使用csv文件提取有用的信息。 csv文件的结构应如下所示。
startTimestamp,endTimestamp,requestType/responseCode,URL/typ,responsetime
07/Dec/2017:18:15:58,07/Dec/2017:18:15:58,GET,200,URL,text/html,5ms
我制作了一个groovyScript来完成这个技巧,但它非常慢。
我知道我可以做一些改进,但想要你的想法。你们当中有些人过去可能已经解决了这个问题。
响应并不总是遵循请求。 并非每个请求都会收到响应(或者由于服务器重新启动而未记录)
日志文件可以从70mb到300mb。我的groovyScript花了很长时间。
我知道unix终端中有很好的快速解决方案,有awk和sort。但没有这方面的经验。
提前感谢您的帮助
这是我已有的代码 可能的改进
1)使用地图,其中键是数字,以便更快地搜索和减少解析
2)不要查看每行的积压列表
def logFile = new File("../request.log")
def outputfile = new File(logFile.parent, logFile.name + ".csv")
def backlog = new ArrayList<String>()
StringBuilder output = new StringBuilder()
outputfile.withPrintWriter { writer ->
logFile.withReader { Reader reader ->
reader.eachLine { String line ->
Iterator<String> it = backlog.iterator()
while (it.hasNext()) {
String bLine = it.next()
String[] lineSplit = line.split(" ")
if (bLine.contains(lineSplit[2])) {
String[] bLineSplit = bLine.split(" ")
output.append(bLineSplit[0] + "," + lineSplit[0] + "," + bLineSplit[4] + "," + lineSplit[4] + "," + bLineSplit[5] + "," + lineSplit[5] + "," + lineSplit[6] + "\r\n")
//writer.println(outputline)
it.remove()
}
}
backlog.add(line)
}
}
writer.println(output)
if (!backlog.isEmpty()) {
}
backlog.each { String line ->
writer.println(line)
}
}
答案 0 :(得分:0)
作为单行:
sort -k 3,3 request.log | awk 'BEGIN { print "startTimestamp;endTimestamp;requestType;responseCode;URL;typ;responsetime"; split("", request); split("", response) } $4 == "->" { printLine(); split($0, request); split("", response) } $4 == "<-" { split($0, response) } END { printLine() } function printLine() { if (length(request)) { print request[1] ";" response[1] ";" request[5] ";" response[5] ";" request[6] ";" response[6] ";" response[7] } }'
作为多班轮:
sort -k 3,3 request.log | awk '
BEGIN {
print "startTimestamp;endTimestamp;requestType;responseCode;URL;typ;responsetime"
split("", request)
}
$4 == "->" {
printLine()
split($0, request)
split("", response)
}
$4 == "<-" {
split($0, response)
}
END {
printLine()
}
function printLine() {
if (length(request)) {
print request[1] ";" response[1] ";" request[5] ";" response[5] ";" request[6] ";" response[6] ";" response[7]
}
}'