我有一个日志文件,需要为记录中的每个URL创建一个哈希键。记录中的每一行都被放入一个数组中,我循环遍历分配散列键的数组。
我需要从中得到:
"2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"
到此:
"/logschecks/scripts/setup1.php"
我尝试过使用match
,scan
和split
,但他们都没能把我带到我需要去的地方。
我的方法目前看起来像:
def pathHistogram (rowsInFile)
i = 0
urlHash = Hash.new
while i <= rowsInFile.length - 1
urlKey = rowsInFile[i].scan(/<"GET ">/).last.first
if urlHash.has_key?(urlKey) == true
#get the number of stars already in there and add one.
urlHash[urlKey] = urlHash[urlKey] + '*'
i = i + 1
else
urlHash[urlKey] = '*'
i = i + 1
end
end
end
我知道只是扫描&#34; GET&#34;我没有完成这项工作,但我正试图让它逐步完成。我尝试的match
和split
版本相当史诗般的失败,但我可能错误地使用它们并且它们已经很久了。
运行此脚本会在&#34;第一个&#34;上给我一个未定义的方法错误,但是当我改变处理方式时我遇到了其他错误。
我还应该说我没有使用scan
。如果另一种方法可以更好地工作,我会非常乐意切换。
非常感谢任何帮助。
答案 0 :(得分:2)
您在对其他答案的评论中说明该模式基本上是"GET ... HTTP
,您对...
部分感兴趣。这很容易被提取出来:
line = '2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"'
line[/"GET (.*?) HTTP/, 1]
# => "/logschecks/scripts/setup1.php"
答案 1 :(得分:1)
假设您的每个输入行都包含/logschecks/...
:
x = "2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: \"GET /logschecks/scripts/setup1.php HTTP/1.1\", host: \"www.example.com\""
x[%r(/logscheck[/\w\.]+)] # => "/logschecks/scripts/setup1.php"
答案 2 :(得分:1)
扫描HTTP日志并不难,但是如何处理它将根据格式而有所不同。在示例中,您比标准日志更容易,因为您有一些可以寻找的地标:
Search for request: "
使用类似的内容:
/request: "\S+ (\S+)/i
该模式将跳过GET
,POST
,HEAD
或用于请求的任何方法。
log_line[/request: "\S+ (\S+)/i, 1] # => "/logschecks/scripts/setup1.php"
您可能想知道如果您正在挖掘日志。在那种情况下......
Search for request: "[GET|POST|HEAD|...]
使用类似的内容:
/request: "(\S+) (\S+)/i
您可以像以下一样使用它:
method, url = log_line.match(/request: "(\S+) (\S+)/i).captures # => ["GET", "/logschecks/scripts/setup1.php"]
method # => "GET"
url # => "/logschecks/scripts/setup1.php"
您也可以grab whatever is inside the double-quotes,然后拆分它以获取部分:
/request: "([^"]+)"/i
例如:
log_line = %[2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"]
method, url, http_ver = log_line[/request: "([^"]+)"/i, 1].split # => ["GET", "/logschecks/scripts/setup1.php", "HTTP/1.1"]
method # => "GET"
url # => "/logschecks/scripts/setup1.php"
http_ver # => "HTTP/1.1"
或use a bit more complex pattern,使用some of the modern extensions并减少代码:
log_line = %[2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"]
/request: "(?<method>\S+) (?<url>\S+) (?<http_ver>\S+)"/i =~ log_line
method # => "GET"
url # => "/logschecks/scripts/setup1.php"
http_ver # => "HTTP/1.1"