Question

我有一个日志文件，需要为记录中的每个URL创建一个哈希键。记录中的每一行都被放入一个数组中，我循环遍历分配散列键的数组。

我需要从中得到：

"2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"

到此：

"/logschecks/scripts/setup1.php"

我尝试过使用match，scan和split，但他们都没能把我带到我需要去的地方。

我的方法目前看起来像：

def pathHistogram (rowsInFile)
  i = 0
  urlHash = Hash.new

  while i <= rowsInFile.length - 1

    urlKey = rowsInFile[i].scan(/<"GET ">/).last.first

    if urlHash.has_key?(urlKey) == true
      #get the number of stars already in there and add one. 
      urlHash[urlKey] = urlHash[urlKey] + '*'
      i = i + 1

    else 

      urlHash[urlKey] = '*'

      i = i + 1

    end
  end
end

我知道只是扫描＆＃34; GET＆＃34;我没有完成这项工作，但我正试图让它逐步完成。我尝试的match和split版本相当史诗般的失败，但我可能错误地使用它们并且它们已经很久了。

运行此脚本会在＆＃34;第一个＆＃34;上给我一个未定义的方法错误，但是当我改变处理方式时我遇到了其他错误。

我还应该说我没有使用scan。如果另一种方法可以更好地工作，我会非常乐意切换。

非常感谢任何帮助。

Answer 1

您在对其他答案的评论中说明该模式基本上是"GET ... HTTP，您对...部分感兴趣。这很容易被提取出来：

line = '2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"'

line[/"GET (.*?) HTTP/, 1]
# => "/logschecks/scripts/setup1.php"

Answer 2

假设您的每个输入行都包含/logschecks/...：

x = "2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: \"GET /logschecks/scripts/setup1.php HTTP/1.1\", host: \"www.example.com\""


x[%r(/logscheck[/\w\.]+)] # => "/logschecks/scripts/setup1.php"

Answer 3

扫描HTTP日志并不难，但是如何处理它将根据格式而有所不同。在示例中，您比标准日志更容易，因为您有一些可以寻找的地标：

Search for request: "使用类似的内容：
```
/request: "\S+ (\S+)/i
```
该模式将跳过GET，POST，HEAD或用于请求的任何方法。
```
log_line[/request: "\S+ (\S+)/i, 1] # => "/logschecks/scripts/setup1.php"
```
您可能想知道如果您正在挖掘日志。在那种情况下......

Search for request: "[GET|POST|HEAD|...]使用类似的内容：

/request: "(\S+) (\S+)/i

您可以像以下一样使用它：

method, url = log_line.match(/request: "(\S+) (\S+)/i).captures # => ["GET", "/logschecks/scripts/setup1.php"]
method # => "GET"
url # => "/logschecks/scripts/setup1.php"

您也可以grab whatever is inside the double-quotes，然后拆分它以获取部分：

/request: "([^"]+)"/i

例如：

log_line = %[2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"]
method, url, http_ver = log_line[/request: "([^"]+)"/i, 1].split # => ["GET", "/logschecks/scripts/setup1.php", "HTTP/1.1"]
method # => "GET"
url # => "/logschecks/scripts/setup1.php"
http_ver # => "HTTP/1.1"

或use a bit more complex pattern，使用some of the modern extensions并减少代码：

log_line = %[2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"]
/request: "(?<method>\S+) (?<url>\S+) (?<http_ver>\S+)"/i =~ log_line
method # => "GET"
url # => "/logschecks/scripts/setup1.php"
http_ver # => "HTTP/1.1"

子串哈希密钥问题？

3 个答案: