子串哈希密钥问题?

时间:2014-03-17 20:41:32

标签: ruby hash substring

我有一个日志文件,需要为记录中的每个URL创建一个哈希键。记录中的每一行都被放入一个数组中,我循环遍历分配散列键的数组。

我需要从中得到:

"2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com" 

到此:

"/logschecks/scripts/setup1.php"

我尝试过使用matchscansplit,但他们都没能把我带到我需要去的地方。

我的方法目前看起来像:

def pathHistogram (rowsInFile)
  i = 0
  urlHash = Hash.new

  while i <= rowsInFile.length - 1

    urlKey = rowsInFile[i].scan(/<"GET ">/).last.first

    if urlHash.has_key?(urlKey) == true
      #get the number of stars already in there and add one. 
      urlHash[urlKey] = urlHash[urlKey] + '*'
      i = i + 1

    else 

      urlHash[urlKey] = '*'

      i = i + 1

    end
  end
end

我知道只是扫描&#34; GET&#34;我没有完成这项工作,但我正试图让它逐步完成。我尝试的matchsplit版本相当史诗般的失败,但我可能错误地使用它们并且它们已经很久了。

运行此脚本会在&#34;第一个&#34;上给我一个未定义的方法错误,但是当我改变处理方式时我遇到了其他错误。

我还应该说我没有使用scan。如果另一种方法可以更好地工作,我会非常乐意切换。

非常感谢任何帮助。

3 个答案:

答案 0 :(得分:2)

您在对其他答案的评论中说明该模式基本上是"GET ... HTTP,您对...部分感兴趣。这很容易被提取出来:

line = '2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"'

line[/"GET (.*?) HTTP/, 1]
# => "/logschecks/scripts/setup1.php"

答案 1 :(得分:1)

假设您的每个输入行都包含/logschecks/...

x = "2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: \"GET /logschecks/scripts/setup1.php HTTP/1.1\", host: \"www.example.com\""


x[%r(/logscheck[/\w\.]+)] # => "/logschecks/scripts/setup1.php"

答案 2 :(得分:1)

扫描HTTP日志并不难,但是如何处理它将根据格式而有所不同。在示例中,您比标准日志更容易,因为您有一些可以寻找的地标:

  • Search for request: "使用类似的内容:

    /request: "\S+ (\S+)/i
    

    该模式将跳过GETPOSTHEAD或用于请求的任何方法。

    log_line[/request: "\S+ (\S+)/i, 1] # => "/logschecks/scripts/setup1.php"
    

    您可能想知道如果您正在挖掘日志。在那种情况下......

  • Search for request: "[GET|POST|HEAD|...]使用类似的内容:

    /request: "(\S+) (\S+)/i
    

    您可以像以下一样使用它:

    method, url = log_line.match(/request: "(\S+) (\S+)/i).captures # => ["GET", "/logschecks/scripts/setup1.php"]
    method # => "GET"
    url # => "/logschecks/scripts/setup1.php"
    
  • 您也可以grab whatever is inside the double-quotes,然后拆分它以获取部分:

    /request: "([^"]+)"/i
    

    例如:

    log_line = %[2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"]
    method, url, http_ver = log_line[/request: "([^"]+)"/i, 1].split # => ["GET", "/logschecks/scripts/setup1.php", "HTTP/1.1"]
    method # => "GET"
    url # => "/logschecks/scripts/setup1.php"
    http_ver # => "HTTP/1.1"
    
  • use a bit more complex pattern,使用some of the modern extensions并减少代码:

    log_line = %[2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"]
    /request: "(?<method>\S+) (?<url>\S+) (?<http_ver>\S+)"/i =~ log_line
    method # => "GET"
    url # => "/logschecks/scripts/setup1.php"
    http_ver # => "HTTP/1.1"