如何编写bash脚本以将并行交错请求中的Rails日志行组合在一起?

时间:2016-02-19 19:47:33

标签: bash text

假设我有一个Ruby on Rails 3日志文件列表,其文件格式如下:

production.log.CCYYMMDD

我正在使用日志标记,因此所有行都以请求唯一的哈希为前缀。例如:

[1a23f343a5331aeb03dc2461895d66d7] Completed 200 OK in 43.2ms (Views: 0.2ms | ActiveRecord: 25.0ms | Solr: 0.0ms)
[3fb5d184493aea1f7637aa5a442d166a] Started GET "/fp.js?_=1455251108526" for 27.55.132.119 at 2016-02-12 06:25:10 +0200
[d730a47ee957fb4b12b01c3b03357ba6] Started POST "/api/d" for 183.88.158.125 at 2016-02-12 06:25:10 +0200
[3fb5d184493aea1f7637aa5a442d166a] Processing by Api::DevicesController#fp as JS
[3fb5d184493aea1f7637aa5a442d166a]   Parameters: {"_"=>"1455251108526"}
[3fb5d184493aea1f7637aa5a442d166a]   Rendered api/devices/fp.js.erb (5.3ms)
[3fb5d184493aea1f7637aa5a442d166a] Completed 200 OK in 6.4ms (Views: 5.0ms | ActiveRecord: 1.2ms | Solr: 0.0ms)

这里的情况是许多进程将行记录到同一文件中,这意味着不同的请求行相互交错。您可以在上面的示例中看到,标识为d730a47ee957fb4b12b01c3b03357ba6的请求的行位于ID为3fb5d184493aea1f7637aa5a442d166a的请求的行之间。

我编写了一个Ruby脚本来解决这个问题,将相同的请求行放在一起,并以正确的时间戳顺序排列。所以,我的Ruby脚本设法处理上面的输入并给出:

[1a23f343a5331aeb03dc2461895d66d7] Completed 200 OK in 43.2ms (Views: 0.2ms | ActiveRecord: 25.0ms | Solr: 0.0ms)
[3fb5d184493aea1f7637aa5a442d166a] Started GET "/fp.js?_=1455251108526" for 27.55.132.119 at 2016-02-12 06:25:10 +0200
[3fb5d184493aea1f7637aa5a442d166a] Processing by Api::DevicesController#fp as JS
[3fb5d184493aea1f7637aa5a442d166a]   Parameters: {"_"=>"1455251108526"}
[3fb5d184493aea1f7637aa5a442d166a]   Rendered api/devices/fp.js.erb (5.3ms)
[3fb5d184493aea1f7637aa5a442d166a] Completed 200 OK in 6.4ms (Views: 5.0ms | ActiveRecord: 1.2ms | Solr: 0.0ms)
[d730a47ee957fb4b12b01c3b03357ba6] Started POST "/api/d" for 183.88.158.125 at 2016-02-12 06:25:10 +0200

有没有办法可以用标准的bash命令来做到这一点?

1 个答案:

答案 0 :(得分:0)

示例输入

我已将此文件用作我的示例输入:

[abc1] Text abc_01
[abc1] Text abc_02
[def2] Text def_01
[abc1] Text abc_03
[def2] Text def_02
[def2] Text def_03
[xyz3] Text xyz_01
[abc1] Text abc_04
[xyz3] Text xyz_02
[def2] Text def_04

第一个字段充当散列,行的其余部分是日志文件条目。

我们希望看到的输出:

[abc1] Text abc_01
[abc1] Text abc_02
[abc1] Text abc_03
[abc1] Text abc_04
[def2] Text def_01
[def2] Text def_02
[def2] Text def_03
[def2] Text def_04
[xyz3] Text xyz_01
[xyz3] Text xyz_02

相同散列的行的顺序相同,但散列现在按外观排序。

Pure Bash解决方案

这是一个 Bash解决方案,仅使用内置函数:

#!/bin/bash

# Read entire file into array
mapfile -t loglines < "$1"

# Associative array to map hashes to line numbers
declare -A hashes

# Array to keep track of sequence of hashes
declare -a hash_seq

# Current line number
ln_no=0

while read -r req_hash _; do
    if [[ -z ${hashes[$req_hash]} ]]; then # If we haven't seen this hash yet
        # Add hash to hash sequence
        hash_seq+=("$req_hash")
        # Add current line number as first line of hash
        hashes[$req_hash]=$ln_no
    else
        # Append current line number to hash
        hashes[$req_hash]+=" $ln_no"
    fi
    ((++ln_no)) # Increment line number
done < "$1"

# Loop over hashes in sequence
for i in "${hash_seq[@]}"; do
    # Loop over line number for each hash
    for ln in ${hashes[$i]}; do # Must not be quoted - shell word splitting!
        # Print next line for this hash
        echo "${loglines[$ln]}"
    done
done

我添加了一些评论,我认为这些评论应该让以下内容变得可以理解:

  • 首先将整个文件读入数组(mapfile行)。
  • 然后遍历整个文件以查找散列发生的序列(hash_seq数组)以及哪些行号属于每个散列(hashes关联数组)。
  • 最后,我们按顺序循环遍历所有哈希值,对于每个哈希值的行号,在loglines数组中查找该行并打印出来。

这很慢,我们稍后会看到。

使用grep

进行Bash

如果我们不仅限于Bash,我们不必将整个文件存储在一个数组中,并且可以使用grep来提取我们想要的行:

#!/bin/bash

# Associative array to keep track of hashes seen
declare -A hashes

# Array to keep track of sequence of hashes
declare -a hash_seq

while read -r req_hash _; do
    if [[ -z ${hashes[$req_hash]} ]]; then # If we haven't seen this hash yet
        # Mark as "seen"
        hashes[$req_hash]=1
        # Add hash to hash sequence
        hash_seq+=("$req_hash")
    fi
done < "$1"

# Loop over hashes in sequence
for cur_hash in "${hash_seq[@]}"; do
    # Print all lines containing the hash
    grep -F "$cur_hash" "$1"
done

这消除了存储文件,关联数组和跟踪行号:

  • while循环获取散列出现的顺序。
  • for循环按顺序获取每个哈希值,并使用grep获取属于相应请求的所有行。使用了-F选项,因此方括号不会被解释为元字符。

awk解决方案

技术上不再是&#34; Bash&#34;,但通常在Bash所在的地方可用。

#!/usr/bin/awk -f

{
    loglines[NR] = $0   # Store all lines
    if ($1 in hashes) { # Append line number if we've seen the hash before
        hashes[$1] = hashes[$1] " " NR
    }
    else {                      # New hash
        hashes[$1] = NR         # Add line number, mark as seen
        hash_seq[++ctr] = $1    # Add to hash sequence
    }
}

END {
    for (i = 1; i <= ctr; ++i) {                # Loop over hashes in sequence
        nl = split(hashes[hash_seq[i]], arr)    # Get line numbers
        for (j = 1; j <= nl; ++j)               # Loop over line numbers
            print loglines[arr[j]]              # Print corresponding line
    }
}

这与第一个Bash解决方案相同,但速度更快。

基准

选择什么?对于扩展到100,000行的样本输入,解决方案如下(输出重定向到文件,如./SO.sh infile > outfile中所示):

  • Pure Bash:

    real    1m7.021s
    user    1m2.265s
    sys     0m4.468s
    
  • Bash和grep:

    real    0m3.882s
    user    0m1.967s
    sys     0m1.890s
    
  • awk中:

    real    0m0.915s
    user    0m0.906s
    sys     0m0.015s
    

不出所料,awk是最快的,但是grep的Bash并不是无法忍受的。 Pure Bash 慢。

限制

在Bash 4.0中添加了关联数组,因此Bash解决方案显然至少需要此版本。我不认为我使用过Bash,grep或awk的GNU扩展名,所以这些应该是相当便携的。