Question

假设我有一个Ruby on Rails 3日志文件列表，其文件格式如下：

production.log.CCYYMMDD

我正在使用日志标记，因此所有行都以请求唯一的哈希为前缀。例如：

[1a23f343a5331aeb03dc2461895d66d7] Completed 200 OK in 43.2ms (Views: 0.2ms | ActiveRecord: 25.0ms | Solr: 0.0ms)
[3fb5d184493aea1f7637aa5a442d166a] Started GET "/fp.js?_=1455251108526" for 27.55.132.119 at 2016-02-12 06:25:10 +0200
[d730a47ee957fb4b12b01c3b03357ba6] Started POST "/api/d" for 183.88.158.125 at 2016-02-12 06:25:10 +0200
[3fb5d184493aea1f7637aa5a442d166a] Processing by Api::DevicesController#fp as JS
[3fb5d184493aea1f7637aa5a442d166a]   Parameters: {"_"=>"1455251108526"}
[3fb5d184493aea1f7637aa5a442d166a]   Rendered api/devices/fp.js.erb (5.3ms)
[3fb5d184493aea1f7637aa5a442d166a] Completed 200 OK in 6.4ms (Views: 5.0ms | ActiveRecord: 1.2ms | Solr: 0.0ms)

这里的情况是许多进程将行记录到同一文件中，这意味着不同的请求行相互交错。您可以在上面的示例中看到，标识为d730a47ee957fb4b12b01c3b03357ba6的请求的行位于ID为3fb5d184493aea1f7637aa5a442d166a的请求的行之间。

我编写了一个Ruby脚本来解决这个问题，将相同的请求行放在一起，并以正确的时间戳顺序排列。所以，我的Ruby脚本设法处理上面的输入并给出：

[1a23f343a5331aeb03dc2461895d66d7] Completed 200 OK in 43.2ms (Views: 0.2ms | ActiveRecord: 25.0ms | Solr: 0.0ms)
[3fb5d184493aea1f7637aa5a442d166a] Started GET "/fp.js?_=1455251108526" for 27.55.132.119 at 2016-02-12 06:25:10 +0200
[3fb5d184493aea1f7637aa5a442d166a] Processing by Api::DevicesController#fp as JS
[3fb5d184493aea1f7637aa5a442d166a]   Parameters: {"_"=>"1455251108526"}
[3fb5d184493aea1f7637aa5a442d166a]   Rendered api/devices/fp.js.erb (5.3ms)
[3fb5d184493aea1f7637aa5a442d166a] Completed 200 OK in 6.4ms (Views: 5.0ms | ActiveRecord: 1.2ms | Solr: 0.0ms)
[d730a47ee957fb4b12b01c3b03357ba6] Started POST "/api/d" for 183.88.158.125 at 2016-02-12 06:25:10 +0200

有没有办法可以用标准的bash命令来做到这一点？

Answer 1

示例输入

我已将此文件用作我的示例输入：

[abc1] Text abc_01
[abc1] Text abc_02
[def2] Text def_01
[abc1] Text abc_03
[def2] Text def_02
[def2] Text def_03
[xyz3] Text xyz_01
[abc1] Text abc_04
[xyz3] Text xyz_02
[def2] Text def_04

第一个字段充当散列，行的其余部分是日志文件条目。

我们希望看到的输出：

[abc1] Text abc_01
[abc1] Text abc_02
[abc1] Text abc_03
[abc1] Text abc_04
[def2] Text def_01
[def2] Text def_02
[def2] Text def_03
[def2] Text def_04
[xyz3] Text xyz_01
[xyz3] Text xyz_02

相同散列的行的顺序相同，但散列现在按外观排序。

Pure Bash解决方案

这是一个纯 Bash解决方案，仅使用内置函数：

#!/bin/bash

# Read entire file into array
mapfile -t loglines < "$1"

# Associative array to map hashes to line numbers
declare -A hashes

# Array to keep track of sequence of hashes
declare -a hash_seq

# Current line number
ln_no=0

while read -r req_hash _; do
    if [[ -z ${hashes[$req_hash]} ]]; then # If we haven't seen this hash yet
        # Add hash to hash sequence
        hash_seq+=("$req_hash")
        # Add current line number as first line of hash
        hashes[$req_hash]=$ln_no
    else
        # Append current line number to hash
        hashes[$req_hash]+=" $ln_no"
    fi
    ((++ln_no)) # Increment line number
done < "$1"

# Loop over hashes in sequence
for i in "${hash_seq[@]}"; do
    # Loop over line number for each hash
    for ln in ${hashes[$i]}; do # Must not be quoted - shell word splitting!
        # Print next line for this hash
        echo "${loglines[$ln]}"
    done
done

我添加了一些评论，我认为这些评论应该让以下内容变得可以理解：

首先将整个文件读入数组（mapfile行）。
然后遍历整个文件以查找散列发生的序列（hash_seq数组）以及哪些行号属于每个散列（hashes关联数组）。
最后，我们按顺序循环遍历所有哈希值，对于每个哈希值的行号，在loglines数组中查找该行并打印出来。

这很慢，我们稍后会看到。

使用grep

进行Bash

如果我们不仅限于Bash，我们不必将整个文件存储在一个数组中，并且可以使用grep来提取我们想要的行：

#!/bin/bash

# Associative array to keep track of hashes seen
declare -A hashes

# Array to keep track of sequence of hashes
declare -a hash_seq

while read -r req_hash _; do
    if [[ -z ${hashes[$req_hash]} ]]; then # If we haven't seen this hash yet
        # Mark as "seen"
        hashes[$req_hash]=1
        # Add hash to hash sequence
        hash_seq+=("$req_hash")
    fi
done < "$1"

# Loop over hashes in sequence
for cur_hash in "${hash_seq[@]}"; do
    # Print all lines containing the hash
    grep -F "$cur_hash" "$1"
done

这消除了存储文件，关联数组和跟踪行号：

while循环获取散列出现的顺序。
for循环按顺序获取每个哈希值，并使用grep获取属于相应请求的所有行。使用了-F选项，因此方括号不会被解释为元字符。

awk解决方案

技术上不再是＆＃34; Bash＆＃34;，但通常在Bash所在的地方可用。

#!/usr/bin/awk -f

{
    loglines[NR] = $0   # Store all lines
    if ($1 in hashes) { # Append line number if we've seen the hash before
        hashes[$1] = hashes[$1] " " NR
    }
    else {                      # New hash
        hashes[$1] = NR         # Add line number, mark as seen
        hash_seq[++ctr] = $1    # Add to hash sequence
    }
}

END {
    for (i = 1; i <= ctr; ++i) {                # Loop over hashes in sequence
        nl = split(hashes[hash_seq[i]], arr)    # Get line numbers
        for (j = 1; j <= nl; ++j)               # Loop over line numbers
            print loglines[arr[j]]              # Print corresponding line
    }
}

这与第一个Bash解决方案相同，但速度更快。

基准

选择什么？对于扩展到100,000行的样本输入，解决方案如下（输出重定向到文件，如./SO.sh infile > outfile中所示）：

Pure Bash：

real    1m7.021s
user    1m2.265s
sys     0m4.468s

Bash和grep：

real    0m3.882s
user    0m1.967s
sys     0m1.890s

awk中：

real    0m0.915s
user    0m0.906s
sys     0m0.015s

不出所料，awk是最快的，但是grep的Bash并不是无法忍受的。 Pure Bash 慢。

限制

在Bash 4.0中添加了关联数组，因此Bash解决方案显然至少需要此版本。我不认为我使用过Bash，grep或awk的GNU扩展名，所以这些应该是相当便携的。

如何编写bash脚本以将并行交错请求中的Rails日志行组合在一起？

1 个答案:

示例输入

Pure Bash解决方案

使用grep

awk解决方案

基准

限制