假设我有一个Ruby on Rails 3日志文件列表,其文件格式如下:
production.log.CCYYMMDD
我正在使用日志标记,因此所有行都以请求唯一的哈希为前缀。例如:
[1a23f343a5331aeb03dc2461895d66d7] Completed 200 OK in 43.2ms (Views: 0.2ms | ActiveRecord: 25.0ms | Solr: 0.0ms)
[3fb5d184493aea1f7637aa5a442d166a] Started GET "/fp.js?_=1455251108526" for 27.55.132.119 at 2016-02-12 06:25:10 +0200
[d730a47ee957fb4b12b01c3b03357ba6] Started POST "/api/d" for 183.88.158.125 at 2016-02-12 06:25:10 +0200
[3fb5d184493aea1f7637aa5a442d166a] Processing by Api::DevicesController#fp as JS
[3fb5d184493aea1f7637aa5a442d166a] Parameters: {"_"=>"1455251108526"}
[3fb5d184493aea1f7637aa5a442d166a] Rendered api/devices/fp.js.erb (5.3ms)
[3fb5d184493aea1f7637aa5a442d166a] Completed 200 OK in 6.4ms (Views: 5.0ms | ActiveRecord: 1.2ms | Solr: 0.0ms)
这里的情况是许多进程将行记录到同一文件中,这意味着不同的请求行相互交错。您可以在上面的示例中看到,标识为d730a47ee957fb4b12b01c3b03357ba6
的请求的行位于ID为3fb5d184493aea1f7637aa5a442d166a
的请求的行之间。
我编写了一个Ruby脚本来解决这个问题,将相同的请求行放在一起,并以正确的时间戳顺序排列。所以,我的Ruby脚本设法处理上面的输入并给出:
[1a23f343a5331aeb03dc2461895d66d7] Completed 200 OK in 43.2ms (Views: 0.2ms | ActiveRecord: 25.0ms | Solr: 0.0ms)
[3fb5d184493aea1f7637aa5a442d166a] Started GET "/fp.js?_=1455251108526" for 27.55.132.119 at 2016-02-12 06:25:10 +0200
[3fb5d184493aea1f7637aa5a442d166a] Processing by Api::DevicesController#fp as JS
[3fb5d184493aea1f7637aa5a442d166a] Parameters: {"_"=>"1455251108526"}
[3fb5d184493aea1f7637aa5a442d166a] Rendered api/devices/fp.js.erb (5.3ms)
[3fb5d184493aea1f7637aa5a442d166a] Completed 200 OK in 6.4ms (Views: 5.0ms | ActiveRecord: 1.2ms | Solr: 0.0ms)
[d730a47ee957fb4b12b01c3b03357ba6] Started POST "/api/d" for 183.88.158.125 at 2016-02-12 06:25:10 +0200
有没有办法可以用标准的bash命令来做到这一点?
答案 0 :(得分:0)
我已将此文件用作我的示例输入:
[abc1] Text abc_01
[abc1] Text abc_02
[def2] Text def_01
[abc1] Text abc_03
[def2] Text def_02
[def2] Text def_03
[xyz3] Text xyz_01
[abc1] Text abc_04
[xyz3] Text xyz_02
[def2] Text def_04
第一个字段充当散列,行的其余部分是日志文件条目。
我们希望看到的输出:
[abc1] Text abc_01
[abc1] Text abc_02
[abc1] Text abc_03
[abc1] Text abc_04
[def2] Text def_01
[def2] Text def_02
[def2] Text def_03
[def2] Text def_04
[xyz3] Text xyz_01
[xyz3] Text xyz_02
相同散列的行的顺序相同,但散列现在按外观排序。
这是一个纯 Bash解决方案,仅使用内置函数:
#!/bin/bash
# Read entire file into array
mapfile -t loglines < "$1"
# Associative array to map hashes to line numbers
declare -A hashes
# Array to keep track of sequence of hashes
declare -a hash_seq
# Current line number
ln_no=0
while read -r req_hash _; do
if [[ -z ${hashes[$req_hash]} ]]; then # If we haven't seen this hash yet
# Add hash to hash sequence
hash_seq+=("$req_hash")
# Add current line number as first line of hash
hashes[$req_hash]=$ln_no
else
# Append current line number to hash
hashes[$req_hash]+=" $ln_no"
fi
((++ln_no)) # Increment line number
done < "$1"
# Loop over hashes in sequence
for i in "${hash_seq[@]}"; do
# Loop over line number for each hash
for ln in ${hashes[$i]}; do # Must not be quoted - shell word splitting!
# Print next line for this hash
echo "${loglines[$ln]}"
done
done
我添加了一些评论,我认为这些评论应该让以下内容变得可以理解:
mapfile
行)。hash_seq
数组)以及哪些行号属于每个散列(hashes
关联数组)。loglines
数组中查找该行并打印出来。这很慢,我们稍后会看到。
如果我们不仅限于Bash,我们不必将整个文件存储在一个数组中,并且可以使用grep来提取我们想要的行:
#!/bin/bash
# Associative array to keep track of hashes seen
declare -A hashes
# Array to keep track of sequence of hashes
declare -a hash_seq
while read -r req_hash _; do
if [[ -z ${hashes[$req_hash]} ]]; then # If we haven't seen this hash yet
# Mark as "seen"
hashes[$req_hash]=1
# Add hash to hash sequence
hash_seq+=("$req_hash")
fi
done < "$1"
# Loop over hashes in sequence
for cur_hash in "${hash_seq[@]}"; do
# Print all lines containing the hash
grep -F "$cur_hash" "$1"
done
这消除了存储文件,关联数组和跟踪行号:
while
循环获取散列出现的顺序。for
循环按顺序获取每个哈希值,并使用grep获取属于相应请求的所有行。使用了-F
选项,因此方括号不会被解释为元字符。技术上不再是&#34; Bash&#34;,但通常在Bash所在的地方可用。
#!/usr/bin/awk -f
{
loglines[NR] = $0 # Store all lines
if ($1 in hashes) { # Append line number if we've seen the hash before
hashes[$1] = hashes[$1] " " NR
}
else { # New hash
hashes[$1] = NR # Add line number, mark as seen
hash_seq[++ctr] = $1 # Add to hash sequence
}
}
END {
for (i = 1; i <= ctr; ++i) { # Loop over hashes in sequence
nl = split(hashes[hash_seq[i]], arr) # Get line numbers
for (j = 1; j <= nl; ++j) # Loop over line numbers
print loglines[arr[j]] # Print corresponding line
}
}
这与第一个Bash解决方案相同,但速度更快。
选择什么?对于扩展到100,000行的样本输入,解决方案如下(输出重定向到文件,如./SO.sh infile > outfile
中所示):
Pure Bash:
real 1m7.021s
user 1m2.265s
sys 0m4.468s
Bash和grep:
real 0m3.882s
user 0m1.967s
sys 0m1.890s
awk中:
real 0m0.915s
user 0m0.906s
sys 0m0.015s
不出所料,awk是最快的,但是grep的Bash并不是无法忍受的。 Pure Bash 慢。
在Bash 4.0中添加了关联数组,因此Bash解决方案显然至少需要此版本。我不认为我使用过Bash,grep或awk的GNU扩展名,所以这些应该是相当便携的。