使用grep搜索多个文件太慢

时间:2019-04-03 12:48:08

标签: bash shell scripting grep

我正在尝试在多个文件中搜索607526个整数条目(保存在数组中),并添加相同的值并存储在文件中.32470个条目花费了1小时45分钟,但尚未完成。您能帮我改善脚本吗? 脚本如下:

#!/bin/bash

my_array=( `grep Curr a.txt  | sed -e 's/Time:\(.*\).Num.*/\1/'` )
my_array_length=${#my_array[@]}
echo $my_array_length

rm -rf output
touch output

for element in "${my_array[@]}"
do
#   echo "${element}"
   toggles=`grep -w "time: ${element}" file_* | awk '{ sum += $6}; END {print sum }'`
   echo "Time:"${element}".Num - "$toggles >> output
done

Inptu和输出为:

a.txt

Curr Time:0.Num - 6274
Curr Time:500.Num - 2
Curr Time:1500.Num - 62
Curr Time:2000.Num - 3
Curr Time:2500.Num - 2
Curr Time:3000.Num - 214
Curr Time:3500.Num - 205
Curr Time:4500.Num - 2
Curr Time:5000.Num - 211
Curr Time:5500.Num - 231


file_0

time: 0 count: 517
time: 2000 count: 9
time: 2500 count: 30
time: 4500 count: 14
time: 5000 count: 2


file_1

time: 0 count: 1500
time: 500 count: 10
time: 1500 count: 25
time: 2500 count: 39
time: 4500 count: 26
time: 5500 count: 154

output

Curr Time:0.NumToggles - 2017
Curr Time:500.NumToggles - 11
Curr Time:1500.NumToggles - 25
Curr Time:2000.NumToggles - 9
Curr Time:2500.NumToggles - 69
Curr Time:3000.NumToggles - 0
Curr Time:3500.NumToggles - 0
Curr Time:4500.NumToggles - 40
Curr Time:5000.NumToggles - 2
Curr Time:5500.NumToggles - 154

如果需要,可以在https://i.stack.imgur.com/kFxt8.jpg上找到图片。

1 个答案:

答案 0 :(得分:1)

这适用于我的git bash仿真。让我知道它是否阻塞了整个数据集。

awk -v keyfile=a.txt ' { sum[$2] += $4; next; }
 END { 
   while ( getline < keyfile && "$0" ) {
     match( $0, "^Curr Time:(.*).Num", key);
     printf "Curr Time:%d.NumToggles - %d\n", key[1], sum[key[1]];
   }
 }
' file_*

逻辑:遍历所有数据文件以求和每个键的值。然后,一个通过主文件以获得完整的密钥集,为每个密钥打印总和。这只会调用一个主过程来读取每个文件,而不是两个来进行初始加载,然后再调用两个来对 every 键的所有数据文件进行完整扫描,这需要数十万次传递文件。

欢迎提问。