Question

在尝试优化我的bash脚本时，我将文件加载到一个数组中并尝试从那里进行grep，我注意到内存中的grep比文件中的标准grep慢得多，甚至考虑到事实上，磁盘I / O正在取消等式。

1）好的，所以我有一个大文件（约3000行），名称=值对，这是我的“缓存”。我将它从文件加载到一个数组（直接向前）

# load to array
l_i_array_index=1
while read line
  do
  g_a_cache[$l_i_array_index]=$line
  let "l_i_array_index += 1"
  done < $g_f_cache

2）然后我为搜索性能运行了一点基准：

time for i in `seq 1 100`
  do
  for l_i_array_index in `seq 1 ${#g_a_cache[@]}`
    do
      echo ${g_a_cache[$l_i_array_index]}
    done | grep -c $l_s_search_string > /dev/null
  done

real    0m14.387s
user    0m13.846s
sys     0m1.781s

3）相同，但直接来自磁盘文件：

time for i in `seq 1 100`
  do
  grep -c $l_s_search_string $g_f_cache > /dev/null
  done
real    0m0.347s
user    0m0.161s
sys     0m0.136s

因此，当它应该更好时，性能会差13至40倍。

我的问题是：1）这种奇怪行为的原因是什么 2）这可以在bash中解决，或者我应该咬紧牙关并最终在Python中重做

P.S。测试在Mac上完成（bash v4），在Cygwin中，每个搜索的时间超过一秒，正常的grep（更快），并且使用数组方法超过10秒。该脚本几乎无法使用..

Answer 1

搜索和算法设计领域的顶级专家多年来对grep程序进行了大量优化。你只是不打算用shell脚本击败它。这是一个荒谬的概念。

老实说，我无法想象为什么你会期望接近grep的速度。也许您认为所有grep磁盘I / O实际上都需要对实际物理磁盘执行某些操作。但事实并非如此。每个现代操作系统都有一个磁盘缓存，文件在第一次读取后将在缓存中，这只需要一小段时间。

Answer 2

这可能与文件系统缓存有关，这导致“优化”实现实际上增加了开销。

可以在以下内容中查看文件系统缓存：

localhost elhigu$ time grep -r "testnottofind" * 
real    0m22.468s
user    0m0.172s
sys 0m0.828s
localhost elhigu$ time grep -r "testnottofind" * 
real    0m0.826s
user    0m0.087s
sys 0m0.232s
localhost elhigu$ time grep -r "testnottofind2" * 
real    0m0.285s
user    0m0.084s
sys 0m0.190s
localhost elhigu$ time grep -r "moretesting" * 
real    0m0.285s
user    0m0.086s
sys 0m0.185s
localhost elhigu$

还有更多信息https://unix.stackexchange.com/questions/8914/does-grep-use-a-cache-to-speed-up-the-searches。

数组中的grep比文件中的grep慢得多

2 个答案: