我如何并行grep

时间:2012-08-10 09:33:49

标签: linux grep

我通常使用grep -rIn pattern_str big_source_code_dir来查找某些内容。但是grep不平行,我该如何使它平行?我的系统有4个核心,如果grep可以使用所有核心,它会更快。

4 个答案:

答案 0 :(得分:11)

如果您使用HDD存储您正在搜索的目录,则无法提高速度。硬盘驱动器几乎是单线程访问单元。

但是如果你真的想要做并行grep,那么this website会给出两个关于如何使用findxargs进行操作的提示。 E.g。

find . -type f -print0 | xargs -0 -P 4 -n 40 grep -i foobar

答案 1 :(得分:1)

Note that you need to escape special characters in your parallel grep search term, for example:

parallel --pipe --block 10M --ungroup LC_ALL=C grep -F 'PostTypeId=\"1\"' < ~/Downloads/Posts.xml > questions.xml

Using standalone grep, grep -F 'PostTypeId="1"' would work without escaping the double quotes. It took me a while to figure that out!

Also note the use of LC_ALL=C and the -F flag (if you're just searching full strings) for additional speed-ups.

答案 2 :(得分:0)

GNU parallel命令对此非常有用。

sudo apt-get install parallel # if not available on debian based systems

然后,paralell手册页提供了一个示例:

EXAMPLE: Parallel grep
       grep -r greps recursively through directories. 
       On multicore CPUs GNU parallel can often speed this up.

       find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}

       This will run 1.5 job per core, and give 1000 arguments to grep.

在你的情况下,它可能是:

find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {}

最后,GNU并行手册页还提供了描述xargsparallel命令之间差异的部分,这有助于理解为什么并行在您的情况下看起来更好

DIFFERENCES BETWEEN xargs AND GNU Parallel
       xargs offer some of the same possibilities as GNU parallel.

       xargs deals badly with special characters (such as space, ' and "). To see the problem try this:

         touch important_file
         touch 'not important_file'
         ls not* | xargs rm
         mkdir -p "My brother's 12\" records"
         ls | xargs rmdir

       You can specify -0 or -d "\n", but many input generators are not optimized for using NUL as separator but are optimized for newline as separator. E.g head, tail, awk, ls, echo, sed, tar -v, perl (-0 and \0 instead of \n),
       locate (requires using -0), find (requires using -print0), grep (requires user to use -z or -Z), sort (requires using -z).

       So GNU parallel's newline separation can be emulated with:

       cat | xargs -d "\n" -n1 command

       xargs can run a given number of jobs in parallel, but has no support for running number-of-cpu-cores jobs in parallel.

       xargs has no support for grouping the output, therefore output may run together, e.g. the first half of a line is from one process and the last half of the line is from another process. The example Parallel grep cannot be
       done reliably with xargs because of this.
       ...

答案 3 :(得分:0)

这里有 3 种方法,但您无法获取其中两个的行号。

(1) 对多个文件并行运行 grep,在这种情况下,一个目录及其子目录中的所有文件。添加 /dev/null 以强制 grep 将文件名添加到匹配行,因为您会想知道匹配的文件。为您的机器调整进程数 -P

find . -type f | xargs -n 1 -P 4 grep -n <grep-args> /dev/null

(2) 对多个文件串行运行 grep 但并行处理 10M 块。调整您的机器和文件的块大小。这里有两种方法可以做到这一点。

# for-loop
for filename in `find . -type f`
do 
  parallel --pipepart --block 10M -a $filename -k "grep <grep-args> | awk -v OFS=: '{print \"$filename\",\$0}'"
done

# using xargs
find . -type f | xargs -I filename parallel --pipepart --block 10M -a filename -k "grep <grep-args> | awk -v OFS=: '{print \"filename\",\$0}'"

(3) 结合(1)和(2):对多个文件并行运行grep并并行处理它们的内容块。为您的机器调整块大小和 xargs 并行度。

find . -type f | xargs -n 1 -P 4 -I filename parallel --pipepart --block 10M -a filename -k "grep <grep-args> | awk -v OFS=: '{print \"filename\",\$0}'"

请注意 (3) 可能不是资源的最佳利用方式。

我有一个 longer write-up,但这是基本的想法。