
时间:2012-08-10 09:33:49

标签: linux grep

我通常使用grep -rIn pattern_str big_source_code_dir来查找某些内容。但是grep不平行,我该如何使它平行?我的系统有4个核心,如果grep可以使用所有核心,它会更快。

4 个答案:

答案 0 :(得分:11)


但是如果你真的想要做并行grep,那么this website会给出两个关于如何使用findxargs进行操作的提示。 E.g。

find . -type f -print0 | xargs -0 -P 4 -n 40 grep -i foobar

答案 1 :(得分:1)

Note that you need to escape special characters in your parallel grep search term, for example:

parallel --pipe --block 10M --ungroup LC_ALL=C grep -F 'PostTypeId=\"1\"' < ~/Downloads/Posts.xml > questions.xml

Using standalone grep, grep -F 'PostTypeId="1"' would work without escaping the double quotes. It took me a while to figure that out!

Also note the use of LC_ALL=C and the -F flag (if you're just searching full strings) for additional speed-ups.

答案 2 :(得分:0)

GNU parallel命令对此非常有用。

sudo apt-get install parallel # if not available on debian based systems


EXAMPLE: Parallel grep
       grep -r greps recursively through directories. 
       On multicore CPUs GNU parallel can often speed this up.

       find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}

       This will run 1.5 job per core, and give 1000 arguments to grep.


find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {}


       xargs offer some of the same possibilities as GNU parallel.

       xargs deals badly with special characters (such as space, ' and "). To see the problem try this:

         touch important_file
         touch 'not important_file'
         ls not* | xargs rm
         mkdir -p "My brother's 12\" records"
         ls | xargs rmdir

       You can specify -0 or -d "\n", but many input generators are not optimized for using NUL as separator but are optimized for newline as separator. E.g head, tail, awk, ls, echo, sed, tar -v, perl (-0 and \0 instead of \n),
       locate (requires using -0), find (requires using -print0), grep (requires user to use -z or -Z), sort (requires using -z).

       So GNU parallel's newline separation can be emulated with:

       cat | xargs -d "\n" -n1 command

       xargs can run a given number of jobs in parallel, but has no support for running number-of-cpu-cores jobs in parallel.

       xargs has no support for grouping the output, therefore output may run together, e.g. the first half of a line is from one process and the last half of the line is from another process. The example Parallel grep cannot be
       done reliably with xargs because of this.

答案 3 :(得分:0)

这里有 3 种方法,但您无法获取其中两个的行号。

(1) 对多个文件并行运行 grep,在这种情况下,一个目录及其子目录中的所有文件。添加 /dev/null 以强制 grep 将文件名添加到匹配行,因为您会想知道匹配的文件。为您的机器调整进程数 -P

find . -type f | xargs -n 1 -P 4 grep -n <grep-args> /dev/null

(2) 对多个文件串行运行 grep 但并行处理 10M 块。调整您的机器和文件的块大小。这里有两种方法可以做到这一点。

# for-loop
for filename in `find . -type f`
  parallel --pipepart --block 10M -a $filename -k "grep <grep-args> | awk -v OFS=: '{print \"$filename\",\$0}'"

# using xargs
find . -type f | xargs -I filename parallel --pipepart --block 10M -a filename -k "grep <grep-args> | awk -v OFS=: '{print \"filename\",\$0}'"

(3) 结合(1)和(2):对多个文件并行运行grep并并行处理它们的内容块。为您的机器调整块大小和 xargs 并行度。

find . -type f | xargs -n 1 -P 4 -I filename parallel --pipepart --block 10M -a filename -k "grep <grep-args> | awk -v OFS=: '{print \"filename\",\$0}'"

请注意 (3) 可能不是资源的最佳利用方式。

我有一个 longer write-up,但这是基本的想法。