Question

我正在创建这个函数，在文件的每一行上创建多个grep。我运行如下：

 cat file.txt | agrep string1 string2 ... stringN

这个想法是打印包含所有字符串的每一行：string1，string2，...，stringN，我遵循这两种方法，第一种是递归方法：

agrep () {
    if [ $# = 0 ]; then
        cat
    else
        pattern="$1"
        shift
        grep -e "$pattern" | agrep "$@"
    fi
}

另一方面，我有一个与迭代方法相关的第二种方法，因为我使用的是for方法：

function agrep () {  
  for a in $@;  do
    cmd+=" | grep '$a'";
  done ;
  while read line ; do
    eval "echo "\'"$line"\'" $cmd";
  done;
}

这两种方法效果很好，但我想知道是否有人可以告诉我哪一种效率更高？如果有一种方法可以在bash中测量这个，那么也是可行的，因为我认为我没有足够的经验来确定这个因为我不知道bash它是一种编程语言，它可以更好地用迭代方法或递归方法，或者如果使用eval会很昂贵。

这两个功能旨在处理大文本并处理文本的每一行，我真的很感激任何解释或建议。

这是一个名为risk的文本文件示例：

1960’s. Until the 1990’s it was a purely theoretical analysis of the
problem of function estimation from a given collection of data.
In the middle of the 1990’s new types of learning algorithms
(called support vector machines) based on the developed t

然后如果我跑：

cat risk | agrep Until

我明白了：

1960.s. Until the 1990.s it was a purely theoretical analysis of the

但另一方面如果我跑：

cat risk | agrep Until new

没有打印任何内容，因为那里有任意两行字符串，这是一个旨在澄清函数用法的函数。

Answer 1

我完全同意已经告知您当前方法陷阱的评论和答案。

基于suggestion made by karakfa，我建议使用一个调用awk的函数，沿着以下几行：

agrep() {
    awk 'BEGIN {
        # read command line arguments and unset them
        for (i = 1; i < ARGC; ++i) {
            strings[i] = ARGV[i]
            ARGV[i] = ""
        }
    }
    {
        for (i in strings) {
            # if the line does not match, skip it
            if ($0 !~ strings[i]) next
        }
        # print remaining lines
        print
    }' "$@"
}

这会将函数的所有参数作为awk的参数传递，这通常会将它们视为文件名。在处理任何输入行之前，每个参数都会添加到新数组strings并从ARGV中删除。

像这样使用：

agrep string1 string2 string3 < file

Answer 2

两者效率都很低但是由于grep非常快，你可能没有注意到。更好的方法是切换到awk

awk '/string1/ && /string2/ && ...  && /stringN/' file

将在一次迭代中执行相同的操作。

Answer 3

安全

基于eval的方法存在一个严重的缺陷：它允许通过搜索恶意形成的字符串进行代码注入。因此，对于两个给定的，递归方法是实际生产场景的唯一合理选择。

为什么eval方法不安全？请看一下这段代码：

cmd+=" | grep '$a'";

a=$'\'"$(rm -rf ~)"\''会怎样？

更正的实施可能会修改此行，如下所示：

printf -v cmd '%s | grep -e %q' "$cmd" "$a"

性能

您的递归方法会在设置长度与传递给agrep的参数数量成比例的管道时进行所有递归。一旦设置了该管道，shell本身就不在了（所有正在进行的操作都由grep进程执行），并且性能开销与管道本身的性能完全相同。

因此，对于足够大的输入文件，设置阶段的性能实际上变为零，并且相关的性能差异将是cat和while read循环之间的性能差异 - {{1能够轻松赢得足够大的输入以克服其启动成本。

递归bash函数vs迭代“eval”字符串构建：哪个表现更好？

3 个答案:

安全

性能