当像awk这样的程序通过管道获取输入时,它是否通过ine读取它?

时间:2015-06-30 11:55:18

标签: bash unix awk

所以我通常以cat text.csv | awk '{print $1}'的形式执行unix任务,输出是由换行符分隔的结果。我想知道因为cat输出csv文件(我假设它是线性完成的),awk也逐行处理输出,线性执行。我觉得这显然是正确的,但考虑到如何将bash命令应用于多个项目,我想知道bash是否处理这些命令的方式与逐行读取不同。

例如,我有两种编写代码的方法:

while IFS=, read a b c
    echo $a $b $c
done < textfile.txt

OR

cat textfile.txt | awk '{print $1 $2 $3}'

它们在运行时间或处理数据方面有何不同?

3 个答案:

答案 0 :(得分:5)

这完全取决于“读取”管道,它如何处理它。默认情况下,大多数使用\n作为“记录分隔符”,因此逐行工作。

这是可选的,但在大多数情况下都适用。管道本身没有任何记录分隔符的概念 - 你可以发送你喜欢的任何东西 - 甚至是原始的二进制内容。

e.g。

tar cvf - . | gzip -c | ssh $somehost "cat > file.tgz"

修改:根据您的更新:

编写代码的方式无关紧要。使其清晰,干净,优雅,不必担心效率,直到必要。这通常是永远的,所以花在优化上的任何时间都是浪费时间。

比其他任何东西花费更多时间的东西都是从磁盘上传输数据 - 无论如何。你不能让它更快,所以大多数 - 没有必要担心。

  

“过早优化是所有邪恶的根源” - 唐纳德克努特

所以写清楚代码,首先是最重要的。如果您真的需要担心性能,请对其进行分析并集中精力(并且可能不使用shell)

答案 1 :(得分:1)

It partly depends on how the source code is written, but it is almost certainly using a buffered interface. The size of that buffer depends how the pipe is used in the program, the C Runtime Library and the operating system in use.

Typically the constants used are PIPE_SIZE (BSD) and PIPE_BUF (POSIX). Assuming a POSIX system, the minimum size is 512 bytes, but it could be 4096 bytes (which is a page size on a 32-bit machine).

The code itself might be using a higher level interface which slices on newlines, and the lower level will manage the buffer. There are several ways to do that.

You compare pipes and file IO. The overhead when using pipes in the way you show (particularly in bash) is that it runs each component in a child process. Where bash IO is not particularly efficient, it is doubtful that it will be slower than creating child processes to run things like cat.

cat textfile.txt | awk '{print $1 $2 $3}'

Will create two child processes. Although cat is very efficient, it is still an overhead. Whether that overhead exceeds the inefficiencies of Bash file IO will be data dependant. You really should benchmark it yourself with your own meaningful data (not trivial snippets). However most would say that you should avoid unnecessary child processes. See also Useless use of cat award

The read command in bash has a number of complications which makes answering your question quite difficult. There are differences depending on whether read is reading from a pipe, the command-line, or a file. It even supports unbuffered input as well. Also you can ignore the newline delimiters with:

read -N number_of_characters variable ....

and you can change the record delimiter so that it is not a newline:

read -d delimiter variable ...

The -d option causes read to continue until the first character of delimiter is read, rather than newline.

答案 2 :(得分:1)

您编写代码的两种方式:

while IFS=, read a b c
    echo $a $b $c
done < textfile.txt

OR

cat textfile.txt | awk '{print $1 $2 $3}'

错了。 shell循环将非常慢并根据输入文件的内容产生奇怪的结果。编写它以避免奇怪结果的正确方法是(您应该使用printf而不是echo):

while IFS=, read -r a b c
    echo "$a $b $c"
done < textfile.txt

但它仍然非常缓慢。 shell是一个环境,可以使用语言调用工具来对这些调用进行排序,它不是用于文本处理的工具 - UNIX文本处理是awk。

cat | awk命令应写为:

awk '{print $1, $2, $3}' textfile.awk

因为awk完全能够自己打开文件而且没有UNIX命令EVER需要cat为他们打开文件,他们都可以自己打开文件(cmd file)或拥有shell为他们打开cmd < file)。

awk一次处理一个输入记录,其中输入记录是由awks RS变量值(默认为换行符)分隔的任何文本块。无论这些记录来自何处/何处。你[很少]需要考虑的唯一事情是缓冲 - 请参阅你的awk和shell手册页以获取相关信息。

从awk输出设置shell变量的一种方法:

$ cat file
the quick brown fox

$ array=( $(awk '{print $1, $2, $3}' file) )

$ echo "${array[0]}"                        
the
$ echo "${array[1]}"                        
quick
$ echo "${array[2]}"
brown

如果您愿意,可以从数组内容中设置单个shell变量,或者只使用数组。

另一种方式:

$ set -- $(awk '{print $1, $2, $3}' file)

$ echo "$1"
the
$ echo "$2"
quick
$ echo "$3"
brown