优化大文件的awk命令

时间:2018-08-25 03:34:07

标签: linux bash awk

我具有这些功能来处理2GB的文本文件。我将其分为6部分进行同步处理,但仍然需要4个多小时。

还有什么可以尝试使脚本更快?

一些细节:

  1. 我将输入的csv输入while循环中,以逐行读取。
  2. 我从read2col函数的4个字段中的csv行中获取了值
  3. 我的mainf函数中的awk从read2col获取值并进行一些算术计算。我将结果四舍五入到小数点后两位。然后,将行打印到文本文件。

样本数据:

"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"

脚本:

read2col()
{
is_one_way=$(echo "$line"| awk -F'","' '{print $7}')
price_outbound=$(echo "$line"| awk -F'","' '{print $30}')
price_exc=$(echo "$line"| awk -F'","' '{print $25}')
tax=$(echo "$line"| awk -F'","' '{print $27}')
price_inc=$(echo "$line"| awk -F'","' '{print $26}')
}


#################################################
#for each line in the csv
mainf()
{
cd $infarepath

while read -r line; do
        #read the value of csv fields into variables
        read2col

        if [[ $is_one_way == 0 ]]; then
                if [[ $price_outbound > 0 ]]; then
                        #calculate price inc and print the entire line to txt file
                        echo $line | awk -v CONVFMT='%.2f' -v pout=$price_outbound -v tax=$tax -F'","' 'BEGIN {OFS = FS} {$25=pout;$26=(pout+(tax / 2)); print}' >>"$csvsplitfile".tmp
                else
                        #divide price ecx and inc by 2 if price outbound is not greater than 0
                        echo $line | awk -v CONVFMT='%.2f' -v pexc=$price_exc -v pinc=$price_inc -F'","' 'BEGIN {OFS = FS} {$25=(pexc / 2);$26=(pinc /2); print}' >>"$csvsplitfile".tmp
                fi
        else
                echo $line >>"$csvsplitfile".tmp
        fi

done < $csvsplitfile
}

1 个答案:

答案 0 :(得分:11)

您应该做的第一事情是停止调用六个子外壳,以便为每行输入运行awk。让我们进行一些快速的信封计算。

假设您的输入行约为292个字符(按照您的示例),那么2G文件将由730万行组成。这意味着您正在启动和停止庞大的四千四百万进程。

而且,尽管Linux可以尽可能高效地处理forkexec,但这并非没有代价:

pax$ time for i in {1..44000000} ; do true ; done
real 1m0.946s

此外,bash尚未真正针对此类处理进行优化,其设计导致此特定用例的行为不佳。有关详细信息,请在我们的姐妹网站之一上查看this excellent answer

下面显示了对两种文件处理方法的分析(一个程序读取整个文件(每行只有hello,而bash一次读取一行)) 。用于获取时间的两个命令是:

time ( cat somefile >/dev/null )
time ( while read -r x ; do echo $x >/dev/null ; done <somefile )

对于不同的文件大小(user+sys时间,几次运行平均),这很有趣:

# of lines   cat-method   while-method
----------   ----------   ------------
     1,000       0.375s         0.031s
    10,000       0.391s         0.234s
   100,000       0.406s         1.994s
 1,000,000       0.391s        19.844s
10,000,000       0.375s       205.583s
44,000,000       0.453s       889.402s

由此看来,while方法可用于较小的数据集,它真的不能很好地扩展。


由于awk本身可以进行计算和格式化输出,因此可以使用一个 awk脚本而不是您的bash / multi-每行awk的组合将使创建所有这些流程和基于行的延迟的成本消失。

此脚本是一个不错的第一次尝试,我们称它为prog.awk

BEGIN {
    FMT = "%.2f"
    OFS = FS
}
{
    isOneWay=$7
    priceOutbound=$30
    priceExc=$25
    tax=$27
    priceInc=$26

    if (isOneWay == 0) {
        if (priceOutbound > 0) {
            $25 = sprintf(FMT, priceOutbound)
            $26 = sprintf(FMT, priceOutbound + tax / 2)
        } else {
            $25 = sprintf(FMT, priceExc / 2)
            $26 = sprintf(FMT, priceInc / 2)
        }
    }
    print
}

您只需使用以下命令运行单个 awk脚本:

awk -F'","' -f prog.awk data.txt

使用您提供的测试数据,这是之前和之后的内容,其中包含字段号25和26的标记:

                                                                                                                                                                                      <-25->   <-26->
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","100.50","138.63","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"