我具有这些功能来处理2GB的文本文件。我将其分为6部分进行同步处理,但仍然需要4个多小时。
还有什么可以尝试使脚本更快?
一些细节:
样本数据:
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
脚本:
read2col()
{
is_one_way=$(echo "$line"| awk -F'","' '{print $7}')
price_outbound=$(echo "$line"| awk -F'","' '{print $30}')
price_exc=$(echo "$line"| awk -F'","' '{print $25}')
tax=$(echo "$line"| awk -F'","' '{print $27}')
price_inc=$(echo "$line"| awk -F'","' '{print $26}')
}
#################################################
#for each line in the csv
mainf()
{
cd $infarepath
while read -r line; do
#read the value of csv fields into variables
read2col
if [[ $is_one_way == 0 ]]; then
if [[ $price_outbound > 0 ]]; then
#calculate price inc and print the entire line to txt file
echo $line | awk -v CONVFMT='%.2f' -v pout=$price_outbound -v tax=$tax -F'","' 'BEGIN {OFS = FS} {$25=pout;$26=(pout+(tax / 2)); print}' >>"$csvsplitfile".tmp
else
#divide price ecx and inc by 2 if price outbound is not greater than 0
echo $line | awk -v CONVFMT='%.2f' -v pexc=$price_exc -v pinc=$price_inc -F'","' 'BEGIN {OFS = FS} {$25=(pexc / 2);$26=(pinc /2); print}' >>"$csvsplitfile".tmp
fi
else
echo $line >>"$csvsplitfile".tmp
fi
done < $csvsplitfile
}
答案 0 :(得分:11)
您应该做的第一事情是停止调用六个子外壳,以便为每行输入运行awk
。让我们进行一些快速的信封计算。
假设您的输入行约为292个字符(按照您的示例),那么2G文件将由730万行组成。这意味着您正在启动和停止庞大的四千四百万进程。
而且,尽管Linux可以尽可能高效地处理fork
和exec
,但这并非没有代价:
pax$ time for i in {1..44000000} ; do true ; done
real 1m0.946s
此外,bash
尚未真正针对此类处理进行优化,其设计导致此特定用例的行为不佳。有关详细信息,请在我们的姐妹网站之一上查看this excellent answer。
下面显示了对两种文件处理方法的分析(一个程序读取整个文件(每行只有hello
,而bash
一次读取一行)) 。用于获取时间的两个命令是:
time ( cat somefile >/dev/null )
time ( while read -r x ; do echo $x >/dev/null ; done <somefile )
对于不同的文件大小(user+sys
时间,几次运行平均),这很有趣:
# of lines cat-method while-method
---------- ---------- ------------
1,000 0.375s 0.031s
10,000 0.391s 0.234s
100,000 0.406s 1.994s
1,000,000 0.391s 19.844s
10,000,000 0.375s 205.583s
44,000,000 0.453s 889.402s
由此看来,while
方法可用于较小的数据集,它真的不能很好地扩展。
由于awk
本身可以进行计算和格式化输出,因此可以使用一个单 awk
脚本而不是您的bash
/ multi-每行awk
的组合将使创建所有这些流程和基于行的延迟的成本消失。
此脚本是一个不错的第一次尝试,我们称它为prog.awk
:
BEGIN {
FMT = "%.2f"
OFS = FS
}
{
isOneWay=$7
priceOutbound=$30
priceExc=$25
tax=$27
priceInc=$26
if (isOneWay == 0) {
if (priceOutbound > 0) {
$25 = sprintf(FMT, priceOutbound)
$26 = sprintf(FMT, priceOutbound + tax / 2)
} else {
$25 = sprintf(FMT, priceExc / 2)
$26 = sprintf(FMT, priceInc / 2)
}
}
print
}
您只需使用以下命令运行单个 awk
脚本:
awk -F'","' -f prog.awk data.txt
使用您提供的测试数据,这是之前和之后的内容,其中包含字段号25和26的标记:
<-25-> <-26->
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","100.50","138.63","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"