针对大型文件优化awk脚本

时间:2018-08-03 19:43:15

标签: bash performance csv optimization awk

TLDR版本

如何优化以下脚本以更快地处理大型文件。

  • ./Processed/Ranked_endpoints_*.csv约1200行
  • Dates.csv约100万行

有效的shell脚本(我认为吗?它仍在运行...):

for RFGAUGE in ./Processed/Ranked_endpoints_*.csv
do
    echo "Processing: $RFGAUGE"
    mkIndex="$(basename $RFGAUGE)"
    echo "$mkIndex"

    ##Duplicate the dates file to be able to edit it in place later on
    cp Dates.csv Timed_${mkIndex%.*}.csv

    ##Remove header
    tail -n +2 $RFGAUGE > noHead_${mkIndex%.*}.csv

    ##The slow part, go line by line and find if the time matches an event and copy to additional columns
    while read LINE
    do
        ##Asign variables from columns and replace , with space
        read RankDUR RankRF RankPEAK RankAVG END START <<< `echo $LINE | cut -d, -f9,10,11,13,14,15 | sed 's/,/ /g'`

        ##Tried using sed and line numbers, failed 
        #STARTLINE=`grep -nr $START Dates.csv | cut -d: -f1`
        #ENDLINE=`grep -nr $END Dates.csv | cut -d: -f1`

        ##Gawk only so can edit file in place
        ##Assigning AWK variables from UNIX variables
        gawk -i inplace -v start="$START" -v end="$END" -v rankdur="$RankDUR" -v rankrf="$RankRF" -v rankpeak="$RankPEAK" -v rankavg="$RankAVG" 'BEGIN{FS=OFS=","}{if($2>=start && $2<=end) print $0,rankdur,rankrf,rankpeak,rankavg; else print $0}' Timed_${mkIndex%.*}.csv

    done < noHead_${mkIndex%.*}.csv

    rm noHead_${mkIndex%.*}.csv

done

长版

我正在尝试根据一些指标对最严重的降雨事件进行排名。数据存在的问题是,降雨事件不会在完全相同的时间开始/停止,并且通常会相互抵消几个小时。

我已经编写了一个脚本,该脚本从多年的每轨数据中提取了可以称为“事件”的内容,然后对事件的不同参数进行了排名。我目前拥有的示例:

./Processed/Ranked_endpoints_*.csv

Date,D,M,Y,WOY, Duration (h),Total RF (mm),Max RF (mm),Rank Duration,Rank Total RF,Rank Max RF,AVG Rank,Rank AVG,EndTime EPOCH, StartTime EPOCH
04/12/2010 05:15:00,4,11,2010,48,7.0,22.599999999999994,8.2,71,39,12,40.6667,1,1291439700,1291414500
17/12/2004 08:00:00,17,11,2004,50,6.5,32.6,5.0,89,12,40,47,2,1103270400,1103247000
25/08/2010 18:00:00,25,7,2010,34,6.5,28.6,4.8,83,20,46,49.6667,3,1282759200,1282735800
...

上面CSV中的重要列是:

  • 第9,10,11,13列-不同参数的排名
  • 第14列-事件结束的时间,以秒为单位
  • 第15列-事件开始的时间(以纪元为单位)

我还创建了一个15分钟的日期/时间csv,其中包含自纪元以来的日期和时间,这与我用来提取“事件”数据的格式相似:

Dates.csv

...
03/12/2010 21:45:00,1291412700 
03/12/2010 22:00:00,1291413600 
03/12/2010 22:15:00,1291414500 
03/12/2010 22:30:00,1291415400 
03/12/2010 22:45:00,1291416300 
03/12/2010 23:00:00,1291417200 
03/12/2010 23:15:00,1291418100 
03/12/2010 23:30:00,1291419000 
03/12/2010 23:45:00,1291419900 
04/12/2010 00:00:00,1291420800 
04/12/2010 00:15:00,1291421700 
04/12/2010 00:30:00,1291422600 
04/12/2010 00:45:00,1291423500 
04/12/2010 01:00:00,1291424400 
04/12/2010 01:15:00,1291425300 
04/12/2010 01:30:00,1291426200 
04/12/2010 01:45:00,1291427100 
04/12/2010 02:00:00,1291428000 
04/12/2010 02:15:00,1291428900 
04/12/2010 02:30:00,1291429800 
04/12/2010 02:45:00,1291430700 
04/12/2010 03:00:00,1291431600 
04/12/2010 03:15:00,1291432500 
04/12/2010 03:30:00,1291433400 
04/12/2010 03:45:00,1291434300 
04/12/2010 04:00:00,1291435200 
04/12/2010 04:15:00,1291436100 
04/12/2010 04:30:00,1291437000 
04/12/2010 04:45:00,1291437900 
04/12/2010 05:00:00,1291438800 
04/12/2010 05:15:00,1291439700 
04/12/2010 05:30:00,1291440600
...

考虑到我每个仪表大约有20年的15分钟数据,可能还有很多仪表。如果时间与“事件”之一匹配,将列9,10,11,13传送到Dates.csv的最佳方法是什么?上面的当前脚本不会将不同的量规合并到1个CSV中,但是剪切/粘贴很容易。

所以最终输出将类似于以下内容,假设降雨在1号仪表后1小时到达2号仪表,并持续了1小时以下:

03/12/2010 22:00:00,1291413600
03/12/2010 22:15:00,1291414500 ,71,39,12,1
03/12/2010 22:30:00,1291415400 ,71,39,12,1
03/12/2010 22:45:00,1291416300 ,71,39,12,1
03/12/2010 23:00:00,1291417200 ,71,39,12,1
03/12/2010 23:15:00,1291418100 ,71,39,12,1,13,25,35,4
03/12/2010 23:30:00,1291419000 ,71,39,12,1,13,25,35,4
...
04/12/2010 05:00:00,1291438800 ,71,39,12,1,13,25,35,4
04/12/2010 05:15:00,1291439700 ,71,39,12,1,13,25,35,4
04/12/2010 05:30:00,1291440600

1 个答案:

答案 0 :(得分:3)

听起来像您可能想做的是运行此命令(对真正的多维数组使用GNU awk并进行排序):

$ cat tst.awk
BEGIN { FS=OFS="," }
NR == 1 { next }
{
    ranks    = $9 OFS $10 OFS $11 OFS $13
    endEpoch = $14
    begEpoch = $15

    for ( epoch=begEpoch; epoch<=endEpoch; epoch+=(15*60) ) {
        epoch2ranks[epoch][++numRanks[epoch]] = ranks
    }
}
END {
    PROCINFO["sorted_in"] = "@ind_num_asc"
    for ( epoch in epoch2ranks ) {
        printf "%s", epoch
        for ( rankNr in epoch2ranks[epoch] ) {
            ranks = epoch2ranks[epoch][rankNr]
            printf "%s%s", OFS, ranks
        }
        print ""
    }
}

运行方式:

$ awk -f tst.awk Ranked_endpoints_*.csv

,然后使用UNIX工具join将其输出与Dates.csv连接起来。

FWIW给出了您在问题中提供的输入:

$ cat file
Date,D,M,Y,WOY, Duration (h),Total RF (mm),Max RF (mm),Rank Duration,Rank Total RF,Rank Max RF,AVG Rank,Rank AVG,EndTime EPOCH, StartTime EPOCH
04/12/2010 05:15:00,4,11,2010,48,7.0,22.599999999999994,8.2,71,39,12,40.6667,1,1291439700,1291414500
17/12/2004 08:00:00,17,11,2004,50,6.5,32.6,5.0,89,12,40,47,2,1103270400,1103247000
25/08/2010 18:00:00,25,7,2010,34,6.5,28.6,4.8,83,20,46,49.6667,3,1282759200,1282735800

它将产生以下输出:

$ awk -f tst.awk file
1103247000,89,12,40,2
1103247900,89,12,40,2
1103248800,89,12,40,2
1103249700,89,12,40,2
1103250600,89,12,40,2
1103251500,89,12,40,2
1103252400,89,12,40,2
1103253300,89,12,40,2
1103254200,89,12,40,2
1103255100,89,12,40,2
1103256000,89,12,40,2
1103256900,89,12,40,2
1103257800,89,12,40,2
1103258700,89,12,40,2
1103259600,89,12,40,2
1103260500,89,12,40,2
1103261400,89,12,40,2
1103262300,89,12,40,2
1103263200,89,12,40,2
1103264100,89,12,40,2
1103265000,89,12,40,2
1103265900,89,12,40,2
1103266800,89,12,40,2
1103267700,89,12,40,2
1103268600,89,12,40,2
1103269500,89,12,40,2
1103270400,89,12,40,2
1282735800,83,20,46,3
1282736700,83,20,46,3
1282737600,83,20,46,3
1282738500,83,20,46,3
1282739400,83,20,46,3
1282740300,83,20,46,3
1282741200,83,20,46,3
1282742100,83,20,46,3
1282743000,83,20,46,3
1282743900,83,20,46,3
1282744800,83,20,46,3
1282745700,83,20,46,3
1282746600,83,20,46,3
1282747500,83,20,46,3
1282748400,83,20,46,3
1282749300,83,20,46,3
1282750200,83,20,46,3
1282751100,83,20,46,3
1282752000,83,20,46,3
1282752900,83,20,46,3
1282753800,83,20,46,3
1282754700,83,20,46,3
1282755600,83,20,46,3
1282756500,83,20,46,3
1282757400,83,20,46,3
1282758300,83,20,46,3
1282759200,83,20,46,3
1291414500,71,39,12,1
1291415400,71,39,12,1
1291416300,71,39,12,1
1291417200,71,39,12,1
1291418100,71,39,12,1
1291419000,71,39,12,1
1291419900,71,39,12,1
1291420800,71,39,12,1
1291421700,71,39,12,1
1291422600,71,39,12,1
1291423500,71,39,12,1
1291424400,71,39,12,1
1291425300,71,39,12,1
1291426200,71,39,12,1
1291427100,71,39,12,1
1291428000,71,39,12,1
1291428900,71,39,12,1
1291429800,71,39,12,1
1291430700,71,39,12,1
1291431600,71,39,12,1
1291432500,71,39,12,1
1291433400,71,39,12,1
1291434300,71,39,12,1
1291435200,71,39,12,1
1291436100,71,39,12,1
1291437000,71,39,12,1
1291437900,71,39,12,1
1291438800,71,39,12,1
1291439700,71,39,12,1

但是,如果您想要的话,请输入idk,因为问题中的示例输出似乎与示例输入不匹配。如果是,则只需使用Dates.csv中的第二个字段和上述输出中的第一个字段作为要匹配的字段并以逗号作为字段分隔符来运行join