如何优化以下脚本以更快地处理大型文件。
./Processed/Ranked_endpoints_*.csv
约1200行Dates.csv
约100万行有效的shell脚本(我认为吗?它仍在运行...):
for RFGAUGE in ./Processed/Ranked_endpoints_*.csv
do
echo "Processing: $RFGAUGE"
mkIndex="$(basename $RFGAUGE)"
echo "$mkIndex"
##Duplicate the dates file to be able to edit it in place later on
cp Dates.csv Timed_${mkIndex%.*}.csv
##Remove header
tail -n +2 $RFGAUGE > noHead_${mkIndex%.*}.csv
##The slow part, go line by line and find if the time matches an event and copy to additional columns
while read LINE
do
##Asign variables from columns and replace , with space
read RankDUR RankRF RankPEAK RankAVG END START <<< `echo $LINE | cut -d, -f9,10,11,13,14,15 | sed 's/,/ /g'`
##Tried using sed and line numbers, failed
#STARTLINE=`grep -nr $START Dates.csv | cut -d: -f1`
#ENDLINE=`grep -nr $END Dates.csv | cut -d: -f1`
##Gawk only so can edit file in place
##Assigning AWK variables from UNIX variables
gawk -i inplace -v start="$START" -v end="$END" -v rankdur="$RankDUR" -v rankrf="$RankRF" -v rankpeak="$RankPEAK" -v rankavg="$RankAVG" 'BEGIN{FS=OFS=","}{if($2>=start && $2<=end) print $0,rankdur,rankrf,rankpeak,rankavg; else print $0}' Timed_${mkIndex%.*}.csv
done < noHead_${mkIndex%.*}.csv
rm noHead_${mkIndex%.*}.csv
done
我正在尝试根据一些指标对最严重的降雨事件进行排名。数据存在的问题是,降雨事件不会在完全相同的时间开始/停止,并且通常会相互抵消几个小时。
我已经编写了一个脚本,该脚本从多年的每轨数据中提取了可以称为“事件”的内容,然后对事件的不同参数进行了排名。我目前拥有的示例:
./Processed/Ranked_endpoints_*.csv
Date,D,M,Y,WOY, Duration (h),Total RF (mm),Max RF (mm),Rank Duration,Rank Total RF,Rank Max RF,AVG Rank,Rank AVG,EndTime EPOCH, StartTime EPOCH
04/12/2010 05:15:00,4,11,2010,48,7.0,22.599999999999994,8.2,71,39,12,40.6667,1,1291439700,1291414500
17/12/2004 08:00:00,17,11,2004,50,6.5,32.6,5.0,89,12,40,47,2,1103270400,1103247000
25/08/2010 18:00:00,25,7,2010,34,6.5,28.6,4.8,83,20,46,49.6667,3,1282759200,1282735800
...
上面CSV中的重要列是:
我还创建了一个15分钟的日期/时间csv,其中包含自纪元以来的日期和时间,这与我用来提取“事件”数据的格式相似:
Dates.csv
...
03/12/2010 21:45:00,1291412700
03/12/2010 22:00:00,1291413600
03/12/2010 22:15:00,1291414500
03/12/2010 22:30:00,1291415400
03/12/2010 22:45:00,1291416300
03/12/2010 23:00:00,1291417200
03/12/2010 23:15:00,1291418100
03/12/2010 23:30:00,1291419000
03/12/2010 23:45:00,1291419900
04/12/2010 00:00:00,1291420800
04/12/2010 00:15:00,1291421700
04/12/2010 00:30:00,1291422600
04/12/2010 00:45:00,1291423500
04/12/2010 01:00:00,1291424400
04/12/2010 01:15:00,1291425300
04/12/2010 01:30:00,1291426200
04/12/2010 01:45:00,1291427100
04/12/2010 02:00:00,1291428000
04/12/2010 02:15:00,1291428900
04/12/2010 02:30:00,1291429800
04/12/2010 02:45:00,1291430700
04/12/2010 03:00:00,1291431600
04/12/2010 03:15:00,1291432500
04/12/2010 03:30:00,1291433400
04/12/2010 03:45:00,1291434300
04/12/2010 04:00:00,1291435200
04/12/2010 04:15:00,1291436100
04/12/2010 04:30:00,1291437000
04/12/2010 04:45:00,1291437900
04/12/2010 05:00:00,1291438800
04/12/2010 05:15:00,1291439700
04/12/2010 05:30:00,1291440600
...
考虑到我每个仪表大约有20年的15分钟数据,可能还有很多仪表。如果时间与“事件”之一匹配,将列9,10,11,13传送到Dates.csv
的最佳方法是什么?上面的当前脚本不会将不同的量规合并到1个CSV中,但是剪切/粘贴很容易。
所以最终输出将类似于以下内容,假设降雨在1号仪表后1小时到达2号仪表,并持续了1小时以下:
03/12/2010 22:00:00,1291413600
03/12/2010 22:15:00,1291414500 ,71,39,12,1
03/12/2010 22:30:00,1291415400 ,71,39,12,1
03/12/2010 22:45:00,1291416300 ,71,39,12,1
03/12/2010 23:00:00,1291417200 ,71,39,12,1
03/12/2010 23:15:00,1291418100 ,71,39,12,1,13,25,35,4
03/12/2010 23:30:00,1291419000 ,71,39,12,1,13,25,35,4
...
04/12/2010 05:00:00,1291438800 ,71,39,12,1,13,25,35,4
04/12/2010 05:15:00,1291439700 ,71,39,12,1,13,25,35,4
04/12/2010 05:30:00,1291440600
答案 0 :(得分:3)
听起来像您可能想做的是运行此命令(对真正的多维数组使用GNU awk并进行排序):
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == 1 { next }
{
ranks = $9 OFS $10 OFS $11 OFS $13
endEpoch = $14
begEpoch = $15
for ( epoch=begEpoch; epoch<=endEpoch; epoch+=(15*60) ) {
epoch2ranks[epoch][++numRanks[epoch]] = ranks
}
}
END {
PROCINFO["sorted_in"] = "@ind_num_asc"
for ( epoch in epoch2ranks ) {
printf "%s", epoch
for ( rankNr in epoch2ranks[epoch] ) {
ranks = epoch2ranks[epoch][rankNr]
printf "%s%s", OFS, ranks
}
print ""
}
}
运行方式:
$ awk -f tst.awk Ranked_endpoints_*.csv
,然后使用UNIX工具join
将其输出与Dates.csv
连接起来。
FWIW给出了您在问题中提供的输入:
$ cat file
Date,D,M,Y,WOY, Duration (h),Total RF (mm),Max RF (mm),Rank Duration,Rank Total RF,Rank Max RF,AVG Rank,Rank AVG,EndTime EPOCH, StartTime EPOCH
04/12/2010 05:15:00,4,11,2010,48,7.0,22.599999999999994,8.2,71,39,12,40.6667,1,1291439700,1291414500
17/12/2004 08:00:00,17,11,2004,50,6.5,32.6,5.0,89,12,40,47,2,1103270400,1103247000
25/08/2010 18:00:00,25,7,2010,34,6.5,28.6,4.8,83,20,46,49.6667,3,1282759200,1282735800
它将产生以下输出:
$ awk -f tst.awk file
1103247000,89,12,40,2
1103247900,89,12,40,2
1103248800,89,12,40,2
1103249700,89,12,40,2
1103250600,89,12,40,2
1103251500,89,12,40,2
1103252400,89,12,40,2
1103253300,89,12,40,2
1103254200,89,12,40,2
1103255100,89,12,40,2
1103256000,89,12,40,2
1103256900,89,12,40,2
1103257800,89,12,40,2
1103258700,89,12,40,2
1103259600,89,12,40,2
1103260500,89,12,40,2
1103261400,89,12,40,2
1103262300,89,12,40,2
1103263200,89,12,40,2
1103264100,89,12,40,2
1103265000,89,12,40,2
1103265900,89,12,40,2
1103266800,89,12,40,2
1103267700,89,12,40,2
1103268600,89,12,40,2
1103269500,89,12,40,2
1103270400,89,12,40,2
1282735800,83,20,46,3
1282736700,83,20,46,3
1282737600,83,20,46,3
1282738500,83,20,46,3
1282739400,83,20,46,3
1282740300,83,20,46,3
1282741200,83,20,46,3
1282742100,83,20,46,3
1282743000,83,20,46,3
1282743900,83,20,46,3
1282744800,83,20,46,3
1282745700,83,20,46,3
1282746600,83,20,46,3
1282747500,83,20,46,3
1282748400,83,20,46,3
1282749300,83,20,46,3
1282750200,83,20,46,3
1282751100,83,20,46,3
1282752000,83,20,46,3
1282752900,83,20,46,3
1282753800,83,20,46,3
1282754700,83,20,46,3
1282755600,83,20,46,3
1282756500,83,20,46,3
1282757400,83,20,46,3
1282758300,83,20,46,3
1282759200,83,20,46,3
1291414500,71,39,12,1
1291415400,71,39,12,1
1291416300,71,39,12,1
1291417200,71,39,12,1
1291418100,71,39,12,1
1291419000,71,39,12,1
1291419900,71,39,12,1
1291420800,71,39,12,1
1291421700,71,39,12,1
1291422600,71,39,12,1
1291423500,71,39,12,1
1291424400,71,39,12,1
1291425300,71,39,12,1
1291426200,71,39,12,1
1291427100,71,39,12,1
1291428000,71,39,12,1
1291428900,71,39,12,1
1291429800,71,39,12,1
1291430700,71,39,12,1
1291431600,71,39,12,1
1291432500,71,39,12,1
1291433400,71,39,12,1
1291434300,71,39,12,1
1291435200,71,39,12,1
1291436100,71,39,12,1
1291437000,71,39,12,1
1291437900,71,39,12,1
1291438800,71,39,12,1
1291439700,71,39,12,1
但是,如果您想要的话,请输入idk,因为问题中的示例输出似乎与示例输入不匹配。如果是,则只需使用Dates.csv中的第二个字段和上述输出中的第一个字段作为要匹配的字段并以逗号作为字段分隔符来运行join
。