用awk分析csv文件

时间:2019-04-02 14:52:24

标签: linux bash shell csv awk

我需要分析另一个csv文件中的一个csv文件,我们可以在其中找到一些“关键字”

我需要分析的csv文件如下所示:4列和X行:

NO1,NF_B000014,81920,23290,28,20480,22211,108,-16,0,100000000,none,online,fds,45501,none,Alo1,
NO1,N_000000,81920,63,0,20480,68,0,0,0,2464966,none,online,fds,131,none,Alo1,
NO1,NO_VM31_GRERW0I2_B000002,203162,87,0,50790,142,0,0,0,100000000,none,online,fds,229,none,Alo1,
NO1,NF_B000014,81920,23290,28,20480,22211,108,-16,0,100000000,none,online,fds,45501,none,Alo1,
NO1,NG_VM31_B000001,2347,54,0,432,69,0,0,0,4397642,none,online,fds,98,none,Alo2,
NO1,NG_VM31_B000001,2342,61,0,64532,69,0,0,0,2346,none,online,fds,90,none,Alo2,
NO1,NG_VM31_B000001,78692,61,0,432,69,0,0,0,23498765,none,online,fds,23,none,Alo2,
NO1,NG_VM31_B000001,98725,61,0,2080357,69,0,0,0,98643,none,online,fds,4330,none,Alo2,
NO1,NG_VM31_B000001,2351,61,0,3424,69,0,0,0,5673,none,online,fds,43,none,Alo2,
NO1,NL_098JD,51551,7,0,234,31,1,0,0,100000001,none,online,fds,99,none,Alo3,
NO1,NL_098JD,5145622,7,0,542,31,1,0,0,100000002,none,online,fds,99,none,Alo3,
NO1,NL_098JD,5123453,7,0,2714,31,1,0,0,100000003,none,online,fds,99,none,Alo3,
NO1,NL_098JD,51454,7,0,8567,31,1,0,0,100000004,none,online,fds,38,none,Alo3,
NO1,NL_098JD,515,7,0,532,31,1,0,0,100000005,none,online,fds,31,none,Alo3,
NO1,NL_098JD,51554,7,0,9347,31,1,0,0,100000006,none,online,fds,3812,none,Alo3,
NO1,NV_IUDS,19873,234,0,543,14,3,0,0,9869324,none,online,fds,54,none,Alo4,
NO1,NV_IUDS,32981,654,0,543,14,3,0,0,2346,none,online,fds,57,none,Alo4,
NO1,NV_IUDS,123554,634,0,543,14,3,0,0,2347642,none,online,fds,86,none,Alo4,
NO1,NV_IUDS,124432,846,0,543,14,3,0,0,1434326,none,online,fds,12,none,Alo4,
NO1,NV_IUDS,234531,402,0,543,14,3,0,0,234645234,none,online,fds,62,none,Alo4,
NO1,NJ_000004,305562,57467,19,76390,102,0,0,0,100000000,none,online,oiu,57569,none,NA,
NO1,NK_O09AE8,421888,221682,53,105472,200,0,0,0,100000000,none,online,oiu,12345,none,NA,
NO1,NK_O09AE8,42188,221682,53,105472,200,0,0,0,100000000,none,online,oiu,221882,none,NA,
NO1,NK_O09AE8,421488,221682,53,105472,200,0,0,0,100000000,none,online,oiu,4325,none,NA,
NO1,NK_O09AE8,421845,221682,53,105472,200,0,0,0,100000000,none,online,oiu,9877634,none,NA,
NO1,NK_O09AE8,421234,221682,53,105472,200,0,0,0,100000000,none,online,oiu,22324882,none,NA,
NO1,NK_O09AE8,421643,221682,53,105472,200,0,0,0,100000000,none,online,oiu,234,none,NA,
NO1,NK_O09AE8,421231,221682,53,105472,200,0,0,0,100000000,none,online,oiu,9834,none,NA,
NO1,NK_O09AE8,421324,221682,53,105472,200,0,0,0,100000000,none,online,oiu,234,none,NA,
NO1,NK_O09AE8,421987,221682,53,105472,200,0,0,0,100000000,none,online,oiu,2345,none,NA,
NO1,NK_O09AE8,42134,221682,53,105472,200,0,0,0,100000000,none,online,oiu,6542,none,NA,
NO1,NF_B000014,81920,23290,28,20480,22211,108,-16,0,100000000,none,online,fds,45501,none,Alo5,
NO1,N_000000,81920,63,0,20480,68,0,0,0,2464966,none,online,fds,131,none,Alo5,
NO1,NO_VM31_GRERW0I2_B000002,203162,87,0,50790,142,0,0,0,100000000,none,online,fds,229,none,Alo5,
NO1,NF_B000014,81920,23290,28,20480,22211,108,-16,0,100000000,none,online,fds,45501,none,Alo5,
NO1,NG_VM31_B000001,2347,54,0,432,69,0,0,0,4397642,none,online,fds,98,none,Alo6,
NO1,NG_VM31_B000001,2342,61,0,64532,69,0,0,0,2346,none,online,fds,90,none,Alo6,
NO1,NG_VM31_B000001,78692,61,0,432,69,0,0,0,23498765,none,online,fds,23,none,Alo6,
NO1,NG_VM31_B000001,98725,61,0,2080357,69,0,0,0,98643,none,online,fds,4330,none,Alo6,
NO1,NG_VM31_B000001,2351,61,0,3424,69,0,0,0,5673,none,online,fds,43,none,Alo6,
NO1,NL_098JD,51551,7,0,234,31,1,0,0,100000001,none,online,fds,99,none,Alo7,
NO1,NL_098JD,5145622,7,0,542,31,1,0,0,100000002,none,online,fds,99,none,Alo7,
NO1,NL_098JD,5123453,7,0,2714,31,1,0,0,100000003,none,online,fds,99,none,Alo7,
NO1,NL_098JD,51454,7,0,8567,31,1,0,0,100000004,none,online,fds,38,none,Alo7,
NO1,NL_098JD,515,7,0,532,31,1,0,0,100000005,none,online,fds,31,none,Alo7,
NO1,NL_098JD,51554,7,0,9347,31,1,0,0,100000006,none,online,fds,3812,none,Alo7,
NO1,NV_IUDS,19873,234,0,543,14,3,0,0,9869324,none,online,fds,54,none,Alo8,
NO1,NV_IUDS,32981,654,0,543,14,3,0,0,2346,none,online,fds,57,none,Alo8,
NO1,NV_IUDS,123554,634,0,543,14,3,0,0,2347642,none,online,fds,86,none,Alo8,
NO1,NV_IUDS,124432,846,0,543,14,3,0,0,1434326,none,online,fds,12,none,Alo8,
NO1,NV_IUDS,234531,402,0,543,14,3,0,0,234645234,none,online,fds,62,none,Alo8,
NO1,NJ_000004,305562,57467,19,76390,102,0,0,0,100000000,none,online,oiu,57569,none,NA,
NO1,NK_O09AE8,421888,221682,53,105472,200,0,0,0,100000000,none,online,oiu,12345,none,NA,
NO1,NK_O09AE8,42188,221682,53,105472,200,0,0,0,100000000,none,online,oiu,221882,none,NA,
NO1,NK_O09AE8,421488,221682,53,105472,200,0,0,0,100000000,none,online,oiu,4325,none,NA,
NO1,NK_O09AE8,421845,221682,53,105472,200,0,0,0,100000000,none,online,oiu,9877634,none,NA,
NO1,NK_O09AE8,421234,221682,53,105472,200,0,0,0,100000000,none,online,oiu,22324882,none,NA,
NO1,NK_O09AE8,421643,221682,53,105472,200,0,0,0,100000000,none,online,oiu,234,none,NA,
NO1,NK_O09AE8,421231,221682,53,105472,200,0,0,0,100000000,none,online,oiu,9834,none,NA,
NO1,NK_O09AE8,421324,221682,53,105472,200,0,0,0,100000000,none,online,oiu,234,none,NA,
NO1,NK_O09AE8,421987,221682,53,105472,200,0,0,0,100000000,none,online,oiu,2345,none,NA,
NO1,NK_O09AE8,42134,221682,53,105472,200,0,0,0,100000000,none,online,oiu,6542,none,NA,

带有“关键字”的csv文件如下所示:

Alo1
Alo2
Alo3
Alo4

我需要开发一个脚本来提取csv文件的所有行,在其中我们可以找到包含关键字的csv文件中存在的单词。

我可以使用此脚本执行此操作:

while read jour
do
        grep -wf "$1" "$2" | awk -F',' '{if(f!=$1)print"\n"; f=$1; print $0;}' | awk -F',' '{print $1","$2","$15","$17}' > test1.csv

done <"$1"

结果是:

NO1,NF_B000014,45501,Alo1
NO1,N_000000,131,Alo1
NO1,NO_VM31_GRERW0I2_B000002,229,Alo1
NO1,NF_B000014,45501,Alo1
NO1,NG_VM31_B000001,98,Alo2
NO1,NG_VM31_B000001,90,Alo2
NO1,NG_VM31_B000001,23,Alo2
NO1,NG_VM31_B000001,4330,Alo2
NO1,NG_VM31_B000001,43,Alo2
NO1,NL_098JD,99,Alo3
NO1,NL_098JD,99,Alo3
NO1,NL_098JD,99,Alo3
NO1,NL_098JD,38,Alo3
NO1,NL_098JD,31,Alo3
NO1,NL_098JD,3812,Alo3
NO1,NV_IUDS,54,Alo4
NO1,NV_IUDS,57,Alo4
NO1,NV_IUDS,86,Alo4
NO1,NV_IUDS,12,Alo4
NO1,NV_IUDS,62,Alo4

您能告诉我如何使用awk吗?使用NR,FNR等吗?

谢谢!

2 个答案:

答案 0 :(得分:1)

这听起来像是您要执行的操作:

$ awk -F', *' 'NR==FNR{words[$1];next} $NF in words' words file
ANA1,#DEFF24,99460, Alo1
ANA1,#DEFF43,15654,Alo1
ANA1,VM30_0009,587290,Alo4
ANA1,#DEFF29,99RS0, Alo2
ANA1,#DEFF43,18NCSO,Alo3
ANA1,VEZK_IOP,587290,Alo4
ANA1,#DEFF98,9846, Alo2
ANA1,#DEFF47,3476,Alo3
ANA1,VM323_LOp9,49862,Alo4

如果没有,请编辑您的问题以阐明您的要求,并提供更真实的代表性样本输入和预期输出。

答案 1 :(得分:0)

  

我需要开发一个脚本来提取CSV文件的所有行,在其中我们可以找到包含关键字的CSV文件中存在的单词。

因此,该词的出现位置不受任何限制:

$ grep -Fwf keywords.csv file.csv

修改后:

脚本中的以下行:

grep -wf "$1" "$2" | awk -F',' '{if(f!=$1)print"\n"; f=$1; print $0;}' | awk -F',' '{print $1","$2","$15","$17}' > test1.csv

可以替换为:

awk '### processing first file starts here
     # read the first file and store list in array a
     (NR==FNR){a[$1];next}
     ### processing second file starts here
     # initialize variables
     # - Set FS and OFS for file 2
     # - reprocess $0 with new FS ($0=$0)
     (FNR==1) { FS=OFS=","; $0=$0 }
     # if 17th field is in a, process
     # - initialize f if this is the first match
     # - print required fields and 
     # - prepend with ORS (\n) or empty string if $1 changed
     ($17 in a) { if (c++==0) f=$1
                  print (f != $1 ? ORS : "") $1,$2,$15,$17
                  f=$1
     }' "$1" "$2" > test1.csv

here上有FSOFSNRFNRORS的详细信息。