使用AWK从csv文件中提取特定单元格,并根据预定义的顺序进行排序

时间:2017-07-26 16:19:37

标签: linux bash csv awk

我有一组CSV文件。对于我需要的每个文件:

  • 提取特定细胞。
  • 根据预定义的顺序对其进行排序,该顺序位于其他文件中。
  • 将结果附加到新文件(将所有文件连接到同一文件)。

文件示例(values1.csv):

Item, avg, max
TT, 3, 5
DD, 3, 6
ZZ, 6, 8
UU, 3, 3
JJ, 1, 5

预定义订单(order.csv)的示例。我需要avg以及max的一些内容:

DD_avg
ZZ_avg
ZZ_max
TT_avg
TT_max
UU_avg
JJ_avg

输出:

  file_name, DD_avg, ZZ_avg, ZZ_max, TT_avg, TT_max, UU_avg, JJ_avg
  values1.csv, 3, 6, 8, 3, 5, 3, 1
  values2.csv, ...................
  values3.csv, ...................

这是否可以使用AWK(或任何其他Linux命令)?我的AWK技能非常有限,我不知道如何处理这个案例。我将在此感谢一些帮助和指导。

修改:真实数据

cat values1.csv

item,avg,max
System/CPU/User/percent,4.8,
System/Memory/Used/bytes,57300000000,
System/Filesystem/^data/Used/bytes,859000000,
System/Disk/disk/Reads/count/sec,37.8,730
System/Disk/disk/Writes/Utilization/percent,7.24,
System/Disk/disk/Reads/bytes/sec,849000,42100000
System/Disk/disk/Writes,0.0026,
System/Disk/disk/Writes/bytes/sec,520000,33500000
System/Disk/disk/Writes/count/sec,46.2,903
System/Disk/disk/Utilization/percent,22.4,
System/Disk/disk/Reads/Utilization/percent,15.2,

Cat order.csv

System/CPU/User/percent_avg
System/Memory/Used/bytes_avg
System/Filesystem/^data/Used/bytes_avg
System/Disk/disk/Reads/count/sec_avg
System/Disk/disk/Writes/count/sec_avg
System/Disk/disk/Reads/count/sec_max
System/Disk/disk/Writes/count/sec_max
System/Disk/disk/Reads/bytes/sec_avg
System/Disk/disk/Writes/bytes/sec_avg
System/Disk/disk/Writes/Utilization/percent_avg
System/Disk/disk/Reads/Utilization/percent_avg

4 个答案:

答案 0 :(得分:3)

使用GNU awk for ARGIND:

$ cat tst.awk
BEGIN { FS=", *"; OFS=", " }
NR==FNR {
    colNames[++numCols] = $0
    next
}
{
    val[ARGIND,$1"_avg"] = $2
    val[ARGIND,$1"_max"] = $3
}
END {
    printf "file_name"
    for (colNr=1; colNr<=numCols; colNr++) {
        printf "%s%s", OFS, colNames[colNr]
    }
    print ""
    for (fileNr=2; fileNr<=ARGIND; fileNr++) {
        printf "%s", ARGV[fileNr]
        for (colNr=1; colNr<=numCols; colNr++) {
            printf "%s%s", OFS, val[fileNr,colNames[colNr]]
        }
        print ""
    }
}

$ gawk -f tst.awk order.csv values1.csv
file_name, DD_avg, ZZ_avg, ZZ_max, TT_avg, TT_max, UU_avg, JJ_avg
values1.csv, 3, 6, 8, 3, 5, 3, 1

使用其他代码只需在FNR==1{++ARGIND}行后面添加BEGIN行。如果内存是一个问题,你可以使用更少的gawks ENDFILE语句而不是END,还有其他选项 - 让我们知道这是否是一个问题。

答案 1 :(得分:2)

akshay@db-3325:/tmp$ cat order 
DD_avg
ZZ_avg
ZZ_max
TT_avg
TT_max
UU_avg
JJ_avg

akshay@db-3325:/tmp$ cat values
Item, avg, max
TT, 3, 5
DD, 3, 6
ZZ, 6, 8
UU, 3, 3
JJ, 1, 5

akshay@db-3325:/tmp$ cat values1 
Item, avg, max
TT, 1, 3
DD, 2, 4

akshay@db-3325:/tmp$ awk  'BEGIN{FS=OFS=","}FNR==NR{o[oh[FNR]=$1];next}function p(){s="";for(i=1; i in oh; i++){ if(!hp){ hr=(hr?hr OFS:"") oh[i] }  s = (s ? s OFS:"")o[oh[i]]; o[oh[i]]="" } if(!hp){print "filename",hr; hp=1} print pf,s}k && FNR==1{p()}{gsub(/ /,""); for(i=2; i<=NF; i++){if(FNR==1){ h[i]=$i }else{ k = $1"_"h[i]; if(k in o)o[k]=$i } } pf=FILENAME }END{p()}' order values values1 
filename,DD_avg,ZZ_avg,ZZ_max,TT_avg,TT_max,UU_avg,JJ_avg
values,3,6,8,3,5,3,1
values1,2,,,1,3,,

更好的可读性

awk '
 BEGIN{
     FS=OFS=","
 }
 FNR==NR{
        o[oh[FNR]=$1];
        next
 }
 function p(){
        s="";
        for(i=1; i in oh; i++){ 
           if(!hp){hr=(hr?hr OFS:"") oh[i]}  
           s = (s ? s OFS:"")o[oh[i]]; 
           o[oh[i]]="" 
        } 
        if(!hp){ print "filename",hr; hp=1} 
        print pf,s
 }
 k && FNR==1{ p() }
 {
    gsub(/ /,""); 
    for(i=2; i<=NF; i++)
    {
       if(FNR==1){ 
          h[i]=$i 
       }
       else{ 
          k = $1"_"h[i]; 
          if(k in o)o[k]=$i 
       } 
    } 
       pf=FILENAME 
 }
 END{
   p()
 }
' order values values1 

答案 2 :(得分:2)

awk救援!

awk -F_ -v OFS=', ' '
         NR==FNR {h[++c]=$1; t[c]=$2; next}
         FNR==1  {if(!data) {
                    printf "%s", "file_name";
                    for(i=1;i<=c;i++)  printf "%s", OFS h[i]"_"t[i];
                    print ""}
                  else pr()}

         FNR>1   {avg[$1]=$2; max[$1]=$3; data=1}

         END     {pr()}

         function pr() {
             printf "%s", FILENAME;
             for(i=1;i<=c;i++)  printf "%s", OFS (t[i]=="avg"?avg[h[i]]:max[h[i]])
             print ""}' order.csv FS=', *' values1.csv 

file_name, DD_avg, ZZ_avg, ZZ_max, TT_avg, TT_max, UU_avg, JJ_avg
values1.csv, 3, 6, 8, 3, 5, 3, 1

values1.csv

之后添加其他文件名

答案 3 :(得分:1)

这看起来像是Python的工作。至少如果你想正确解析CSV (带引号字段,多行字段,包含逗号的字段等),优雅地处理缺少的列,以支持每个文件的列数可变,每个文件中列的顺序 ,每个文件的列的不同的子集等。

这是一个Python 2/3脚本,它读取列选择和顺序,从作为脚本的第一个参数提供的第一个文件开始,然后从剩余的参数中“值文件”。选定的行和列(按顺序)将打印到标准输出(因此您可以将它们重定向到文件)。为了更好地处理奇怪的字段值(行多行),您需要使用 var myNumber = 123; var sponsorNumber = [345, 234, 525]; angular.forEach(sponsorNumber, function(value) { if (value !== myNumber) { console.log('doesnt match!'); } }); 代替。

csv.writer

用法:

#!/usr/bin/python
import sys
import csv
from collections import defaultdict

with open(sys.argv[1], 'r') as csvfile:
    # AA_avg, BB_max lines -> [['AA', 'avg'], ['BB', 'max]]
    order = list(csv.reader(csvfile, delimiter='_'))

# output header
print(','.join(["file_name"] + ["{}_{}".format(*o) for o in order]))

for filename in sys.argv[2:]:
    with open(filename, 'r') as csvfile:
        # read all values in a 2D associative map
        reader = csv.DictReader(csvfile, skipinitialspace=True)
        values = defaultdict(dict)
        for row in reader:
            item = row[reader.fieldnames[0]]
            for field in reader.fieldnames[1:]:
                values[item][field] = row[field]

    # select and print only the ones from order list
    line = [filename] + [values[item].get(field,'N/A') for item,field in order if item in values]
    print(','.join(line))