如何用AWK获得字段的基数?

时间:2018-01-11 12:16:02

标签: awk

我正在尝试计算txt文件中每个字段的唯一出现次数。

样品:

2008,12,13,6,1007,847,1149,1010,DL,1631,N909DA,162,143,122,99,80,ATL,IAH,689,8,32,0,,0,1,0,19,0,79
2008,12,13,6,638,640,808,753,DL,1632,N604DL,90,73,50,15,-2,JAX,ATL,270,14,26,0,,0,0,0,15,0,0
2008,12,13,6,756,800,1032,1026,DL,1633,N642DL,96,86,56,6,-4,MSY,ATL,425,23,17,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,612,615,923,907,DL,1635,N907DA,131,112,103,16,-3,GEG,SLC,546,5,23,0,,0,0,0,16,0,0
2008,12,13,6,749,750,901,859,DL,1636,N646DL,72,69,41,2,-1,SAV,ATL,215,20,11,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1002,959,1204,1150,DL,1636,N646DL,122,111,71,14,3,ATL,IAD,533,6,45,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,834,835,1021,1023,DL,1637,N908DL,167,168,139,-2,-1,ATL,SAT,874,5,23,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,655,700,856,856,DL,1638,N671DN,121,116,85,0,-5,PBI,ATL,545,24,12,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1251,1240,1446,1437,DL,1639,N646DL,115,117,89,9,11,IAD,ATL,533,13,13,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1110,1103,1413,1418,DL,1641,N908DL,123,135,104,-5,7,SAT,ATL,874,8,11,0,,0,NA,NA,NA,NA,NA

此处的完整数据集:https://github.com/markgrover/cloudcon-hive(2008年的航班延误数据集。)

对于单列,我们可以这样做:

 for i in $(seq 1 28); do cut -d',' -f$i 2008.csv | head |sort | uniq | wc -l ; done |tr '\n' ':' ; echo

有没有办法一次性完成所有列?

我认为预期的输出如下:

 1:1:1:1:10:10:10:10:1:10:9:9:6:9:9:9:2:5:5:5:6:1:1:1:3:2:2:2:

对于整个数据集:

1:12:31:7:1441:1217:1441:1378:20:7539:5374:690:526:664:1154:1135:303:304:1435:191:343:2:5:2:985:600:575:157:

3 个答案:

答案 0 :(得分:3)

使用GNU awk实现真正的多维数组:

$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
    for (i=1; i<=NF; i++) {
        vals[i][$i]
    }
}
END {
    for (i=1; i<=NF; i++) {
        printf "%s%s", length(vals[i]), (i<NF?OFS:ORS)
    }
}

$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3

和任何awk:

$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
    for (i=1; i<=NF; i++) {
        if ( !seen[i,$i]++ ) {
            cnt[i]++
        }
    }
}
END {
    for (i=1; i<=NF; i++) {
        printf "%s%s", cnt[i], (i<NF?OFS:ORS)
    }
}

$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3

答案 1 :(得分:2)

在GNU awk中:

$ awk '
BEGIN { FS=OFS="," }                        # delimiters to , 
{
    for(i=1;i<=NF;i++)                      # iterate over every field
        a[i][$i]                            # store unique values to 2d hash
}
END {                                       # after all the records
    for(i=1;i<=NF;i++)                      # iterate the unique values for each field
        for(j in a[i])
            c[i]++                          # count them and 
    for(i=1;i<=NF;i++) 
        printf "%s%s",c[i], (i==NF?ORS:OFS) # output the values
}' file
1,1,1,1,10,10,10,10,1,9,7,10,10,10,10,9,8,5,8,8,8,1,1,1,3,2,4,2,3

输出不完全相同,不确定错误是你的还是我的。好吧,最后一列的值为790NA,因此我的值更准确。

答案 2 :(得分:1)

另一个awk

这会给你一个滚动计数,管道到tail -1以获得总计数的最后一行

$ awk -F, -v OFS=: '{for(i=1;i<=NF;i++) 
                       printf "%s%s", NR-(a[i,$i]++?++c[i]:c[i]),(i==NF)?ORS:OFS}' file

1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1
1:1:1:1:2:2:2:2:1:2:2:2:2:2:2:2:2:2:2:2:2:1:1:1:2:1:2:1:2
1:1:1:1:3:3:3:3:1:3:3:3:3:3:3:3:3:2:3:3:3:1:1:1:3:2:3:2:3
1:1:1:1:4:4:4:4:1:4:4:4:4:4:4:4:4:3:4:4:4:1:1:1:3:2:4:2:3
1:1:1:1:5:5:5:5:1:5:5:5:5:5:5:5:5:3:5:5:5:1:1:1:3:2:4:2:3
1:1:1:1:6:6:6:6:1:5:5:6:6:6:6:6:5:4:6:6:6:1:1:1:3:2:4:2:3
1:1:1:1:7:7:7:7:1:6:6:7:7:7:7:6:5:5:7:6:6:1:1:1:3:2:4:2:3
1:1:1:1:8:8:8:8:1:7:7:8:8:8:8:7:6:5:8:7:7:1:1:1:3:2:4:2:3
1:1:1:1:9:9:9:9:1:8:7:9:9:9:9:8:7:5:8:8:8:1:1:1:3:2:4:2:3
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3