How to compare the number of delimiters across a unix file to be matching with the top row(fields)?

时间:2016-12-09 12:41:37

标签: bash unix awk text-processing

I am setting up a var as

var=$(cat ip.txt | head -1 | sed 's/[^|]//g' | awk '{ print length }')

which stores the number of '|' in the top row.

Then, I can get the number of delimiters in each line using

awk -F\| '{print NF-1}' ip.txt

and I need to compare the individual numbers that I get with $var.

Final Output required is the number of lines which show such a behaviour. For example, if line 2 to line 20 have more delimiters than the header, then my output should be, 19 lines have greater number of delimiters than the top row from a total of 6000 rows(number of rows in the file).

Example :

$ cat ip.txt
DeptID|EmpFName|EmpLName|Salary
Engg|Sam|Lewis|1000
Engg|Smith|Davis|2000|||
HR|Denis|Lillie|1500
HR|Danny|Borrinson|3000|
IT|David|Letterman|2000||
IT|John|Newman|3000

The header has 3 '|', but lines 3,5 and 6 have extra delimiters. so I want an output like "3 lines have more delimiters than the top row from a total of 7 rows"

2 个答案:

答案 0 :(得分:2)

$ awk -F'|' 'NR==1{n=NF} NF>n{c++} END{printf "%d lines > %d fields\n", c, NR}' ip.txt
3 lines > 7 fields

答案 1 :(得分:1)

awk -F '|' '
 NR == 1 {
    # take the reference of field
    RefCount = NF - 1
    # skip header
    next
    }
 {
 # count the number of line having NF - 1 separator in an array (1 count by number of separator)
 LinesWith[ (NF - 1)] ++ 
 # uncomment line after if you want to print bad lines
 # if ( NF - 1 != RefCount) print 
 }

 # at the end (of file)
 END {
    # print each element of the counting array (bad first, good finally)
    for ( LineWith in LinesWith) if ( LineWith != RefCount) print "There is/are " LinesWith[ LineWith] " line(s) with " LineWith  " separators"
     print "There is/are " LinesWith[ RefCount] " correct line(s) with " RefCount " separators"
     }
  ' ip.txt

<强>注释:

  • 这不是一个oneliner(“可能是”)但只使用1 awk来做所有事情,除非在之后需要时没有为脚本分配可变量。
  • 代码是自我评论,用于理解使用过的概念(所以看起来有点长)
  • 我改变了一点请求(计算每个特定的分隔符)但是一些简单的修改可以给出数量而不是细节