将文件中的关键字计数到另一个文件

时间:2017-06-05 12:17:26

标签: bash shell awk

我有两个文件File1和File2。我必须在File2到File1中找到关键字并对其进行计数。 File1中在File2中没有任何关键字的行应计为OTHERS,并可能将其保存在File3中(用于验证)。

File1中

CallSid

文件2

 000001111YYYY0000
 122334YYYY9999
 89898989AAAA89899
 AAAA7678989812234
 ZZZZ878098098098
 0000000000000000

输出

YYYY
AAAA
ZZZZ

File3(OTHERS)

YYYY: 2
AAAA: 2
ZZZZ: 1
OTHERS: 1

我所知道的方法是使用grep和wc -l来计算关键字,但这并不理想,特别是当我有很多关键字需要查找时。

4 个答案:

答案 0 :(得分:2)

使用awk

CMDLINE

awk 'FNR==NR{a[$1];next}\
{b=1;for(i in a)if(z=gsub(i,"&")){x[i]+=z;b=0}}\
b{x["Others"]++;print > "file3"}\
END{for(i in x)print i, x[i]}' file{2,}

由于长度

,可能更适合脚本
FNR==NR{
    Strings[$1]
    next
}
{
    Found=0
    for(Regex in Strings)
        if(matches=gsub(Regex,"&")){
            Sums[Regex]+=matches
            Found=1
        }
}
!Found{
    Sums["Others"]++ 
    print > "file3"
}
END{
     for(Regex in Sums)
         print Regex, Sums[Regex]
}

另存为

awkscript.awk

运行
awk -f awkscript.awk file{2,}

答案 1 :(得分:0)

尝试:如果你没有按照file1或file2打扰输出的顺序,那么下面的内容可能对你有所帮助。

awk 'FNR==NR{A[$0];next} {gsub(/[0-9]/,"");} ($0 in A){B[$0]++;next} !($0 in A) && $0{OTHERS[$0]++} END{for(i in B){print i": "B[i]};for(j in OTHERS){print j": "OTHERS[j]}}' file2  file1

也会很快添加说明。

EDIT1:以非单一形式添加代码,并在此处进行适当的解释。

awk 'FNR==NR{                                                 #### FNR==NR condition will be TRUE when first file file2 is being read, FNR and NR are awks built-in variables, both re-present line numbers of files only difference between them is FNR gets re-set whenever a new file is getting started and NRs value will be keep on increasing till all files get read.
                A[$0];                                        #### creating an array whose index is $0(current line) of file2.
                next                                          #### using next keyword for skipping all the next statements.
            }
            {
                VAL=$0;                                       #### creating a variable named VAL which has current lines value.
                gsub(/[0-9]/,"");                             #### gsub is awks built-in function to globally substituting all the digits to NULL in lines for file1.
            }
     ($0 in A){                                               #### now checking if new-edited $0(current line) is present in array A then do following statements.
                B[$0]++;                                      #### creating an array named B with index of $0 and incrementing its value with 1 each time.
                next                                          #### using next keyword for skipping all the next statements.
              }
     !($0 in A){                                              #### If current line is NOT present in array A.
                        OTHERS[VAL]++                         #### create an array named OTHERS with index of variable VAL and increment its value with 1 each time it comes in this section.
                     }
     END{                                                     #### Starting END section here for awk.
                for(i in B){                                  #### Traversing through array B now.
                                print i": "B[i]               #### printing the index of array B and its respective value now.
                           };
                for(j in OTHERS){                             #### Traversing through array OTHERS now.
                                        print j": "OTHERS[j]  #### printing index of array B with its value too.
                                }
        }
    ' file2  file1                                            #### Mentioning the Input_files now.

答案 2 :(得分:0)

awk 解决方案(包括将“其他人”保存到单独的文件file3.txt中):

awk 'NR==FNR{ group=(group)?group"|"$0 : $0; next }
     { if(match($0,group)){ a[substr($0,RSTART,RLENGTH)]++ } 
       else { a["OTHERS"]++; print >> "file3.txt" } 
     } END { for(i in a) print i": "a[i] }' file2 file1

输出:

ZZZZ: 1
AAAA: 2
YYYY: 2
OTHERS: 1

其他

cat file3.txt
0000000000000000

答案 3 :(得分:0)

awk 'BEGIN{a["OTHERS"]=0}
  (NR==FNR) {a[$0]=0;next}
  {b=0}{for(i in a) if( match($0,i) !=0 ){a[i]++;b=1} }
  {if(b==0) a["OTHERS"]++} 
  END{for(i in a) print i,": ",a[i]}' 
  File2 File1