跨多个文件的匹配行的百分比

时间:2018-02-22 16:52:07

标签: bash grep comm

我需要在多个文件中找到共同的行;超过100个文件,每个文件有数百万行。与此类似:Shell: Find Matching Lines Across Many Files

但是,我不仅要查找所有文件中的共享行,还要查找除一个文件以外的所有文件中找到的行,除了两个文件之外的所有文件等等。我有兴趣使用百分比来这样做。例如,哪些条目显示在90%的文件中,80%,70%等等。举个例子:

File1中

lineA
lineB
lineC

文件2

lineB
lineC
lineD

文件3

lineC
lineE
lineF

为了示范而假设输出:

<lineC> is found in 3 out of 3 files (100.00%)

<lineB> is found in 2 out of 3 files (66.67%)

<lineF> is found in 1 out of 3 files (33.33%)

有谁知道怎么做?

非常感谢!

1 个答案:

答案 0 :(得分:2)

使用GNU awk实现其多维数组:

gawk '
    BEGIN {nfiles = ARGC-1}
    { lines[$0][FILENAME] = 1 }
    END {
        for (line in lines) {
            n = length(lines[line])
            printf "<%s> is found in %d of %d files (%.2f%%)\n", line, n, nfiles, 100*n/nfiles
        }
    }
' file{1,2,3}
<lineA> is found in 1 of 3 files (33.33%)
<lineB> is found in 2 of 3 files (66.67%)
<lineC> is found in 3 of 3 files (100.00%)
<lineD> is found in 1 of 3 files (33.33%)
<lineE> is found in 1 of 3 files (33.33%)
<lineF> is found in 1 of 3 files (33.33%)

输出顺序是不确定的