我需要在多个文件中找到共同的行;超过100个文件,每个文件有数百万行。与此类似:Shell: Find Matching Lines Across Many Files。
但是,我不仅要查找所有文件中的共享行,还要查找除一个文件以外的所有文件中找到的行,除了两个文件之外的所有文件等等。我有兴趣使用百分比来这样做。例如,哪些条目显示在90%的文件中,80%,70%等等。举个例子:
File1中
lineA
lineB
lineC
文件2
lineB
lineC
lineD
文件3
lineC
lineE
lineF
为了示范而假设输出:
<lineC> is found in 3 out of 3 files (100.00%)
<lineB> is found in 2 out of 3 files (66.67%)
<lineF> is found in 1 out of 3 files (33.33%)
有谁知道怎么做?
非常感谢!
答案 0 :(得分:2)
使用GNU awk实现其多维数组:
gawk '
BEGIN {nfiles = ARGC-1}
{ lines[$0][FILENAME] = 1 }
END {
for (line in lines) {
n = length(lines[line])
printf "<%s> is found in %d of %d files (%.2f%%)\n", line, n, nfiles, 100*n/nfiles
}
}
' file{1,2,3}
<lineA> is found in 1 of 3 files (33.33%)
<lineB> is found in 2 of 3 files (66.67%)
<lineC> is found in 3 of 3 files (100.00%)
<lineD> is found in 1 of 3 files (33.33%)
<lineE> is found in 1 of 3 files (33.33%)
<lineF> is found in 1 of 3 files (33.33%)
输出顺序是不确定的