是否有人使用特别优雅的命令行(linux,OS X)来识别给定目录中的“文本类似”文件?
通过“文本相似”,我的意思是文件只应在N行数上有所不同。
答案 0 :(得分:1)
以下是使用统一diff
和wc
计算不同行的粗略方法。 Grep
用于过滤掉diff上下文:
diff -U 0 file1 file2 | grep -v ^@ | grep -v ^--- | grep -v ^+++ | wc -l
答案 1 :(得分:1)
使用awk
diff file1 file2 |awk '!/^<|^>|^-/{a=$0;lt[a]=0;gt[a]=0;next} # Use label (not start from <,>,---) and set the array lt and gt
/</{lt[a]++} # if has differ "<", sum it into array lt
/>/{gt[a]++} # if has differ ">", sum it into array gt
END{for (i in lt)
sum+=lt[i]>gt[i]?lt[i]:gt[i] # compare "<" or ">" lines, take the max and add in variable sum
printf "Files have differs in %d lines\n",sum # Do the print job.
if (sum<3) {print "So files are similar" }
else{print "So files are not similar"}
}'
您可以自己定义数字,例如,在我的命令中,如果两行中有不同的&#34; if(sum&lt; 3)&#34;,我会认为这些文件不相似。
测试结果。
$ cat file1
a
b
a
d
b
c
c
$ cat file2
a
b
d
b
d
c
d
f
$ diff file1 file2
3d2
< a
5a5
> d
7,8c7,8
< c
<
---
> d
> f
$ diff file1 file2 |awk '!/^<|^>|^-/{a=$0;lt[a]=0;gt[a]=0;next}/</{lt[a]++}/>/{gt[a]++}END{for (i in lt) sum+=lt[i]>gt[i]?lt[i]:gt[i];printf "Files have differs in %d lines\n",sum;if (sum<3) {print "So files are similar" }else{print "So files are not similar"}}'
Files have differs in 4 lines
So files are not similar
答案 2 :(得分:0)
也许PMD正是您所寻找的:https://pmd.github.io
它得到了维护,使用很简单。
您可能需要重复代码检测:https://pmd.github.io/pmd-5.5.5/usage/cpd-usage.html (如果你定位代码或简单的纯文本,你的问题就不清楚了,但我不明白为什么它不应该在两种情况下都有效)。
答案 3 :(得分:0)
使用Terraform意味着有许多文件是从其他文件复制而来的,仅进行了少量更改。当您想了解文件的特殊之处时,弄清楚从何处复制文件确实令人沮丧。我制作了一个称为similarities.sh
的工具,以帮助我识别文件与一组其他文件中每个文件的相似程度。
#!/bin/bash
fileA="$1"
shift
for fileB in "$@"; do
(
# diff once grep twice with the help of tee and stderr
diff $fileA $fileB | \
tee >(grep -cE '^< ' >&2) | \
grep -cE '^> ' >&2
# recapture stderr
) 2>&1 | (
read -d '' diffA diffB;
printf "The files %s and %s have %s:%s diffs out of %s:%s lines.\n" \
$fileA $fileB $diffA $diffB $(wc -l < $fileA) $(wc -l < $fileB)
)
done | column -t
这里正在起作用:
$ similarities.sh terraform.tfvars ../*/terraform.tfvars
The files terraform.tfvars and ../api_proxy/terraform.tfvars have 3:3 diffs out of 51:51 lines.
The files terraform.tfvars and ../cf-ip-location-lookup/terraform.tfvars have 4:12 diffs out of 51:59 lines.
The files terraform.tfvars and ../cf-region-cookie-setter/terraform.tfvars have 4:8 diffs out of 51:55 lines.
The files terraform.tfvars and ../cf-switch-region-origin/terraform.tfvars have 4:10 diffs out of 51:57 lines.
The files terraform.tfvars and ../reformat_devops_alerts/terraform.tfvars have 0:0 diffs out of 51:51 lines.
The files terraform.tfvars and ../restart_location/terraform.tfvars have 17:3 diffs out of 51:37 lines.
The files terraform.tfvars and ../warehouse-availability-etl/terraform.tfvars have 3:3 diffs out of 51:51 lines.