Question

我需要使用大文件，必须找到两者之间的差异。我不需要不同的位，但需要差异的数量。

查找我想出的不同行数

diff --suppress-common-lines --speed-large-files -y File1 File2 | wc -l

它有效，但还有更好的方法吗？

如何计算差异的确切数量（使用标准工具，如bash，diff，awk，sed一些旧版本的perl）？

Answer 1

如果要计算不同的行数，请使用：

diff -U 0 file1 file2 | grep ^@ | wc -l

约翰的回答不是重复计算不同的行吗？

Answer 2

diff -U 0 file1 file2 | grep -v ^@ | wc -l

diff列表顶部的两个文件名减去2。统一格式可能比并排格式快一点。

Answer 3

如果使用Linux / Unix，那么comm -1 file1 file2如何在file1中打印不在file2中的行，comm -1 file1 file2 | wc -l来计算它们，以及类似于comm -2 ...呢？

Answer 4

由于每个不同的输出行都以<或>字符开头，我建议如下：

diff file1 file2 | grep ^[\>\<] | wc -l

只在脚本行中使用\<或\>，您只能在其中一个文件中计算差异。

Answer 5

我相信此answer中的正确解决方案是：

$ diff -y --suppress-common-lines a b | grep '^' | wc -l
1

Answer 6

如果您正在处理具有类似内容的文件，这些文件应该按行进行排序（例如描述类似内容的CSV文件），例如，想要在以下文件中找到2个差异：文件a：文件b： min，max min，max 1,5 2,5 3,4 3,4 -2,10 -1,1 你可以在Python中实现它，如下所示： different_lines = 0 open（file1）as a，open（file2）as b：换行： other_line = b.readline（） if line！= other_line： different_lines + = 1

Answer 7

这是一种计算两个文件之间任何类型的差异的方法，并为这些差异指定了正则表达式-这里Lambda用于表示除换行符以外的任何字符：

摘录自git diff --patience --word-diff=porcelain --word-diff-regex=. file1 file2 | pcre2grep -M "^@[\s\S]*" | pcre2grep -M --file-offsets "(^-.*\n)(^\+.*\n)?|(^\+.*\n)" | wc -l：

man git-diff

--patience Generate a diff using the "patience diff" algorithm. --word-diff[=<mode>] Show a word diff, using the <mode> to delimit changed words. By default, words are delimited by whitespace; see --word-diff-regex below. porcelain Use a special line-based format intended for script consumption. Added/removed/unchanged runs are printed in the usual unified diff format, starting with a +/-/` ` character at the beginning of the line and extending to the end of the line. Newlines in the input are represented by a tilde ~ on a line of its own. --word-diff-regex=<regex> Use <regex> to decide what a word is, instead of considering runs of non-whitespace to be a word. Also implies --word-diff unless it was already enabled. Every non-overlapping match of the <regex> is considered a word. Anything between these matches is considered whitespace and ignored(!) for the purposes of finding differences. You may want to append |[^[:space:]] to your regular expression to make sure that it matches all non-whitespace characters. A match that contains a newline is silently truncated(!) at the newline. For example, --word-diff-regex=. will treat each character as a word and, correspondingly, show differences character by character.是Ubuntu 20.04上pcre2grep软件包的一部分。

如何计算linux上两个文件之间的差异？

7 个答案: