Question

我有差异，我后期处理，并希望扁平化相等的线条。这是一个例子：

Foo
-Bar
+Bar
Baz

我想压下相等的线条，这样它们就不再出现在差异中了。

非常简单

-(.*)\n\+\1\n

当我有多行匹配时，问题就出现了：

-Foo
-Bar
+Foo
+Bar

有什么想法吗？或者我不应该做一个RegEx并编写一个简单的解析器？或者一个已经存在？

如果有更好的解决方案，有些背景故事。我正在分析两个文件，看看它们是否相同。可悲的是，输出几乎相同但需要一些后处理，例如

-on line %d
+on line 8

所以我要经历并将已知字符串转换为其他已知字符串，然后我试图检查差异是否为空或仍然不同。

Answer 1

之前我已经对diff输出进行了一些简单的分析，所以我有一个Perl脚本给了我一个开始的基础。请考虑以下两个数据文件file.1和file.2。

file.1中

Data

Foo
Bar 1
Baz

I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with

-(.*)\n\+\1\n

The problems start when I have multi-line matches like:

Foo 2
Bar 2

Etc.

file.2中

Data

Foo
Bar 10
Baz

I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with

-(.*)\n\+\1\n

The problems start when I have multi-line matches like:

Foo 20
Bar 20

Etc.

原始差异输出

原始统一diff输出为：

--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar 1
+Bar 10
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo 2
-Bar 2
+Foo 20
+Bar 20

 Etc.

后处理输出

现在，经过后处理后，所有数字字符串都已替换为##，因此后处理文件如下所示：

--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar ##
+Bar ##
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo ##
-Bar ##
+Foo ##
+Bar ##

 Etc.

这是程序的输入，用于分析差异是否仍然存在。

为了真正有用，我们必须隔离标题行（---和+++）并保留它们。对于从@@开始的每个差异块，我们需要捕获-和+行的相邻行，并且：

检查+和-
检查-行的内容是否与+行的内容相同。
请注意，虽然示例数据未显示，但您可以在一个-部分中拥有多个+和@@行的块。
如果@@块中没有剩余差异，则可以丢弃整个块。
如果存在差异，那么我们需要输出标题行，如果它们之前没有输出。
如果存在差异，则输出整个差异块。

冲洗并重复。

我选择的编程语言是Perl。

checkdiffs.pl

#!/usr/bin/env perl
use strict;
use warnings;
use constant debug => 0;

my $file1;
my $file2;
my $header = 0;

OUTER:
while (my $line = <>)
{
    chomp $line;
    print "[$line]\n" if debug;
    if ($line =~ m/^--- /)
    {
        $file1 = $line;
        $file2 = <>;
        chomp $file2;
        print "[$file2]\n" if debug;
        if ($file2 !~ m/^\+\+\+ /)
        {
            print STDERR "Unexpected file identification lines\n";
            print STDERR "$file1\n";
            print STDERR "$file2\n";
            next OUTER;
        }
        $header = 0;    # Have not output file header yet

        my @lines;
        my $atline;

        last OUTER unless defined($line = <>);
INNER:
        while ($line =~ m/^@@ /)
        {
            chomp $line;
            print "@[$line]\n" if debug;
            $atline = $line;
            @lines  = ();

            while (defined($line = <>) && $line =~ m/^[- +]/)
            {
                chomp $line;
                print ":[$line]\n" if debug;
                push @lines, $line;
            }
            # Got a complete @@ block of diffs
            post_process($atline, @lines);

            last OUTER if !defined($line);
            next INNER if ($line =~ m/^@@ /);
            print STDERR "Unexpected input line: [$line]\n";
            last OUTER;
        }
    }
}

sub differences
{
    my($pref, $mref) = @_;
    my $pnum = scalar(@$pref);
    my $mnum = scalar(@$mref);
    print "-->> differences\n" if debug;
    return 0 if ($pnum == 0 && $mnum == 0);
    return 1 if ($pnum != $mnum);
    foreach my $i (0..($pnum-1))
    {
        my $pline = substr(${$pref}[$i], 1);
        my $mline = substr(${$mref}[$i], 1);
        return 1 if ($pline ne $mline);
    }
    print "<<-- differences\n" if debug;
    return 0;
}

sub post_process
{
    my($atline, @lines) = @_;

    print "-->> post_process\n" if debug;
    # Work out whether there are any differences left
    my @plines = ();    # +lines
    my @mlines = ();    # -lines
    my $diffs  = 0;
    my $ptype  = ' ';   # Previous line type

    foreach my $line (@lines)
    {
        print "---- $line\n" if debug;
        my ($ctype) = ($line =~ m/^(.)/);
        if ($ctype eq ' ')
        {
            if (($ptype eq '-' || $ptype eq '+') && differences(\@plines, \@mlines))
            {
                $diffs = 1;
                last;
            }
            @plines = ();
            @mlines = ();
        }
        elsif ($ctype eq '-')
        {
            push @mlines, $line;
        }
        elsif ($ctype eq '+')
        {
            push @plines, $line;
        }
        else
        {
            print STDERR "Unexpected input line format: $line\n";
            exit 1;
        }
        $ptype = $ctype;
    }

    $diffs = 1 if differences(\@plines, \@mlines);

    if ($diffs != 0)
    {
        # Print the block of differences, preceded by file header if necessary
        if ($header == 0)
        {
            print "$file1\n";
            print "$file2\n";
            $header = 1;
        }
        print "$atline\n";
        foreach my $line (@lines)
        {
            print "$line\n";
        }
    }

    print "<<-- post_process\n" if debug;
    return;
}

使用data文件进行测试，并使用三种变体进行测试：

$ perl checkdiffs.pl data
$ perl checkdiffs.pl data.0
--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar #0
+Bar ##
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
$ perl checkdiffs.pl data.1
--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo #0
-Bar ##
+Foo ##
+Bar ##

 Etc.
$ perl checkdiffs.pl data.2
--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar #0
+Bar ##
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo ##
-Bar #0
+Foo ##
+Bar ##

 Etc.
$

这符合您的要求吗？

Answer 2

我认为这可能有用（除非你有重复对）：

   sed 's/^[-+]//' filename | perl -ne 'print unless $seen{$_}++'

用空字符串替换起始+/-。然后选择唯一的行。

Answer 3

您可以使用s modifier和positive lookahead：

使用s修饰符，您还可以将新行与点
有正向前瞻，你可以找到比赛的发生但不把它作为比赛的一部分（跳过其中的一切......）。

Here是regexpal的样本匹配。

这是C＃正则表达式样本，应该接近您的需要：

var sourceString = @"-Foo
    +Foo
    la
    -Bar
    +Foo
    la
    -Ko
    +Bar
    la
    +Ko
    -Ena
    asdsda
    -Dva
    +Ena
    +Dva
    ";
Regex ItemRegex = new Regex(@"(?s)\-(.*?)\n(?=(.*?)(\+\1))", RegexOptions.Compiled);
foreach (Match ItemMatch in ItemRegex.Matches(sourceString))
{
    Console.WriteLine(ItemMatch);
}

正则表达式匹配相等的差异线

3 个答案:

file.1中

file.2中

原始差异输出

后处理输出

checkdiffs.pl