正则表达式匹配相等的差异线

时间:2013-03-31 01:03:07

标签: regex diff

我有差异,我后期处理,并希望扁平化相等的线条。这是一个例子:

Foo
-Bar
+Bar
Baz

我想压下相等的线条,这样它们就不再出现在差异中了。

非常简单
-(.*)\n\+\1\n

当我有多行匹配时,问题就出现了:

-Foo
-Bar
+Foo
+Bar

有什么想法吗?或者我不应该做一个RegEx并编写一个简单的解析器?或者一个已经存在?

如果有更好的解决方案,有些背景故事。我正在分析两个文件,看看它们是否相同。可悲的是,输出几乎相同但需要一些后处理,例如

-on line %d
+on line 8

所以我要经历并将已知字符串转换为其他已知字符串,然后我试图检查差异是否为空或仍然不同。

3 个答案:

答案 0 :(得分:0)

之前我已经对diff输出进行了一些简单的分析,所以我有一个Perl脚本给了我一个开始的基础。请考虑以下两个数据文件file.1file.2

file.1中

Data

Foo
Bar 1
Baz

I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with

-(.*)\n\+\1\n

The problems start when I have multi-line matches like:

Foo 2
Bar 2

Etc.

file.2中

Data

Foo
Bar 10
Baz

I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with

-(.*)\n\+\1\n

The problems start when I have multi-line matches like:

Foo 20
Bar 20

Etc.

原始差异输出

原始统一diff输出为:

--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar 1
+Bar 10
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo 2
-Bar 2
+Foo 20
+Bar 20

 Etc.

后处理输出

现在,经过后处理后,所有数字字符串都已替换为##,因此后处理文件如下所示:

--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar ##
+Bar ##
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo ##
-Bar ##
+Foo ##
+Bar ##

 Etc.

这是程序的输入,用于分析差异是否仍然存在。

为了真正有用,我们必须隔离标题行(---+++)并保留它们。对于从@@开始的每个差异块,我们需要捕获-+行的相邻行,并且:

  1. 检查+-
  2. 的行数是否相同
  3. 检查-行的内容是否与+行的内容相同。
  4. 请注意,虽然示例数据未显示,但您可以在一个-部分中拥有多个+@@行的块。
  5. 如果@@块中没有剩余差异,则可以丢弃整个块。
  6. 如果存在差异,那么我们需要输出标题行,如果它们之前没有输出。
  7. 如果存在差异,则输出整个差异块。
  8. 冲洗并重复。

    我选择的编程语言是Perl。

    checkdiffs.pl

    #!/usr/bin/env perl
    use strict;
    use warnings;
    use constant debug => 0;
    
    my $file1;
    my $file2;
    my $header = 0;
    
    OUTER:
    while (my $line = <>)
    {
        chomp $line;
        print "[$line]\n" if debug;
        if ($line =~ m/^--- /)
        {
            $file1 = $line;
            $file2 = <>;
            chomp $file2;
            print "[$file2]\n" if debug;
            if ($file2 !~ m/^\+\+\+ /)
            {
                print STDERR "Unexpected file identification lines\n";
                print STDERR "$file1\n";
                print STDERR "$file2\n";
                next OUTER;
            }
            $header = 0;    # Have not output file header yet
    
            my @lines;
            my $atline;
    
            last OUTER unless defined($line = <>);
    INNER:
            while ($line =~ m/^@@ /)
            {
                chomp $line;
                print "@[$line]\n" if debug;
                $atline = $line;
                @lines  = ();
    
                while (defined($line = <>) && $line =~ m/^[- +]/)
                {
                    chomp $line;
                    print ":[$line]\n" if debug;
                    push @lines, $line;
                }
                # Got a complete @@ block of diffs
                post_process($atline, @lines);
    
                last OUTER if !defined($line);
                next INNER if ($line =~ m/^@@ /);
                print STDERR "Unexpected input line: [$line]\n";
                last OUTER;
            }
        }
    }
    
    sub differences
    {
        my($pref, $mref) = @_;
        my $pnum = scalar(@$pref);
        my $mnum = scalar(@$mref);
        print "-->> differences\n" if debug;
        return 0 if ($pnum == 0 && $mnum == 0);
        return 1 if ($pnum != $mnum);
        foreach my $i (0..($pnum-1))
        {
            my $pline = substr(${$pref}[$i], 1);
            my $mline = substr(${$mref}[$i], 1);
            return 1 if ($pline ne $mline);
        }
        print "<<-- differences\n" if debug;
        return 0;
    }
    
    sub post_process
    {
        my($atline, @lines) = @_;
    
        print "-->> post_process\n" if debug;
        # Work out whether there are any differences left
        my @plines = ();    # +lines
        my @mlines = ();    # -lines
        my $diffs  = 0;
        my $ptype  = ' ';   # Previous line type
    
        foreach my $line (@lines)
        {
            print "---- $line\n" if debug;
            my ($ctype) = ($line =~ m/^(.)/);
            if ($ctype eq ' ')
            {
                if (($ptype eq '-' || $ptype eq '+') && differences(\@plines, \@mlines))
                {
                    $diffs = 1;
                    last;
                }
                @plines = ();
                @mlines = ();
            }
            elsif ($ctype eq '-')
            {
                push @mlines, $line;
            }
            elsif ($ctype eq '+')
            {
                push @plines, $line;
            }
            else
            {
                print STDERR "Unexpected input line format: $line\n";
                exit 1;
            }
            $ptype = $ctype;
        }
    
        $diffs = 1 if differences(\@plines, \@mlines);
    
        if ($diffs != 0)
        {
            # Print the block of differences, preceded by file header if necessary
            if ($header == 0)
            {
                print "$file1\n";
                print "$file2\n";
                $header = 1;
            }
            print "$atline\n";
            foreach my $line (@lines)
            {
                print "$line\n";
            }
        }
    
        print "<<-- post_process\n" if debug;
        return;
    }
    

    使用data文件进行测试,并使用三种变体进行测试:

    $ perl checkdiffs.pl data
    $ perl checkdiffs.pl data.0
    --- file.1  2013-03-30 18:58:35.000000000 -0700
    +++ file.2  2013-03-30 18:58:48.000000000 -0700
    @@ -1,7 +1,7 @@
     Data
    
     Foo
    -Bar #0
    +Bar ##
     Baz
    
     I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
    $ perl checkdiffs.pl data.1
    --- file.1  2013-03-30 18:58:35.000000000 -0700
    +++ file.2  2013-03-30 18:58:48.000000000 -0700
    @@ -10,7 +10,7 @@
    
     The problems start when I have multi-line matches like:
    
    -Foo #0
    -Bar ##
    +Foo ##
    +Bar ##
    
     Etc.
    $ perl checkdiffs.pl data.2
    --- file.1  2013-03-30 18:58:35.000000000 -0700
    +++ file.2  2013-03-30 18:58:48.000000000 -0700
    @@ -1,7 +1,7 @@
     Data
    
     Foo
    -Bar #0
    +Bar ##
     Baz
    
     I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
    @@ -10,7 +10,7 @@
    
     The problems start when I have multi-line matches like:
    
    -Foo ##
    -Bar #0
    +Foo ##
    +Bar ##
    
     Etc.
    $ 
    

    这符合您的要求吗?

答案 1 :(得分:0)

我认为这可能有用(除非你有重复对):

   sed 's/^[-+]//' filename | perl -ne 'print unless $seen{$_}++'

用空字符串替换起始+/-。然后选择唯一的行。

答案 2 :(得分:0)

您可以使用s modifierpositive lookahead

  • 使用s修饰符,您还可以将新行与点
  • 匹配
  • 有正向前瞻,你可以找到比赛的发生但不把它作为比赛的一部分(跳过其中的一切......)。

Here是regexpal的样本匹配。

这是C#正则表达式样本,应该接近您的需要:

var sourceString = @"-Foo
    +Foo
    la
    -Bar
    +Foo
    la
    -Ko
    +Bar
    la
    +Ko
    -Ena
    asdsda
    -Dva
    +Ena
    +Dva
    ";
Regex ItemRegex = new Regex(@"(?s)\-(.*?)\n(?=(.*?)(\+\1))", RegexOptions.Compiled);
foreach (Match ItemMatch in ItemRegex.Matches(sourceString))
{
    Console.WriteLine(ItemMatch);
}