正则表达式,从分隔字符串中删除重复路径

时间:2018-02-24 09:45:59

标签: c# regex perl

我尝试使用正则表达式从分号分隔的字符串中删除重复的文件路径。最终路径的顺序无关紧要。

示例输入:

C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;

期望的输出:

C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;

我有以下正则表达式,但是当输入字符串变得很长时非常慢。再加上它运行数千行,时间非常糟糕。

\b([^;]+)(?=.*;\1;);

非常感谢有关如何提高性能的任何提示!

7 个答案:

答案 0 :(得分:8)

或C#版本:

PickNgo

输出:

using System;
using System.Collections.Generic;

public class Program
{
    public static void Main()
    {
        var paths = @"C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;";

        var cleaned = string.Join(";", new HashSet<string>(paths.Split(';')));

        Console.WriteLine(cleaned);
    }
}

C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path3; 处拆分输入,将其设为;以摆脱欺骗,再次加入HashSet<string>(..)

警告:如果您的路径包含;作为目录名称的一部分,则会中断 - 您必须为此案例获得更多创意 - 但同样适用于任何您使用的RegEx。

答案 1 :(得分:7)

在Perl中删除重复项的典型方法是使用哈希。另请参阅perlfaq4: How can I remove duplicate elements from a list or array?

my $str = q{C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3};
my %seen;
my $out = join ';', sort grep { !$seen{$_}++ } split /;/, $str;
print $out, "\n";
__END__
# Output:
C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6

我把sort扔进了那里,但是如果你不需要它就可以删除它。

虽然您尚未指定实现是否应该在C#或Perl中,但同样的想法也应该适用于C#。 (更新:请参阅Patrick Artner's answer

请注意,正则表达式很慢,因为对于\b([^;]+)的每个匹配,引擎必须扫描字符串的整个剩余部分以查找前瞻.*;\1;,因此它基本上就像嵌套循环一样。

答案 2 :(得分:1)

请尝试以下代码。

var inputStr = "C:\\Users\\user\\Desktop\\TESTING\\path1;C:\\Users\\user\\Desktop\\TESTING\\path5;C:\\Users\\user\\Desktop\\TESTING\\path1;C:\\Users\\user\\Desktop\\TESTING\\path6;C:\\Users\\user\\Desktop\\TESTING\\path1;C:\\Users\\user\\Desktop\\TESTING\\path3;C:\\Users\\user\\Desktop\\TESTING\\path1;C:\\Users\\user\\Desktop\\TESTING\\path3"

var urlArr = inputStr.split(";");
var uniqueUrlList = [];

urlArr.forEach(function (elem, indx1) {
    let foundElem = uniqueUrlList.find((x, indx2)=>{
        return x.toUpperCase() === elem.toUpperCase() &&
        (indx1 != indx2);
    });    
    
    if (foundElem === undefined) {
        uniqueUrlList.push(elem);
    }
});

console.log(uniqueUrlList);

答案 3 :(得分:1)

Perl,最优化的单行RegEx版本:

(?<![^;])([^;]++;)(?=(?>[^;]*;)*?\1)

在您自己的输入字符串上,您自己的正则表达式需要大约114,000步才能找到所有匹配项,但是使用这个步骤需要567步才能完成。

在~4秒内发现超过40000次:

enter image description here

Live demo

RegEx细分:

(?<!    # A Negative lookbehind
    [^;]    # Should be anything other than `;`
)   # End of lookbehind
(   # Capturing group #1
    [^;]++; # Match anything up to first `;`
)   # End of CG #1
(?= # A Positive lookahead
    (?>[^;]*;)*?    # Skip over next path, don't backtrack
    \1  # Until an occurrence
)   # End of lookahead

答案 4 :(得分:1)

在Perl中,

#!/usr/bin/env perl

# always use these two
use strict;
use warnings;

my $paths = 'C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;';

print "$paths\n";
{
    my %temporary_hash = map { $_ => 1 } split( q{;}, $paths );
    $paths = join( q{;}, keys %temporary_hash );
}
print "$paths\n";

请参阅perldoc -q duplicate

答案 5 :(得分:0)

在Perl中,使用库List::Util需要一行来完成它,这是核心和高度优化的:

my $newpaths = join ';', uniq split /;/, $paths;

它是如何工作的? split会创建一个在;字符周围分割的路径列表; uniq将确保没有重复; join将创建一系列路径,再次以;分隔。

如果路径的情况不重要,那么:

my $newpaths = join ';', uniq split /;/, lc $paths;

完整的计划可能是:

use strict;
use warnings;

use List::Util qw( uniq );

my $paths = 'C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;';

my $newpaths = join ';', uniq split /;/, $paths;

print $newpaths, "\n";

为了让事情变得有趣,让我们来对付使用临时散列的提议的解决方案。这是时间计划:

use strict;
use warnings;

use List::Util qw( uniq );
use Time::HiRes qw( time );

my @p;
for( my $i = 0; $i < 1000000; $i++ ) {
  push @p, 'C:\This\is\a\random\path' . int(rand(250000));
}
my $paths = join ';', @p;

my $t = time();
my $newpaths = join ';', uniq split /;/, $paths;
$t = time() - $t;
print 'Time with uniq: ', $t, "\n";

$t = time();
my %temp = map { $_ => 1 } split /;/, $paths;
$newpaths = join ';', keys %temp;
$t = time() - $t;
print 'Time with temporaty hash: ', $t, "\n";

它生成100万个随机路径,其重复比为5:1(每个路径重复5次)。我测试过这个服务器的时间是:

Time with uniq: 0.849196910858154
Time with temporaty hash: 1.29486703872681

这使得uniq库比临时哈希更快。 100:1重复:

Time with uniq: 0.526581048965454
Time with temporaty hash: 0.823433876037598

10000:1重复:

Time with uniq: 0.423808097839355
Time with temporaty hash: 0.736939907073975

两种算法的工作量越少,重复次数越多。随着重复项的增加,uniq的表现会更好。

随意使用随机生成器的编号。

答案 6 :(得分:-2)

由于这些是不区分大小写的Windows路径,因此您可能希望删除除大小写以外的相同元素

(下一步是推动每个元素通过File::Spec::canonpath以查找路径是否相同但是表达方式不同,然后可能考虑链接,但这只是不区分大小写的情况)< / p>

我不知道您的请求“使用正则表达式”是否是必需的,但正如您所发现的那样,这是一种非常低效的方法

我推荐一个简单的split分号,和 List::UtilsBy 做与案例无关的唯一性

use strict;
use warnings 'all';
use feature 'say';

use List::UtilsBy 'uniq_by';

my $p = 'C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;';

my $newp = join "", map { "$_;" } uniq_by { lc } split /;/, $p;

say $newp;

输出

C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path3;