我尝试使用正则表达式从分号分隔的字符串中删除重复的文件路径。最终路径的顺序无关紧要。
示例输入:
C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;
期望的输出:
C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;
我有以下正则表达式,但是当输入字符串变得很长时非常慢。再加上它运行数千行,时间非常糟糕。
\b([^;]+)(?=.*;\1;);
非常感谢有关如何提高性能的任何提示!
答案 0 :(得分:8)
或C#版本:
PickNgo
输出:
using System;
using System.Collections.Generic;
public class Program
{
public static void Main()
{
var paths = @"C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;";
var cleaned = string.Join(";", new HashSet<string>(paths.Split(';')));
Console.WriteLine(cleaned);
}
}
在C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path3;
处拆分输入,将其设为;
以摆脱欺骗,再次加入HashSet<string>(..)
。
警告:如果您的路径包含;
作为目录名称的一部分,则会中断 - 您必须为此案例获得更多创意 - 但同样适用于任何您使用的RegEx。
答案 1 :(得分:7)
在Perl中删除重复项的典型方法是使用哈希。另请参阅perlfaq4: How can I remove duplicate elements from a list or array?
my $str = q{C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3};
my %seen;
my $out = join ';', sort grep { !$seen{$_}++ } split /;/, $str;
print $out, "\n";
__END__
# Output:
C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6
我把sort
扔进了那里,但是如果你不需要它就可以删除它。
虽然您尚未指定实现是否应该在C#或Perl中,但同样的想法也应该适用于C#。 (更新:请参阅Patrick Artner's answer)
请注意,正则表达式很慢,因为对于\b([^;]+)
的每个匹配,引擎必须扫描字符串的整个剩余部分以查找前瞻.*;\1;
,因此它基本上就像嵌套循环一样。
答案 2 :(得分:1)
请尝试以下代码。
var inputStr = "C:\\Users\\user\\Desktop\\TESTING\\path1;C:\\Users\\user\\Desktop\\TESTING\\path5;C:\\Users\\user\\Desktop\\TESTING\\path1;C:\\Users\\user\\Desktop\\TESTING\\path6;C:\\Users\\user\\Desktop\\TESTING\\path1;C:\\Users\\user\\Desktop\\TESTING\\path3;C:\\Users\\user\\Desktop\\TESTING\\path1;C:\\Users\\user\\Desktop\\TESTING\\path3"
var urlArr = inputStr.split(";");
var uniqueUrlList = [];
urlArr.forEach(function (elem, indx1) {
let foundElem = uniqueUrlList.find((x, indx2)=>{
return x.toUpperCase() === elem.toUpperCase() &&
(indx1 != indx2);
});
if (foundElem === undefined) {
uniqueUrlList.push(elem);
}
});
console.log(uniqueUrlList);
答案 3 :(得分:1)
Perl,最优化的单行RegEx版本:
(?<![^;])([^;]++;)(?=(?>[^;]*;)*?\1)
在您自己的输入字符串上,您自己的正则表达式需要大约114,000步才能找到所有匹配项,但是使用这个步骤需要567步才能完成。
在~4秒内发现超过40000次:
RegEx细分:
(?<! # A Negative lookbehind
[^;] # Should be anything other than `;`
) # End of lookbehind
( # Capturing group #1
[^;]++; # Match anything up to first `;`
) # End of CG #1
(?= # A Positive lookahead
(?>[^;]*;)*? # Skip over next path, don't backtrack
\1 # Until an occurrence
) # End of lookahead
答案 4 :(得分:1)
在Perl中,
#!/usr/bin/env perl
# always use these two
use strict;
use warnings;
my $paths = 'C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;';
print "$paths\n";
{
my %temporary_hash = map { $_ => 1 } split( q{;}, $paths );
$paths = join( q{;}, keys %temporary_hash );
}
print "$paths\n";
答案 5 :(得分:0)
在Perl中,使用库List::Util
需要一行来完成它,这是核心和高度优化的:
my $newpaths = join ';', uniq split /;/, $paths;
它是如何工作的? split
会创建一个在;
字符周围分割的路径列表; uniq
将确保没有重复; join
将创建一系列路径,再次以;
分隔。
如果路径的情况不重要,那么:
my $newpaths = join ';', uniq split /;/, lc $paths;
完整的计划可能是:
use strict;
use warnings;
use List::Util qw( uniq );
my $paths = 'C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;';
my $newpaths = join ';', uniq split /;/, $paths;
print $newpaths, "\n";
为了让事情变得有趣,让我们来对付使用临时散列的提议的解决方案。这是时间计划:
use strict;
use warnings;
use List::Util qw( uniq );
use Time::HiRes qw( time );
my @p;
for( my $i = 0; $i < 1000000; $i++ ) {
push @p, 'C:\This\is\a\random\path' . int(rand(250000));
}
my $paths = join ';', @p;
my $t = time();
my $newpaths = join ';', uniq split /;/, $paths;
$t = time() - $t;
print 'Time with uniq: ', $t, "\n";
$t = time();
my %temp = map { $_ => 1 } split /;/, $paths;
$newpaths = join ';', keys %temp;
$t = time() - $t;
print 'Time with temporaty hash: ', $t, "\n";
它生成100万个随机路径,其重复比为5:1(每个路径重复5次)。我测试过这个服务器的时间是:
Time with uniq: 0.849196910858154
Time with temporaty hash: 1.29486703872681
这使得uniq
库比临时哈希更快。 100:1重复:
Time with uniq: 0.526581048965454
Time with temporaty hash: 0.823433876037598
10000:1重复:
Time with uniq: 0.423808097839355
Time with temporaty hash: 0.736939907073975
两种算法的工作量越少,重复次数越多。随着重复项的增加,uniq
的表现会更好。
随意使用随机生成器的编号。
答案 6 :(得分:-2)
由于这些是不区分大小写的Windows路径,因此您可能希望删除除大小写以外的相同元素
(下一步是推动每个元素通过File::Spec::canonpath
以查找路径是否相同但是表达方式不同,然后可能考虑链接,但这只是不区分大小写的情况)< / p>
我不知道您的请求“使用正则表达式”是否是必需的,但正如您所发现的那样,这是一种非常低效的方法
我推荐一个简单的split
分号,和
List::UtilsBy
做与案例无关的唯一性
use strict;
use warnings 'all';
use feature 'say';
use List::UtilsBy 'uniq_by';
my $p = 'C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;';
my $newp = join "", map { "$_;" } uniq_by { lc } split /;/, $p;
say $newp;
C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path3;