Question

我需要匹配以下模式：

df = pd.read_csv('UNSPSCdataset.csv',encoding='mac_roman',low_memory=False)
features = ['MaterialDescription']
temp_features = df[features]
temp_features.to_csv('materialDescription', encoding='UTF-8')
X = pd.read_csv('materialDescription',encoding='UTF-8')

以下规则适用：

#/a/b/c/d
#/a/b/&1/d
#/a/b/c[&1]/d

所以，我想出了以下内容：

a) # is the number sign and then its a path. Pretty much anything can be in the path segments. For &1 and []'s, they follow certain rules.
b) &1 (or any number) has to be in a path segment by itself
c) [&1] has to follow at least one character and has to end the segment, only [&l1] is allowed for now

似乎工作正常，但我的探查器显示它是一个瓶颈。有没有办法以更优化的方式提高性能或重组它？我不需要捕获或分组任何东西，我只需要知道它是否是一条有效的路径。

Answer 1

运行一些快速测试，这是最快的。我在负字符类中添加了一些括号，以排除其中包含无关括号的路径。没有它们会更快。

var pattern = "^#(?:/(?:&\\d+|[^/&[\\]]+\\[&1]|[^/&[\\]]+))+$";
var REc = new Regex(pattern, RegexOptions.Compiled);

根据最常见的段类型更改顺序可能会更快 - 这对我的测试数据来说更快，主要是字母数字段：

var pattern2 = "^#(?:/(?:[^/&[\\]]+|&\\d+|[^&/[\\]]+\\[&1]))+$";

使用REc.IsMatch(bs)

进行测试

如果段中的括号正常，则速度更快：

var pattern = "^#(?:/(?:&\\d+|[^/]+\\[&1]|[^/&]+))+$";

Answer 2

你可以尝试的一件事就是告诉正则表达式引擎不要捕获任何东西：

^#(?:(?:/[^/&]+)|(?:/&\\d+)|(?:/[^/]+\\[&1\\]))+

通过标记每个组(?: ... )，我们告诉引擎忽略该组。

这个正则表达式可以优化吗？

2 个答案: