Question

使用Regex时，我遇到了性能问题。我正在使用的方法按预期工作，但处理大型文本文件需要很长时间。

我需要从文件的每一行中取出单词： “tjdj47 *** ss__s * 47 djj ___ s_sd4 4”

应返回单词列表（任何包含多于1个字符的字母或字母数字序列）

tjdj47
ss
47
djj
sd4

我使用Regex模式

 pattern = new Regex(
            @"([A-Za-z0-9]
            ([A-Za-z0-9])*
            [A-Za-z0-9])",
            RegexOptions.IgnorePatternWhitespace);

过滤和拆分单词的方法

public List<string> SplitLineIntoWords(string lineText)
    {
        List<string> lineWords = new List<string>();

        foreach (Match m in pattern.Matches(lineText))
        {
            lineWords.Add(m.Groups[1].Value.ToLower());
        }

        return lineWords;
    }

如何优化方法以更快地执行？（现在，将文字与大小为350mb的文件分开最多需要25秒）

Answer 1

您的表达式基本上匹配包含至少2个字母数字字符的子字符串。

使用

var results = Regex.Matches(s, @"[A-Za-z0-9]{2,}")
        .Cast<Match>()
        .Select(m => m.Value)
        .ToList();

请参阅regex demo。

查看RegexHero.net的基准测试：

由于回溯（以及填满第二组筹码），([A-Za-z0-9]([A-Za-z0-9])*[A-Za-z0-9])需要更多时间来匹配：

[A-Za-z0-9]匹配字母数字，然后
([A-Za-z0-9])*尽可能多地匹配并捕获每个字母数字
[A-Za-z0-9]需要匹配一个字符，因此引擎会退后一步，并让字母数字字符与最后一个子图标匹配。

[A-Za-z0-9]{2,}没有回溯，因为只有一种方法可以匹配字符串。

以下是两种模式如何只得到第一次匹配的比较（使用PCRE选项完成，但它与.NET的做法非常接近）：1）your regex和2）my solution

Regex 1 ：

Regex 2 ：

Answer 2

您可以将与该字符组匹配的正则表达式优化为Max至少2次或更多。

Answer 3

我认为您遇到此问题的主要原因是回溯。将你的正则表达式改为：

from matplotlib import pyplot
from matplotlib import patches
from matplotlib import animation
from matplotlib.collections import PatchCollection

fig = pyplot.figure()

coords = [[1,0],[0,1],[0,0]] #arbitrary set of coordinates

ax = pyplot.axes(xlim=(0, len(coords)), ylim=(0, len(coords)))
ax.set_aspect('equal')


patchList = list()

for coord in coords:
    patch = patches.Rectangle(coord, 1, 1, color="white")
    ax.add_patch(patch)
    patch.set_visible = True
    patchList.append(patch)

rectangles = PatchCollection(patchList, animated=True)
ax.add_collection(rectangles)


black = []
white = ["white"]*len(patchList)

def animate(i):
    black.append("black")
    white.pop()

    colors = black + white
    print(colors)

    rectangles.set_facecolors(colors)
    print("%.2f%%..."%(100*float(i+1)/len(coords)))

    return rectangles,

anim = animation.FuncAnimation(fig, animate,
                               # init_func=init,
                               frames=len(coords)-1,
                               interval=1,
                               blit=True,
                               repeat = False)

pyplot.show()

至于你的正则表达式，这一部分：

@"[a-zA-Z0-9]{2,}".

一次会回溯每个单词，所以你基本上扫描每个单词两次。第一个匹配([A-Za-z0-9])*然后在[A-Za-z0-9]([A-Za-z0-9])*上失败并回溯到第一个[A-Za-z0-9]，然后扫描整个示例并匹配它。

Optimaze Regex方法（从文本行中分割字母数字字）

3 个答案: