Question

我有一个python脚本，运行超过1M行，长度不同。脚本运行速度很慢。在过去的12个小时里，它一直只运行了30000多个。由于文件已经拆分，因此拆分文件是不可能的。我的代码如下所示：

regex1 = re.compile(r"(\{\{.*?\}\})", flags=re.IGNORECASE)
regex2 = re.compile(r"(<ref.*?</ref>)", flags=re.IGNORECASE)
regex3 = re.compile(r"(<ref.*?\/>)", flags=re.IGNORECASE)
regex4 = re.compile(r"(==External links==.*?)", flags=re.IGNORECASE)
regex5 = re.compile(r"(<!--.*?-->)", flags=re.IGNORECASE)
regex6 = re.compile(r"(File:[^ ]*? )", flags=re.IGNORECASE)
regex7 = re.compile(r" [0-9]+ ", flags=re.IGNORECASE)
regex8 = re.compile(r"(\[\[File:.*?\]\])", flags=re.IGNORECASE)
regex9 = re.compile(r"(\[\[.*?\.JPG.*?\]\])", flags=re.IGNORECASE)
regex10 = re.compile(r"(\[\[Image:.*?\]\])", flags=re.IGNORECASE)
regex11 = re.compile(r"^[^_].*(\) )", flags=re.IGNORECASE)

fout = open(sys.argv[2],'a+')

with open(sys.argv[1]) as f:
    for line in f:
        parts=line.split("\t")
        label=parts[0].replace(" ","_").lower()
        line=parts[1].lower()
        try:
            line = regex1.sub("",line )
        except:
            pass
        try:
            line = regex2.sub("",line )
        except:
            pass
        try:
            line = regex3.sub("",line )
        except:
            pass
        try:
            line = regex4.sub("",line )
        except:
            pass
        try:
            line = regex5.sub("",line )
        except:
            pass
        try:
            line = regex6.sub("",line )
        except:
            pass
        try:
            line = regex8.sub("",line )
        except:
            pass
        try:
            line = regex9.sub("",line )
        except:
            pass
        try:
            line = regex10.sub("",line )
        except:
            pass

        try:     
            for match in re.finditer(r"(\[\[.*?\]\])", line):
                replacement_list=match.group(0).replace("[","").replace("]","").split("|")
                replacement_list = [w.replace(" ","_") for w in replacement_list]
                replacement_for_links=' '.join(replacement_list)
                line = line.replace(match.group(0),replacement_for_links)
        except:
            pass
        try:
            line = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', line, flags=re.MULTILINE)  
        except:
            pass    
        try:
            line = line.translate(None, '!"#$%&\'*+,./:;<=>?@[\\]^`{|}~')
        except:
            pass        
        try:
            line = line.replace(' (',' ')   
            line=' '.join([word.rstrip(")") if not '(' in word else word for word in line.split(" ")])
            line=re.sub(' isbn [\w-]+ ',' ' ,line)
            line=re.sub(' [p]+ [\w-]+ ',' ' ,line)
            line = re.sub( ' \d+ ', ' ', line)
            line= re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", line)
            line = re.sub( '\s+', ' ', line).strip()
            line=re.sub(' isbn [\w-]+ ',' ' ,line)
        except:
            pass    
        out_string=label+"\t"+line
        fout.write(out_string)
        fout.write("\n")

fout.close()

我是否有任何变化可以比当前版本获得显着改善？

更新1：使用@fearless_fool的建议进行性能分析后，我意识到regex3和regex9以及http去除效率最低。

更新2：发现使用.*为正则表达式模式的步骤添加更多内容，这很有趣。我尝试用[^X]*代替X，其中regex1 = re.compile(r"(\{\{[^\}]*?\}\})", flags=re.IGNORECASE)是我知道它永远不会发生在字符串中。它可以为1000条长线提高约20倍。例如，现在regex1是(\{\{[^\}]*?\}\}) ....如果我想在负匹配中使用两个字符，我不知道该怎么做。例如，如果我想将(\{\{[^\}\}]*?\}\})更改为[]，我知道这是错误的，因为substr中的任何单词都被视为单独的字符。

Answer 1

（提升对答案的评论）：我建议您使用优雅且有用的Regex 101 Tool来分析您的个人regexen，看看他们是否有过多的时间。

当你在这里时，你可以在网站上发布一个完整的例子，以便其他人可以看到你正在使用的典型输入。（我意识到你已经做到了 - 太棒了！）

Answer 2

使用@fearless_fool推荐的有用Regex工具后，我将.*替换为代表.*更严格版本的正则表达式，以显着提高速度，例如：[^\]]*。整个脚本中的这些更改显着提高了性能。

提高python

2 个答案: