np.where如何用正则表达式提高性能?

时间:2016-08-05 05:34:40

标签: python regex performance pandas numpy

我是python numpy和正则表达式的新手。我试图从每行的pandas文本列中提取模式。根据我的要求,有许多可能的案例可供使用,所以我在下面写了不同的正则表达式。为了迭代和搜索给定的模式我正在使用python' s np.where但我遇到了性能问题。有没有办法改善性能或任何替代方案以实现低于输出。

x_train['Description'] is my pandas column.

54672 rows in my dataset.


Code:

pattern1 = re.compile(r'\bAGE[a-z]?\b[\s\w]*\W+\d+.*(?:year[s]|month[s]?)',re.I)

pattern2 = re.compile(r'\bfor\b[\s]*age[s]?\W+\d+\W+(?:month[s]?|year[s]?)',re.I)

pattern3 = re.compile(r'\badult[s]?.[\w\s]\d+',re.I)

pattern4 = re.compile(r'\b\d+\W+(?:month[s]?|year[s]?)\W+of\W+age[a-z]?',re.I)

pattern5 = re.compile(r'[a-z][a-z\s]+(?:month[s]?|year[s]?)[\w\s]+age[s]?',re.I) 

pattern6 = re.compile(r'\bage.*?\s\d+[\s]*\+',re.I)

pattern7 = re.compile(r'\bbetween[\s]*age[s]?[\s]*\d+.*(?:month[s]?|year[s]?)',re.I)

pattern8 = re.compile(r'\b\d+[\w+\s]*?(?:\band\sup\b|\band\sabove\b|\band\sold[a-z]*\b)',re.I)

np_time = time.time()

x_train['pattern'] = np.where(x_train['Description'].str.contains(pattern1), x_train['Description'].str.findall(pattern1),

                              np.where (x_train['Description'].str.contains(pattern2), x_train['Description'].str.findall(pattern2),

                              np.where (x_train['Description'].str.contains(pattern3), x_train['Description'].str.findall(pattern3),

                              np.where (x_train['Description'].str.contains(pattern4), x_train['Description'].str.findall(pattern4),  

                              np.where (x_train['Description'].str.contains(pattern5), x_train['Description'].str.findall(pattern5),  

                              np.where (x_train['Description'].str.contains(pattern6), x_train['Description'].str.findall(pattern6),  

                              np.where (x_train['Description'].str.contains(pattern7), x_train['Description'].str.findall(pattern7),  

                              np.where (x_train['Description'].str.contains(pattern8), x_train['Description'].str.findall(pattern8),                                


                                                'NO PATTERN')      

                                                             )))))))


print "pattern extraction ran in = "
print("--- %s seconds ---" % (time.time() - np_time))



pattern extraction ran in = 
--- 99.5106501579 seconds ---

示例输入和输出代码

        Description                                  pattern     

    0  **AGE RANGE: 6 YEARS** AND UP 10' LONG          AGE RANGE: 6 YEARS 
       STRING OF BEAUTIFUL LIGHTS MULTIPLE 
       LIGHT EFFECTS FADE IN AND OUT

    1  DIMENSIONS   OVERALL HEIGHT - TOP           AGE GROUP: -2 YEARS/3 TO 4 
       TO BOTTOM: 34.5'' OVERALL WIDTH - SIDE      YEARS/5 TO 6 YEARS/7 TO 8 
                                                   YEARS/7 TO 8 YEARS.
       TO SIDE: 20''  OVERALL DEPTH - 
       FRONT TO BACK:      15''  COUNTER TOP 
       HEIGHT - TOP TO BOTTOM: 23''  OVERALL 
       PRODUCT WEIGHT: 38 LBS "   
       **"AGE GROUP: -2 YEARS/3 TO 4 YEARS/5 TO 6 
        YEARS/7 TO 8 YEARS**.

   2   THE FLAME-RETARDANT FOAM ALSO CONTAINS              AGED 1-5 YEARS
       ANTIMICROBIAL PROTECTION, SO IT WON'T GROW 
       MOLD OR BACTERIA IF IT GETS WET. THE 
       BRIGHTLY-COLORED 
       VINYL EXTERIOR IS EASY TO WIPE CLEAN. FOAMMAN 
       IS DESIGNED FOR KIDS **AGED 1-5 YEARS**

1 个答案:

答案 0 :(得分:0)

您可以尝试以下几种方法:

首先,您需要确定较慢的正则表达式。例如,您可以使用https://regex101.com/观察“步骤”值来做到这一点。

我检查了正则表达式,数字5和8是最慢的。

27800 steps = [a-z][a-z\s]+(?:month[s]?|year[s]?)[\w\s]+age[s]?
 4404 steps= \b\d+[\w+\s]*?(?:\band\sup\b|\band\sabove\b|\band\sold[a-z]*\b)

您可以考虑优化这两个正则表达式。

例如,您可以重写此\b\d+[\w+\s]*?(?:\band\sup\b|\band\sabove\b|\band\sold[a-z]*\b)

进入此\b\d+[\w+\s]*?(?:\band\s(?:up|above|old[a-z]*\b)),它使用的步数减少了约50%。

对于其他正则表达式,有几个选项。您可以将其重写为:

[A-Z][A-LN-XZ\s]+(?:(?:Y(?!EARS?)|M(?!ONTHS?))[A-LN-XZ\s]+)*(?:MONTHS?|YEARS?)[\w\s]+AGE[S]?

哪个更快。虽然不多,但(27800 vs 23800)

但是,加快速度实际上似乎是使其区分大小写。

区分大小写的原始正则表达式仅执行3700个步骤。以及经过优化的1470。

因此,您可以只对整个字符串进行大写/小写,然后在(区分大小写)正则表达式上使用它。您甚至可能不需要转换字符串,因为样本上的字符串无论如何似乎都是大写的。

要查看的另一件事是要测试的正则表达式的顺序。如果有些正则表达式比其他更容易匹配,则应首先对其进行测试。

如果您不知道这样的概率,并且您认为它们差不多相同,则可以考虑将更简单的正则表达式放在首位。一如既往地测试很难匹配的复杂正则表达式会浪费时间。

最后,当您拥有(a | b | c)这样的选项时,出于与以前相同的原因,您可能会考虑将最有可能出现在开头。