我是python numpy和正则表达式的新手。我试图从每行的pandas文本列中提取模式。根据我的要求,有许多可能的案例可供使用,所以我在下面写了不同的正则表达式。为了迭代和搜索给定的模式我正在使用python' s np.where
但我遇到了性能问题。有没有办法改善性能或任何替代方案以实现低于输出。
x_train['Description'] is my pandas column.
54672 rows in my dataset.
Code:
pattern1 = re.compile(r'\bAGE[a-z]?\b[\s\w]*\W+\d+.*(?:year[s]|month[s]?)',re.I)
pattern2 = re.compile(r'\bfor\b[\s]*age[s]?\W+\d+\W+(?:month[s]?|year[s]?)',re.I)
pattern3 = re.compile(r'\badult[s]?.[\w\s]\d+',re.I)
pattern4 = re.compile(r'\b\d+\W+(?:month[s]?|year[s]?)\W+of\W+age[a-z]?',re.I)
pattern5 = re.compile(r'[a-z][a-z\s]+(?:month[s]?|year[s]?)[\w\s]+age[s]?',re.I)
pattern6 = re.compile(r'\bage.*?\s\d+[\s]*\+',re.I)
pattern7 = re.compile(r'\bbetween[\s]*age[s]?[\s]*\d+.*(?:month[s]?|year[s]?)',re.I)
pattern8 = re.compile(r'\b\d+[\w+\s]*?(?:\band\sup\b|\band\sabove\b|\band\sold[a-z]*\b)',re.I)
np_time = time.time()
x_train['pattern'] = np.where(x_train['Description'].str.contains(pattern1), x_train['Description'].str.findall(pattern1),
np.where (x_train['Description'].str.contains(pattern2), x_train['Description'].str.findall(pattern2),
np.where (x_train['Description'].str.contains(pattern3), x_train['Description'].str.findall(pattern3),
np.where (x_train['Description'].str.contains(pattern4), x_train['Description'].str.findall(pattern4),
np.where (x_train['Description'].str.contains(pattern5), x_train['Description'].str.findall(pattern5),
np.where (x_train['Description'].str.contains(pattern6), x_train['Description'].str.findall(pattern6),
np.where (x_train['Description'].str.contains(pattern7), x_train['Description'].str.findall(pattern7),
np.where (x_train['Description'].str.contains(pattern8), x_train['Description'].str.findall(pattern8),
'NO PATTERN')
)))))))
print "pattern extraction ran in = "
print("--- %s seconds ---" % (time.time() - np_time))
pattern extraction ran in =
--- 99.5106501579 seconds ---
示例输入和输出代码
Description pattern
0 **AGE RANGE: 6 YEARS** AND UP 10' LONG AGE RANGE: 6 YEARS
STRING OF BEAUTIFUL LIGHTS MULTIPLE
LIGHT EFFECTS FADE IN AND OUT
1 DIMENSIONS OVERALL HEIGHT - TOP AGE GROUP: -2 YEARS/3 TO 4
TO BOTTOM: 34.5'' OVERALL WIDTH - SIDE YEARS/5 TO 6 YEARS/7 TO 8
YEARS/7 TO 8 YEARS.
TO SIDE: 20'' OVERALL DEPTH -
FRONT TO BACK: 15'' COUNTER TOP
HEIGHT - TOP TO BOTTOM: 23'' OVERALL
PRODUCT WEIGHT: 38 LBS "
**"AGE GROUP: -2 YEARS/3 TO 4 YEARS/5 TO 6
YEARS/7 TO 8 YEARS**.
2 THE FLAME-RETARDANT FOAM ALSO CONTAINS AGED 1-5 YEARS
ANTIMICROBIAL PROTECTION, SO IT WON'T GROW
MOLD OR BACTERIA IF IT GETS WET. THE
BRIGHTLY-COLORED
VINYL EXTERIOR IS EASY TO WIPE CLEAN. FOAMMAN
IS DESIGNED FOR KIDS **AGED 1-5 YEARS**
答案 0 :(得分:0)
您可以尝试以下几种方法:
首先,您需要确定较慢的正则表达式。例如,您可以使用https://regex101.com/观察“步骤”值来做到这一点。
我检查了正则表达式,数字5和8是最慢的。
27800 steps = [a-z][a-z\s]+(?:month[s]?|year[s]?)[\w\s]+age[s]?
4404 steps= \b\d+[\w+\s]*?(?:\band\sup\b|\band\sabove\b|\band\sold[a-z]*\b)
您可以考虑优化这两个正则表达式。
例如,您可以重写此\b\d+[\w+\s]*?(?:\band\sup\b|\band\sabove\b|\band\sold[a-z]*\b)
进入此\b\d+[\w+\s]*?(?:\band\s(?:up|above|old[a-z]*\b))
,它使用的步数减少了约50%。
对于其他正则表达式,有几个选项。您可以将其重写为:
[A-Z][A-LN-XZ\s]+(?:(?:Y(?!EARS?)|M(?!ONTHS?))[A-LN-XZ\s]+)*(?:MONTHS?|YEARS?)[\w\s]+AGE[S]?
哪个更快。虽然不多,但(27800 vs 23800)
但是,加快速度实际上似乎是使其区分大小写。
区分大小写的原始正则表达式仅执行3700个步骤。以及经过优化的1470。
因此,您可以只对整个字符串进行大写/小写,然后在(区分大小写)正则表达式上使用它。您甚至可能不需要转换字符串,因为样本上的字符串无论如何似乎都是大写的。
要查看的另一件事是要测试的正则表达式的顺序。如果有些正则表达式比其他更容易匹配,则应首先对其进行测试。
如果您不知道这样的概率,并且您认为它们差不多相同,则可以考虑将更简单的正则表达式放在首位。一如既往地测试很难匹配的复杂正则表达式会浪费时间。
最后,当您拥有(a | b | c)这样的选项时,出于与以前相同的原因,您可能会考虑将最有可能出现在开头。