我试图从句子中提取一个人的年龄;这有点简化,但这都是一个研究项目。我知道在句子中,年龄总是先于冒号后跟0或更多的空格,或冒号,空格,几个单词和一些空格(例如:“性格:一个可爱的八十岁的奶奶”,我想要一个允许我从其中一个组中提取“八十”的正则表达式。我正在使用python的're'库,我的代码挂起了这个例子(代码和示例如下):
regex_age_string = r'([:]*[ ]*)?((([a-z]*)([ -]*))+)([ -]+)(year)'
regex_age_string = re.compile(regex_age_string, re.DOTALL)
sentence = 'history: four year-old boy was really sad when he found
out the toy was broken'
age_extract_string = re.search(regex_age_string, sentence)
print(age_extract_string.group())
print(age_extract_string.group(2))
然而,当我通过剪掉一些尾词来缩短句子时的作品。我读到了由于灾难性的回溯导致的正则表达式搜索,但我不确定这是如何应用的/如何解决它。
答案 0 :(得分:1)
正则表达式导致速度减慢的原因是catastrophic backtracking。它是由量化组内的一系列可选模式引起的 - (([a-z]*)([ -]*))+
。
您实际上可以匹配:
到year
之前的所有字母,空格或连字符:
r':\s*([a-z\s-]*?)\s*-*year'
请参阅regex demo。
<强>详情
:
- :
\s*
- 0+ whitespacves ([a-z\s-]*?)
- 第1组:0+小写ASCII字母,空格或连字符\s*
- 0+ whitespaces -*
- 0 + -
个字符year
- 子字符串。答案 1 :(得分:0)
根据您的说明,您可以使用以下正则表达式来获取年龄( 0-999 岁之间的人不感兴趣)
(?i)\b(?:zero|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty)\b(?=\s*year)|\b(?:(?:one|two|three|four|five|six|seven|eight|nine)? hundred(?:\sand)?\s)?(?:(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)[\s-]?)?\b(?:one|two|three|four|five|six|seven|eight|nine)?(?=\syear)
它使用了以下句子:
history: Zero year-old baby
history: FOUR year-old boy was really sad when he found
out the toy was broken
character: a lovely eighty-three year old grandma
test: a nice eighty year-old father
character: a lovely eighty years old grandma
character: a lovely ninety-nine year old grandma
research: a great eight year-old brother
character: a lovely one hundred ninety-nine year old increadible grandma
character: a lovely one hundred and ninety-nine year old really increadible grandma
character: a lovely one hundred one year old super increadible grandma
character: a lovely nine hundred and ninety-nine year old super super increadible grandma
character: a lovely nine hundred ninety nine year old super super increadible grandma
随意适应数千和数百万岁的奶奶
<强> DEMO on regex101.com 强>