正则表达式匹配以文本格式写的年龄

时间:2018-05-02 07:52:09

标签: python regex

我试图从句子中提取一个人的年龄;这有点简化,但这都是一个研究项目。我知道在句子中,年龄总是先于冒号后跟0或更多的空格,或冒号,空格,几个单词和一些空格(例如:“性格:一个可爱的八十岁的奶奶”,我想要一个允许我从其中一个组中提取“八十”的正则表达式。我正在使用python的're'库,我的代码挂起了这个例子(代码和示例如下):

regex_age_string = r'([:]*[ ]*)?((([a-z]*)([ -]*))+)([ -]+)(year)'
regex_age_string = re.compile(regex_age_string, re.DOTALL)
sentence = 'history:   four year-old boy was really sad when he found 
out the toy was broken'
age_extract_string = re.search(regex_age_string, sentence)
print(age_extract_string.group())
print(age_extract_string.group(2))

然而,当我通过剪掉一些尾词来缩短句子时的作品。我读到了由于灾难性的回溯导致的正则表达式搜索,但我不确定这是如何应用的/如何解决它。

2 个答案:

答案 0 :(得分:1)

正则表达式导致速度减慢的原因是catastrophic backtracking。它是由量化组内的一系列可选模式引起的 - (([a-z]*)([ -]*))+

您实际上可以匹配:year之前的所有字母,空格或连字符:

r':\s*([a-z\s-]*?)\s*-*year'

请参阅regex demo

<强>详情

  • : - :
  • \s* - 0+ whitespacves
  • ([a-z\s-]*?) - 第1组:0+小写ASCII字母,空格或连字符
  • \s* - 0+ whitespaces
  • -* - 0 + -个字符
  • year - 子字符串。

答案 1 :(得分:0)

根据您的说明,您可以使用以下正则表达式来获取年龄( 0-999 岁之间的人不感兴趣)

(?i)\b(?:zero|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty)\b(?=\s*year)|\b(?:(?:one|two|three|four|five|six|seven|eight|nine)? hundred(?:\sand)?\s)?(?:(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)[\s-]?)?\b(?:one|two|three|four|five|six|seven|eight|nine)?(?=\syear)

它使用了以下句子:

history:   Zero year-old baby
history:   FOUR year-old boy was really sad when he found 
out the toy was broken
character: a lovely eighty-three year old grandma
test: a nice eighty year-old father
character: a lovely eighty years old grandma
character: a lovely ninety-nine year old grandma
research: a great eight year-old brother
character: a lovely one hundred ninety-nine year old increadible grandma
character: a lovely one hundred and ninety-nine year old really increadible grandma
character: a lovely one hundred one year old super increadible grandma
character: a lovely nine hundred and ninety-nine year old super super increadible grandma
character: a lovely nine hundred ninety nine year old super super increadible grandma

随意适应数千数百万岁的奶奶

<强> DEMO on regex101.com