如何编写正则表达式来提取年份

时间:2018-02-06 17:07:00

标签: regex python-3.x information-extraction

我们如何编写正则表达式以提取文本中的年份,年份可能会以下列形式出现

Case 1:
1970 - 1980 --> 1970, 1980
January 1920 - Feb 1930 --> 1920, 1930
May 1920 to September 1930 --> 1920, 1930
Case 2:
July 1945 --> 1945

Case 1编写正则表达式很简单,但我如何处理Case 2

\d{4} \s? (?: [^a-zA-Z0-9] | to) \s? \w+? \d{4}

2 个答案:

答案 0 :(得分:1)

根据您的要求,只需匹配所有4位数字

import re
s = '''1970 - 1980
January 1920 - Feb 1930
May 1920 to September 1930
July 1945'''

p = re.compile(r'\b\d{4}\b')

s = s.splitlines()
for x in s:
    result = p.findall(x) 
    print(result)

输出

['1970', '1980']
['1920', '1930']
['1920', '1930']
['1945']

答案 1 :(得分:0)

正则表达式.*?([0-9]{4})(?:.*?([0-9]{4}))?<Grid VerticalAlignment="Top" Background="Yellow"> <Grid.ColumnDefinitions> <ColumnDefinition /> <ColumnDefinition Width="Auto" /> </Grid.ColumnDefinitions> <TextBlock x:Name="textBlock" /> <t:ZeroHeightDecorator Grid.Column="1"> <Button> <Viewbox> <Path Fill="Black" Data="M 0,0 H 100 V 100 H 0 Z" /> </Viewbox> </Button> </t:ZeroHeightDecorator> </Grid>

详细说明:

  • .*?(\d{4})(?:.*?(\d{4}))?捕获小组
  • ()非捕获组
  • (?:)完全匹配{n}
  • n匹配零和无限时间之间的任何字符(懒惰)

Python代码

.*?

输出:

def Years(text):
        return re.findall(r'.*?([0-9]{4})(?:.*?([0-9]{4}))?', text)

print(Years('January 1920 - Feb 1930'))