使用正则表达式捕获日期

时间:2014-09-02 13:14:47

标签: python regex scrapy

我有一个正在解析下面日期的正则表达式:

(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))

它正在解析以下字符串:

The owners of this address received a permit on Wednesday, July 31, 2014

scrapy中项目的输出是:

[u'June', u'31', u'2014', u'', u'', u'', u'', u'', u'', u'']

我希望scrapy项目是:

[u'June 31, 2014']

这是我的scrapy代码:

date_scrape = response.css('#ctl00_MasterDiv > div.Divwidth100 td.content_panel_middle > div > p:contains("The owners of this address") > b ::text')

permit_date = date_scrape.re(r'(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))')

有关如何解决这个问题的想法吗?

2 个答案:

答案 0 :(得分:1)

如果您不想深入了解正则表达式的精彩世界,那么这是另一种解决方案。

dateutil.parser.parse()fuzzy=True一起使用。来自scrapy shell的演示:

$ scrapy shell index.html
>>> text = response.xpath('//body/b/text()').extract()[0]
>>> text
u'The owners of this address received a permit on Wednesday, July 31, 2014'

>>> from dateutil.parser import parse
>>> parse(text, fuzzy=True)
datetime.datetime(2014, 7, 31, 0, 0)

其中index.html包含测试html数据:

<body>
    <b>The owners of this address received a permit on Wednesday, July 31, 2014</b>
</body>

答案 1 :(得分:1)

import re
s='The owners of this address received a permit on Wednesday, July 31, 2014'

words = (re.findall(r'(\w+ \d+, \d+)',s))
print words

结果:

['July 31, 2014']