我有一个正在解析下面日期的正则表达式:
(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))
它正在解析以下字符串:
The owners of this address received a permit on Wednesday, July 31, 2014
scrapy中项目的输出是:
[u'June', u'31', u'2014', u'', u'', u'', u'', u'', u'', u'']
我希望scrapy项目是:
[u'June 31, 2014']
这是我的scrapy代码:
date_scrape = response.css('#ctl00_MasterDiv > div.Divwidth100 td.content_panel_middle > div > p:contains("The owners of this address") > b ::text')
permit_date = date_scrape.re(r'(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))')
有关如何解决这个问题的想法吗?
答案 0 :(得分:1)
如果您不想深入了解正则表达式的精彩世界,那么这是另一种解决方案。
将dateutil.parser.parse()
与fuzzy=True
一起使用。来自scrapy shell
的演示:
$ scrapy shell index.html
>>> text = response.xpath('//body/b/text()').extract()[0]
>>> text
u'The owners of this address received a permit on Wednesday, July 31, 2014'
>>> from dateutil.parser import parse
>>> parse(text, fuzzy=True)
datetime.datetime(2014, 7, 31, 0, 0)
其中index.html
包含测试html数据:
<body>
<b>The owners of this address received a permit on Wednesday, July 31, 2014</b>
</body>
答案 1 :(得分:1)
import re
s='The owners of this address received a permit on Wednesday, July 31, 2014'
words = (re.findall(r'(\w+ \d+, \d+)',s))
print words
结果:
['July 31, 2014']