Question

使用Scrapy我想从网站中提取一些数据。这是我的解析部分：

item['title'] = sel.xpath('//div[@class="box"]/h3/text()').extract()
item['date'] = sel.xpath('//div[@class="date"]/text()').extract()
item['text'] = sel.xpath('//span[@class="usercontent"]/p/text()').extract()

这可以正常工作。但是，我想仅将第二项限制为具有正则表达式（\d\d\.\d\d\.\d\d\d\d）的日期。我按照手册写了这个：

item['date'] = sel.xpath('//div[@class="date"]/text()').re(r'\d\d\.\d\d\.\d\d\d\d').extract()

这样它就不起作用了。我收到以下错误：

信号处理程序遇到错误：绑定方法？.close_spider of scrapy.contrib.feedexport.FeedExporter对象在......

如果我用Shell测试它，正则表达式工作得很好。有什么建议吗？非常感谢！我使用的是Windows 7,64位，Python 2.7。

Answer 1

只是您extract()之后不需要致电re()，因为re()会返回 unicode字符串列表：

item['date'] = sel.xpath('//div[@class="date"]/text()').re(r'\d\d\.\d\d\.\d\d\d\d')

使用Scrapy / Python正则表达式

1 个答案: