我在使用Scrapy抓取网站时获得以下 log_count / ERROR 。我可以看到它提出了43个请求并获得了43个响应。一切都很好看。那么错误是什么?:
2018-03-19 00:31:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 18455,
'downloader/request_count': 43,
'downloader/request_method_count/GET': 43,
'downloader/response_bytes': 349500,
'downloader/response_count': 43,
'downloader/response_status_count/200': 38,
'downloader/response_status_count/301': 5,
'dupefilter/filtered': 39,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 3, 18, 15, 31, 30, 227072),
'item_scraped_count': 11,
'log_count/DEBUG': 56,
'log_count/ERROR': 21,
'log_count/INFO': 8,
'memusage/max': 53444608,
'memusage/startup': 53444608,
'request_depth_max': 1,
'response_received_count': 38,
'scheduler/dequeued': 40,
'scheduler/dequeued/memory': 40,
'scheduler/enqueued': 40,
'scheduler/enqueued/memory': 40,
'spider_exceptions/AttributeError': 21,
'start_time': datetime.datetime(2018, 3, 18, 15, 31, 20, 91856)}
2018-03-19 00:31:30 [scrapy.core.engine] INFO: Spider closed (finished)
这是我的蜘蛛代码:
from scrapy import Spider
from scrapy.http import Request
import re
class EventSpider(Spider):
name = 'event' #name of the spider
allowed_domains = ['.....com']
start_urls = ['http://.....com',
'http://.....com',
'http://.....com',
'http://.....com',]
def parse(self, response):
events = response.xpath('//h2/a/@href').extract()
#events = response.xpath('//a[@class = "event-overly"]').extract()
for event in events:
absolute_url = response.urljoin(event)
yield Request(absolute_url, callback = self.parse_event)
def parse_event(self, response):
title = response.xpath('//h1/text()').extract_first()
start_date = response.xpath('//div/p/text()')[0]. extract()
start_date_final = re.search("^[0-9]{1,2}(th|st|nd|rd)\s[A-Z][a-z]{2}\s[0-9]{4}", start_date)
#start_date_final2 = start_date_final.group(0)
end_date = response.xpath('//div/p/text()')[0]. extract()
end_date_final = re.search("\s[0-9]{1,2}(th|st|nd|rd)\s[A-Z][a-z]{2}\s[0-9]{4}", end_date)
email = response.xpath('//*[@id="more-email-with-dots"]/@value').extract_first()
email_final = re.findall("[a-zA-Z0-9_.+-]+@(?!....)[\.[a-zA-Z0-9-.]+",email)
description = response.xpath('//*[@class = "events-discription-block"]//p//text()').extract()
start_time = response.xpath('//div/p/text()')[1]. extract()
venue = response.xpath('//*[@id ="more-text-with-dots"]/@value').extract()
yield{
'title': title,
'start_date': start_date_final.group(0),
'end_date': end_date_final.group(0),
'start_time': start_time,
'venue': venue,
'email': email_final,
'description': description
}
我对刮世界绝对是新手。如何克服这个错误?
答案 0 :(得分:0)
该输出显示记录了21错误;您还可以看到所有这些都是AttributeError
s。
如果查看日志输出的其余部分,您将看到错误本身:
Traceback (most recent call last):
(...)
'end_date': end_date_final.group(0),
AttributeError: 'NoneType' object has no attribute 'group'
从此,您可以看到end_date_final
的正则表达式总是找不到匹配项。