使用Scrapy抓取站点时log_count / ERROR

时间:2018-03-18 19:22:45

标签: python web-scraping scrapy

我在使用Scrapy抓取网站时获得以下 log_count / ERROR 。我可以看到它提出了43个请求并获得了43个响应。一切都很好看。那么错误是什么?:

2018-03-19 00:31:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 18455,
 'downloader/request_count': 43,
 'downloader/request_method_count/GET': 43,
 'downloader/response_bytes': 349500,
 'downloader/response_count': 43,
 'downloader/response_status_count/200': 38,
 'downloader/response_status_count/301': 5,
 'dupefilter/filtered': 39,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 18, 15, 31, 30, 227072),
 'item_scraped_count': 11,
 'log_count/DEBUG': 56,
 'log_count/ERROR': 21,
 'log_count/INFO': 8,
 'memusage/max': 53444608,
 'memusage/startup': 53444608,
 'request_depth_max': 1,
 'response_received_count': 38,
 'scheduler/dequeued': 40,
 'scheduler/dequeued/memory': 40,
 'scheduler/enqueued': 40,
 'scheduler/enqueued/memory': 40,
 'spider_exceptions/AttributeError': 21,
 'start_time': datetime.datetime(2018, 3, 18, 15, 31, 20, 91856)}
2018-03-19 00:31:30 [scrapy.core.engine] INFO: Spider closed (finished)

这是我的蜘蛛代码:

from scrapy import Spider
from scrapy.http import Request
import re

class EventSpider(Spider):
    name = 'event' #name of the spider
    allowed_domains = ['.....com']
    start_urls = ['http://.....com',
                  'http://.....com',
                  'http://.....com',
                  'http://.....com',]

    def parse(self, response):
        events = response.xpath('//h2/a/@href').extract()
        #events = response.xpath('//a[@class = "event-overly"]').extract()

        for event in events: 
              absolute_url = response.urljoin(event)
              yield Request(absolute_url, callback = self.parse_event)

    def parse_event(self, response):
          title = response.xpath('//h1/text()').extract_first()       
          start_date = response.xpath('//div/p/text()')[0]. extract()
          start_date_final = re.search("^[0-9]{1,2}(th|st|nd|rd)\s[A-Z][a-z]{2}\s[0-9]{4}", start_date)
          #start_date_final2 = start_date_final.group(0)          
          end_date = response.xpath('//div/p/text()')[0]. extract()
          end_date_final = re.search("\s[0-9]{1,2}(th|st|nd|rd)\s[A-Z][a-z]{2}\s[0-9]{4}", end_date)
          email = response.xpath('//*[@id="more-email-with-dots"]/@value').extract_first()
          email_final = re.findall("[a-zA-Z0-9_.+-]+@(?!....)[\.[a-zA-Z0-9-.]+",email)        
          description = response.xpath('//*[@class = "events-discription-block"]//p//text()').extract()
          start_time = response.xpath('//div/p/text()')[1]. extract() 
          venue = response.xpath('//*[@id ="more-text-with-dots"]/@value').extract()          
          yield{
              'title': title,
              'start_date': start_date_final.group(0),
              'end_date': end_date_final.group(0),
              'start_time': start_time,
              'venue': venue,
              'email': email_final,
              'description': description
          }  

我对刮世界绝对是新手。如何克服这个错误?

1 个答案:

答案 0 :(得分:0)

该输出显示记录了21错误;您还可以看到所有这些都是AttributeError s。

如果查看日志输出的其余部分,您将看到错误本身:

Traceback (most recent call last):
  (...)
    'end_date': end_date_final.group(0),
AttributeError: 'NoneType' object has no attribute 'group'

从此,您可以看到end_date_final的正则表达式总是找不到匹配项。