scrapy:错误:exceptions.AttributeError:'响应'对象没有属性' xpath'

时间:2015-06-11 10:06:35

标签: python xpath web-scraping scrapy scrapy-spider

我写了一个蜘蛛来抓取一些页面,有时候它可以工作,但有时却没有。

下面是问题所在:

Traceback (most recent call last):
  File "/home/hadoop/scrapy/myapp/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/home/hadoop/scrapy/myapp/lib/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick
    taskObj._oneWorkUnit()
  File "/home/hadoop/scrapy/myapp/lib/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
    result = next(self._iterator)
  File "/home/hadoop/scrapy/myapp/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
  File "/home/hadoop/scrapy/myapp/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
    yield next(it)
  File "/home/hadoop/scrapy/myapp/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
    for x in result:
  File "/home/hadoop/scrapy/myapp/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/hadoop/scrapy/myapp/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/hadoop/scrapy/myapp/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/hadoop/scrapy/myapp/scrapy-redis-master/soufang/soufang/spiders/soufang_spider.py", line 28, in parse_community
    temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/div[@class='ewmBoxTitle']/span[@class='floatl']/text()").extract()
exceptions.AttributeError: 'Response' object has no attribute 'xpath'

我有两个问题:

1。导致AttributeError的原因是什么?

2。我编写了一个下载中间件来记录所有异常,并将相关的URL记录在名为outurl_record.txt的txt文件中。为什么我没有抓住这个例外.AttributeError和网址?

下面是我的middleware.py和设置:

class CustomRecordMiddleware(object):

    def process_exception(self,request,exception,spider):
        url = request.url
        proxy = request.meta['proxy']
        myfile = open('outurl_record.txt','a')
        myfile.write(url+'\n')
        myfile.write(proxy+'\n')
        myfile.close()
        log.msg('Fail to request url %s with exception %s' % (url, str(exception)))

这是设置:

DOWNLOADER_MIDDLEWARES={'soufang.misc.middleware.CustomRecordMiddleware':860,}

这是spider.py

    #-*- coding=utf8 -*-

import scrapy
from soufang.items import Community_info
import sys 
from imp import reload
import re
from scrapy_redis.spiders import RedisSpider
reload(sys)
sys.setdefaultencoding( "utf-8" )


class soufangSpider(RedisSpider):

    name = 'soufang_redis'
    redis_key = 'soufangSpider:start_urls'


    def parse_community(self,response):     

        item = response.meta['item']

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/div[@class='ewmBoxTitle']/span[@class='floatl']/text()").extract()
        item['community'] = temp[0] if temp else ''

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/dl[@class='lbox']/dd/strong[text()='开 发 商:']/../text()").extract()
        item['developer'] = temp[0] if temp else ''

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/dl[@class='lbox']/dd/strong[text()='所属区域:']/../text()").extract()
        if temp :
            item['district'] = temp[0]
        else:
            item['district'] = ''
            myfile = open('outurl_item.txt', 'a')
            myfile.write(response.url)
            myfile.write('\n')
            myfile.close()

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/dl[@class='lbox']/dd/strong[text()='小区地址:']/../text()").extract()
        item['address'] = temp[0] if temp else ''

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/dl[@class='lbox']/dd/strong[text()='邮${nbsp}${nbsp}${nbsp}${nbsp}编:']/../text()").extract()
        item['postcode'] = temp[0] if temp else ''

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/dl[@class='lbox']/dd/strong[text()='竣工时间:']/../text()").extract()
        item['yearOfDev'] = temp[0] if temp else ''

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/dl[@class='firstpic']/dd[text()='本月均价:']/span[1]/text()").extract()
        item['price'] = temp[0] if temp else ''

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/dl[@class='lbox']/dd/strong[text()='总 户 数:']/../text()").extract()
        item['household_no'] = temp[0] if temp else ''

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/dl[@class='lbox']/dd/strong[text()='物业类别:']/../text()").extract()
        item['community_type'] = temp[0] if temp else ''

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/dl[@class='lbox']/dd/strong[text()='物 业 费:']/../text()").extract()
        item['property_fee'] = temp[0] if temp else ''

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/dl[@class='lbox']/dd/strong[text()='建筑面积:']/../text()").extract()
        item['total_area'] = temp[0] if temp else ''

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/dl[@class='lbox']/dd/strong[text()='占地面积:']/../text()").extract()
        item['area'] = temp[0] if temp else ''

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/dl[@class='lbox']/dd/strong[text()='绿 化 率:']/../text()").extract()
        item['greening_rate'] = temp[0] if temp else ''

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/dl[@class='lbox']/dd/strong[text()='容 积 率:']/../text()").extract()
        item['volumn_rate'] = temp[0] if temp else ''

        temp = response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/div[@class='yihang']/h3[text()='交通状况']/../following-sibling::dl[1]/dt[1]/text()").extract()
        item['transportation'] = temp[0] if temp else ''

        temp = "".join(response.xpath(u"//div[@class='maininfo']/div[@class='leftinfo']/div[@class='yihang']/h3[text()='周边信息']/../following-sibling::dl[1]//text()").extract())
        item['periphery'] = temp if temp else ''

        yield item



    def parse(self,response):

        flag = response.xpath(u"//div[@class='wid1000']/div[@class='listBox floatl']/div[@class='houseList']").extract()
        city = response.xpath(u"//div[@class='wid1000']/div[@class='bread']/a[2]/text()").extract()

        if flag and city :
            item = Community_info()
            item['city'] =city[0][:-3]
            urls = response.xpath(u"//div[@class='info rel floatl ml15']/dl/dd[last()]/a[1]/@href").extract()
            if len(urls)==0:
                myfile = open('urls0.txt','a')
                myfile.write(response.url+'\n')
                myfile.close()
            next_page = response.xpath(u"//div[@class='listBox floatl']/div[@class='fanye gray6']/a[text()='下一页']/@href").extract()
            if next_page:
                pageurl = next_page[0]
                fullpage = re.match(r'http://.+com', response.url).group()+pageurl
                yield scrapy.Request(fullpage,callback=self.parse)            
            for url in urls :
                try:
                    request = scrapy.Request(url,callback=self.parse_community)
                    request.meta['item'] = item
                    yield request
                except Exception,e:
                    myfile = open('badurl_item.txt','a')
                    myfile.write(response.url+'\n')
                    myfile.write(url+'\n')
                    myfile.close()
        else:
            myfile = open('outurl_break','a')
            myfile.write(response.url + '\n')
            myfile.close()
            yield scrapy.Request(response.url,callback=self.parse)

1 个答案:

答案 0 :(得分:1)

我将回答第一个问题: - 1.导致AttributeError的原因是什么?

这是因为页面是空的。

例如:scrapy shell

In [1]: fetch("http://diachiso.vn/Shop/CityPage_LoadShopBySubServiceIdAndFilterId?pageindex=6&pagesize=20&sid=256&fid=&cityid=3&parentSid=520")
2015-10-11 11:22:55 [scrapy] INFO: Spider opened
2015-10-11 11:22:55 [scrapy] DEBUG: Crawled (200) <GET http://diachiso.vn/Shop/CityPage_LoadShopBySubServiceIdAndFilterId?pageindex=6&pagesize=20&sid=256&fid=&cityid=3&parentSid=520> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fadf4f9cb50>
[s]   item       {}
[s]   request    <GET http://diachiso.vn/Shop/CityPage_LoadShopBySubServiceIdAndFilterId?pageindex=6&pagesize=20&sid=256&fid=&cityid=3&parentSid=520>
[s]   response   <200 http://diachiso.vn/Shop/CityPage_LoadShopBySubServiceIdAndFilterId?pageindex=6&pagesize=20&sid=256&fid=&cityid=3&parentSid=520>
[s]   settings   <scrapy.settings.Settings object at 0x7fadf4f9cad0>
[s]   spider     <DefaultSpider 'default' at 0x7fadf232a710>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [2]: response.
response.body     response.flags    response.meta     response.request  response.url      
response.copy     response.headers  response.replace  response.status   response.urljoin  

In [2]: response.status
Out[2]: 200

In [6]: response.body
Out[6]: ''