为什么scrapy提高回复内容不是文本

时间:2017-06-25 06:18:44

标签: python scrapy

在使用scrapy抓取家庭网站时,它经常引发 AttributeError ****响应内容不是文字。我不知道如何解决它。

import scrapy
from scrapy.http import Request
from ershoufang.items import ErshoufangItem
from scrapy.http import HtmlResponse

from scrapy.selector import Selector

class HouseSpider(scrapy.Spider):
    name = 'house'
    allowed_domains = ['qhd.58.com', 'short.58.com', 'jxjump.58.com']
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
    }

    def start_requests(self):
        pages = []

        for i in range(1,11):
            url = 'http://qhd.58.com/ershoufang/pn'+str(i)
            page = scrapy.Request(url,headers=self.header)
            pages.append(page)
        return pages

    def parse(self, response):
        urls = response.xpath('//div[@class="list-info"]/h2[@class="title"]/a/@href').extract()
        for url in urls:
            yield scrapy.Request(url,headers=self.header,callback=self.parse_page)

    def parse_page(self,response):
        url = [response.url]
        sel = Selector(response)

        title = sel.xpath('//div[@class="house-title"]/h1[1]/text()').extract()
        updatetime = sel.xpath('//p[@class="house-update-info"]/span[@class="up"][1]/text()').extract()
        totalcount = sel.xpath('//p[@class="house-update-info"]/span[@class="up"][2]/em[@id="totalcount"]/text()').extract()
        totalprice = sel.xpath('//div[@class="general-item general-situation"]/div[@class="general-item-wrap"][1]/ul[@class="general-item-left"]/li[1]/span[2]/text()').extract()
        area = sel.xpath('//div[@class="general-item general-situation"]/div[@class="general-item-wrap"][1]/ul[@class="general-item-left"]/li[3]/span[2]/text()').extract()
        orientation = sel.xpath('//div[@class="general-item general-situation"]/div[@class="general-item-wrap"][1]/ul[@class="general-item-left"]/li[4]/span[2]/text()').extract()
    ...

然后我设置 DOWNLOAD_DELAY ,但它不起作用。 以下是错误的屏幕截图: enter image description here

这是我的蜘蛛日志文件的一部分:



2017-06-25 17:26:15 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: ershoufang)
2017-06-25 17:26:15 [scrapy.utils.log] INFO: Overridden settings: {'LOG_FILE': 'output.log', 'SPIDER_MODULES': ['ershoufang.spiders'], 'BOT_NAME': 'ershoufang', 'NEWSPIDER_MODULE': 'ershoufang.spiders'}
2017-06-25 17:26:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2017-06-25 17:26:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-25 17:26:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-25 17:26:22 [scrapy.middleware] INFO: Enabled item pipelines:
['ershoufang.pipelines.ErshoufangPipeline']
2017-06-25 17:26:22 [scrapy.core.engine] INFO: Spider opened
2017-06-25 17:26:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-25 17:26:22 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-06-25 17:26:23 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://qhd.58.com/ershoufang/pn1/> from <GET http://qhd.58.com/ershoufang/pn1>
2017-06-25 17:26:23 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://qhd.58.com/ershoufang/pn3/> from <GET http://qhd.58.com/ershoufang/pn3>
2017-06-25 17:26:23 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://qhd.58.com/ershoufang/pn2/> from <GET http://qhd.58.com/ershoufang/pn2>
2017-06-25 17:26:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://qhd.58.com/ershoufang/pn2/> (referer: None)
2017-06-25 17:26:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://qhd.58.com/ershoufang/pn1/> (referer: None)
2017-06-25 17:26:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://qhd.58.com/ershoufang/30421994065981x.shtml> (referer: http://qhd.58.com/ershoufang/pn2/)
2017-06-25 17:26:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://qhd.58.com/ershoufang/pn3/> (referer: None)
2017-06-25 17:26:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://qhd.58.com/ershoufang/30501551684529x.shtml> (referer: http://qhd.58.com/ershoufang/pn1/)
2017-06-25 17:26:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://qhd.58.com/ershoufang/30501320705462x.shtml> (referer: http://qhd.58.com/ershoufang/pn1/)
2017-06-25 17:26:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://qhd.58.com/ershoufang/30501196128333x.shtml> (referer: http://qhd.58.com/ershoufang/pn1/)
2017-06-25 17:26:25 [scrapy.core.scraper] DEBUG: Scraped from <200 http://qhd.58.com/ershoufang/30421994065981x.shtml>

{'area': ['91㎡'],
 'averageprice': ['-1'],
 'buildtime': ['2013年'],
 'businesslocation': ['-1', '-1'],
 'carpark': ['-1'],
 'change_rate': ['-1'],
 'decoration': ['精装修'],
 'firstprice': ['10.5                                万(月供1005元/月)'],
 'floor': ['高层/共18层'],
 'forestrate': ['-1'],
 'manageprice': ['-1'],
 'orientation': ['南北'],
 'style': ['2室2厅1卫'],
 'time': ['2017-06-17更新'],
 'title': ['东山新天地南区两室两厅'],
 'totalprice': ['35万(单价 3846元/㎡)'],
 'url': ['http://qhd.58.com/ershoufang/30421994065981x.shtml'],
 'viewnum': ['0'],
 'volumnrate': ['-1']}
2017-06-25 17:26:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://qhd.58.com/ershoufang/30443314256710x.shtml> (referer: http://qhd.58.com/ershoufang/pn2/)
2017-06-25 17:26:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://qhd.58.com/ershoufang/30444218000066x.shtml> (referer: http://qhd.58.com/ershoufang/pn2/)
2017-06-25 17:26:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://qhd.58.com/ershoufang/30500142161204x.shtml> (referer: http://qhd.58.com/ershoufang/pn1/)
2017-06-25 17:26:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://qhd.58.com/ershoufang/30147536387657x.shtml> (referer: http://qhd.58.com/ershoufang/pn2/)
2017-06-25 17:26:26 [scrapy.core.scraper] DEBUG: Scraped from <200 http://qhd.58.com/ershoufang/30501551684529x.shtml>
...
...
2017-06-25 17:26:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://qhd.58.com/ershoufang/30501215521224x.shtml?adtype=3> (referer: http://qhd.58.com/ershoufang/pn1/)
2017-06-25 17:26:31 [scrapy.core.scraper] ERROR: Spider error processing <GET http://qhd.58.com/ershoufang/30512902895416x.shtml?adtype=3> (referer: http://qhd.58.com/ershoufang/pn1/)
Traceback (most recent call last):
  File "c:\python\python35-32\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\Python spider\ershoufang\ershoufang\spiders\house.py", line 32, in parse_page
    sel = Selector(response)
  File "c:\python\python35-32\lib\site-packages\scrapy\selector\unified.py", line 67, in __init__
    text = response.text
  File "c:\python\python35-32\lib\site-packages\scrapy\http\response\__init__.py", line 93, in text
    raise AttributeError("Response content isn't text")
AttributeError: Response content isn't text
2017-06-25 17:26:31 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://jing.58.com/adJump?adType=3&target=pZwY0jCfsL7Cua3draOWUvYfugF1pAqduh78uzt1njEknjTzPH9dPj0Org980v6YUyk_nH0YnWbvP1TknHbvn1bvnjc3nHDvPjEkn10YsjNOnjE1PjcksakhUAqMuv-8gLR1ugFxpyEqnaubpgPkgvP6IANqnHchuA-107q_UvP6UjYQnj03FhP_pyR8I7qd0vRzgv-b5iu-UMwGIZ-GujY1njEknjTzPH9dPj0Oriud0vRzpyEqnWEdP19vn10hpyd-pHYhuyOYpgwOIZ-kuHYkFhR8IA-YXRqWmgw-5iu-UMwGIZ-xUAqWmykqFhwG0LKxIA-VuHYQPjb3n19zP19zrHTOFMKf0v-Ypyq85HDkriuWUA-Wpv-b5HbOn16BmH6BsyndPjcVPAEzPBdBmh7hsH7bryuBuHwbm1TOPBukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LwluaukmyI-gLwO0ANqnikQnk&end=end> from <GET http://short.58.com/zd_p/9938ba8b-c542-4d26-bbaf-1d9fbe4dc096/?target=reu-16-xgk_psfegvimob_84544431615022q-feykn&end=end>
2017-06-25 17:26:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://qhd.58.com/ershoufang/30509970982594x.shtml?adtype=3> (referer: http://qhd.58.com/ershoufang/pn1/)
2017-06-25 17:26:31 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://jing.58.com/adJump?adType=3&target=pZwY0jCfsL7Cua3draOWUvYfugF1pAqduh78uzt1njE3P1T1rHTdnHT3PZ980v6YUyk_nH0YnWbvP1TknHbvn1bvnjc3nHDvPjEkn10YsjNOnjNdrjbYsakhUAqMuv-8gLR1ugFxpyEqnaubpgPkgvP6IANqnHchuA-107q_UvP6UjYQnj03FhP_pyR8I7qd0vRzgv-b5iu-UMwGIZ-GujY1njE3P1T1rHTdnHT3Paud0vRzpyEqPjNOPH0QnWDLP19QrHNhpyd-pHYhuyOYpgwOIZ-kuHYkFhR8IA-YXRqWmgw-5iu-UMwGIZ-xUAqWmykqFhwG0LKxIA-VuHYQPjb3n19zP19zrHTOFMKf0v-Ypyq85HDQniuWUA-Wpv-b5HRbmWmLuHm3sHEOPHnVPjn3Pzd6uj6-sHIbuH7WnHNvP1--PzukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LwluaukmyI-gLwO0ANqnikQnk&end=end> from <GET http://short.58.com/zd_p/5db67e68-4953-4387-ad8e-7de1c15679e7/?target=reu-16-xgk_psfegvimob_84560482417465q-feykn&end=end>
2017-06-25 17:26:31 [scrapy.core.scraper] ERROR: Spider error processing <GET http://qhd.58.com/ershoufang/30512968966456x.shtml?adtype=3> (referer: http://qhd.58.com/ershoufang/pn1/)
Traceback (most recent call last):
  File "c:\python\python35-32\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\Python spider\ershoufang\ershoufang\spiders\house.py", line 32, in parse_page
    sel = Selector(response)
  File "c:\python\python35-32\lib\site-packages\scrapy\selector\unified.py", line 67, in __init__
    text = response.text
  File "c:\python\python35-32\lib\site-packages\scrapy\http\response\__init__.py", line 93, in text
    raise AttributeError("Response content isn't text")
AttributeError: Response content isn't text
2017-06-25 17:26:31 [scrapy.core.scraper] ERROR: Spider error processing <GET http://qhd.58.com/ershoufang/30497621522098x.shtml?adtype=3> (referer: http://qhd.58.com/ershoufang/pn1/)
Traceback (most recent call last):
  File "c:\python\python35-32\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\Python spider\ershoufang\ershoufang\spiders\house.py", line 32, in parse_page
    sel = Selector(response)
  File "c:\python\python35-32\lib\site-packages\scrapy\selector\unified.py", line 67, in __init__
    text = response.text
  File "c:\python\python35-32\lib\site-packages\scrapy\http\response\__init__.py", line 93, in text
    raise AttributeError("Response content isn't text")
AttributeError: Response content isn't text
2017-06-25 17:26:31 [scrapy.core.scraper] ERROR: Spider error processing <GET http://qhd.58.com/ershoufang/30488972065477x.shtml?adtype=3> (referer: http://qhd.58.com/ershoufang/pn1/)
Traceback (most recent call last):
  File "c:\python\python35-32\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python\python35-32\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\Python spider\ershoufang\ershoufang\spiders\house.py", line 32, in parse_page
    sel = Selector(response)
  File "c:\python\python35-32\lib\site-packages\scrapy\selector\unified.py", line 67, in __init__
    text = response.text
  File "c:\python\python35-32\lib\site-packages\scrapy\http\response\__init__.py", line 93, in text
    raise AttributeError("Response content isn't text")
AttributeError: Response content isn't text
...
...
2017-06-25 17:54:34 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-25 17:54:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 43658,
 'downloader/request_count': 104,
 'downloader/request_method_count/GET': 104,
 'downloader/response_bytes': 1098928,
 'downloader/response_count': 104,
 'downloader/response_status_count/200': 99,
 'downloader/response_status_count/301': 2,
 'downloader/response_status_count/403': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 6, 25, 9, 54, 34, 774987),
 'item_scraped_count': 100,
 'log_count/DEBUG': 205,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 102,
 'scheduler/dequeued': 104,
 'scheduler/dequeued/memory': 104,
 'scheduler/enqueued': 104,
 'scheduler/enqueued/memory': 104,
 'start_time': datetime.datetime(2017, 6, 25, 9, 54, 23, 140286)}
2017-06-25 17:54:34 [scrapy.core.engine] INFO: Spider closed (finished)
&#13;
&#13;
&#13;

1 个答案:

答案 0 :(得分:1)

如果您正在处理压缩响应,有时检查 request.body 可能会有所帮助:

import zlib

zlib.decompress(request.body, 15+32)