Img Scrapy没有正确的xpath结果

时间:2016-10-28 07:09:44

标签: xpath scrapy scrapy-spider

Chrome Xpath Helper在http://tieba.baidu.com/f?kw=dota2&fr=index中获得了正确的链接。 但在scrapy的蜘蛛中没有像这个日志那样的结果:

> E:\ladder\tieba\tieba\spiders\tiebaSpiber.py:11: ScrapyDeprecationWarning: tieba.spiders.tiebaSpiber.tiebaSpider inherits from deprecated class scrapy.spiders.BaseSpider, please inherit from scrapy.spiders.Spider. (warning only on first subclass, there may be others)
  class tiebaSpider(BaseSpider):
img_url:
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

蜘蛛代码:

class tiebaSpider(BaseSpider):
    name = "tiebaSpider"
    allowed_domains = ["tieba.baidu.com"]
    download_delay = 1
    start_urls = ["http://tieba.baidu.com/f?ie=utf-8&kw=dota2", ]

    rules = (
        Rule(LinkExtractor(allow=(r'http://tieba.baidu.com/f?kw=dota2&ie=utf-8&pn=')), callback='parse_tieba',
             follow=True),
    )

    def parse_tieba(self, response):
        self.log("Fetch Dota2 Tieba Page:%s" % response.url)
        sel = Selector(response)

        rep_num = sel.xpath('//span[@class="threadlist_rep_num center_text"]/text()').extract()
        title = sel.xpath('//div[@class="threadlist_title pull_left j_th_tit "]/a/text()').extract()
        author = sel.xpath('//span[@class="frs-author-name-wrap"]/a/text()').extract()
        img_url = sel.xpath('//div[@class="threadlist_text pull_left"]//div[@class="small_wrap j_small_wrap"]//a[@class="thumbnail vpic_wrap"]/img/@src').extract()

        item = TiebaItem()
        item['rep_num'] = [n for n in rep_num]
        item['title'] = [n for n in title]
        item['author'] = [n for n in author]
        item['img_url'] = [n for n in img_url]

        print("img_url:\n")
        print(img_url)
        yield item

1 个答案:

答案 0 :(得分:0)

如果您从网络服务器检查实际收到的HTML格式,您会注意到src代码的<img>属性为空:

$ scrapy shell 'http://tieba.baidu.com/f?kw=dota2&fr=index'
2016-10-28 11:13:58 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)

2016-10-28 11:14:00 [scrapy] DEBUG: Crawled (200) <GET http://tieba.baidu.com/f?kw=dota2&fr=index> (referer: None)

>>> print(response.xpath('//div[@class="threadlist_text pull_left"]//div[@class="small_wrap j_small_wrap"]//a[@class="thumbnail vpic_wrap"]').extract_first())
<a class="thumbnail vpic_wrap"><img src="" attr="71814" data-original="http://imgsrc.baidu.com/forum/wh%3D135%2C90/sign=d25862d404d79123e0b59c759e0175bb/a92cb751f3deb48f948c9302f81f3a292ff5785e.jpg" bpic="http://imgsrc.baidu.com/forum/pic/item/a92cb751f3deb48f948c9302f81f3a292ff5785e.jpg" class="threadlist_pic j_m_pic "></a>
>>> 

但您也可以注意到data-original属性看起来更有趣:

>>> from pprint import pprint
>>> pprint(response.xpath('//div[@class="threadlist_text pull_left"]//div[@class="small_wrap j_small_wrap"]//a[@class="thumbnail vpic_wrap"]/img/@data-original').extract())
[u'http://imgsrc.baidu.com/forum/wh%3D135%2C90/sign=d25862d404d79123e0b59c759e0175bb/a92cb751f3deb48f948c9302f81f3a292ff5785e.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C90%3Bcrop%3D0%2C0%2C90%2C90/sign=4909678ffe246b607b5bba7ddbd4237c/9f396e094b36acafd9ddaf2074d98d1000e99c07.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C180%3Bcrop%3D0%2C0%2C90%2C90/sign=6d1bc479d943ad4ba67b4ec9b22e6b97/5c2c493d269759ee89455917bafb43166c22df2f.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C90/sign=46c1cc9483d4b31cf0699cb2b7fa1e4f/bd862d2ac65c10385f6f1915ba119313b17e892e.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D91%2C90/sign=de722bda78cf3bc7e855c5e5e02c8391/accf9e18367adab4f396cc9483d4b31c8501e4fe.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C90/sign=9549bad85182b2b7a7ca31cd0181f2df/9dc1673e6709c93d44c22c2b973df8dcd000540b.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C160%3Bcrop%3D0%2C0%2C90%2C90/sign=1361b72e751ed21b799c26ec9d42ecf2/caf91f0828381f307dd1ab75a1014c086c06f07c.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C160%3Bcrop%3D0%2C0%2C90%2C90/sign=003bc7ff7bf082022dc799367bd7cadb/0d38256d55fbb2fbee667bce474a20a44423dcf7.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C160%3Bcrop%3D0%2C0%2C90%2C90/sign=c30aaadd546034a829b7b088fb3f7862/c3fdcc0735fae6cd21a688bd07b30f2443a70f35.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C159%3Bcrop%3D0%2C0%2C90%2C90/sign=8ff4a1d85182b2b7a7ca31cd0181fada/3857980a19d8bc3e9f853f168a8ba61eaad345b6.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C159%3Bcrop%3D0%2C0%2C90%2C90/sign=fd928ccac7fc1e17fdea84387abcc736/5d2188529822720eb2c8d92673cb0a46f31fab3a.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C159%3Bcrop%3D0%2C0%2C90%2C90/sign=4cb4bdf006f41bd5da06e0fd61f6b0fe/6410b912c8fcc3cef25793e89a45d688d53f2051.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C159%3Bcrop%3D0%2C0%2C90%2C90/sign=f5025694962f07085f502209d90889ac/7ce22c9b033b5bb5bb3e64253ed3d539b400bc52.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D145%2C90/sign=86be70ceb4315c6043c063eeb984e72a/241923c79f3df8dcb740534ac511728b451028c6.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C90%3Bcrop%3D0%2C0%2C90%2C90/sign=f5d7eef34934970a47261826a5e6e8f8/c3fdcc0735fae6cd268d8dbd07b30f2443a70f02.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C120%3Bcrop%3D0%2C0%2C90%2C90/sign=2d5511d753b5c9ea62a60beae5158732/08b62ca85edf8db1d2dd5bb80123dd54544e7454.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D136%2C90/sign=24d6709da751f3dec3e7b165a7d8dc26/64983d1f95cad1c81e464470773e6709c83d513a.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C106%3Bcrop%3D0%2C0%2C90%2C90/sign=b985bccac3ef76093c5e91961ef192fc/fc05e51f4134970a853fa8789dcad1c8a6865d6b.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C90%3Bcrop%3D0%2C0%2C90%2C90/sign=04db575ec1ea15ce41bbe800862c03c3/edee83504fc2d56282d5e936ef1190ef74c66c65.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C90%3Bcrop%3D0%2C0%2C90%2C90/sign=e97ea5f2a0d3fd1f365caa3300621c2f/5df2b318972bd4075e0fe52173899e510db30973.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C110%3Bcrop%3D0%2C0%2C90%2C90/sign=e9f47f005ce736d158468401ab7c7ef3/99c76a8b4710b9129906c722cbfdfc0390452278.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D159%2C90/sign=0034aae273ec54e741b9121f8c01b769/02988a58d109b3debd89c3b4c4bf6c81810a4c09.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D160%2C90/sign=19b0661bf4039245a1e0e90eb1a488fb/d6d442afa40f4bfb21930e820b4f78f0f53618ff.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D160%2C90/sign=06e4f2d0a4af2eddd4a441e8bb202dd0/cdf3a4315c6034a82e323457c31349540b23766e.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C159%3Bcrop%3D0%2C0%2C90%2C90/sign=743f9e9fa98b87d65017a3163724190d/348f3d2dd42a2834f3b81aa553b5c9ea14cebf5c.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C159%3Bcrop%3D0%2C0%2C90%2C90/sign=a3aea53fc511728b3078842bf8d0f2fb/4ac19282b9014a9084ad6c13a1773912b11beee7.jpg',
 u'http://imgsrc.baidu.com/forum/wh%3D90%2C159%3Bcrop%3D0%2C0%2C90%2C90/sign=e82781cf07b30f2435cfe40af8b9e076/56de63f40ad162d9617d48b219dfa9ec8813cde7.jpg']
>>> 

请尝试使用img_url = sel.xpath('//div[@class="threadlist_text pull_left"]//div[@class="small_wrap j_small_wrap"]//a[@class="thumbnail vpic_wrap"]/img/@data-attribute').extract()