Selector在scrapy for python中没有返回任何内容

时间:2015-07-27 08:44:20

标签: python web-scraping scrapy

我正在使用htmlResponseselectorhtmlResponse返回网站<200 "site">但是当我检查选择器(响应)时它会显示<Selector xpath=None data=u'<html></html>'>即使htmlResponse会返回此

<200 http://www.tripadvisor.in/Hotel_Review-g3581633-d2290190-Reviews-Corbett_Tr
eetop_Riverview-Marchula_Jim_Corbett_National_Park_Uttarakhand.htmlhttp://www.tr
ipadvisor.in/Hotel_Review-g297600-d8029162-Reviews-Daman_Casa_Tesoro-Daman_Daman
_and_Diu.html>

代码:

from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapingtest.items import ScrapingTestingItem
from collections import OrderedDict
import json
from scrapy.selector.lxmlsel import HtmlXPathSelector
import csv
import scrapy
from scrapy.http import HtmlResponse

class scrapingtestspider(Spider):
    name = "scrapytesting"
    allowed_domains = ["tripadvisor.in"]
 #   base_uri = ["tripadvisor.in"]
def start_requests(self):
    site_array=["http://www.tripadvisor.in/Hotel_Review-g3581633-d2290190-Reviews-Corbett_Treetop_Riverview-Marchula_Jim_Corbett_National_Park_Uttarakhand.html"
                "http://www.tripadvisor.in/Hotel_Review-g297600-d8029162-Reviews-Daman_Casa_Tesoro-Daman_Daman_and_Diu.html",
                "http://www.tripadvisor.in/Hotel_Review-g304557-d2519662-Reviews-Darjeeling_Khushalaya_Sterling_Holidays_Resort-Darjeeling_West_Bengal.html",
                "http://www.tripadvisor.in/Hotel_Review-g319724-d3795261-Reviews-Dharamshala_The_Sanctuary_A_Sterling_Holidays_Resort-Dharamsala_Himachal_Pradesh.html",
                "http://www.tripadvisor.in/Hotel_Review-g1544623-d8029274-Reviews-Dindi_By_The_Godavari-Nalgonda_Andhra_Pradesh.html"]

    for i in range(len(site_array)):
        response = HtmlResponse(site_array[i])
        sels = Selector(response)
        sites = sels.xpath('//a[contains(text(), "Next")]/@href').extract()
        print "________________________________________________________________"
        print sels
        print "________________________________________________________________"
        if(sites and len(sites) > 0):
            for site in sites:
                yield Request(site_array[i],self.parse)

1 个答案:

答案 0 :(得分:1)

如上所述here,您不设置Response对象的正文。

为什么不让yield新的Request使用site_array的网址让Scrapy抓住他们?你目前正在做的事情不会成功。

当然,在这种情况下,您需要调整解析器方法或编写一个新方法,并将其作为callback添加到Request(我会做第二个版本)。