我正在使用htmlResponse
和selector
,htmlResponse
返回网站<200 "site">
但是当我检查选择器(响应)时它会显示<Selector xpath=None data=u'<html></html>'>
即使htmlResponse
会返回此
<200 http://www.tripadvisor.in/Hotel_Review-g3581633-d2290190-Reviews-Corbett_Tr
eetop_Riverview-Marchula_Jim_Corbett_National_Park_Uttarakhand.htmlhttp://www.tr
ipadvisor.in/Hotel_Review-g297600-d8029162-Reviews-Daman_Casa_Tesoro-Daman_Daman
_and_Diu.html>
代码:
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapingtest.items import ScrapingTestingItem
from collections import OrderedDict
import json
from scrapy.selector.lxmlsel import HtmlXPathSelector
import csv
import scrapy
from scrapy.http import HtmlResponse
class scrapingtestspider(Spider):
name = "scrapytesting"
allowed_domains = ["tripadvisor.in"]
# base_uri = ["tripadvisor.in"]
def start_requests(self):
site_array=["http://www.tripadvisor.in/Hotel_Review-g3581633-d2290190-Reviews-Corbett_Treetop_Riverview-Marchula_Jim_Corbett_National_Park_Uttarakhand.html"
"http://www.tripadvisor.in/Hotel_Review-g297600-d8029162-Reviews-Daman_Casa_Tesoro-Daman_Daman_and_Diu.html",
"http://www.tripadvisor.in/Hotel_Review-g304557-d2519662-Reviews-Darjeeling_Khushalaya_Sterling_Holidays_Resort-Darjeeling_West_Bengal.html",
"http://www.tripadvisor.in/Hotel_Review-g319724-d3795261-Reviews-Dharamshala_The_Sanctuary_A_Sterling_Holidays_Resort-Dharamsala_Himachal_Pradesh.html",
"http://www.tripadvisor.in/Hotel_Review-g1544623-d8029274-Reviews-Dindi_By_The_Godavari-Nalgonda_Andhra_Pradesh.html"]
for i in range(len(site_array)):
response = HtmlResponse(site_array[i])
sels = Selector(response)
sites = sels.xpath('//a[contains(text(), "Next")]/@href').extract()
print "________________________________________________________________"
print sels
print "________________________________________________________________"
if(sites and len(sites) > 0):
for site in sites:
yield Request(site_array[i],self.parse)
答案 0 :(得分:1)
如上所述here,您不设置Response
对象的正文。
为什么不让yield
新的Request
使用site_array
的网址让Scrapy抓住他们?你目前正在做的事情不会成功。
当然,在这种情况下,您需要调整解析器方法或编写一个新方法,并将其作为callback
添加到Request
(我会做第二个版本)。