如何在“下一页”中获取“/ zufang / dongcheng / pg2 /”链接?
<a href="/zufang/dongcheng/pg2/" data-page="2">下一页</a>
我试过这个,但一无所获。
网址为“http://bj.lianjia.com/zufang/dongcheng/”
start_urls = (
'http://bj.lianjia.com/zufang/dongcheng/',
)
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('/html/body/div[4]/div[2]/div/div[2]/div[2]/a[5]',)), callback="parse", follow= True),)
def parse(self, response):
l = ItemLoader(item = ItjuziItem(),response=response)
for i in range(0,len(response.xpath("//div[@class='info-panel']/h2/a/text()").extract())):
info = response.xpath("//div[@class='info-panel']/h2/a/text()").extract()[i].encode('utf-8')
local = response.xpath("//div[@class='info-panel']").xpath(".//span[@class='region']/text()").extract()[i].encode('utf-8')
house_layout = response.xpath("//div[@class='info-panel']").xpath(".//span[@class='zone']//text()").extract()[i].encode('utf-8')
house_square = response.xpath("//div[@class='info-panel']").xpath(".//span[@class='meters']/text()").extract()[i].encode('utf-8')
house_orientation = response.xpath("//div[@class='info-panel']").xpath(".//div[@class='where']//span/text()").extract()[(i + 1) * 4 - 1].encode('utf-8')
district = response.xpath("//div[@class='info-panel']").xpath(".//div[@class='con']/a/text()").extract()[i].encode('utf-8')[:-6]
floor = response.xpath("//div[@class='info-panel']").xpath(".//div[@class='con']//text()").extract()[(i + 1) * 5 - 3].encode('utf-8')
building_year = response.xpath("//div[@class='info-panel']").xpath(".//div[@class='con']//text()").extract()[(i + 1) * 5 - 1].encode('utf-8')
price_month = response.xpath("//div[@class='info-panel']").xpath(".//span[@class='num']//text()").extract()[(i + 1) * 2 - 2].encode('utf-8')
person_views = response.xpath("//div[@class='info-panel']").xpath(".//span[@class='num']//text()").extract()[(i + 1) * 2 - 1].encode('utf-8')
tags = []
for j in range(0,len(response.xpath("//div[@class='view-label left']")[i].xpath(".//span//text()").extract())):
tags.append(response.xpath("//div[@class='view-label left']")[i].xpath(".//span//text()").extract()[j].encode("utf-8"))
l.add_value('info',info)
l.add_value('local',local)
l.add_value('house_layout',house_layout)
l.add_value('house_square',house_square)
l.add_value('house_orientation',house_orientation)
l.add_value('district',district)
l.add_value('floor',floor)
l.add_value('building_year',building_year)
l.add_value('price_month',price_month)
l.add_value('person_views',person_views)
l.add_value('tags',tags)
print l
return l.load_item()
答案 0 :(得分:0)
提取链接并将其附加到当前网址
from urlparse import urljoin
yield scrapy.Request(urljoin(response.url,response.xpath("//a[@data-page]/@href").extract_first()), callback=self.parse)
或递增页面
for i in range(2,20):
yield scrapy.Request("http://bj.lianjia.com/zufang/dongcheng/pg"+ str(i), callback=self.parse)
答案 1 :(得分:-1)
为什么不继续在网址中附加页面的增值值。 适用于此链接。
url = 'http://bj.lianjia.com/zufang/dongcheng/'
counter = 1
while <http error is null>:
url_to_crawl = url + "/pg" + counter
counter = counter + 1
<crawl the url>