如何在scrapy

时间:2016-09-03 13:30:16

标签: python regex scrapy web-crawler

如何在“下一页”中获取“/ zufang / dongcheng / pg2 /”链接?

<a href="/zufang/dongcheng/pg2/" data-page="2">下一页</a>

我试过这个,但一无所获。

网址为“http://bj.lianjia.com/zufang/dongcheng/

start_urls = (
            'http://bj.lianjia.com/zufang/dongcheng/',
    )

rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('/html/body/div[4]/div[2]/div/div[2]/div[2]/a[5]',)), callback="parse", follow= True),)


def parse(self, response):
            l = ItemLoader(item = ItjuziItem(),response=response)
            for i in range(0,len(response.xpath("//div[@class='info-panel']/h2/a/text()").extract())):
                    info = response.xpath("//div[@class='info-panel']/h2/a/text()").extract()[i].encode('utf-8')
                    local = response.xpath("//div[@class='info-panel']").xpath(".//span[@class='region']/text()").extract()[i].encode('utf-8') 
                    house_layout = response.xpath("//div[@class='info-panel']").xpath(".//span[@class='zone']//text()").extract()[i].encode('utf-8')
                    house_square = response.xpath("//div[@class='info-panel']").xpath(".//span[@class='meters']/text()").extract()[i].encode('utf-8')
                    house_orientation = response.xpath("//div[@class='info-panel']").xpath(".//div[@class='where']//span/text()").extract()[(i + 1) * 4 - 1].encode('utf-8')
                    district = response.xpath("//div[@class='info-panel']").xpath(".//div[@class='con']/a/text()").extract()[i].encode('utf-8')[:-6]
                    floor = response.xpath("//div[@class='info-panel']").xpath(".//div[@class='con']//text()").extract()[(i + 1) * 5 - 3].encode('utf-8')
                    building_year = response.xpath("//div[@class='info-panel']").xpath(".//div[@class='con']//text()").extract()[(i + 1) * 5 - 1].encode('utf-8')
                    price_month = response.xpath("//div[@class='info-panel']").xpath(".//span[@class='num']//text()").extract()[(i + 1) * 2 - 2].encode('utf-8')
                    person_views = response.xpath("//div[@class='info-panel']").xpath(".//span[@class='num']//text()").extract()[(i + 1) * 2 - 1].encode('utf-8')
                    tags = []
                    for j in range(0,len(response.xpath("//div[@class='view-label left']")[i].xpath(".//span//text()").extract())):
                            tags.append(response.xpath("//div[@class='view-label left']")[i].xpath(".//span//text()").extract()[j].encode("utf-8"))
                    l.add_value('info',info)
                    l.add_value('local',local)
                    l.add_value('house_layout',house_layout)
                    l.add_value('house_square',house_square)
                    l.add_value('house_orientation',house_orientation)
                    l.add_value('district',district)
                    l.add_value('floor',floor)
                    l.add_value('building_year',building_year)
                    l.add_value('price_month',price_month)
                    l.add_value('person_views',person_views)
                    l.add_value('tags',tags)
                    print l
                    return l.load_item()

2 个答案:

答案 0 :(得分:0)

提取链接并将其附加到当前网址

from urlparse import urljoin
yield scrapy.Request(urljoin(response.url,response.xpath("//a[@data-page]/@href").extract_first()), callback=self.parse)

或递增页面

for i in range(2,20):
    yield scrapy.Request("http://bj.lianjia.com/zufang/dongcheng/pg"+ str(i), callback=self.parse)

答案 1 :(得分:-1)

为什么不继续在网址中附加页面的增值值。 适用于此链接。

url = 'http://bj.lianjia.com/zufang/dongcheng/'

counter = 1
while <http error is null>:
    url_to_crawl = url + "/pg" + counter
    counter = counter + 1
    <crawl the url>