Scrapy爬虫不会递归爬行下一页

时间:2017-08-25 01:24:14

标签: python-2.7 scrapy

我正在尝试构建此爬虫来获取craigslist的住房数据,

但是抓取器在获取第一页后停止,并且不会转到下一页。

这是代码,它适用于第一页,但是对于上帝的爱,我不明白为什么它没有进入下一页。任何洞察力都非常感激。我跟着this part from scrapy tutorial

import scrapy
import re

from scrapy.linkextractors import LinkExtractor




class QuotesSpider(scrapy.Spider):
    name = "craigslistmm"
    start_urls = [
        "https://vancouver.craigslist.ca/search/hhh"
    ]



    def parse_second(self,response):
        #need all the info in a dict
        meta_dict = response.meta
        for q in response.css("section.page-container"):
            meta_dict["post_details"]= {
                "location":
                    {"longitude":q.css("div.mapAndAttrs div.mapbox div.viewposting::attr(data-longitude)" ).extract(),
                "latitude":q.css("div.mapAndAttrs div.mapbox div.viewposting::attr(data-latitude)" ).extract()},

                "detailed_info":  ' '.join(q.css('section#postingbody::text').extract()).strip()

            }


        return meta_dict





    def parse(self, response):
        pattern = re.compile("\/([a-z]+)\/([a-z]+)\/.+")
        for q in response.css("li.result-row"):

            post_urls = q.css("p.result-info a::attr(href)").extract_first()
            mm = re.match(pattern, post_urls)

            neighborhood= q.css("p.result-info span.result-meta span.result-hood::text").extract_first()




            next_url = "https://vancouver.craigslist.ca/"+ post_urls
            request = scrapy.Request(next_url,callback=self.parse_second)
            #next_page = response.xpath('.//a[@class="button next"]/@href').extract_first()
            #follow_url =  "https://vancouver.craigslist.ca/" + next_page
            #request1 =  scrapy.Request(follow_url,callback=self.parse)
            #yield response.follow(next_page,callback = self.parse)


            request.meta['id'] = q.css("li.result-row::attr(data-pid)").extract_first()
            request.meta['pricevaluation'] = q.css("p.result-info span.result-meta span.result-price::text").extract_first()
            request.meta["information"] =  q.css("p.result-info span.result-meta span.housing::text" ).extract_first()
            request.meta["neighborhood"] =q.css("p.result-info span.result-meta span.result-hood::text").extract_first()
            request.meta["area"] = mm.group(1)
            request.meta["adtype"] = mm.group(2)


            yield request
            #yield scrapy.Request(follow_url, callback=self.parse)

        next_page = LinkExtractor(allow="s=\d+").extract_links(response)[0]


        # = "https://vancouver.craigslist.ca/" + next_page
        yield response.follow(next_page.url,callback=self.parse)

1 个答案:

答案 0 :(得分:0)

问题似乎与使用next_page的{​​{1}}提取有关。如果您查看外观,您会看到重复的请求被过滤。页面上有更多链接可以满足您的提取规则,也许它们不会以任何特定顺序提取(或者不按您希望的顺序提取)。

我认为更好的方法是准确提取您想要的信息,尝试使用它:

LinkExtractor