Scrapy Craigslist脚本

时间:2016-03-12 16:35:55

标签: python html xpath scrapy craigslist

我想创建一个Scrapy脚本来刮取任何craigslist子域中计算机演出的所有结果: 例如:http://losangeles.craigslist.org/search/cpg/ 这个查询返回了许多文章的列表,我试图使用CrawlSpider和linkExtractor来抓取每个结果的标题和href(不仅是第一页上的结果)无效,但是Script没有返回任何内容。 我会在这里粘贴我的脚本,谢谢

    import scrapy
    from scrapy.spiders import Rule,CrawlSpider
    from scrapy.linkextractors import LinkExtractor

    class CraigspiderSpider(CrawlSpider):
        name = "CraigSpider"
        allowed_domains = ["http://losangeles.craigslist.org"]
        start_urls = (
                    'http://losangeles.craigslist.org/search/cpg/',
        )

        rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_page", follow= True),)

        def parse_page(self, response):
            items = response.selector.xpath("//p[@class='row']")
        for i in items:
            link = i.xpath("./span[@class='txt']/span[@class='pl']/a/@href").extract()
            title = i.xpath("./span[@class='txt']/span[@class='pl']/a/span[@id='titletextonly']/text()").extract()
            print link,title

1 个答案:

答案 0 :(得分:0)

根据您粘贴的代码,parse_page

  1. 不会返回/产生任何内容,
  2. 只包含一行:“items = response.selector ...”
  3. 上面#2的原因是for循环没有正确缩进。

    尝试缩进for循环:

    class CraigspiderSpider(CrawlSpider):
        name = "CraigSpider"
        allowed_domains = ["http://losangeles.craigslist.org"]
        start_urls = ('http://losangeles.craigslist.org/search/cpg/',)
    
        rules = (Rule(
            LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)),
            callback="parse_page", follow= True))
    
        def parse_page(self, response):
            items = response.selector.xpath("//p[@class='row']")
    
            for i in items:
                link = i.xpath("./span[@class='txt']/span[@class='pl']/a/@href").extract()
                title = i.xpath("./span[@class='txt']/span[@class='pl']/a/span[@id='titletextonly']/text()").extract()
                print link, title
                yield dict(link=link, title=title)