递归抓取后没有数据被删除

时间:2015-10-19 03:18:42

标签: python scrapy

我试图递归地从https://iowacity.craigslist.org/search/jjj抓取工作头衔。也就是说,我希望蜘蛛抓住第1页上的所有职位,然后点击链接" next>"在底部刮下一页,依此类推。我模仿迈克尔赫尔曼的教程写我的蜘蛛。 http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.ViJ6rPmrTIU

这是我的代码:

import scrapy
from craig_rec.items import CraigRecItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class CraigslistSpider(CrawlSpider):
    name = "craig_rec"
    allowed_domains = ["https://craigslist.org"]
    start_urls = ["https://iowacity.craigslist.org/search/jjj"]

    rules = (
        Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_items", follow= True),
)

    def parse_items(self, response):
        items = []
        for sel in response.xpath("//span[@class = 'pl']"):
            item = CraigRecItem()
            item['title'] = sel.xpath("a/text()").extract()
            items.append(item)
        return items  

我发布了蜘蛛,但没有数据被删除。有帮助吗?谢谢!

2 个答案:

答案 0 :(得分:1)

当您将allowed_domains设置为“https://craigslist.org”时,由于对子域名“iowacity.craigslist.org”的异地请求,它会停止抓取。

您必须将其设置为:

allowed_domains = ["craigslist.org"]

根据docs allowed_domains 是一个字符串列表,其中包含允许此蜘蛛抓取的域。它希望它采用 domain.com 格式,它允许域本身和蜘蛛解析所有子域。

您也可以只允许少数子域,或者通过将该属性留空来允许所有请求。

答案 1 :(得分:1)

迈克尔赫尔曼的教程很棒,但对于旧版本的scrapy来说。此代码段避免了一些弃用警告,并将parse_page转换为生成器:

import scrapy
from craig_rec.items import CraigRecItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class CraiglistSpider(CrawlSpider):
    name = "craiglist"
    allowed_domains = ["craigslist.org"]
    start_urls = (
        'https://iowacity.craigslist.org/search/jjj/',
    )

    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="button next"]',)),
             callback="parse_page", follow=True),
    )

    def parse_page(self, response):
        for sel in response.xpath("//span[@class = 'pl']"):
            item = CraigRecItem()
            item['title'] = sel.xpath(".//a/text()").extract()
            yield item

post还有一些关于抓取Craigslist的好建议。