爬行深度自动化

时间:2014-04-10 16:12:19

标签: python web-scraping scrapy

我的网站包含3个级别。

  • 国家

我想从所有街道页面中删除数据。为此,我建造了一只蜘蛛。 现在我如何从国家到街道,而不在start_url字段中添加一百万个URL。

我为国家建造一只蜘蛛​​,一只用于城市,一只用于街道? 抓取爬行器跟踪所有链接到一定深度的整个想法不是吗?

将DEPTH_LIMIT = 3添加到settings.py文件并没有改变任何内容。

我开始抓取:scrapy crawl spidername


修改

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import Spider
from scrapy.selector import Selector
from winkel.items import WinkelItem

class DmozSpider(CrawlSpider):
name = "dmoz"
    allowed_domains = ["mydomain.nl"]
    start_urls = [
        "http://www.mydomain.nl/Zuid-Holland"
    ]

    rules = (Rule(SgmlLinkExtractor(allow=('*Zuid-Holland*', )), callback='parse_winkel', follow=True),)

    def parse_winkel(self, response):
        sel = Selector(response)
        sites = sel.xpath('//ul[@id="itemsList"]/li')
        items = []

        for site in sites:
        item = WinkelItem()
        item['adres'] = site.xpath('.//a/text()').extract(), site.xpath('text()').extract(), sel.xpath('//h1/text()').re(r'winkel\s*(.*)')
        items.append(item)
        return items 

1 个答案:

答案 0 :(得分:2)

您需要使用CrawlSpider,使用Rules为国家,城市和街道定义Link Extractors

例如:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(SgmlLinkExtractor(allow=('country', )), callback='parse_country'),
        Rule(SgmlLinkExtractor(allow=('city', )), callback='parse_city'),
        Rule(SgmlLinkExtractor(allow=('street', )), callback='parse_street'),
    )

    def parse_country(self, response):
        self.log('Hi, this is a country page! %s' % response.url)

    def parse_city(self, response):
        self.log('Hi, this is a city page! %s' % response.url)

    def parse_street(self, response):
        self.log('Hi, this is a street page! %s' % response.url)