用scrapy跳上爬行的url

时间:2015-04-15 14:04:45

标签: python web-scraping scrapy

我想创建一个能够在每个网站上的url&rider之间跳转的机器人。 enter image description here

如果在网站1上有两个网址,我想开始在网站上抓取2个新实例。

可能同时限制实例数量。

我的代码实际上只适用于1个网站,它无法跳到获取网址上。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log

class ExampleSpider(CrawlSpider):
    name = "crawler"
    allowed_domains = ["*"]
    start_urls = ["http://domainA.com"]
    rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)

1 个答案:

答案 0 :(得分:0)

您可以定义一个参数,如果它被传递则成为起始URL,并为您获得的每个链接生成带有子进程(https://docs.python.org/2/library/subprocess.html)的新进程。

也许是这样的: (N.B - 未经测试)

class ExampleSpider(CrawlSpider):
    name = "test_crawler"
    allowed_domains = ["*"]
    rules=[Rule(SgmlLinkExtractor(),callback='parse_item',follow=True)]

    def __init__(self, starting_url=None):

        if starting_url:
            self.start_urls = [starting_url]
        else:
            self.start_urls = ["http://www.domainA.com"]

        CrawlSpider.__init__(self)


    def parse_item(self,response):
        subprocess.call('scrapy crawl {0} -a starting_url    {1}'.format(self.name, response.url).split())