Question

我是Scrapy的新手。我在数据库中有成千上万的url，xpath元组和值。这些网址来自不同的域名（不是总的来说，同一个域名可能有100个网址）。

x.com/a //h1
y.com/a //div[@class='1']
z.com/a //div[@href='...']
x.com/b //h1
x.com/c //h1
...

现在我希望每2小时尽快获得这些值，但要确保我不会超载任何这些值。

无法弄清楚如何做到这一点。

我的想法：

我可以为每个不同的域创建一个Spider，设置它的解析规则并立即运行它们。

这是一个好习惯吗？

编辑：我不确定如何根据并发性将数据输出到数据库中。

EDIT2：

我可以这样做 - 每个域都有一个新的蜘蛛。但这不可能有数以千计的不同网址和x路径。

class WikiScraper(scrapy.Spider):
    name = "wiki_headers"

    def start_requests(self):
        urls = [
            'https://en.wikipedia.org/wiki/Spider',
            'https://en.wikipedia.org/wiki/Data_scraping',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        header = hxs.select('//h1/text()').extract()
        print header
        filename = 'result.txt'
        with open(filename, 'a') as f:
            f.write(header[0])
        self.log('Saved file %s' % filename)

class CraigslistScraper(scrapy.Spider):
    name = "craigslist_headers"

    def start_requests(self):
        urls = [
            'https://columbusga.craigslist.org/act/6062657418.html',
            'https://columbusga.craigslist.org/acc/6060297390.html',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        header = hxs.select('//span[@id="titletextonly"]/text()').extract()
        filename = 'result.txt'
        with open(filename, 'a') as f:
            f.write(header[0])
        self.log('Saved file %s' % filename)

Answer 1

从您在edit2中发布的示例中，看起来您的所有类都可以通过一个级别轻松抽象。这个怎么样：？

from urllib.parse import urlparse

class GenericScraper(scrapy.Spider):
    def __init__(self, urls, xpath):
        super().__init__()
        self.name = self._create_scraper_name_from_url(urls[0])
        self.urls = urls
        self.xpath = xpath

    def _create_scraper_name_from_url(url):
        '''Generate scraper name from url
           www.example.com/foobar/bar -> www_example_com'''
        netloc = urlparse(url).netloc
        return netloc.replace('.','_')

    def start_requests(self):
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        header = hxs.select(self.xpath).extract()
        filename = 'result.txt'
        with open(filename, 'a') as f:
            f.write(header[0])
        self.log('Saved file %s' % filename)

接下来，您可以按xpaths

对数据库中的数据进行分组

for urls, xpath in grouped_data:
    scraper = GenericScraper(urls, xpath)
    # do whatever you need with scraper

AD concurency：您的数据库应该处理concurent写入，所以我没有看到问题

编辑：与超时相关：我不知道scrapy如何在引擎盖下工作，即它是否使用某种类型的并列化以及它是否在后台异步运行。但是根据你所写的内容，我猜它确实如此，当你启动1k个刮刀时，每次发出多个请求，你的硬件无法处理那么多的流量（免责声明，这只是猜测！）。

可能有一种原生方式可以执行此操作，但可能的解决方法是使用multiprocessing +队列：

from multiprocessing import JoinableQueue, Process
NUMBER_OF_CPU = 4 # change this to your number.
SENTINEL = None


class Worker(Process):
    def __init__(self, queue):
        super().__init__()
        self.queue = queue
    def run(self):
        # blocking wait !You have to use sentinels if you use blocking waits!
        item = self.queue.get():
        if item is SENTINEL:
            # we got sentinel, there are no more scrapers to process
            self.queue.task_done()
            return
        else:
            # item is scraper, run it
            item.run_spider() # or however you run your scrapers
            # This assumes that each scraper is **not** running in background! 

            # Tell the JoinableQueue we have processed one more item
            # In the main thread the queue.join() waits untill for
            # each item taken from queue a queue.task_done() is called
            self.queue.task_done()


def run():
    queue = JoinableQueue()
    # if putting that many things in the queue gets slow (I imagine
    # it can) You can fire up a separate Thread/Process to fill the
    # queue in the background while workers are already consuming it.
    for urls, xpath in grouped_data:
        scraper = GenericScraper(urls, xpath)
        queue.put(scraper)
    for sentinel in range(NUMBER_OF_CPU):
        # None or sentinel of your choice to tell the workers there are 
        # no more scrapers to process
        queue.put(SENTINEL)
    workers = []
    for _ in range(NUMBER_OF_CPU):
        worker = Worker(queue)
        workers.append(worker)
        worker.start()

    # We have to wait until the queue is processed
    queue.join()

但请记住，这是一种完全无视Scrapy能力的并行执行的香草方法。我发现This blogpost使用twisted来实现（我认为）同样的事情。但由于我从未使用过扭曲，我无法评论

Answer 2

如果您因为scrapy参数而考虑allowed_domains无法一次处理多个域，请记住它是可选的。

如果蜘蛛中没有设置allowed_domains参数，它可以与它获得的每个域一起使用。

Answer 3

如果我理解正确您有域映射到xpath值，并且您想根据您抓取的域拉取xpath吗？
尝试类似：

DOMAIN_DATA = [('domain.com', '//div')] 
def get_domain(url):
    for domain, xpath in DOMAIN_DATA:
        if domain in url: 
            return xp


def parse(self, response):
    xpath = get_domain(response.url)
    if not xpath:
        logging.error('no xpath for url: {}; unknown domain'.format(response.url))
        return
    item = dict()
    item['some_field'] = repsonse.xpath(xpath).extract()
    yield item

广泛的爬行 - 不同的xpaths - Scrapy

3 个答案: