使用scrapy作为项目生成器

时间:2016-09-15 09:53:20

标签: python scrapy scrapy-pipeline

我有一个需要抓取数据的现有脚本(main.py)。

我开始了一个scrapy项目来检索这些数据。现在,有没有什么方法可以将main.py从scrapy中检索数据作为Item生成器,而不是使用Item管道来保存数据?

像这样的东西真的很方便,但是如果可行的话,我找不到怎么做。

for item in scrapy.process():

我在那里发现了一个潜在的解决方案:https://tryolabs.com/blog/2011/09/27/calling-scrapy-python-script/,使用多线程的队列。

即使我理解这种行为与分布式抓取不兼容,这是Scrapy的目的,但我仍然有点惊讶你不会将这个功能用于小型项目。

2 个答案:

答案 0 :(得分:0)

您可以从抓取工具发送json数据并获取结果。它可以按如下方式完成:

拥有蜘蛛:

class MySpider(scrapy.Spider):
    # some attributes
    accomulated=[]

    def parse(self, response):
        # do your logic here
        page_text = response.xpath('//text()').extract()
        for text in page_text:
            if conditionsAreOk( text ):
                self.accomulated.append(text)

    def closed( self, reason ):
        # call when the crawler process ends
        print JSON.dumps(self.accomulated)

编写一个runner.py脚本,如:

import sys
from twisted.internet import reactor

import scrapy

from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging    
from scrapy.utils.project import get_project_settings

from spiders import MySpider 

def main(argv): 

    url = argv[0]

    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s', 'LOG_ENABLED':False })
    runner = CrawlerRunner( get_project_settings() )

    d = runner.crawl( MySpider, url=url)

    # For Multiple in the same process
    #
    # runner.crawl('craw')
    # runner.crawl('craw2')
    # d = runner.join()

    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until the crawling is finished


if __name__ == "__main__":
   main(sys.argv[1:])

然后从你的main.py中调用它:

import json, subprocess, sys, time

def main(argv): 

    # urlArray has http:// or https:// like urls
    for url in urlArray:    
        p = subprocess.Popen(['python', 'runner.py', url ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        out, err = p.communicate()

        # do something with your data
        print out
        print json.loads(out)

        # This just helps to watch logs
        time.sleep(0.5)


if __name__ == "__main__":
   main(sys.argv[1:])

注意

这不是使用Scrapy的最佳方式,但是对于不需要复杂后期处理的快速结果,此解决方案可以提供您所需的。

我希望它有所帮助。

答案 1 :(得分:0)

你可以在Twisted或Tornado应用程序中这样做:

import collections

from twisted.internet.defer import Deferred
from scrapy.crawler import Crawler
from scrapy import signals


def scrape_items(crawler_runner, crawler_or_spidercls, *args, **kwargs):
    """
    Start a crawl and return an object (ItemCursor instance)
    which allows to retrieve scraped items and wait for items
    to become available.

    Example:

    .. code-block:: python

        @inlineCallbacks
        def f():
            runner = CrawlerRunner()
            async_items = scrape_items(runner, my_spider)
            while (yield async_items.fetch_next):
                item = async_items.next_item()
                # ...
            # ...

    This convoluted way to write a loop should become unnecessary
    in Python 3.5 because of ``async for``.
    """
    # this requires scrapy >= 1.1rc1
    crawler = crawler_runner.create_crawler(crawler_or_spidercls)
    # for scrapy < 1.1rc1 the following code is needed:
    # crawler = crawler_or_spidercls
    # if not isinstance(crawler_or_spidercls, Crawler):
    #    crawler = crawler_runner._create_crawler(crawler_or_spidercls)

    d = crawler_runner.crawl(crawler, *args, **kwargs)
    return ItemCursor(d, crawler)


class ItemCursor(object):
    def __init__(self, crawl_d, crawler):
        self.crawl_d = crawl_d
        self.crawler = crawler

        crawler.signals.connect(self._on_item_scraped, signals.item_scraped)

        crawl_d.addCallback(self._on_finished)
        crawl_d.addErrback(self._on_error)

        self.closed = False
        self._items_available = Deferred()
        self._items = collections.deque()

    def _on_item_scraped(self, item):
        self._items.append(item)
        self._items_available.callback(True)
        self._items_available = Deferred()

    def _on_finished(self, result):
        self.closed = True
        self._items_available.callback(False)

    def _on_error(self, failure):
        self.closed = True
        self._items_available.errback(failure)

    @property
    def fetch_next(self):
        """
        A Deferred used with ``inlineCallbacks`` or ``gen.coroutine`` to
        asynchronously retrieve the next item, waiting for an item to be
        crawled if necessary. Resolves to ``False`` if the crawl is finished,
        otherwise :meth:`next_item` is guaranteed to return an item
        (a dict or a scrapy.Item instance).
        """
        if self.closed:
            # crawl is finished
            d = Deferred()
            d.callback(False)
            return d

        if self._items:
            # result is ready
            d = Deferred()
            d.callback(True)
            return d

        # We're active, but item is not ready yet. Return a Deferred which
        # resolves to True if item is scraped or to False if crawl is stopped.
        return self._items_available

    def next_item(self):
        """Get a document from the most recently fetched batch, or ``None``.
        See :attr:`fetch_next`.
        """
        if not self._items:
            return None
        return self._items.popleft()

主要思想是收听item_scraped信号,然后将其包装到具有更好API的对象中。

请注意,在main.py脚本中需要一个事件循环才能使其正常工作;上面的例子适用于twisted.defer.inlineCallbacks或tornado.gen.coroutine。