Question

我现在的做法：

每次页面刷新或加载时，我让我的后端抓住前端页面发送的获取请求来运行我的scrapy蜘蛛。已抓取的数据将显示在我的首页中。这是代码，我称之为子流程来运行蜘蛛：

from subprocess import run

@get('/api/get_presentcode')
def api_get_presentcode():
    if os.path.exists("/static/presentcodes.json"):
        run("rm presentcodes.json", shell=True)

    run("scrapy crawl presentcodespider -o ../static/presentcodes.json", shell=True, cwd="./presentcodeSpider")
    with open("/static/presentcodes.json") as data_file:
        data = json.load(data_file)

    logging.info(data)
    return data

效果很好。

我想要的是什么：

然而，蜘蛛爬行的网站几乎没有变化，所以它不需要经常抓取。

所以我想在后端使用 coroutine 方法每隔30分钟运行一次scrapy蜘蛛。

我尝试过并取得了成功：

from subprocess import run

# init of my web application
async def init(loop):
....

async def run_spider():
    while True:
        print("Run spider...")
        await asyncio.sleep(10)  #  to check results more obviously 

loop = asyncio.get_event_loop()
tasks = [run_spider(), init(loop)]
loop.run_until_complete(asyncio.wait(tasks))
loop.run_forever()

也很有效。

但是当我将run_spider()的代码更改为此代码时（基本上与第一代代码相同）：

async def run_spider():
    while True:
        if os.path.exists("/static/presentcodes.json"):
            run("rm presentcodes.json", shell=True)

        run("scrapy crawl presentcodespider -o ../static/presentcodes.json", shell=True, cwd="./presentcodeSpider")
        await asyncio.sleep(20)

蜘蛛仅在第一次运行时，已爬网的数据已成功存储到 presentcode.json ，但蜘蛛在20秒后仍未调用。

问题

我的计划出了什么问题？是因为我在协程中调用子流程而且它无效？
在主应用程序运行时运行蜘蛛有什么更好的想法吗？

编辑：

让我先把我的网络应用初始化函数的代码放在这里：

async def init(loop):
    logging.info("App started at {0}".format(datetime.now()))
    await orm.create_pool(loop=loop, user='root', password='', db='myBlog')
    app = web.Application(loop=loop, middlewares=[
        logger_factory, auth_factory, response_factory
    ])
    init_jinja2(app, filters=dict(datetime=datetime_filter))
    add_routes(app, 'handlers')
    add_static(app)
    srv = await loop.create_server(app.make_handler(), '127.0.0.1', 9000)  # It seems something happened here.
    logging.info('server started at http://127.0.0.1:9000') # this log didn't show up.
    return srv

我的想法是，主要的应用程序使协同事件循环“卡住了”＃39;所以后来蜘蛛不能回调。

让我查看create_server和run_until_complete ..

Answer 1

可能不是一个完整的答案，我不会像你那样做。但是从subprocess协程中调用asyncio肯定是不正确的。协同程序提供协同多任务，因此当您从协程中调用subprocess时，该协程会有效地停止整个应用程序，直到调用进程完成。

使用asyncio时，您需要了解的一件事是，只有在您拨打await（或yield from或{{1}时，控制流才能从一个协程切换到另一个协程}，async for和其他快捷方式）。如果你做了一些长时间的动作而没有调用任何那些，那么你阻止任何其他协同程序，直到这个动作完成。

您需要使用的是asyncio.subprocess，它将在子进程运行时将控制流正确地返回到应用程序的其他部分（即Web服务器）。

以下是实际async with协程的外观：

run_spider()

基于Python asyncio模块运行Web应用程序时运行spider

我现在的做法：

我想要的是什么：

我尝试过并取得了成功：

问题

编辑：

1 个答案: