CherryPy:如何在数据更新时停止和缓冲传入的请求

时间:2015-09-18 01:54:56

标签: python cherrypy

我在一台实现RESTful API的服务器中使用cherrypy。 这些反应意味着一些繁重的计算需要大约2秒钟 请求。为了进行这种计算,使用了一些更新的数据 一天一次。

数据在后台更新(大约需要半小时), 一旦更新,新数据的引用就会传递给 响应请求的函数。这只需要一个毫秒。

我需要的是确保每个请求都以 旧数据或新数据,但在更改数据引用时不会发生任何请求处理。理想情况下,我希望在更改数据引用时找到一种缓冲传入请求的方法,并确保在所有进程内请求完成后更改引用。

我当前(非)工作的最小例子如下:

import time
import cherrypy
from cherrypy.process import plugins

theData = 0

def processData():
    """Backround task works for half hour three times a day, 
        and when finishes it publish it in the engine buffer."""
    global theData # using global variables to simplify the example
    theData += 1
    cherrypy.engine.publish("doChangeData", theData)

class DataPublisher(object):

    def __init__(self):
        self.data = 'initData'
        cherrypy.engine.subscribe('doChangeData', self.changeData)

    def changeData(self, newData):
        cherrypy.engine.log("Changing data, buffering should start!")
        self.data = newData
        time.sleep(1) #exageration of the 1 milisec of  the references update to visualize the problem
        cherrypy.engine.log("Continue serving buffered and new requests.")

    @cherrypy.expose
    def index(self):
        result = "I get "+str(self.data)
        cherrypy.engine.log(result)
        time.sleep(3) 
        return result

if __name__ == '__main__':
    conf = {
         '/': { 'server.socket_host': '127.0.0.1',
                'server.socket_port': 8080} 
        }
    cherrypy.config.update(conf)

    btask = plugins.BackgroundTask(5, processData) #5 secs for the example                          
    btask.start()

    cherrypy.quickstart(DataPublisher())

如果我运行此脚本,并且还打开浏览器,请输入localhost:8080并刷新 页面很多,我得到:

...
[17/Sep/2015:21:32:41] ENGINE Changing data, buffering should start!
127.0.0.1 - - [17/Sep/2015:21:32:41] "GET / HTTP/1.1" 200 7 "... 
[17/Sep/2015:21:32:42] ENGINE I get 3
[17/Sep/2015:21:32:42] ENGINE Continue serving buffered and new requests.
127.0.0.1 - - [17/Sep/2015:21:24:44] "GET / HTTP/1.1" 200 7 "...
...

这意味着某些请求处理在之前和之后开始 数据引用开始或结束更改。我想避免这两种情况。 类似的东西:

...
127.0.0.1 - - [17/Sep/2015:21:32:41] "GET / HTTP/1.1" 200 7 "... 
[17/Sep/2015:21:32:41] ENGINE Changing data, buffering should start!
[17/Sep/2015:21:32:42] ENGINE Continue serving buffered and new requests.
[17/Sep/2015:21:32:42] ENGINE I get 3
127.0.0.1 - - [17/Sep/2015:21:24:44] "GET / HTTP/1.1" 200 7 "...
...

我搜索了文档和网页,发现这些引用并未完全涵盖这种情况:

http://www.defuze.org/archives/198-managing-your-process-with-the-cherrypy-bus.html

How to execute asynchronous post-processing in CherryPy?

http://tools.cherrypy.org/wiki/BackgroundTaskQueue

Cherrypy : which solutions for pages with large processing time

How to stop request processing in Cherrypy?

更新(使用简单的解决方案):

经过深思熟虑,我认为这个问题具有误导性,因为它包含了问题本身的一些实现要求,即:停止处理并开始缓冲。对于该问题,可以将需求简化为:确保使用旧数据或新数据处理每个请求。

对于后者,存储已使用数据的时间本地引用就足够了。此引用可用于所有请求处理,如果另一个线程发生更改self.data,则没有问题。对于python对象,垃圾收集器将处理旧数据。

具体来说,通过以下方式更改索引功能就足够了:

@cherrypy.expose
def index(self):
    tempData = self.data
    result = "I started with %s"%str(tempData)
    time.sleep(3) # Heavy use of tempData
    result += " that changed to %s"%str(self.data)
    result += " but I am still using %s"%str(tempData)
    cherrypy.engine.log(result)
    return result

结果我们会看到:

[21/Sep/2015:10:06:00] ENGINE I started with 1 that changed to 2 but I am still using 1

我仍然希望保留原始(更严格的)问题和cyraxjoe答案,因为我发现这些解决方案非常有用。

1 个答案:

答案 0 :(得分:3)

我将解释两个一种解决问题的方法 es

第一个是基于插件的。

基于插件 仍然需要一种同步。它只能起作用,因为只有一个BackgroundTask进行修改(也只是一个原子操作)。

import time
import threading

import cherrypy
from cherrypy.process import plugins

UPDATE_INTERVAL = 0.5
REQUEST_DELAY = 0.1
UPDATE_DELAY = 0.1
THREAD_POOL_SIZE = 20

next_data = 1

class DataGateway(plugins.SimplePlugin):

    def __init__(self, bus):
        super(DataGateway, self).__init__(bus)
        self.data = next_data

    def start(self):
        self.bus.log("Starting DataGateway")
        self.bus.subscribe('dg:get', self._get_data)
        self.bus.subscribe('dg:update', self._update_data)
        self.bus.log("DataGateway has been started")

    def stop(self):
        self.bus.log("Stopping DataGateway")
        self.bus.unsubscribe('dg:get', self._get_data)
        self.bus.unsubscribe('dg:update', self._update_data)
        self.bus.log("DataGateway has been stopped")

    def _update_data(self, new_val):
        self.bus.log("Changing data, buffering should start!")
        self.data = new_val
        time.sleep(UPDATE_DELAY)
        self.bus.log("Continue serving buffered and new requests.")

    def _get_data(self):
        return self.data


def processData():
    """Backround task works for half hour three times a day,
        and when finishes it publish it in the engine buffer."""
    global next_data
    cherrypy.engine.publish("dg:update", next_data)
    next_data += 1


class DataPublisher(object):

    @property
    def data(self):
        return cherrypy.engine.publish('dg:get').pop()

    @cherrypy.expose
    def index(self):
        result = "I get " + str(self.data)
        cherrypy.engine.log(result)
        time.sleep(REQUEST_DELAY)
        return result

if __name__ == '__main__':
    conf = {
        'global': {
            'server.thread_pool': THREAD_POOL_SIZE,
            'server.socket_host': '127.0.0.1',
            'server.socket_port': 8080,
        }
    }
    cherrypy.config.update(conf)
    DataGateway(cherrypy.engine).subscribe()
    plugins.BackgroundTask(UPDATE_DELAY, processData).start()
    cherrypy.quickstart(DataPublisher())

在这个版本中,同步来自 read& amp;写入操作在cherrypy.engine线程上执行。在您刚刚操作发布到引擎的插件DataGateway上抽象出所有内容。

第二种方法是使用Event threading.Event个对象。这是一种更加手动的方法,其额外的好处是,由于读取速度更快,因为它不会在cherrypy.engine线程上执行,因此它可能会更快。

threading.Event based(a.k.a。manual)

import time
import cherrypy
import threading
from cherrypy.process import plugins

UPDATE_INTERVAL = 0.5
REQUEST_DELAY = 0.1
UPDATE_DELAY = 0.1
THREAD_POOL_SIZE = 20

next_data = 1

def processData():
    """Backround task works for half hour three times a day,
        and when finishes it publish it in the engine buffer."""
    global next_data
    cherrypy.engine.publish("doChangeData", next_data)
    next_data += 1


class DataPublisher(object):

    def __init__(self):
        self._data = next_data
        self._data_readable = threading.Event()
        cherrypy.engine.subscribe('doChangeData', self.changeData)

    @property
    def data(self):
        if self._data_readable.is_set():
            return self._data
        else:
            self._data_readable.wait()
            return self.data

    @data.setter
    def data(self, value):
        self._data_readable.clear()
        time.sleep(UPDATE_DELAY)
        self._data = value
        self._data_readable.set()

    def changeData(self, newData):
        cherrypy.engine.log("Changing data, buffering should start!")
        self.data = newData
        cherrypy.engine.log("Continue serving buffered and new requests.")

    @cherrypy.expose
    def index(self):
        result = "I get " + str(self.data)
        cherrypy.engine.log(result)
        time.sleep(REQUEST_DELAY)
        return result

if __name__ == '__main__':
    conf = {
        'global': {
            'server.thread_pool': THREAD_POOL_SIZE,
            'server.socket_host': '127.0.0.1',
            'server.socket_port': 8080,
        }
    }
    cherrypy.config.update(conf)
    plugins.BackgroundTask(UPDATE_INTERVAL, processData).start()
    cherrypy.quickstart(DataPublisher())

我已经在@property装饰器中添加了一些细节,但真正的要点在于threading.Event以及DataPublisher对象在工作线程之间共享的事实。 / p>

我还添加了两个示例中增加线程池大小所需的线程池配置。默认值为10.

作为一种测试我刚刚说过的方法,你可以执行这个 Python 3 脚本(如果你现在没有python3,你就有了安装它的借口)它会在给定线程池的情况下或多或少地同时执行100个请求。

测试脚本

import time
import urllib.request
import concurrent.futures


URL = 'http://localhost:8080/'
TIMEOUT = 60
DELAY = 0.05
MAX_WORKERS = 20
REQ_RANGE = range(1, 101)

def load_url():
    with urllib.request.urlopen(URL, timeout=TIMEOUT) as conn:
        return conn.read()


with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    futures = {}
    for i in REQ_RANGE:
        print("Sending req {}".format(i))
        futures[executor.submit(load_url)] = i
        time.sleep(DELAY)
    results = []
    for future in concurrent.futures.as_completed(futures):
        try:
            data = future.result().decode()
        except Exception as exc:
            print(exc)
        else:
            results.append((futures[future], data))
    curr_max = 0
    for i, data in sorted(results, key=lambda r: r[0]):
        new_max = int(data.split()[-1])
        assert new_max >= curr_max, "The data was not updated correctly"
        print("Req {}: {}".format(i, data))
        curr_max = new_max

根据日志确定您遇到问题的方式,它不值得信任这类问题。特别假设您无法控制请求登录"访问"登录。我无法使用我的测试代码使代码失败,但在一般情况下确实存在竞争条件,在这个示例中它应该始终有效,因为代码只是制作atomic operation。只需从中心点定期分配一个属性。

如果您有问题留下评论,我希望代码是自我解释的。

编辑:我编辑了基于插件的方法,因为它只能工作,因为只有一个位置正在执行插件,如果你创建另一个更新数据的后台任务,那么它可能会有问题做一些事情而不仅仅是一项任务。如果您要从一个 BackgroundTask更新,则无论代码是什么,都可以。