Python - 并行从API,缓存和推送到数据库中收集数据

时间:2017-09-21 22:43:40

标签: python multithreading algorithm

我很难弄清楚如何开发此算法的第3阶段:

  1. 从一系列API中获取数据
  2. 将数据存储在脚本中,直到达到某个条件(缓存并且不要打扰数据库)
  3. 将结构化数据推送到数据库并同时继续1(无需等待启动1即可完成数据库上传,这两项内容应并行进行)

    import requests
    import time
    from sqlalchemy import schema, types
    from sqlalchemy.engine import create_engine
    import threading 
    
    # I usually work on postgres
    meta = schema.MetaData(schema="example")
    
    # table one
    table_api_one =    schema.Table('api_one', meta,
                       schema.Column('id', types.Integer, primary_key=True),                        
                       schema.Column('field_one', types.Unicode(255), default=u''),
                       schema.Column('field_two', types.BigInteger()),
                  )
    # table two
    table_api_two =    schema.Table('api_two', meta,
                       schema.Column('id', types.Integer, primary_key=True),                        
                       schema.Column('field_one', types.Unicode(255), default=u''),
                       schema.Column('field_two', types.BigInteger()),
                  )
    
    # create tables
    engine = create_engine("postgres://......", echo=False, pool_size=15, max_overflow=15)
    meta.bind = engine
    meta.create_all(checkfirst=True)
    
    # get the data from the API and return data as JSON
    def getdatafrom(url):
        data = requests.get(url)
        structured = data.json()    
        return structured 
    
    # push the data to the DB
    def flush(list_one,list_two):
        connection = engine.connect()
        # both lists are list of json
        connection.execute(table_api_one.insert(),list_one) 
        connection.execute(table_api_two.insert(),list_two) 
        connection.close()
    
    # start doing something
    def main():
        timesleep = 30
        flush_limit = 10
        threading.Timer(timesleep * flush_limit, main).start()
        data_api_one = []
        data_api_two = []
    
        # repeat the process 10 times (flush_limit) avoiding to keep to busy the DB  
        WHILE len(data_api_one) > flush_limit AND len(data_api_two) > flush_limit:
             data_api_one.append(getdatafrom("http://www.apiurlone.com/api...").copy())
             data_api_two.append(getdatafrom("http://www.apiurltwo.com/api...").copy())
             time.sleep(timesleep)
    
        # push the data when the limit is reached
        flush(data_api_one,data_api_two)
    
    # start the example
    main()
    
  4. 在这个示例脚本中,线程每10 * 30秒启动一次main()(避免重叠线程) 但是,对于flush()期间的此算法,脚本停止从API收集数据。

    如何刷新并不断从API获取数据?

    谢谢!

1 个答案:

答案 0 :(得分:0)

常用方法是Queue对象(来自名为Queuequeue的模块,具体取决于Python版本。)

创建一个生成器函数(在一个线程中运行),它收集api数据,并在队列中清空put时,在另一个线程中运行的使用者函数等待get队列中的数据并将其存储到数据库中。