Question

我很难弄清楚如何开发此算法的第3阶段：

从一系列API中获取数据
将数据存储在脚本中，直到达到某个条件（缓存并且不要打扰数据库）

将结构化数据推送到数据库并同时继续1（无需等待启动1即可完成数据库上传，这两项内容应并行进行）

import requests
import time
from sqlalchemy import schema, types
from sqlalchemy.engine import create_engine
import threading 

# I usually work on postgres
meta = schema.MetaData(schema="example")

# table one
table_api_one =    schema.Table('api_one', meta,
                   schema.Column('id', types.Integer, primary_key=True),                        
                   schema.Column('field_one', types.Unicode(255), default=u''),
                   schema.Column('field_two', types.BigInteger()),
              )
# table two
table_api_two =    schema.Table('api_two', meta,
                   schema.Column('id', types.Integer, primary_key=True),                        
                   schema.Column('field_one', types.Unicode(255), default=u''),
                   schema.Column('field_two', types.BigInteger()),
              )

# create tables
engine = create_engine("postgres://......", echo=False, pool_size=15, max_overflow=15)
meta.bind = engine
meta.create_all(checkfirst=True)

# get the data from the API and return data as JSON
def getdatafrom(url):
    data = requests.get(url)
    structured = data.json()    
    return structured 

# push the data to the DB
def flush(list_one,list_two):
    connection = engine.connect()
    # both lists are list of json
    connection.execute(table_api_one.insert(),list_one) 
    connection.execute(table_api_two.insert(),list_two) 
    connection.close()

# start doing something
def main():
    timesleep = 30
    flush_limit = 10
    threading.Timer(timesleep * flush_limit, main).start()
    data_api_one = []
    data_api_two = []

    # repeat the process 10 times (flush_limit) avoiding to keep to busy the DB  
    WHILE len(data_api_one) > flush_limit AND len(data_api_two) > flush_limit:
         data_api_one.append(getdatafrom("http://www.apiurlone.com/api...").copy())
         data_api_two.append(getdatafrom("http://www.apiurltwo.com/api...").copy())
         time.sleep(timesleep)

    # push the data when the limit is reached
    flush(data_api_one,data_api_two)

# start the example
main()

在这个示例脚本中，线程每10 * 30秒启动一次main（）（避免重叠线程）但是，对于flush（）期间的此算法，脚本停止从API收集数据。

如何刷新并不断从API获取数据？

谢谢！

Answer 1

常用方法是Queue对象（来自名为Queue或queue的模块，具体取决于Python版本。）

创建一个生成器函数（在一个线程中运行），它收集api数据，并在队列中清空put时，在另一个线程中运行的使用者函数等待get队列中的数据并将其存储到数据库中。

Python - 并行从API，缓存和推送到数据库中收集数据

1 个答案: