有没有办法使用线程同时执行SQL查询,这样我可以减少下面代码的处理时间?有没有更好的方法在不使用pandas模块的情况下执行与下面相同的结果?鉴于我正在处理的数据集的大小,我无法将整个数据集存储在内存中,并且我发现在SELECT * FROM语句的行上进行循环,并将它们与我查询的列表进行比较会增加处理时间。 / p>
# DATABASE layout
# _____________________________________________________________
# | id | name | description |
# |_____________|____________________|__________________________|
# | 1 | John | Credit Analyst |
# | 2 | Jane | Doctor |
# | ... | ... | ... |
# | 5000000 | Mohammed | Dentist |
# |_____________|____________________|__________________________|
import sqlite3
SEARCH_IDS = [x for x in range(15000)]
DATABASE_NAME = 'db.db'
def chunks(wholeList, chunkSize=999):
"""Yield successive n-sized chunks from wholeList."""
for i in range(0, len(wholeList), chunkSize):
yield wholeList[i:i + chunkSize]
def search_database_for_matches(listOfIdsToMatch):
'''Takes a list of ids and returns the rows'''
conn = sqlite3.connect(DATABASE_NAME)
cursor = conn.cursor()
sql = "SELECT id, name, description FROM datatable WHERE id IN ({})".format(', '.join(["?" for x in listOfIdsToMatch]))
cursor.execute(sql,tuple(listOfIdsToMatch))
rows = cursor.fetchall()
return rows
def arrange(orderOnList,listToBeOrdered,defaultReturnValue='N/A'):
'''Takes a list of ids in the desired order and list of tuples which have ids as the first items.
the list of tuples is aranged into a new list corresponding to the order of the source list'''
from collections import OrderedDict
resultList=[defaultReturnValue for x in orderOnList]
indexLookUp = OrderedDict( [ ( value , key ) for key , value in enumerate( orderOnList ) ] )
for item in listToBeOrdered:
resultList[indexLookUp[item[0]]]=item
return resultList
def main():
results=[]
for chunk in chunks(SEARCH_IDS,999):
results += search_database_for_matches(chunk)
results = arrange(SEARCH_IDS,results)
print(results)
if __name__ == '__main__': main()
答案 0 :(得分:4)
一些建议:
不是使用迭代器通过chucks读取记录,而是应该使用分页。
看到这个问题:
如果您正在使用多线程/多处理,请确保您的数据库可以支持它。 请参阅:SQLite And Multiple Threads
要实现您想要的功能,您可以使用适用于每个块的工作池。请参阅Python文档中的Using a pool of workers。
示例:
Import multiprocessing
with multiprocessing.pool.Pool(process = 4) as pool:
result = pool.map(search_database_for_match, [for chunk in chunks(SEARCH_IDS,999)])