我有一个需要搜索特定值的Excel 2010文件(xlsx)列表。由于xslx是二进制格式,因此无法使用普通文本编辑器。所以我为每个文件执行以下操作
这需要多处理,因为它不受I / O限制。大熊猫的东西和数组转换需要时间。所以我已经设置了我的脚本的多处理版本(见下文):
问题是每个工作进程的内存消耗。尽管每个xlsx文件只有100kb,但每个工作程序的内存持续高达2GB。我不明白为什么在处理新文件之前没有释放内存。这样我在处理文件列表之前就会耗尽内存。
问题似乎不是队列,而是熊猫的东西。
这是我的代码。它可以使用您系统上的任何xlsx文件进行测试。
import pandas as pd
import multiprocessing as mp
import glob
path = r'c:\temp'
fileFilter = 'serial.xlsx'
searchString = '804.486'
def searchFile(tasks, results, searchString):
"""Iterates over files in tasks and searches in file for the
occurence of 'searchString'.
Args:
-----
tasks: queue of strings
Files to look in
results: queue of strings
Files where the searchString was found
searchString: str
the string to be searched
"""
# for files in the queue
for task in iter(tasks.get, 'STOP'):
# read the filestructre into memory
xfile = pd.ExcelFile(task)
# iterate all sheets
for sheet in xfile.sheet_names[:3]:
# read the sheet
data = pd.read_excel(xfile, sheet)
# check if searchString is in numpy representation of dataframe
if searchString in data.values.astype(str):
# put filename in results queue
results.put(task)
break
xfile.close()
if __name__ == "__main__":
# get all files matching the filter that are in the root path
print('gathering files')
files = glob.glob(path + '\**\{}'.format(fileFilter), recursive=True)
# setup of queues and variables
n_proc = 2
tasks = mp.Queue()
results = mp.Queue()
print('Start processing')
# setup processes and start them
procs = [mp.Process(target=searchFile,
args=(tasks, results, searchString))
for x in range(n_proc)]
for p in procs:
p.daemon = True
p.start()
# populate queue
for file in files:
tasks.put(file)
for proc in procs:
tasks.put('STOP')
for p in procs:
p.join()
# print results
for result in range(results.qsize()):
print(results.get())
print('Done')
答案 0 :(得分:0)
gc中的问题似乎无法收集你永远不会离开的函数上下文中的pandas帧。您可以使用multiprocessing.Pool.map
来为您执行队列处理。将为每个项目调用Worker函数,让gc完成工作。您还可以使用maxtasksperchild
池构造函数参数来限制worker处理的项目数量。
import glob
import multiprocessing
def searchFile(task, searchString):
xfile = pd.ExcelFile(task)
...
if found:
return task
if __name__ == '__main__':
files = glob.glob(path + '\**\{}'.format(fileFilter), recursive=True)
searchString = '804.486'
pool = multiprocessing.Pool(2, maxtasksperchild=10)
args = ((fname, searchString) for fname in files)
matchedFiles = filter(None, pool.map(searchFile, args))
pool.close()