我编写了一个脚本,必须从文件夹中读取大量excel文件(大约10,000个)。此脚本加载excel文件(其中一些有超过2,000行)并读取一列来计算行数(检查内容)。如果行数不等于给定数字,则会将警告写入日志中。
当脚本读取超过1,000个excel文件时出现问题。然后当它抛出内存错误时,我不知道问题出在哪里。以前,该脚本读取两个包含14,000行的csv文件并将其存储在列表中。这些列表包含excel文件的标识符及其各自的行数。如果此行数不等于excel文件的行数,则会写入警告。阅读这些清单可能是个问题?
我正在使用openpyxl加载工作簿,在打开下一个工作簿之前是否需要关闭它们?
这是我的代码:
# -*- coding: utf-8 -*-
import os
from openpyxl import Workbook
import glob
import time
import csv
from time import gmtime,strftime
from openpyxl import load_workbook
folder = ''
conditions = 0
a = 0
flight_error = 0
condition_error = 0
typical_flight_error = 0
SP_error = 0
cond_numbers = []
with open('Conditions.csv','rb') as csv_name: # Abre el fichero csv donde estarán las equivalencias
csv_read = csv.reader(csv_name,delimiter='\t')
for reads in csv_read:
cond_numbers.append(reads)
flight_TF = []
with open('vuelo-TF.csv','rb') as vuelo_TF:
csv_read = csv.reader(vuelo_TF,delimiter=';')
for reads in csv_read:
flight_TF.append(reads)
excel_files = glob.glob('*.xlsx')
for excel in excel_files:
print "Leyendo excel: "+excel
wb = load_workbook(excel)
ws = wb.get_sheet_by_name('Control System')
flight = ws.cell('A7').value
typical_flight = ws.cell('B7').value
a = 0
for row in range(6,ws.get_highest_row()):
conditions = conditions + 1
value_flight = int(ws.cell(row=row,column=0).value)
value_TF = ws.cell(row=row,column=1).value
value_SP = int(ws.cell(row=row,column=4).value)
if value_flight == '':
break
if value_flight != flight:
flight_error = 1 # Si no todos los flight numbers dentro del vuelo son iguales
if value_TF != typical_flight:
typical_flight_error = 2 # Si no todos los typical flight dentro del vuelo son iguales
if value_SP != 100:
SP_error = 1
for cond in cond_numbers:
if int(flight) == int(cond[0]):
conds = int(cond[1])
if conds != int(conditions):
condition_error = 1 # Si el número de condiciones no se corresponde con el esperado
for vuelo_TF in flight_TF:
if int(vuelo_TF[0]) == int(flight):
TF = vuelo_TF[1]
if typical_flight != TF:
typical_flight_error = 1 # Si el vuelo no coincide con el respectivo typical flight
if flight_error == 1:
today = datetime.datetime.today()
time = today.strftime(" %Y-%m-%d %H.%M.%S")
log = open('log.txt','aw')
message = time+': Los flight numbers del vuelo '+str(flight)+' no coinciden.\n'
log.write(message)
log.close()
flight_error = 0
if condition_error == 1:
today = datetime.datetime.today()
time = today.strftime(" %Y-%m-%d %H.%M.%S")
log = open('log.txt','aw')
message = time+': El número de condiciones del vuelo '+str(flight)+' no coincide. Condiciones esperadas: '+str(int(conds))+'. Condiciones obtenidas: '+str(int(conditions))+'.\n'
log.write(message)
log.close()
condition_error = 0
if typical_flight_error == 1:
today = datetime.datetime.today()
time = today.strftime(" %Y-%m-%d %H.%M.%S")
log = open('log.txt','aw')
message = time+': El vuelo '+str(flight)+' no coincide con el typical flight. Typical flight respectivo: '+TF+'. Typical flight obtenido: '+typical_flight+'.\n'
log.write(message)
log.close()
typical_flight_error = 0
if typical_flight_error == 2:
today = datetime.datetime.today()
time = today.strftime(" %Y-%m-%d %H.%M.%S")
log = open('log.txt','aw')
message = time+': Los typical flight del vuelo '+str(flight)+' no son todos iguales.\n'
log.write(message)
log.close()
typical_flight_error = 0
if SP_error == 1:
today = datetime.datetime.today()
time = today.strftime(" %Y-%m-%d %H.%M.%S")
log = open('log.txt','aw')
message = time+': Hay algún Step Percentage del vuelo '+str(flight)+' menor que 100.\n'
log.write(message)
log.close()
SP_error = 0
conditions = 0
结尾的if语句用于检查和写入警告日志。
我正在使用带有8 GB RAM和intel xeon w3505(两核,2,53 GHz)的Windows XP。
答案 0 :(得分:10)
openpyxl的默认实现将所有访问的单元格存储到内存中。我建议您使用优化阅读器(链接 - https://openpyxl.readthedocs.org/en/latest/optimized.html)代替
在代码中: -
wb = load_workbook(file_path, use_iterators = True)
加载工作簿时use_iterators = True
。然后访问工作表和单元格,如:
for row in sheet.iter_rows():
for cell in row:
cell_text = cell.value
这会将内存占用减少到5-10%
更新:在版本2.4.0中use_iterators = True
选项被删除。在较新的版本中引入了openpyxl.writer.write_only.WriteOnlyWorksheet
来转储大量数据。
from openpyxl import Workbook
wb = Workbook(write_only=True)
ws = wb.create_sheet()
# now we'll fill it with 100 rows x 200 columns
for irow in range(100):
ws.append(['%d' % i for i in range(200)])
# save the file
wb.save('new_big_file.xlsx')
未测试刚刚从上述链接复制的以下代码。
感谢@SdaliM提供的信息。
答案 1 :(得分:1)
如@anuragal所说
openpyxl将所有访问的单元存储到内存中
在循环每个单元格时处理此巨大内存问题的另一种方法是分而治之。关键是在读取足够的单元格之后,用wb.save()
保存excel,然后将过去的值从内存中删除。
checkPointLine = 100 # choose a better number in your case.
excel = openpyxl.load_workbook(excelPath,data_only= True)
ws = excel.active
readingLine = 1
for rowNum in range(readingLine,max_row):
row = ws[rowNum]
first = row[0]
currentRow = first.row
#doing the things to this line content then mark `isDirty = True`
if currentRow%checkPointLine == 0:
if isDirty:
#write back only changed content
excel.save(excelPath)
isDirty = False
excel = openpyxl.load_workbook(excelPath)
ws = excel.active
readingLine = first.row
答案 2 :(得分:0)
使用最新版本的openpyxl,必须使用read_only=True
参数加载和读取大型源工作簿,并使用write_only=True
模式创建/编写大型目标工作簿:
答案 3 :(得分:0)
这种方法对我有用,将数据从SQLite数据库复制到每个表的相应工作表中,其中一些表有> 250,000行,而我正从OpenPyXL遇到内存错误。诀窍是逐渐每保存10万行,然后重新打开工作簿-这似乎减少了内存使用量。我做的事情与@sakiM在上面所做的非常相似。这是我的代码中执行此操作的部分:
row_num = 2 # row 1 previously populated with column names
session = self.CreateDBSession() # SQL Alchemy connection to SQLite
for item in session.query(ormClass):
col_num = 1
for col_name in sorted(fieldsInDB): # list of columns from the table being put into XL columns
if col_name != "__mapper__": # Something SQL Alchemy apparently adds...
val = getattr(item, col_name)
sheet.cell(row=row_num, column=col_num).value = val
col_num += 1
row_num += 1
if row_num % self.MAX_ROW_CHUNK == 0: # MAX_ROW_CHUNK = 100000
self.WriteChunk()
# Write this chunk and reload the workbook to work around OpenPyXL memory issues
def WriteChunk(self):
print("Incremental save of %s" % self.XLSPath)
self.SaveXLWorkbook()
print("Reopening %s" % self.XLSPath)
self.OpenXLWorkbook()
# Open the XL Workbook we are updating
def OpenXLWorkbook(self):
if not self.workbook:
self.workbook = openpyxl.load_workbook(self.XLSPath)
return self.workbook
# Save the workbook
def SaveXLWorkbook(self):
if self.workbook:
self.workbook.save(self.XLSPath)
self.workbook = None