我有单独的电子表格,其中包含一年中每个月的数据 - 总共12个电子表格。每个工作簿包含200k-500k行。
e.g。
一月
| name | course | grade |
|-------|---------|-------|
| dave | math | 90 |
| chris | math | 80 |
| dave | english | 75 |
二月
| name | course | grade |
|-------|---------|-------|
| dave | science | 72 |
| chris | art | 58 |
| dave | music | 62 |
我使用openpyxl打开每个月度工作簿,遍历每一行和每个单元格,并将相关数据写入个人工作簿。即所有属于Chris的行都在“Chris.xlsx”中,属于Dave的行进入“Dave.xlsx”。
我遇到的问题是openpyxl 非常慢。我确信这是因为我的代码非常程序化,并没有优化迭代和编写。
任何想法都会非常感激。
def appendToWorkbooks():
print("Appending workbooks")
je_dump_path = "C:/test/"
# define list of files in path
je_dump_files = os.listdir( je_dump_path )
# define path for resultant files
results_path = "C:/test/output/"
max_row = 0
input_row = 1
for file in je_dump_files:
current_row = 1
# load each workbook in the directory
load_file = je_dump_path + file
print("Loading workbook: " + file)
wb = load_workbook(filename=load_file, read_only=True)
print("Loaded workbook: " + file)
# select the worksheet with the name Sheet in each workbook
ws = wb['Sheet']
print("Loaded worksheet")
# iterate through the rows in the currently open workbook
for row in ws.iter_rows():
# determine the person this row of data relates to
person = ws.cell(row=current_row, column=1).value
# set output workbook to that person
output_wb_file = results_path + person + ".xlsx"
output_wb = load_workbook(output_wb_file)
output_ws = output_wb["Sheet"]
# increment the current row
current_row = current_row + 1
print("Currently on row: " + str(current_row))
# determine the last row in the current output workbook
max_row = output_ws.max_row
# set the output row to the row after the last row in the current output workbook
output_row = max_row + 1
for cell in row:
output_ws.cell(row=output_row, column=column_index_from_string(cell.column)).value = cell.value
output_wb.save(output_wb_file)
答案 0 :(得分:0)
这条线在循环内部非常昂贵:
max_row = output_ws.max_row
但您确实需要提供有关您的文件和您所看到的性能的更多详细信息。单个文件有多大?他们需要多长时间才能单独加载?等