Question

我有单独的电子表格，其中包含一年中每个月的数据 - 总共12个电子表格。每个工作簿包含200k-500k行。

e.g。

一月

| name  | course  | grade |
|-------|---------|-------|
| dave  | math    | 90    |
| chris | math    | 80    |
| dave  | english | 75    |

二月

| name  | course  | grade |
|-------|---------|-------|
| dave  | science | 72    |
| chris | art     | 58    |
| dave  | music   | 62    |

我使用openpyxl打开每个月度工作簿，遍历每一行和每个单元格，并将相关数据写入个人工作簿。即所有属于Chris的行都在“Chris.xlsx”中，属于Dave的行进入“Dave.xlsx”。

我遇到的问题是openpyxl 非常慢。我确信这是因为我的代码非常程序化，并没有优化迭代和编写。

任何想法都会非常感激。

def appendToWorkbooks():
    print("Appending workbooks")
    je_dump_path = "C:/test/"

    # define list of files in path
    je_dump_files = os.listdir( je_dump_path )

    # define path for resultant files
    results_path = "C:/test/output/"

    max_row = 0
    input_row = 1

    for file in je_dump_files:
        current_row = 1

        # load each workbook in the directory
        load_file = je_dump_path + file
        print("Loading workbook: " + file)
        wb = load_workbook(filename=load_file, read_only=True)
        print("Loaded workbook: " + file)

        # select the worksheet with the name Sheet in each workbook
        ws = wb['Sheet']
        print("Loaded worksheet")

        # iterate through the rows in the currently open workbook
        for row in ws.iter_rows():

            # determine the person this row of data relates to
            person = ws.cell(row=current_row, column=1).value

            # set output workbook to that person
            output_wb_file = results_path + person + ".xlsx"
            output_wb = load_workbook(output_wb_file)
            output_ws = output_wb["Sheet"]

            # increment the current row
            current_row = current_row + 1

            print("Currently on row: " + str(current_row))

            # determine the last row in the current output workbook
            max_row = output_ws.max_row

            # set the output row to the row after the last row in the current output workbook
            output_row = max_row + 1

            for cell in row:
                output_ws.cell(row=output_row, column=column_index_from_string(cell.column)).value = cell.value
            output_wb.save(output_wb_file)

Answer 1

这条线在循环内部非常昂贵： max_row = output_ws.max_row

但您确实需要提供有关您的文件和您所看到的性能的更多详细信息。单个文件有多大？他们需要多长时间才能单独加载？等

openpyxl整合电子表格

1 个答案: