Question

我必须定期将正在运行的日志/文本文件（300K +行，20MB +）转换为许多不同的xlsx文件（大约2K +不同长度的文件）。同时，我正在根据文件中的数据创建目录树。

我们的想法是创建一个可以根据需要重新运行的脚本。我只需要6-7秒就可以读取整个文件以获取所需的所有信息，但是目前第一次生成所有xlsx文件大约需要2分钟。我只用了几个星期进入python，所以我不确定这是否合理。

我正在使用openpyxl（在Windows 7上），因为它具有表单保护功能，但我不确定是否有更快的进程可以执行相同的操作。任何更换过程都需要具有纸张保护和色谱柱宽度调整功能。

我尝试过“write_only”模式，但我没有注意到速度上的明显差异。除去细胞保护似乎也没有影响。

with open(file_name, "r") as f:
    close_prev_file = False
    # Turn off additional regex lookups until needed (speeds up read process)
    read_body = False
    data = f.readlines()
    for line in data:
        # Header handling
        if re.search(<large regex pattern>,line,re.IGNORECASE) is not None:
            # Activate dormant regex lookups below (slows down read process)
            read_body = True
            # If header information is found, finish writing any previous files and start a new one
            if close_prev_file is True:
                path = <pattern from concatenated variable results>
                new_file = path + <variable results> + ".xlsx"
                # If the new_file doesn't already exist, create it
                if os.path.exists(new_file) is False:
                    distutils.dir_util.mkpath(path)
                    print("Generating Excel file: " + new_file)
                    wb.save(new_file)
            close_prev_file = True
            wb = Workbook()
            ws = wb.get_sheet_by_name("Sheet")
            <apply sheet protection>
        # Body text handling
        elif read_body is True:
            # Read current line, decide how to format the output
            < if/then code>
                # Format xample: pull data from the line, split into two columns
                #ws.cell(row=row, column=1, value=re.sub("<pattern>","",line,0,re.IGNORECASE))
                #ws.cell(row=row, column=2, value=re.search("<pattern>","",line,0,re.IGNORECASE).group(0))
            <build variables from regex searches of subsequent lines>
            # If the intended file already exists, skip further regex searches and resume looking for header info
            if <all variables established>:
                path = <pattern from concatenated variable results>
                new_file = path + <variable results> + ".xlsx"
                if os.path.exists(new_file) is True:
                    <turn off reading, reset variables>
                    close_prev_file = False
                    read_body = False

Answer 1

您可以尝试卸载在线程上编写xlsx文件的IO操作。由于GIL（https://wiki.python.org/moin/GlobalInterpreterLock），Pythons线程模块不会并行执行操作。

但是，在不同线程中交错IO和非IO操作时，性能会提高。

这里我生成100个随机numpy数组，并使用np.savetxt将它们保存到磁盘。通过在线程上卸载IO可以获得显着的性能。

%%time
count = 0
while count < 100:

    array = np.random.randint(1, size=(200, 600))

    np.savetxt(str(uuid.uuid1(count)), array)

    count += 1

CPU时间：用户5.24 s，sys：307 ms，总计：5.55 s

壁垒时间：28.9秒

%%time
thread_list = []
count = 0
while count < 100:

    if threading.active_count() > 8:
        continue

    array = np.random.randint(1, size=(200, 600))

    thread = threading.Thread(target=np.savetxt,
                              args=[str(uuid.uuid1(count)), array])

    thread_list.append(thread)
    thread.start()

    count += 1

for thread in thread_list:
    thread.join()

CPU时间：用户18秒，系统：660毫秒，总计：18.6秒

壁垒时间：18.5秒

您可以尝试在单独的线程上卸载对wb.save（new_file）的调用。也许有些东西：

import threading
if threading.active_count() > 8: # Choose a number suitable for your pc.
    [thread.join() for thread in thread_list]

else:
    thread = threading.Thread(target=wb.save,
                              args=[new_file])

    thread_list.append(thread)
    thread.start()

原因你需要在主循环之外的某处定义“thread_list = []”等。还要在退出之前加入剩余的线程。

编辑：别忘了导入线程库：）

从python

1 个答案: