使用Python加快数万个文档到docx的转换

时间:2018-11-07 08:38:38

标签: python-3.x

我有超过44K的doc文件正在等待转换为docx。我用来转换单个文档文件的代码如下:

from win32com import client

def doc2docx(doc_name):
    word = client.Dispatch("Word.Application")
    doc = word.Documents.Open(doc_name)
    docx_name = doc_name.replace(".doc", ".docx")
    doc.SaveAs(docx_name, 16)
    doc.Close()
    word.Quit()

我尝试了以下代码来转换10个doc文档的子集:

from glob import glob
from time import time

paths = glob("U:\\WordDocuments\*.doc")
start = time()
counter = 0
for i in paths:
    doc2docx(i)
    counter += 1
    print(counter)
end = time()
duration = end -start
print("It took", duration, "seconds to process 10 doc files.")

上面的代码运行没有错误。但是,花了3分钟多的时间才能隐藏10个doc文档。我如何加快这个过程?我可以想到多线程或多处理,但是我不知道如何实现它们。谢谢!

1 个答案:

答案 0 :(得分:0)

from win32com import client
from glob import glob
from time import time
from multiprocessing import Pool


def doc2docx(doc_name):
    word = client.Dispatch("Word.Application")
    doc = word.Documents.Open(doc_name)
    docx_name = doc_name.replace(".doc", ".docx")
    doc.SaveAs(docx_name, 16)
    doc.Close()
    word.Quit()

paths = glob("U:\\WordDocuments\*.doc")
global start
start = time()
A = []
pool = Pool()
r=pool.map_async(doc2docx,paths,callback=pool_processing_complete)
r.wait()
pool.close()
pool.join()

def pool_processing_complete(x):
    A.extend(x)
    global start
    end = time()
    duration = end -start
    print("It took", duration, "seconds to process 10 doc files.")

使用多处理池,这是示例。