如何编写可以读取doc / docx文件并将其转换为txt的python脚本?

时间:2017-06-26 12:59:25

标签: python anaconda

基本上我有一个包含大量.doc / .docx文件的文件夹。我需要它们.txt格式。该脚本应迭代目录中的所有文件,将它们转换为.txt文件并将它们存储在另一个文件夹中。

我该怎么办?

是否存在可以执行此操作的模块?

3 个答案:

答案 0 :(得分:3)

我认为这将成为一个有趣的快速编程项目。这只是在一个包含“Hello,world!”的简单.docx文件上进行了测试,但逻辑系列应该为您提供一个工作空间来解析更复杂的文档。

from shutil import copyfile, rmtree
import sys
import os
import zipfile
from lxml import etree

# command format: python3 docx_to_txt.py Hello.docx

# let's get the file name
zip_dir = sys.argv[1]
# cut off the .docx, make it a .zip
zip_dir_zip_ext = os.path.splitext(zip_dir)[0] + '.zip'
# make a copy of the .docx and put it in .zip
copyfile(zip_dir, zip_dir_zip_ext)
# unzip the .zip
zip_ref = zipfile.ZipFile(zip_dir_zip_ext, 'r')
zip_ref.extractall('./temp')
# get the xml out of /word/document.xml
data = etree.parse('./temp/word/document.xml')
# we'll want to go over all 't' elements in the xml node tree.
# note that MS office uses namespaces and that the w must be defined in the namespaces dictionary args
# each :t element is the "text" of the file. that's what we're looking for
# result is a list filled with the text of each t node in the xml document model
result = [node.text.strip() for node in data.xpath("//w:t", namespaces={'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})]
# dump result into a new .txt file
with open(os.path.splitext(zip_dir)[0]+'.txt', 'w') as txt:
    # join the elements of result together since txt.write can't take lists
    joined_result = '\n'.join(result)
    # write it into the new file
    txt.write(joined_result)
# close the zip_ref file
zip_ref.close()
# get rid of our mess of working directories
rmtree('./temp')
os.remove(zip_dir_zip_ext)

我确信有更优雅或pythonic的方式来实现这一目标。您需要将要转换的文件放在与python文件相同的目录中。命令格式为python3 docx_to_txt.py file_name.docx

答案 1 :(得分:0)

conda install -c conda-forge python-docx

来自docx import Document的

doc =文档(文件)

对于doc.paragrafs中的p,

:     打印(p.text)     通

答案 2 :(得分:0)

我想分享我的方法,基本上归结为两个将.doc.docx转换为字符串的命令,这两个选项都需要一个特定的程序包:

import docx
import os
import glob
import subprocess
import sys

# .docx (pip3 install python-docx)
doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
# .doc (apt-get install antiword)
doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")

然后我将这些解决方案包装到一个函数中,该函数可以将结果作为python字符串返回,也可以写入文件(可以选择追加或替换)。

import docx
import os
import glob
import subprocess
import sys

def doc2txt(infile, outfile, return_string=False, append=False):
    if os.path.exists(infile):
        if infile.endswith(".docx"):
            try:
                doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
            except Exception as e:
                print("Exception in converting .docx to str: ", e)
                return None
        elif infile.endswith(".doc"):
            try:
                doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")
            except Exception as e:
                print("Exception in converting .docx to str: ", e)
                return None
        else:
            print("{0} is not .doc or .docx".format(infile))
            return None

        if return_string == True:
            return doctext
        else:
            writemode = "a" if append==True else "w"
            with open(outfile, writemode) as f:
                f.write(doctext)
                f.close()
    else:
        print("{0} does not exist".format(infile))
        return None

然后我将通过类似以下方式调用此函数:

files = glob.glob("/path/to/filedir/**/*.doc*", recursive=True)
outfile = "/path/to/out.txt"
for file in files:
    doc2txt(file, outfile, return_string=False, append=True)

我通常不需要执行此操作,但是直到现在为止该脚本已经可以满足我的所有需求,如果您发现此函数存在一个错误,请在注释中告诉我。