基本上我有一个包含大量.doc / .docx文件的文件夹。我需要它们.txt格式。该脚本应迭代目录中的所有文件,将它们转换为.txt文件并将它们存储在另一个文件夹中。
我该怎么办?
是否存在可以执行此操作的模块?
答案 0 :(得分:3)
我认为这将成为一个有趣的快速编程项目。这只是在一个包含“Hello,world!”的简单.docx文件上进行了测试,但逻辑系列应该为您提供一个工作空间来解析更复杂的文档。
from shutil import copyfile, rmtree
import sys
import os
import zipfile
from lxml import etree
# command format: python3 docx_to_txt.py Hello.docx
# let's get the file name
zip_dir = sys.argv[1]
# cut off the .docx, make it a .zip
zip_dir_zip_ext = os.path.splitext(zip_dir)[0] + '.zip'
# make a copy of the .docx and put it in .zip
copyfile(zip_dir, zip_dir_zip_ext)
# unzip the .zip
zip_ref = zipfile.ZipFile(zip_dir_zip_ext, 'r')
zip_ref.extractall('./temp')
# get the xml out of /word/document.xml
data = etree.parse('./temp/word/document.xml')
# we'll want to go over all 't' elements in the xml node tree.
# note that MS office uses namespaces and that the w must be defined in the namespaces dictionary args
# each :t element is the "text" of the file. that's what we're looking for
# result is a list filled with the text of each t node in the xml document model
result = [node.text.strip() for node in data.xpath("//w:t", namespaces={'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})]
# dump result into a new .txt file
with open(os.path.splitext(zip_dir)[0]+'.txt', 'w') as txt:
# join the elements of result together since txt.write can't take lists
joined_result = '\n'.join(result)
# write it into the new file
txt.write(joined_result)
# close the zip_ref file
zip_ref.close()
# get rid of our mess of working directories
rmtree('./temp')
os.remove(zip_dir_zip_ext)
我确信有更优雅或pythonic的方式来实现这一目标。您需要将要转换的文件放在与python文件相同的目录中。命令格式为python3 docx_to_txt.py file_name.docx
答案 1 :(得分:0)
conda install -c conda-forge python-docx
来自docx import Document的doc =文档(文件)
对于doc.paragrafs中的p,: 打印(p.text) 通
答案 2 :(得分:0)
我想分享我的方法,基本上归结为两个将.doc
或.docx
转换为字符串的命令,这两个选项都需要一个特定的程序包:
import docx
import os
import glob
import subprocess
import sys
# .docx (pip3 install python-docx)
doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
# .doc (apt-get install antiword)
doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")
然后我将这些解决方案包装到一个函数中,该函数可以将结果作为python字符串返回,也可以写入文件(可以选择追加或替换)。
import docx
import os
import glob
import subprocess
import sys
def doc2txt(infile, outfile, return_string=False, append=False):
if os.path.exists(infile):
if infile.endswith(".docx"):
try:
doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
except Exception as e:
print("Exception in converting .docx to str: ", e)
return None
elif infile.endswith(".doc"):
try:
doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")
except Exception as e:
print("Exception in converting .docx to str: ", e)
return None
else:
print("{0} is not .doc or .docx".format(infile))
return None
if return_string == True:
return doctext
else:
writemode = "a" if append==True else "w"
with open(outfile, writemode) as f:
f.write(doctext)
f.close()
else:
print("{0} does not exist".format(infile))
return None
然后我将通过类似以下方式调用此函数:
files = glob.glob("/path/to/filedir/**/*.doc*", recursive=True)
outfile = "/path/to/out.txt"
for file in files:
doc2txt(file, outfile, return_string=False, append=True)
我通常不需要执行此操作,但是直到现在为止该脚本已经可以满足我的所有需求,如果您发现此函数存在一个错误,请在注释中告诉我。