我正在将docx文件转换为pdf。目前我正在将docx转换为txt文件,然后将txt文件写入pdf。
但我想将docx从解析后的lxml直接转换为pdf(维护lxml结构/格式化)。
有简化的方法吗?
当前docx转换为pdf:
from shutil import copyfile, rmtree
import sys
import os
import zipfile
from lxml import etree
zip_dir = sys.argv[1]
zip_dir_zip_ext = os.path.splitext(zip_dir)[0] + '.zip'
copyfile(zip_dir, zip_dir_zip_ext)
zip_ref = zipfile.ZipFile(zip_dir_zip_ext, 'r')
zip_ref.extractall('./temp')
data = etree.parse('./temp/word/document.xml')
result = [node.text.strip() for node in data.xpath("//w:t", namespaces={'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})]
import codecs
with codecs.open(os.path.splitext(zip_dir)[0]+'_converted_temp.txt', 'w', 'UTF-8') as txt:
joined_result = '\n'.join(result)
txt.write(joined_result)
zip_ref.close()
rmtree('./temp')
os.remove(zip_dir_zip_ext)
灵感:How do I write a python script that can read doc/docx files and convert them to txt?