我想从word文件中的评论中删除所有个人信息。
删除作者姓名很好,我使用以下内容
document = Document('sampleFile.docx')
core_properties = document.core_properties
core_properties.author = ""
document.save('new-filename.docx')
但这不是我需要的,我想删除在该word文件中注释的任何人的名字。
我们手动操作的方法是进入偏好设置 - > security->保存时从此文件中删除个人信息
答案 0 :(得分:4)
CoreProperties类识别的核心属性列在官方文档中:http://python-docx.readthedocs.io/en/latest/api/document.html#coreproperties-objects
要覆盖所有这些,您可以将它们设置为空字符串,就像您用来覆盖作者元数据的字符串一样:
document = Document('sampleFile.docx')
core_properties = document.core_properties
meta_fields= ["author", "category", "comments", "content_status", "created", "identifier", "keywords", "language", "revision", "subject", "title", "version"]
for meta_field in meta_fields:
setattr(core_properties, meta_field, "")
document.save('new-filename.docx')
答案 1 :(得分:3)
如果您想从.docx
文件中的评论中删除个人信息,则必须深入了解文件本身。
因此,.docx
只是一个.zip
存档,其中包含特定于字的文件。我们需要覆盖它的一些内部文件,而我能找到的最简单的方法就是将所有文件复制到内存中,更改我们需要更改的内容并将其全部放到新文件中。
import re
import os
from zipfile import ZipFile
docx_file_name = '/path/to/your/document.docx'
files = dict()
# We read all of the files and store them in "files" dictionary.
document_as_zip = ZipFile(docx_file_name, 'r')
for internal_file in document_as_zip.infolist():
file_reader = document_as_zip.open(internal_file.filename, "r")
files[internal_file.filename] = file_reader.readlines()
file_reader.close()
# We don't need to read anything more, so we close the file.
document_as_zip.close()
# If there are any comments.
if "word/comments.xml" in files.keys():
# We will be working on comments file...
comments = files["word/comments.xml"]
comments_new = str()
# Files contents have been read as list of byte strings.
for comment in comments:
if isinstance(comment, bytes):
# Change every author to "Unknown Author".
comments_new += re.sub(r'w:author="[^"]*"', "w:author=\"Unknown Author\"", comment.decode())
files["word/comments.xml"] = comments_new
# Remove the old .docx file.
os.remove(docx_file_name)
# Now we want to save old files to the new archive.
document_as_zip = ZipFile(docx_file_name, 'w')
for internal_file_name in files.keys():
# Those are lists of byte strings, so we merge them...
merged_binary_data = str()
for binary_data in files[internal_file_name]:
# If the file was not edited (therefore is not the comments.xml file).
if not isinstance(binary_data, str):
binary_data = binary_data.decode()
# Merge file contents.
merged_binary_data += binary_data
# We write old file contents to new file in new .docx.
document_as_zip.writestr(internal_file_name, merged_binary_data)
# Close file for writing.
document_as_zip.close()