我需要帮助替换word文档中的字符串,同时保持整个文档的格式。
我正在使用python-docx,在阅读文档之后,它适用于整个段落,所以我放松了格式,如粗体或斜体字。 包括要替换的文本是粗体,我想保持这种方式。 我正在使用此代码:
from docx import Document
def replace_string2(filename):
doc = Document(filename)
for p in doc.paragraphs:
if 'Text to find and replace' in p.text:
print 'SEARCH FOUND!!'
text = p.text.replace('Text to find and replace', 'new text')
style = p.style
p.text = text
p.style = style
# doc.save(filename)
doc.save('test.docx')
return 1
因此,如果我实现它并想要类似(包含要替换的字符串的段落丢失其格式):
这是第1段,这是粗体中的文字。
这是第2段,我将替换旧文
目前的结果是:
这是第1段,这是粗体中的文字。
这是第2段,我将替换新文本
答案 0 :(得分:10)
我发布了这个问题(尽管我在这里看到了几个相同的问题),因为这些(据我所知)都没有解决这个问题。有一个使用oodocx库,我试过,但没有工作。所以我找到了解决方法。
代码非常相似,但逻辑是:当我找到包含我想要替换的字符串的段落时,使用运行添加另一个循环。 (这只有在我想要替换的字符串具有相同格式时才有效。)
def replace_string(filename):
doc = Document(filename)
for p in doc.paragraphs:
if 'old text' in p.text:
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
if 'old text' in inline[i].text:
text = inline[i].text.replace('old text', 'new text')
inline[i].text = text
print p.text
doc.save('dest1.docx')
return 1
答案 1 :(得分:2)
这是我在替换文本时保留文本样式的方法。
基于Alo
的回答以及搜索文本可以分为多个运行的事实,这是我在模板docx文件中替换占位符文本的方法。它会检查所有文档段落以及占位符的所有表单元格内容。
在段落中找到搜索文本后,它将循环遍历运行,以标识哪个运行包含搜索文本的部分文本,然后在第一次运行中插入替换文本,然后在剩余的跑步次数。
我希望这对某人有帮助。这是gist,如果有人想对其进行改进
编辑:
随后,我发现了python-docx-template
,它允许在docx模板中进行jinja2样式的模板化。这是指向documentation
python3 python-docx python-docx-template
def docx_replace(doc, data):
paragraphs = list(doc.paragraphs)
for t in doc.tables:
for row in t.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
paragraphs.append(paragraph)
for p in paragraphs:
for key, val in data.items():
key_name = '${{{}}}'.format(key) # I'm using placeholders in the form ${PlaceholderName}
if key_name in p.text:
inline = p.runs
# Replace strings and retain the same style.
# The text to be replaced can be split over several runs so
# search through, identify which runs need to have text replaced
# then replace the text in those identified
started = False
key_index = 0
# found_runs is a list of (inline index, index of match, length of match)
found_runs = list()
found_all = False
replace_done = False
for i in range(len(inline)):
# case 1: found in single run so short circuit the replace
if key_name in inline[i].text and not started:
found_runs.append((i, inline[i].text.find(key_name), len(key_name)))
text = inline[i].text.replace(key_name, str(val))
inline[i].text = text
replace_done = True
found_all = True
break
if key_name[key_index] not in inline[i].text and not started:
# keep looking ...
continue
# case 2: search for partial text, find first run
if key_name[key_index] in inline[i].text and inline[i].text[-1] in key_name and not started:
# check sequence
start_index = inline[i].text.find(key_name[key_index])
check_length = len(inline[i].text)
for text_index in range(start_index, check_length):
if inline[i].text[text_index] != key_name[key_index]:
# no match so must be false positive
break
if key_index == 0:
started = True
chars_found = check_length - start_index
key_index += chars_found
found_runs.append((i, start_index, chars_found))
if key_index != len(key_name):
continue
else:
# found all chars in key_name
found_all = True
break
# case 2: search for partial text, find subsequent run
if key_name[key_index] in inline[i].text and started and not found_all:
# check sequence
chars_found = 0
check_length = len(inline[i].text)
for text_index in range(0, check_length):
if inline[i].text[text_index] == key_name[key_index]:
key_index += 1
chars_found += 1
else:
break
# no match so must be end
found_runs.append((i, 0, chars_found))
if key_index == len(key_name):
found_all = True
break
if found_all and not replace_done:
for i, item in enumerate(found_runs):
index, start, length = [t for t in item]
if i == 0:
text = inline[index].text.replace(inline[index].text[start:start + length], str(val))
inline[index].text = text
else:
text = inline[index].text.replace(inline[index].text[start:start + length], '')
inline[index].text = text
# print(p.text)
# usage
doc = docx.Document('path/to/template.docx')
docx_replace(doc, dict(ItemOne='replacement text', ItemTwo="Some replacement text\nand some more")
doc.save('path/to/destination.docx')
答案 2 :(得分:2)
根据DOCX文档的架构:
页脚和页眉一样,我们可以直接遍历段落来查找和替换我们的关键字,但是这样会导致文本格式被重置,所以我们只能遍历运行中的单词并替换它们。但是,由于我们的关键字可能会超出运行的长度范围,因此我们无法成功替换它们。
因此,我在这里提供一个思路:首先,以段落为单位,通过列表标记段落中每个字符的位置;然后,标记遍历列表中每个字符的位置;找出段落中的关键字,按对应关系删除并以字符为单位替换。
'''
-*- coding: utf-8 -*-
@Time : 2021/4/19 13:13
@Author : ZCG
@Site :
@File : Batch DOCX document keyword replacement.py
@Software: PyCharm
'''
from docx import Document
import os
import tqdm
def get_docx_list(dir_path):
'''
:param dir_path:
:return: List of docx files in the current directory
'''
file_list = []
for path,dir,files in os.walk(dir_path):
for file in files:
if file.endswith("docx") == True and str(file[0]) != "~": #Locate the docx document and exclude temporary files
file_root = path+"\\"+file
file_list.append(file_root)
print("The directory found a total of {0} related files!".format(len(file_list)))
return file_list
class ParagraphsKeyWordsReplace:
'''
self:paragraph
'''
def paragraph_keywords_replace(self,x,key,value):
'''
:param x: paragraph index
:param key: Key words to be replaced
:param value: Replace the key words
:return:
'''
keywords_list = [s for s in range(len(self.text)) if self.text.find(key, s) == s] # Retrieve the number of occurrences of the Key in this paragraph and record the starting position in the List
# there if use: while self.text.find(key) >= 0,When {"ab":" ABC "} is encountered, it will enter an infinite loop
while len(keywords_list)>0: #If this paragraph contains more than one key, you need to iterate
index_list = [] #Gets the index value for all characters in this paragraph
for y, run in enumerate(self.runs): # Read the index of run
for z, char in enumerate(list(run.text)): # Read the index of the chars in the run
position = {"run": y, "char": z} # Give each character a dictionary index
index_list.append(position)
# print(index_list)
start_i = keywords_list.pop() # Fetch the starting position containing the key from the back to the front of the list
end_i = start_i + len(key) # Determine where the key word ends in the paragraph
keywords_index_list = index_list[start_i:end_i] # Intercept the section of a list that contains keywords in a paragraph
# print(keywords_index_list)
# return keywords_index_list #Returns a list of coordinates for the chars associated with keywords
ParagraphsKeyWordsReplace.character_replace(self, keywords_index_list, value)
# print(f"Successful replacement:{key}===>{value}")
def character_replace(self,keywords_index_list,value):
'''
:param keywords_index_list: A list of indexed dictionaries containing keywords
:param value: The new word after the replacement
: return:
Receive parameters and delete the characters in keywords_index_list back-to-back, reserving the first character to replace with value
Note: Do not delete the list in reverse order, otherwise the list length change will cause a string index out of range error
'''
while len(keywords_index_list) > 0:
dict = keywords_index_list.pop() #Deletes the last element and returns its value
y = dict["run"]
z = dict["char"]
run = self.runs[y]
char = self.runs[y].text[z]
if len(keywords_index_list) > 0:
run.text = run.text.replace(char, "") #Delete the [1:] character
elif len(keywords_index_list) == 0:
run.text = run.text.replace(char, value) #Replace the 0th character
class DocxKeyWordsReplace:
'''
self:docx
'''
def content(self,replace_dict):
print("Please wait for a moment, the body content is processed...")
for key, value in tqdm.tqdm(replace_dict.items()):
for x,paragraph in enumerate(self.paragraphs):
ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph,x,key,value)
def tables(self,replace_dict):
print("Please wait for a moment, the body tables is processed...")
for key,value in tqdm.tqdm(replace_dict.items()):
for i,table in enumerate(self.tables):
for j,row in enumerate(table.rows):
for cell in row.cells:
for x,paragraph in enumerate(cell.paragraphs):
ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph,x,key,value)
def header_content(self,replace_dict):
print("Please wait for a moment, the header body content is processed...")
for key,value in tqdm.tqdm(replace_dict.items()):
for i,sections in enumerate(self.sections):
for x,paragraph in enumerate(self.sections[i].header.paragraphs):
ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph, x, key, value)
def header_tables(self,replace_dict):
print("Please wait for a moment, the header body tables is processed...")
for key,value in tqdm.tqdm(replace_dict.items()):
for i,sections in enumerate(self.sections):
for j,tables in enumerate(self.sections[i].header.tables):
for k,row in enumerate(tables[j].rows):
for l,cell in row.cells:
for x, paragraph in enumerate(cell.paragraphs):
ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph, x, key, value)
def footer_content(self, replace_dict):
print("Please wait for a moment, the footer body content is processed...")
for key,value in tqdm.tqdm(replace_dict.items()):
for i, sections in enumerate(self.sections):
for x, paragraph in enumerate(self.sections[i].footer.paragraphs):
ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph, x, key, value)
def footer_tables(self, replace_dict):
print("Please wait for a moment, the footer body tables is processed...")
for key,value in tqdm.tqdm(replace_dict.items()):
for i, sections in enumerate(self.sections):
for j, tables in enumerate(self.sections[i].footer.tables):
for k, row in enumerate(tables[j].rows):
for l, cell in row.cells:
for x, paragraph in enumerate(cell.paragraphs):
ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph, x, key, value)
def main():
'''
How to use it: Modify the values in replace_dict and file_dir
Replace_dict: The following dictionary corresponds to the format, with key as the content to be replaced and value as the new content
File_dir: The directory where the docx file resides. Supports subdirectories
'''
# Input part
replace_dict = {
"MG life technology (shenzhen) co., LTD":"Shenzhen YW medical technology co., LTD",
"MG-":"YW-",
"2017-":"2020-",
"Z18":"Z20",
}
file_dir = r"D:\Working Files\SVN\"
# Call processing part
for i,file in enumerate(get_docx_list(file_dir),start=1):
print(f"{i}、Files in progress:{file}")
docx = Document(file)
DocxKeyWordsReplace.content(docx, replace_dict=replace_dict)
DocxKeyWordsReplace.tables(docx, replace_dict=replace_dict)
DocxKeyWordsReplace.header_content(docx, replace_dict=replace_dict)
DocxKeyWordsReplace.header_tables(docx, replace_dict=replace_dict)
DocxKeyWordsReplace.footer_content(docx, replace_dict=replace_dict)
DocxKeyWordsReplace.footer_tables(docx, replace_dict=replace_dict)
docx.save(file)
print("This document has been processed!\n")
if __name__ == "__main__":
main()
print("All complete processing!")
答案 3 :(得分:0)
from docx import Document
document = Document('old.docx')
dic = {'name':'ahmed','me':'zain'}
for p in document.paragraphs:
inline = p.runs
for i in range(len(inline)):
text = inline[i].text
if text in dic.keys():
text=text.replace(text,dic[text])
inline[i].text = text
document.save('new.docx')